Improve cgroup hook and configuration file support on heteregeneous clusters

Overview

Issues with the current hook and configuration file syntax

The current hook source code and hook configuration file syntax supported makes it hard to use a single cgroup hook for either a number of clusters that do not have a single PBSPro version, or a PBSPro server that contains nodes with different configurations.

The main issues faced by people with complex clusters are:

  • Source code that does not support using the same hook code everywhere:
    • For some OSes the imports at the start of the code do not exist, so an exception occurs before even the OS check is done; that implicitly rejects the event, which means that no jobs can be run on the nodes with these OSes
    • Even when passing this hurdle, if the OS is unsupported, it is seen as a 'configuration error'; instead the hook should do nothing on OSes that do not support cgroups
    • The master hook was updated to support Python 3 for future PBSPro releases, but in a way that renders it unusable on currently released 18.x and 19.x PBSPro versions
    • The master hook now adds support for events that may not exist in earlier PBSPro releases; the hook will crash when used on older PBSPro releases that do not support these hook events, regardless of whether they are needed to support cgroup functionality.
  • Configuration file syntax that does not allow a single configuration file to be used for the entire cluster when nodes are heterogeneous and require different configuration. E.g.
    • some hosts do not have the kernel boot option to support "memsw" functionality in the mem controller, which often makes a different setting of mem.swappiness necessary
    • some hosts with a large socket count and large memory latency discrepancies when using remote memory benefit from vnode_per_numa_node, but this is unnecessary for other hosts with lower socket counts or that are always assigned exclusively to jobs
    • on some hosts cpusets are desirable, but not on others
    • on some GPU nodes device fencing is desirable to prevent jobs sharing a node to use the wrong GPUs, and because CPUs and GPUs must be located on the same socket for correct performance vnode_per_numa_node is essential

Approach

The proposed changes shall retain backward compatibility with existing cgroup hook configuration files. This design extends the format as it currently exists.

The first set of issues can be fixed fairly easily by rearranging the code and adding some guards.

The second set of issues often makes sites require N different cgroup hooks to be imported each with their own configuration file, with each host needing to disable all the hooks but one using the configuration files. This is extremely cumbersome, especially since the hook allows "vntype" to be used to exclude collections of hosts sharing a vntype but doesn't allow limiting the hook to a set of vntypes; that idiosyncracy of the current configuration file syntax means that introducing a new vntype and hook requires all the other hooks to have their configuration updated to explicitly exclude support for that vntype. That is extremely hard to manage, and errors in the sequence of steps taken often leads to failures (when e.g. two competing hooks are both run on a host).

That which can be expressed in the current configuration file syntax causes two main issues:

  • There is a syntax to enable or disable sections of the cgroup hook configuration file depending on hosts, but not to change some important booleans like use_hyperthreads or vnode_per_numa_node depending on which host the hook is running
  • There is an exclude_vntype but no mirror image include_vntype; for hosts there is an exclude_hosts and a run_only_on_hosts (with mysteriously different naming), but these cannot modulate per-vntype decisions and the semantics of some combinations are unclear (to give one example, run_only_on_hosts basically neuters exclude_hosts and exclude_vntypes).

Maintaining the code is also a lot harder than necessary because references to the configuration variables to enable/disable sections are currently scattered throughout the code (it would be better to confine it to one spot parsing the configuration file, so that the rest of the code can trivially find which controllers and flags are enabled).

Slightly extending the configuration file syntax supported has been demonstrated at many sites (running prototypes of the cgroup hook with changes) to fix these issues. This forms the crux of the proposal (though not the code changes, some of which are in essence to allow the hook to function properly on most host configurations).

Expressing booleans in a flexible way to express on which hosts they should be enabled

The idea of this change is to introduce a parser that can convert strings into booleans depending on the host on which the cgroup hook runs or its vntype. This can then be used to make most common configurations that use the current rather idiosyncratic and asymmetric section enablers/disabler a lot simpler, by simply expressing when "enabled" for a section or for the whole hook should evaluate to True of False.

The current proposed changes are not made to be as general as possible, but to support a lot of use cases and have been used widely by sites. The syntax supported in the extension was also made to allow customers to easily cut and paste bits of existing configuration files containing lists in exclude_hosts, run_only_on_hosts and exclude_vntype without having to rewrite them – in other words, it supports comma-separated lists.

If the strings

  • "vntype in:"
  • "vntype not in:"
  • "host in:"
  • "host not in:"

are recognized at the start of a string, the rest of the string is recognized as a list of vntypes/hosts for which to set the variable to True ("in:") or False ("not in:"). For all other hosts by definition the variable is set to the inverse. The entries are usually simply names but the code will allow single entries that are fnmatch.fnmatch patterns; commas are not supported in the patterns since they separate entries. This would mainly be used by sites for wildcards using * or ?.

Every section of the config file now has an "enabled" attribute, which should be set to something that is transformed into a boolean (i.e. either one of the strings that is transformed into a boolean described above, or a true boolean). If "enabled" is not defined in a section, then it is implictly taken as "true" (and possibly modified by what follows).

To go back to the examples given in the overview:

  • No cgroups on hosts starting with nocg (e.g. nocg01, nocg02, nocg03...), but enabled on all the other hosts:
    • in the main section, add:
      "enabled" : "host not in: nocg*"
  • some hosts do not have the kernel boot option to support "memsw" functionality in the mem controller
    • e.g. assume four sets of hosts, with the following vntypes:
      • compute_with_memsw
      • compute_no_memsw
      • gpu_with_memsw
      • gpu_no_memsw
    • in the memory subsection, you could then use:
      "swappiness" : "vntype in: compute_with_memsw, gpu_with_memsw"
      or even
      "swappiness" : "vntype in: *_with_memsw"
      and rely on translation of True to 1 (i.e. enable use of swap) or 0 (disable use of swap) – which is already in the existing cgroup hook code
  • some hosts with a large socket count and large memory latency discrepancies when using remote memory benefit from vnode_per_numa_node, but this is unnecessary for other hosts with lower socket counts or that are always assigned exclusively to jobs
    • define three sets of host with vntypes "thin", "fat" and "gpu" (you want per-socket vnodes on GPU nodes, even if they are thin)
    • in the main section, use:
      "vnode_per_numa_node" : "vntype in: fat, gpu"

  • on some hosts cpusets are desirable, but not on others
    • in the cpuset section, use e.g.
      "enabled" :  "host not in: nocpuset01, nocpuset02"

  • on some GPU nodes device fencing is desirable to prevent jobs sharing a node to use the wrong GPUs, and because CPUs and GPUs must be located on the same socket for correct performance vnode_per_numa_node is essential; on other nodes using these is unneeded (and unwanted):
    • in the main section use
      "vnode_per_numa_node" : "vntype in: thin, gpu"
    • in the devices section use
      "enabled" : "vntype in: gpu"

Note that this can also, in a limited fashion, be used for some numeric variables by using small editions in the rest of the code (see the first example), by allowing numeric values but converting booleans True and False to sensible numerical values – see the first example.

Most sites can use just this one way of defining "enabled".

There is one exception to the addition of "enabled" to each section: the "cgroup" section of the configuration file has always been a dictionary of dictionaries, and some portions of the existing cgroup hook code relies on all values of that dictionary to themselves be iterables (which a boolean, of course, is not). Rather than forcing the rest of the code to comply with an "elegant" structure that would have "enabled" defined at all levels, the cgroup section's "enabled" is stripped, since it expresses the same thing as "enabled" in the main section (if you disable all the controllers then having a main section becomes rather pointless).

Integration of older configuration file options to enable/disable sections and making 'exceptions' possible through their use

In most cases, the existing options for enabling/disabling sections (or the whole cgroup hook) can be replaced by the aforementioned support for strings morphed into booleans if only one option is used:

  • exclude_vntypes: use
    "enabled" : "vntype not in: <comma separated list formerly used in exclude_vntype>"

  • exclude_hosts: use
    "enabled" : "host not in: <comma separated list formerly used in exclude_hosts>"

  • run_only_on_hosts: use
    "enabled" : "host in: <comma separated list formerly used in run_only_on_hosts>"

Since you can define the value for a vntype explicitly, "exclude_vntype" has become largely redundant. The recommendation would be to deprecate it, but the current implementation modulates the "enabled" flag for the section based on exclude_vntypes (in theory you can use a host-based selection in "enabled" and then modulate it based on the vntype discovered to exclude named hostbut it is counterintuitive to explicitly define a host as enabled and then implicitly exclude it based on vntype.)

Some sites would sometimes like to define "exceptions" without having to change vntype; for this the other existing options (which are slightly extended) can be used in addition to the base but now more flexible "enabled". These have always been lists and have up to now not supported wildcards, but each entry is now supported to be an fnmatch.fnmatch() pattern instead of a literal string.

If vntype-based lists are used to define "enabled" for a section, then the existing exclude_hosts configuration option can be used to modulate the answer so that site admins can still define exceptions to the rules. In order to ensure consistency, a mirror image "include_hosts" is now also parsed.

  • "exclude_hosts" lists hosts that should still be disabled despite having been enabled earlier (usually using a vntype-based string that would enable the section for this vntype).
    e.g.:
    "enabled" : "vntype in: gpu"
    "exclude_hosts" : [ "gpu_test*" ]
    would leave all GPU nodes enabled, except the gpu_test node where you don't want this section of the cgroup hook code to interfere with thins you're testing

  • "include_hosts" lists hosts that should still be enabled despite having been disabled earlier  (usually using a vntype-based string that would disable the section for this vntype).
    e.g.:
    "cpuset"  : {
                        "enabled" : "vntype in fat"
                        "include_hosts" : [ "test_cpuset_thin01", "test_cpuset_thin02" ]
                        ...
    would allow you to test the behaviour of cpusets on two thin nodes to validate whether they can be enabled in general later.

Finally, run_only_on_hosts has become largely redundant, but in the current implementation, it is defined to modulate "enabled" (i.e. if "enabled" was true but run_only_on_hosts is non-empty and does not list the host, "enabled" is set to false instead) in a way that most closely matches the plain vernacular meaning of the option. i.e.

"enabled" : "vntype in: willing"
"run_only_on_hosts" : [ "able01", "able02", "able03" ]

would leave that section disabled for all 'unwilling' vntypes (including able01..03 nodes if they are 'unwilling') but also all "willing" vntype nodes that were not one of the three listed nodes.

It is, of course, strongly discouraged to write combinations in any order that would not follow the order 

  1. "enabled",
  2. "exclude_vntypes" and "exclude_hosts",
  3. "include_hosts",
  4. "run_only_on_hosts",

because it could induce the reader into assuming different semantics than those described here (where the order in which some options modulate others is well defined). But most real-life examples of users using this feature seem to only use one or more rarely two options in this natural order.

Technical details

  • When looking at whether the OS is supported, whether the kernel supports cgroups, and whether cgroup controllers are mounted, decisions on whether the cgroup hook needs to run at all and whether the configuration matches the cgroup controllers supported on the host are made as early as possible.

    When the cgroup functionality is not there at all on the host, the hook should accept the event without doing anything, e.g. on Windows hosts or on Linux kernels that do not support cgroups at all.

    When the kernel is known to support cgroups and on hosts where the cgroup hook is enabled in its main section, then the hook expects to find a mount for all controllers whose config file sections are enabled. The hook will not silently disable sections of the config file because a controller mount is missing, but will throw a fatal error and reject all events. If a controller mount is missing on a given host, then the site admin is expected to express in the configuration file that that section (or the whole hook) is disabled for the relevant host or vntype.

  • The decode_list and decode_dict methods to ensure that the json parser output converts bytes/bytearrays/unicode strings to the "string" type was changed in master from something working only in Python2 (where "str" and "bytes" are similar, but "unicode" is different) into something working only in Python3 (where "str" is now close to the old "unicode" instead of "bytes"). This was changed to support both Python 2 and Python 3 (in a way that does not create two sections, one for Python2 and one for Python3, which could in due course lead authors to fix and/or test changes on only one section, breaking the other), so that the same hook can be used on "master" but also older versions of PBSPro.

  • The "master" hook was changed to avoid exceptions on versions of PBSPro that support only the original set of MoM hook events, and not the newer events pbs.EXECJOB_ABORT, pbs.EXECJOB_PRERESUME, pbs.EXECJOB_POSTSUSPEND and pbs.EXECJOB_RESIZE. That too is necessary to ensure the same hook code can be used on multiple versions, which will be necessary to prevent having to maintain separate branches and fix issues in multiple sources.

  • The code that deals with use_hyperthreads was extended to also support AMD processor hosts, and to also work when use_hyperthreads is True but the host has one hardware thread per core (the existing master code was broken, forcing a config file change to make the hook work on all hosts when some hosts and OSes have HT enabled and others do not.)

  • An option "propagate_vntype_to_server" was introduced to control whether the "vntype" detected in the MoM's vntype file is propagated to resources_available.vntype on the server's node information structure. Some sites were already using "resources_available.vntype" on the server for other things, and it has specific usage on Cray XC machines, and the cgroup's overloading of the resource would break some of their existing configuration and scripts/hooks. Setting this to "false" completely decouples the cgroup "vntype" from resources_available.vntype on the server (which then has to be managed manually using qmgr unless MoM sets it regardless of the cgroup hook).

  • Extra subsystems were defined in the configuration file defaults, to allow creation of the directories if the corresponding controllers are mounted. They are disabled by default, but should be enabled if e.g. hooks that are run after the cgroup hook create children of the cgroup hook's cgroup; they can also be enabled as long as the controller is mounted (if the controller is not mounted a CgroupConfigError will be raised if the controller is nonetheless enabled on this host, unless the source is changed to allow this instead --if the exception is not raised the code will silently disable the controller).

    In particular, some versions of Docker do not like to have the cgroup hook create directories in only some of the mounted cgroup controller mounts, and some others will create the missing directory but the job cgroups will not be cleaned up by the cgroup hook when the job ends,  and with this change a site admin will have stronger hints about how to fix these issue.

  • A routine "morph_config_dict_bools" was introduced to morph the special strings described above into booleans when recognized, and to centralize processing of the older exclude_vntype, exclude_hosts, and run_only_on_hosts configuration file options. It is called when instantiating the cgroup class, since it relies on knowing the host and vntype. Parsing the file can be done earlier, since it is already a static method that can be called using the CgroupUtils class name without instantiating a class object.

  • Early exits of the hook were added to exit early when no controllers are mounted or if everything is disabled on this host after parsing and morphing the config file.

  • If running on mother superior, "Cgroup Limit exceeded" messages are written to the job stderr in addition to the MoM log (to avoid users having to nag their system administrators to find out "why their job suddenly stopped").

  • A bug was fixed in the routine to remove cgroups to avoid spurious deletions of the parent cgroups for all PBSpro jobs when orphans are cleaned up

  • A confusing description in a message was fixed (when hooks fail to work they will attempt to set a system hold on a relevant job, so "(suspended)" or "(suspend failed)" proved extremely confusing to site admins, since suspending jobs are totally different operations.

  • A log message indicating the type of event was moved to the place after the acquisition of the lock file, so that you're sure it's printed only when the next messages in the log are indeed related to the processing of the event that is mentioned (otherwise when two hooks competed the message could be printed interleaved in the log messages from an unrelated hook event, sowing confusion about what those messages related to).

Examples

The following example configuration file illustrates many of the features outlined in this design.

{

        "cgroup_prefix" : "pbspro",

        "enabled" : "vntype in: type_a, type_b, no_numa, no_numa_no_cpuset",

        "periodic_resc_update" : true,

        "vnode_per_numa_node" : "vntype not in: no_numa",

        "online_offlined_nodes" : "host not in: *keepoff",

        "orphan_cleanup_race_delay": 5,

        "cgroup":

        {

               "cpuacct":

               {

                      "enabled" : "host in: dummy, tc72",

                      "exclude_hosts" : []

               },

               "cpuset":

               {

                      "enabled" : "vntype not in: *no_cpuset",

                      "controller_mount" : "/sys/fs/cgroup/cpuset",

                      "exclude_hosts" : [],

                      "exclude_vntypes" : [],

                      "memory_spread_page" : true,

                      "mem_hardwall" : false,

                      "mem_fences" : "vntype in: mem_fences, uv"

               },

               "devices":

               {

                      "enabled" : false,

                      "exclude_hosts" : [],

                      "exclude_vntypes" : [],

                      "allow" : ["b *:* rwm","c *:* rwm", ["mic/scif","rwm"],["nvidiactl","rwm", "*"],["nvidia-uvm","rwm"]]

               },

               "hugetlb":

               {

                      "enabled" : false,

                      "default" : "0MB",

                      "exclude_hosts" : [],

                      "exclude_vntypes" : []

               },

               "memory":

               {

                      "enabled" : true,

                      "default" : "256MB",

                      "reserve_memory" : "2GB",

                      "exclude_hosts" : [],

                      "exclude_vntypes" : []

               },

               "memsw":

               {

                      "enabled" : false,

                      "default" : "256MB",

                      "reserve_memory" : "2gb",

                      "exclude_hosts" : [],

                      "exclude_vntypes" : []

               }

        }

}





OSS Site Map

Project Documentation Main Page

Developer Guide Pages