Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • No cgroups on hosts starting with nocg (e.g. nocg01, nocg02, nocg03...), but enabled on all the other hosts:
    • in the main section, add:
      "enabled" : "host not in: nocg*"
  • some hosts do not have the kernel boot option to support "memsw" functionality in the mem controller
    • e.g. assume four sets of hosts, with the following vntypes:
      • compute_with_memsw
      • compute_no_memsw
      • gpu_with_memsw
      • gpu_no_memsw
    • in the memory subsection, you could then use:
      "swappiness" : "vntype in: compute_with_memsw, gpu_with_memsw"
      or even
      "swappiness" : "vntype in: *_with_memsw"
      and rely on translation of True to 1 (i.e. enable use of swap) or 0 (disable use of swap) – which is already in the existing cgroup hook code
  • some hosts with a large socket count and large memory latency discrepancies when using remote memory benefit from vnode_per_numa_node, but this is unnecessary for other hosts with lower socket counts or that are always assigned exclusively to jobs
    • define three sets of host with vntypes "thin", "fat" and "gpu" (you want per-socket vnodes on GPU nodes, even if they are thin)
    • in the main section, use:
      "vnode_per_numa_node" : "vntype in: fat, gpu"

  • on some hosts cpusets are desirable, but not on others
    • in the cpuset section, use e.g.
      "enabled" :  "host not in: nocpuset01, nocpuset02"

  • on some GPU nodes device fencing is desirable to prevent jobs sharing a node to use the wrong GPUs, and because CPUs and GPUs must be located on the same socket for correct performance vnode_per_numa_node is essential; on other nodes using these is unneeded (and unwanted):
    • in the main section use
      "vnode_per_numa_node" : "vntype in: thin, gpu"
    • in the devices section use
      "enabled" : "vntype in: gpu"

Note that this can also, in a limited fashion, be used for some numeric variables by using small editions in the rest of the code (see the first example), by allowing numeric values but converting booleans True and False to sensible numerical values – see the first example.

...

  • exclude_vntypes: use
    "enabled" : "vntype not in: <comma separated list formerly used in exclude_vntype>"

  • exclude_hosts: use
    "enabled" : "host not in: <comma separated list formerly used in exclude_hosts>"

  • run_only_on_hosts: use
    "enabled" : "host in: <comma separated list formerly used in run_only_on_hosts>"

...

  • "exclude_hosts" lists hosts that should still be disabled despite having been enabled earlier (usually using a vntype-based string that would enable the section for this vntype).
    e.g.:
    "enabled" : "vntype in: gpu"
    "exclude_hosts" : [ "gpu_test*" ]
    would leave all GPU nodes enabled, except the gpu_test node where you don't want this section of the cgroup hook code to interfere with thins you're testing

  • "include_hosts" lists hosts that should still be enabled despite having been disabled earlier  (usually using a vntype-based string that would disable the section for this vntype).
    e.g.:
    "cpuset"  : {
                        "enabled" : "vntype in fat"
                        "include_hosts" : [ "test_cpuset_thin01", "test_cpuset_thin02" ]
                        ...
    would allow you to test the behaviour of cpusets on two thin nodes to validate whether they can be enabled in general later.

Finally, run_only_on_hosts has become largely redundant, but in the current implementation, it is defined to modulate "enabled" (i.e. if "enabled" was true but run_only_on_hosts is non-empty and does not list the host, "enabled" is set to false instead) in a way that most closely matches the plain vernacular meaning of the option. i.e.

"enabled" : "vntype in: willing"
"run_only_on_hosts" : [ "able01", "able02", "able03" ]

would leave that section disabled for all 'unwilling' vntypes (including able01..03 nodes if they are 'unwilling') but also all "willing" vntype nodes that were not one of the three listed nodes.

...