Improve memory+swap management in the cgroup hook

Overview of problems

Swap depletion

The current way to reserve for swap usage (using "mem" requested for physical memory and "vmem" for the sum of physical plus swap usage) leads to situations in which swap can be depleted and jobs can be killed. Obviously that is a defect – the scheduler should be unable to schedule jobs unless they are guaranteed the resources they requested.

An illustration: suppose a node has 60GB of physical memory available to jobs in cgroups and 60GB of swap available to jobs in cgroups. resources_available.mem for the host will be set to 60GB and resources_available.vmem will be set to 120GB.

Suppose two jobs requesting -lselect=1:mem=10GB:vmem=50GB are submitted. The amount of physical memory used by these jobs will be 20GB maximum, with the rest (up to 80GB) coming from swap. Since the requested vmem (100GB) is less than the host available vmem (120GB) the jobs will be allowed to use the host. Unfortunately, while the system allows swap to be used as backing store for physical memory, it doesn't allow ignoring physical memory limits and using physical  memory as backing store when swap is depleted, so if both jobs use 50GB you will run out of swap (80GB>60GB) and the OOM killer will wake up and kill a job.

To avoid trouble, it is currently necessary to reserve at least as much swap as there is physical memory as unusable to jobs (in that case there can be no swap depletion). If sites want to allow jobs to use swap, that is, of course, a waste of disk space.

To avoid these situations it would have been better to schedule "memsw" using two separate schedulable resources, physical memory ("mem") and swap, which would be added to set the memsw limit. Alas, that ship has sailed – we have now allowed people to submit jobs using "mem" and "vmem", with possible swap usage implicitly being the difference between the two. But since the difference is known, the hook could actually derive the quantities of swap available on a host and requested by jobs, so we can fix the problem by transparently managing that different resource (which we'll call 'cgswap' for 'cgroup swap').

Treatment of jobs that do not request memory explicitly

For some jobs it is hard to specify accurate memory requests. When these are not specified, the current behaviour of the hook is to set very small limits that only allow jobs with small memory footprints to run. Some sites do not like that behaviour and would like to run larger jobs despite their failure to request memory explicitly; since the defaults are added by MoM and unknown to the scheduler, the scheduler may schedule a number of jobs that will deplete the memory available on the host and that will lead to job failures.

Of course, disabling the "memory" subsystem in the hook's config file will prevent this, but then the OS and daemons on the node are also not protected against PBSPro jobs that would use so much memory that the OS is starved of memory resources.

Sites have given two scenarios in which they do want some protection for the OS (i.e. enforcement of "reserve_amount" specified in the cgroup config file "memory" and "memsw" subsections) but do not want restrictive per job limits if jobs do not explicitly specify memory requests:

-Nodes with shared jobs in which individual per-job limits are hard to estimate, but for which it's unlikely that the sum of all jobs will deplete memory
-Nodes running exclusive host jobs (and not using suspend/resume preemption), where restricting jobs to an amount smaller than than available on the host is unnecessary.

It would appear most useful if this could be controlled on a per-host basis; fortunately if we introduce booleans in the config file to control this, the current syntax allows just that.

Proposed solution and interfaces

Swap depletion

The new feature will only change behaviour if a configuration parameter called "manage_cgswap" is set to true in the JSON config file's "memsw" section.  By default it is left to False.

For all vnodes for which the cgroup hook computes resources_available.vmem and resources_available.mem and publishes these to the server, resources_available.cgswap will be set to the difference. The hook's exechost_startup event will create the resource if it does not exist, but there is no provision in the hook API to set the correct flags; a site will have to either create the resource on the server with the correct flags or to set the correct flags before the resource is used; that is not unduly burdensome since the scheduler also has to be told about the resource.

To make the new feature as transparent as possible, a site may enable a queuejob and modifyjob event on the cgroup hook, which will complete either "old" style -lncpus=X,mem=Y,vmem=Z or new-style -lselect=N:ncpus=X:mem=Y:vmem=Z with the computed "cgswap". It will also be possible to supply mem and cgswap and the hook will then compute vmem. qalter will support changing either two or one of the quantities (if two new ones are supplied the third will be computed; if only one is supplied then the old mem or vmem value will be used to complete the (mem,cgswap,vmem) tuple of new values.

-lnodes syntax is not supported (due to its rather ambiguous nature and the many things into which the server can transform them – unfortunately the hook interface does not have the knowledge that the server does about what is meant by some 'properties' in the -lnodes syntax). If sites want to continue to allow -lnodes, then either they must write a queuejob that transforms the request into using -lselect (which is fairly trivial for trivial subset of the -lnodes syntax) or specify all three of (mem, vmem, cgswap) in the -lnodes specification – preferably correctly.

To enable cgswap management, a site has to:

  • enable the cgroup hook (obviously)
  • set manage_cgswap to true in the hook configuration file (in the "memsw" section), or to a value that evaluates to true on the relevant hosts (e.g. by setting it to "vntype in: ignore_default_mem");
  • on the relevant hosts, killall -HUP pbs_mom on the hosts, or restart MoM, to make the hook compute resources_available.cgswap;
  • qmgr -c "set resource cgswap flag = 'nhm'" so that the 'h' flag is added to make the server manage resources_available.cgswap;
  • add cgswap to the "resources:" line in $PBS_HOME/sched_priv/sched_config
  • killall -HUP pbs_sched (on the host where the server runs)
  • enable the cgroup hook's queuejob and modifyjob events in qmgr

Care must be taken to set the order of the cgroup hook if other queuejob or modifyjob hook exist that set or modify mem or vmem – in this case you might want the  cgroup hook queuejob and modifyjob events to run after these.

Since the exechost_* events of the cgroup hook are usually needed to be ordered with respect to other hooks (typically, you want cgroups to run last except for hooks that rely on cgroups like container hooks), depending on all the hooks at a site, it may be necessary to import the hook source as two different hooks and to assign the queuejob/modifyjob and the other events to other hook instances, so that the order can be controlled independently for different groups of events.

Treatment of jobs that do not request memory explicitly

The changes introduce two new configuration parameters (booleans) in the memory, memsw and hugetlb sections:

-"enforce_default", which will control whether  a lack of explicit memory requests will yield (if set to true) limits set through the configuration file default or (if set to false) the hosts's total cgroup-available memory. By default it is true, which leaves behaviour unchanged.
-"exclhost_ignore_default", which will make -lplace=exclhost jobs that do not explicitly request memory always behave as if enforce_default was set to false. By default exclhost_ignore_default is false (i.e. exclhost jobs are not treated differently than other jobs, and comply to the behaviour mandated by "enforce_default").

For "memsw", if no vmem is requested for a job, if the default is enforced, "default" denotes additional swap allowed for the job to use (consistent with reserved_amount and reserved_percent).

It is possible to e.g. disable enforcement of defaults for memory but enforce defaults for memsw. In this case, if the default for memsw is 0B, for a job that does not request vmem no swap usage will be allowed and the job's mem and memsw limits will be the same. If e.g. memsw's exclhost_ignore_defaults is enabled in addition, then in contrast exclusivehost jobs will be allowed to use all cgroup-available swap on the host even if they do not request cgswap or vmem.

Note this design document does not concern the behaviour of the cgroup hook to mitigate common misconfigurations: the memsw default is always clamped to the cgroup available memsw minus available mem, i.e. the default is lowered if the default quantity specified is more than the swap detected to be available for jobs. The existing sanity check that always clamps a job's vmem limit to the mem limit when there is no swap is also untouched (since allowing memsw limit>mem limit also makes no sense on such a host). It does have an impact on the design of PTL tests, since they may be run on hosts with no or little swap.

Change of default values

Sites have had many problems with the defaults in the shipped CF file; the documentation does discuss how to set these, but the CF file as shipped is often safe or leads to counterintuitive effects when people haven't absorbed all the advice in the documentation (after which, of course, the default CF file is irrelevant).

  • mem_fences set to true is unsafe unless the workload is organised to prevent NUMA vnodes from being shared by jobs not entirely contained in the node.

    When failing to adhere to the (documented) safety rules, it is possible to deplete the physical memory on NUMA nodes even when no job violates its group memory usage limits, and the fences may then be forcing jobs to use swap instead of physical memory; when there is no swap or once swap is fully depleted, the jobs are killed by OOM, through no fault of the job (just a site admin's inadvertent failure to follow safety restrictions for the workload); since the reason is resource depletion in swap or a NUMA node, there will be no hints in the job stdout/stderr.

    We have seen at least three sites completely baffled by that behaviour.

    mem_fences was set to true by default to mimic the behaviour of the cpuset MoM, but the context for the decision to enable memory fences in the cpuset MoM was different: in that MoM, there is no separate cgroup memory controller, so memory fences are the only way to contain rogue jobs that spread beyond the NUMA nodes assigned to them be allocating more memory than they reserved.

    With the cgroup hook,there is now a separate memory controller to stop rogues from allocating too much memory.

    With this in mind,  enabling memory fences by default is probably seriously misguided in the cgroup hook.

    If sites adhere to the restrictions necessary to make memory fences safe, then they can enable them, but they should not do so before reading the documentation about them. Enabling them by default makes that impossible – they are enabled even when site admins have no idea of the implications.

  • the 'memory' section reserve_amount used to be 64MB. Sites have invariably felt OOM's wrath when they have not tuned this – if the jobs indeed hog all the memory they requested, the OS processes require far more than 64MB, so the host runs out of memory and calls OOM to kill a job anyway. 1GB is a more reasonable default for today's clusters, even though it is on the low side for most clusters (5-6GB are more typical sane values on HPC nodes running a fairly substantial software stack and accessing a remote shared filesystem.

    We should not ship a default that is invariably too low for any modern compute node. Sites often fail to adjust the default and are puzzled by the inevitable job failures that result. It's better for the default to be slightly more conservative – the net effect will be to reduce resources_available.mem on nodes, and if sites wonder "why is 1GB missing", they will usually be prompted to read the documentation.
     
  • vnode_hidden_mb needs to be at least 1 in al sections of the cgroup configuration file. E.g. when a job requests some 'mem' but no' vmem' and enforce_defaults is false for memsw, the cgroup memsw limit is raised to all memory available to jobs. But to the scheduler, such jobs use zero vmem; that means that when computing "already assigned" memory for jobs that are already running on the node, we need to ignore those limits.

    When vnode_hidden_mb is 1, then the scheduler cannot see 1mb, so any job that gets a cgroup limit because the scheduler assigns the memory resource will have a limit that is guaranteed to be lower than an implicitly raised limit, so we can recognize the latter.

  • in contrast to the other values in the "memsw" section, the "default" for memsw used to be inclusive of the default value specified for physical memory in the "memory" subsection.

    That is an aberration:
    • it makes it possible to specify impossible defaults (the kernel will refuse to set memsw to something smaller than mem),
    • it makes it impossible for the memory and memsw sections to be seen as controlling different resources (physical memory vs. swap),
    • and it makes "default" inconsistent with "reserve_amount" and "reserve_percent", which specify some swap to be added to the physical memory in the "memory" section.
    • The default values also must  have an independent meaning in both sections if defaults in both sections can be either enforced or not independently, or the behaviour would become completely baffling (you could turn off enforcement of the memory default, but despite that, the default value could remain relevant for the other section).

Most sites that enable both "memory" and "memsw" expect jobs that do not explicitly request swap not to use any  (i.e. to have the memory and memsw limits set to identical values).  That suggests semantics where "default" in memsw controls an amount of swap (to be added to whatever limit was set for memory) and a default of "0B" for the allowed swap usage for a job.

Implementation note about 0 memory size. 

https://github.com/openpbs/openpbs/pull/2287 fixes a bug in handling pbs.size variables with a zero value but a prefix larger than KB. The hook, however, should make all attempts to run even on currently released versions of OpenPBS and PBS Professional, so 0 valued pbs.size quantities and pbs.size quantities that are subtracted from each other but have equal values (in bytes) are handled explicitly to avoid triggering the bug.







OSS Site Map

Project Documentation Main Page

Developer Guide Pages