Support for the cpu cgroup controller and zero-CPU jobs

Links

Link to discussion on Developer Forum: <link to your project's discussion>
Link to issue: <issue link if available>
Link to pull request: <PR link if available>

Overview

Some sites want to use the "cpu" cgroup controller, in addition to or instead of the cpuset controller

Some sites may want to "overcommit" CPU resources, and use up more "ncpus" resources than there are physical CPU threads; but when they do they would still want to ensure that some jobs do not hog more resources than others (i.e. to make sure a job receives no more than job_ncpus/total_assigned_ncpus of the CPU resources on the node).
Some sites may not want to use the cpuset controller (e.g. because of undesirable interaction with suspend/resume preemption, since the scheduler is unaware of CPU assignmentsin cpusets), but may still want to distribute CPU resources fairly across jobs
Some sites may want to support zero-CPU jobs, but these should be "weightless", i.e. only grab CPU resources unused by other jobs.

Additionally, they may want to combine zero-CPU jobs (with the cpu controller managing their access to CPU resources) with the cpuset controller, but the current implementation does not allow this, since it will assign a minimum of one CPU for each job it manages.

Approach

For the cpu controller, it is fairly simple: all we need to do is to set

cpu.shares (these are set as 1000*ncpus; the scaling factor is irrelevant as long as it is used consistently within all the cgroups in the parent cgroup directory that houses the per-job cgroup directories.)
cpu.cfs_quota_us

for the different jobs, after having created a per-job cgroup directory in the controller's hierarchy (as we do for other controllers).

In the configuration file for the cgroup hook, there are some extra tunables:

cfs_period_us, the periodicity with which the kernel will look at whether processes need to be throttled to enforce quotas. The default is 100000 in the hook (0.1s), which is also the default in most current distributions. Reducing this will ensure finer grained control over throttling of application threads that have used more than their allotted quota, at the expense of extra overhead for the Linux CFS scheduler.
cfs_quota_fudge_factor: when setting quotas to N cpus * cfs_period_us, the finite granularity of checks will usually not enable jobs to fully use N CPUs, but e.g. 98% or 99% of that. To ensure only real rogues are throttled, it is advisble to set the quotas slightly higher. The default is 1.03, which is entirely reasonable for the default cfs_period_us; this may need to be increased for much shorter values of cfs_period_us.
enforce_per_period_quota: controls whether hard quotas are enforced. If set to True, then an application requesting N CPUs will not be able to use more than N CPUs even if there are CPU resources that are idle. That leads to repeatable execution times for jobs, and users cannot request less CPUs (in an attempt to let their jobs run faster) and then hope to use more CPU resources stealthily. Obviously, if the intent is to indeed overcommit CPU resources (i.e. to allow e.g. resources_available.ncpus=256 on a node with 64 CPU threads), this should be set to False.
zero_cpus_shares_fraction: how many "shares" to give jobs requesting zero ncpus. The default is 0.002, which makes the application threads weightless (the CFS scheduler will allocate CPU resources for them only when no other application threads with more than zero shares are attempting to use CPUs).
zero_cpus_quota_fraction: we can, of course, not use a zero default when quotas are enforced, or "weightless" jobs would always be throttled and prevented from running. The default is 0.2, which lets "weightless" jobs use 0.2 of a CPU thread (which assumes that users will create weightless jobs for I/O bound jobs for which this poses no problem, and avoids bursts of CPU activity that might slow down parallel jobs and waste resources). Set this to 1.0 or higher to disable throttling of (single threaded) zero-cpu jobs. the "shares" will still make them weightless, but will make them capable of using more "leftover" CPU resources.

Weightless jobs are not run in a per-job cpuset created for them, but in the main pbspro.slice cgroup. This allows them to coexist with jobs requiring a strictly positive number of CPUs, which can then be placed onto cpusets with specific CPU threads assigned to them.

In the "cpuset" section, there is a new allow_zero_cpus flag to prevent sites from unwittingly accepting "weightless" jobs into the pspro.slice cpuset (if "cpus" is disabled, they would not even be weightless!):

By default this is set to true, which means that if cpuset cgroups are enabled, jobs that request zero ncpus on a host will have processes placed in the main pbspro.slice cpuset. Setting it to false will disallow that; execjob_launch and execjob_attach events will instead end in rejection if no ncpus were assigned on the host. To avoid single host jobs being accepted by mother superior and then failing immediately when the process that runs the job script is created, for mother superior execjob_begin and execjob_resize hook events will end in rejection if a job asks for no ncpus on mother superior and the cpuset cgroups are enabled.

Quota enforcement on hosts with multithread cores on their CPUs is a bit tricky to control: an application using the A and B threads of a single core is using "200% of a CPU" but only one core, an application using the A threads of two different cores also uses "200% of a CPU" but essentially uses almost two cores. There is no way for the cpu controller to see the difference, although obviously the cpuset controller can often be used to fence jobs into their own sets of threads which alleviates the problem (but then, the utility of the cpu controller is low when you are already using cpusets, except for weightless jobs).

When ncpus_are_cores is set and use_ hyperthreads is set, the quotas are multiplied by the number of threads per core detected on the host. If cpusets are not used and the application threads are e.g. placed on the A threads of all cores, then the applications are "lucky" and will run faster than if the threads were placed on the A and B threads of a lesser number of CPU cores, although that advantage will disappear if all CPU threads are in use.

Functionality outside of the cgroup hook made possible by enabling the cpu controller

A site may adapt the pbspro parent directory's cpu.cfs_quota_us and cpu.shares, in order to leave "some" more room for the operating system (e.g. throttling all jobs on the system to use only 95% of the CPU resources on the node by using the quota). cpu.shares shold then be compared to the values set by systemd in e.g. system.slice and user.slice.

OSS Site Map

Project Documentation Main Page

Developer Guide Pages