Make cgroup hook robust in the face of volatile cgroup controller mounts

Follow the PBS Pro Design Document Guidelines.

Links

Link to discussion on Developer Forum: <link to your project's discussion>
Link to issue: <issue link if available>
Link to pull request: <PR link if available>

Overview

We have two sites in Europe using Docker in combination with the cgroup hook (sometimes in parallel with PBSPro, i.e. with Docker making containers alongside the PBSPro ones made by the cgroup hook, sometimes cooperatively, with the PBSPro container hook creating cgroups within the per-job ones set up by pbs_cgroups.PY).

Very rarely, we see that the parsing of /proc/mounts makes it pick up different mounts of the cgroup controllers as the mount path to use; most of the time this has few ill effects, but since the mounts do not seem to be permanent and we don't synchronise with the docker daemon messing with them, the cgroup hook picks up a mount path but fails when it wants to actually use it.

The cgroup hook code indeed does not seem to countenance that a controller could be mounted in more than one spot, and certainly not that some of the mount paths encountered could be volatile. Furthermore, it always sets paths up using the last mount encountered for the controller in /proc/mounts, which is a recipe for disaster if we have other software adding mounts and then removing them.

At one site, the mount paths picked up usually refer to paths set up by Docker for individual containers (possibly in preparation for the cgroup controller mounts in the container?) At one site, we see volatile mounts straight in /sys/fs/cgroup.

If those paths stay valid while an event is being processed it's invisible to the end user. Alas, if the cgroup hook picks up a mount path and then tries to use it when it no longer exists, it fails, with a message like e.g.:

"CgroupConfigError ('Failed to create directory: /sys/fs/cgroup/blkio,cpuacct,memory,freezer/pbspro.slice/pbspro-413268.pbs10.slice/ (ENOMEM)',)"

At that site, inspection of the host after the error was discovered did confirm that /sys/fs/cgroup/blkio,cpuacct,memory,freezer is nowhere to be found, but it must have been there when the execjob_begin event for that job caused the hook to start running.

In the past, for one site we've extended the cgroup config file syntax to allow the site admin to hardcode the cgroup controller mount paths. That seems to have fixed all issues, so there's every chance to believe that the /sys/fs/cgroup/<controller> paths are unharmed and always usable. But of course it's onerous for a site admin to have to specify these. This was not submitted in a pull request, but since the problem is resurfacing at other sites, we probably need to reintroduce this.

For the new build which I have created to address the issue, when there are competing mounts discovered for a given controller, the hook will always use the shorter one; that will automatically pick up a permanent mount in all use cases I've seen and for all errors that I have seen up to now (note: on some systems like HPE/Cray or HPE/SGI, it may pick up /dev/cpuset or /dev/cgmem to manage cpusets resp. memory cgroup controllers instead of the distribution-standard /sys/fs/cgroup mount, but that is not a problem – these are permanent aliases).

But just in case this would not be caused by extra mounts but by a kernel bug that gives an incorrect path for the existing mount in rare race conditions, I also would like to introduce code to allow the mount paths for the cgroup controllers to be specified explicitly. That way sites could work around even those issues without us having to spin another version of the cgroup hook.

I also defined extra controllers that can be enabled, should Docker play more nicely with us when trying to create child cgroups in controller hierarchies that we do not manage (it may be because we do not create directories for some controllers Docker knows about that it is messing with the cgroup controller mounts). If you enable these controllers, the hook will do nothing except correctly populate the tasks file for them (and destroy them when the job ends).

The selection of the shorter mount and the introduction of "no operation" support for extra controllers is not truly an interface change, but I'll still describe it here.

Interface changes

Subsystems recognized by the hook

The hook now knows and supports enabling:

blkio
cpu
cpuacct
cpuset
devices
freezer
hugetlb
memory
net_cls
net_prio
perf_event
pids
rdma
systemd

If enabled, for the following 'unmanaged' subsystems nothing is done except creating and destroying cgroups and filling in the 'tasks' files with processes that belong to the job:

blkio
freezer
net_cls
net_prio
perf_event
pids
rdma
systemd

mount_path option in the sections for the subsystems in the configuration file

The hook configuration file now supports a "mount_path" parameter that describes the mount point of the cgroup controller to be used (a string containing a path). This is supported for all sections except "memsw" since that is not a separate controller but a separate functionality of the "memory" controller, so the corresponding files are in the memory controller's hierarchy.

Example:

"cpuset" : {
"mount_path" : "/sys/fs/cgroup/cpuset",
"enabled" : true,
"exclude_cpus" : [],
"exclude_hosts" : [],
"exclude_vntypes" : [],
"mem_fences" : true,
"mem_hardwall" : false,
"memory_spread_page" : false
},

OSS Site Map

Project Documentation Main Page

Developer Guide Pages