Fix incorrect MoM RLIMIT_AS when vmem is used for the group hook's memsw functionality

Follow the PBS Pro Design Document Guidelines.

Overview

When enabling the "memsw" functionality in the cgroup hook, the semantics of "vmem" change.

Without it "vmem" is managed by MoM and denotes the sum of the address spaces used by all processes on the mother superior node, and MoM sets a per process limit for RLIMIT_AS.

But when the cgroup hook "memsw" functionality is enabled, "vmem" requests instead specify the sum of physical memory plus swap usage for the job, which is often smaller than the address space limit.

Indeed, it is doubtful that in the face of kernel limits on real memory plus swap usage anyone would care about the address space limits, especially on 64-bit systems (which have an 8 Exabyte positive address address space). Address space limits were used in the past when there were no physical memory and swap limits enforced by the kernel, but cgroups make these unnecessary.

But on most current versions of PBSPro, the MoM does not know about this change in semantics for vmem, and it still sets a RLIMIT_AS limit for all processes in the job. Since many applications allocate much more address space than they actually use memory, that can make applications fail even though they are staying well within the memory+swap usage corresponding to their vmem requests.

Even MoM itself can be a victim for small vmem limits: when the job starter child forks to run the execjob_launch hook, the fork and exec of pbs_python can make it use up to 150mb-200mb of address space, so setting a small RLIMIT_AS limit can make either the job starter or the hooks fail.

Furthermore, a per-job 'pvmem' resource request unambiguously requests exactly a RLIMIT_AS, so there is no real need to use vmem to specify this limit if one is desired.

To summarise: when enabling the "memsw" functionality in the cgroup hook, RLIMIT_AS should be unlimited, unless the job also requests "pvmem" (which is an unambiguous request to set RLIMIT_AS).

Note: this really is a MoM bug; ideally it should be possible to enable or disable automatic setting of the RLIMIT_AS limit when 'vmem' is requested in the MoM config file. But until such a mechanism exists, at least the cgroup hook can mitigate the effect for values of 'vmem' large enough to allow the job starter and the hooks to run.

Changes proposed

A new parameter would be introduced to the cgroup configuration file, manage_rlimit_as, which by default would be true.

If enabled, then the cgroup hook will reset the RLIMIT_AS process limit for task processes to either unlimited (if pvmem is not specified) or the value of pvmem requested for the job.

This functionality requires a kernel that supports the prlimit system call (i.e. Linux kernel 2.6.36 and above).

If hooks use Python 3, i.e. for PBS versions 2020 and above, that is the only requirement.

For older versions of PBS, the 'prlimit' command should be present. This command is available in util-linux versions 2.21 and above on:
-CentOS7/RedHat from version 7.0
-SLES from version 12 onward,
-Ubuntu from version Ubuntu 16.04 onward
-Debian versions from version 8.0 onward
(i.e. most OS versions with a kernel that supports the prlimit system call and have a version of util-linux >= 2.21).

If the support for changing process limits of other processes is not present, then limits set by MoM are never changed (i.e. the hook behaves as before, and as if manage_rlimit_as were disabled).

When the flag is disabled, limits set by MoM will not be changed. It is unlikely anyone would want this behaviour, except on MoMs where it is possible to disable enforcement of 'vmem' as an RLIMIT_AS limit set on processes.







OSS Site Map

Project Documentation Main Page

Developer Guide Pages