Links

Overview

Some customers would like the cgroup hook to stop managing the ngus resource completely on some nodes but manage it on others.

One use case is a site with test nodes with a single GPU that is to be shared by more than one job, mainly for testing. The site administrators plan to set it to a fixed number larger than one, using either v2 config files or using qmgr, to manage how much to oversubscribe the GPU resources.

The current hook does not allow this: it currently behaves as the manager of the ngpus resource on vnodes, and wil set resources_available.ngpus according to the ngpus which it can assign correctly. This entails that if the 'devices' controller is disabled  or if 'discover_gpus' is disabled, it will actively set it to 0 to indicate that there are no GPUs available to assign to jobs; that in turn means that jobs requesting the ngpus resource will never use vnodes on the host.

In the current hook there is thus no way to allow oversubscription of ngpus resources: resources_available.ngpus is set to the correct value of GPUs the hook knows it can manage and assign, and jobs that request ngpus resources are assigned one GPU per ngpus in the job's request for their exclusive use.

This proposal introduces a new flag 'ngpus_ext_managed' in the main section of the configuration file; enabling the flag indicates that resources_available.ngpus is managed externally to the hook and that the hook should not assign GPUs to jobs when jobs request the ngpus resource.

By default it is false. If set to 'true' in the pbs_cgroups.CF configuration file:

Since this is a boolean flag, it is possible to use the current config file syntax to enable ngpus to be managed by the hook on some nodes and to be externally managed on some other nodes.







OSS Site Map

Project Documentation Main Page

Developer Guide Pages