Nvidia MIG Support

Follow the PBS Pro Design Document Guidelines.

Links

Link to discussion on Developer Forum: https://community.openpbs.org/t/nvidia-mig-support/2382
Link to issue: <issue link if available>
Link to pull request: Initial support for MIG-enabled GPUs by vstumpf · Pull Request #2142 · openpbs/openpbs
Link to pull request updating the MIG UUID format Switch to obtaining nvidia a100 MIG identifier/UUID from nvidia-smi -L directly, rather than constructing tuple format· Pull Request #2452 · openpbs/openpbs

Overview

From the nvidia documentation, "the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications".

PBS will recognize the separate GPU instances as a separate gpus.

Pre-requisites

MIG must be enabled on the gpu. See nvidia documentation how to enable it.

The nvidia kernel module parameter nv_cap_enable_devfs must be enabled (set to 1).

The admin must create GPU Instances and Compute Instances before starting the MoM.

Terminology

MIG = Multi Instance GPU

GI = GPU Instance (A MIG GPU can have multiple GIs)

CI = Compute Instance (A GI can have multiple CIs)

How it works

The cgroups hook currently loads the gpu information via nvidia-smi. At this point, it will also note if MIG is enabled on any gpus. If a GPU has MIG enabled, it will look up the GIs, and replace the physical GPU with the GIs it finds.
This means, if a node has a MIG split into 7 GIs, it will replace the 1 physical GPU with the 7 GIs, and ngpus will be 7.

Now in order for the job to be able to use the GI, a CI(s) needs to be created for that GI. Follow the nvidia documentation on how to do this.

The job’s cgroup needs multiple devices allowed in order to use the GI. It requires the following devices:

The GI (look through /dev/nvidia-caps)
All CIs that are in the GI (look through /dev/nvidia-caps)
The GPU device that has the GIs are created on (/dev/nvidia0, /dev/nvidia1, etc)
The nvidiactl device (required for the GPU device)

Even though the job has access to the GPU device, because MIG is enabled it doesn’t have access to ALL the GIs on the system.

External Dependencies

This relies on the nvidia-smi command. It uses the output of nvidia-smi mig -lgi to list the GIs and nvidia-smi mig -lci to list the CIs.

Unfortunately, these commands only output in table format, like so:

[vstumpf@gpusrv-01 ~]$ sudo nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19        7          0:1     |
+----------------------------------------------------+

The hook will use regex to match this output, but if the output format changes, a patch will be required.

On the two machines I tested on, nvidia-smi mig -lci had different formats:

[vstumpf@gpusrv-01 ~]$ nvidia-smi -h
NVIDIA System Management Interface -- v455.45.01

[vstumpf@gpusrv-01 ~]$ sudo nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      7       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+

versus

# nvidia-smi -h
NVIDIA System Management Interface -- v450.51.06

# nvidia-smi mig -lci
+-------------------------------------------------------+
| Compute instances:                                    |
| GPU     GPU       Name             Profile   Instance |
|       Instance                       ID        ID     |
|         ID                                            |
|=======================================================|
|   0      7       MIG 1g.5gb           0         0     |
+-------------------------------------------------------+

It also uses the nvidia-smi -L command to list out the UUIDs of each MIG device. This command is used to update the $CUDA_VISIBLE_DEVICES environment variable, which is used to specify which CIs a particular job would run on.

[abdas@gpusrv-01 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ade9b969-b95f-7fc1-9075-860078a3b0b7)
  MIG 1g.5gb      Device  0: (UUID: MIG-1587894d-e0db-5f61-8b47-9aac9ed49baf)
  MIG 1g.5gb      Device  1: (UUID: MIG-b77baeb1-8280-5af4-be1c-0ad656409721)
  MIG 1g.5gb      Device  2: (UUID: MIG-d8b08635-c7ba-55aa-88ac-f86867aad854)
  MIG 1g.5gb      Device  3: (UUID: MIG-98782be8-73f5-5cb4-a225-02158827e203)
  MIG 1g.5gb      Device  4: (UUID: MIG-baeafc17-9af0-5df8-be81-36ee538070d9)
  MIG 1g.5gb      Device  5: (UUID: MIG-3f78d2a1-636f-5a9c-ade4-268bb246fbc0)
  MIG 1g.5gb      Device  6: (UUID: MIG-86c5989a-adb4-528d-ad03-c1f85b86c222)

CUDA_VISIBLE_DEVICES

Instead of CUDA_VISIBLE_DEVICES being filled with the UUIDs of the GPUs, it will be instead the UUID of the CIs.

Previously compute instances were identified via the format MIG-GPU-<GPU_UUID>/<GI_ID>/<CI_ID>, but now each compute instance in each GI would have it's own UUID of format MIG-<MIG_UUID>

[abdas@gpusrv-01 ~]$ qsub -I -lselect=ngpus=2
qsub: waiting for job 368.gpusrv-01 to start
qsub: job 368.gpusrv-01 ready

[abdas@gpusrv-01 ~]$ echo $CUDA_VISIBLE_DEVICES
MIG-1587894d-e0db-5f61-8b47-9aac9ed49baf,MIG-b77baeb1-8280-5af4-be1c-0ad656409721

External Interface Changes

There are no changes to the external interface. If MIG is enabled and there are GPU Instances created, the hook will automatically use them.

OSS Site Map

Project Documentation Main Page

Developer Guide Pages