Nvidia MIG Support

Follow the PBS Pro Design Document Guidelines.

Overview

From the nvidia documentation, "the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications".

PBS will recognize the separate GPU instances as a separate gpus.

Pre-requisites

MIG must be enabled on the gpu. See nvidia documentation how to enable it.

The nvidia kernel module parameter nv_cap_enable_devfs must be enabled (set to 1).

The admin must create GPU Instances and Compute Instances before starting the MoM.

Terminology

MIG = Multi Instance GPU

GI = GPU Instance (A MIG GPU can have multiple GIs)

CI = Compute Instance (A GI can have multiple CIs)

How it works

The cgroups hook currently loads the gpu information via nvidia-smi. At this point, it will also note if MIG is enabled on any gpus. If a GPU has MIG enabled, it will look up the GIs, and replace the physical GPU with the GIs it finds.
This means, if a node has a MIG split into 7 GIs, it will replace the 1 physical GPU with the 7 GIs, and ngpus will be 7.

Now in order for the job to be able to use the GI, a CI(s) needs to be created for that GI. Follow the nvidia documentation on how to do this.

The job’s cgroup needs multiple devices allowed in order to use the GI. It requires the following devices:

  1. The GI (look through /dev/nvidia-caps)

  2. All CIs that are in the GI (look through /dev/nvidia-caps)

  3. The GPU device that has the GIs are created on (/dev/nvidia0, /dev/nvidia1, etc)

  4. The nvidiactl device (required for the GPU device)

Even though the job has access to the GPU device, because MIG is enabled it doesn’t have access to ALL the GIs on the system.

External Dependencies

This relies on the nvidia-smi command. It uses the output of nvidia-smi mig -lgi to list the GIs and nvidia-smi mig -lci to list the CIs.

Unfortunately, these commands only output in table format, like so:

[vstumpf@gpusrv-01 ~]$ sudo nvidia-smi mig -lgi +----------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | |====================================================| | 0 MIG 1g.5gb 19 7 0:1 | +----------------------------------------------------+

The hook will use regex to match this output, but if the output format changes, a patch will be required.

On the two machines I tested on, nvidia-smi mig -lci had different formats:

[vstumpf@gpusrv-01 ~]$ nvidia-smi -h NVIDIA System Management Interface -- v455.45.01 [vstumpf@gpusrv-01 ~]$ sudo nvidia-smi mig -lci +--------------------------------------------------------------------+ | Compute instances: | | GPU GPU Name Profile Instance Placement | | Instance ID ID Start:Size | | ID | |====================================================================| | 0 7 MIG 1g.5gb 0 0 0:1 | +--------------------------------------------------------------------+

versus

# nvidia-smi -h NVIDIA System Management Interface -- v450.51.06 # nvidia-smi mig -lci +-------------------------------------------------------+ | Compute instances: | | GPU GPU Name Profile Instance | | Instance ID ID | | ID | |=======================================================| | 0 7 MIG 1g.5gb 0 0 | +-------------------------------------------------------+

It also uses the nvidia-smi -L command to list out the UUIDs of each MIG device. This command is used to update the $CUDA_VISIBLE_DEVICES environment variable, which is used to specify which CIs a particular job would run on.

CUDA_VISIBLE_DEVICES

Instead of CUDA_VISIBLE_DEVICES being filled with the UUIDs of the GPUs, it will be instead the UUID of the CIs.

Previously compute instances were identified via the format MIG-GPU-<GPU_UUID>/<GI_ID>/<CI_ID>, but now each compute instance in each GI would have it's own UUID of format MIG-<MIG_UUID>

 

External Interface Changes

There are no changes to the external interface. If MIG is enabled and there are GPU Instances created, the hook will automatically use them.