Page Comparison

...

Link to discussion on Developer Forum: https://community.openpbs.org/t/nvidia-mig-support/2382
Link to issue: <issue link if available>
Link to pull request: <PR link if available> https://github.com/openpbs/openpbs/pull/2142
Link to pull request updating the MIG UUID format Switch to obtaining nvidia a100 MIG identifier/UUID from nvidia-smi -L directly, rather than constructing tuple format· Pull Request #2452 · openpbs/openpbs

Overview

From the nvidia documentation, "the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications".

...

The cgroups hook currently loads the gpu information via nvidia-smi. At this point, it will also note if MIG is enabled on any gpus. If a GPU has MIG enabled, it will look up the GIs, and replace the physical GPU with the GIs it finds.
This means, if a node has a MIG GPU split into 7 GIs, it will replace the 1 physical GPU with the 7 GIs, and ngpus will be 7.

Now in order for the job to be able to use the GI, a CI(s) needs to be created on the nodefor that GI. Follow the nvidia documentation on how to do this.

...

Code Block

# nvidia-smi -h
NVIDIA System Management Interface -- v450.51.06

# nvidia-smi mig -lci
+-------------------------------------------------------+
| Compute instances:                                    |
| GPU     GPU       Name             Profile   Instance |
|       Instance                       ID        ID     |
|         ID                                            |
|=======================================================|
|   0      7       MIG 1g.5gb           0         0     |
+-------------------------------------------------------+

It also uses the nvidia-smi -L command to list out the UUIDs of each MIG device. This command is used to update the $CUDA_VISIBLE_DEVICES environment variable, which is used to specify which CIs a particular job would run on.

Code Block

[abdas@gpusrv-01 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ade9b969-b95f-7fc1-9075-860078a3b0b7)
  MIG 1g.5gb      Device  0: (UUID: MIG-1587894d-e0db-5f61-8b47-9aac9ed49baf)
  MIG 1g.5gb      Device  1: (UUID: MIG-b77baeb1-8280-5af4-be1c-0ad656409721)
  MIG 1g.5gb      Device  2: (UUID: MIG-d8b08635-c7ba-55aa-88ac-f86867aad854)
  MIG 1g.5gb      Device  3: (UUID: MIG-98782be8-73f5-5cb4-a225-02158827e203)
  MIG 1g.5gb      Device  4: (UUID: MIG-baeafc17-9af0-5df8-be81-36ee538070d9)
  MIG 1g.5gb      Device  5: (UUID: MIG-3f78d2a1-636f-5a9c-ade4-268bb246fbc0)
  MIG 1g.5gb      Device  6: (UUID: MIG-86c5989a-adb4-528d-ad03-c1f85b86c222)

CUDA_VISIBLE_DEVICES

Instead of CUDA_VISIBLE_DEVICES being filled with the UUIDs of the GPUs, it will be instead the UUID of the CIs.

Their format is MIGPreviously compute instances were identified via the format MIG-GPU-<GPU-_UUID>/<GI>/<CI>.<GI_ID>/<CI_ID>, but now each compute instance in each GI would have it's own UUID of format MIG-<MIG_UUID>

Code Block

[vstumpf@gpusrvabdas@gpusrv-01 ~]$ qsub -I -lselect=ngpus=2
qsub: waiting for job 1007368.gpusrv-01 to start
qsub: job 1007368.gpusrv-01 ready

[vstumpf@gpusrvabdas@gpusrv-01 ~]$ echo $CUDA_VISIBLE_DEVICES 
MIG-GPU1587894d-abcdefghe0db-hijk5f61-lmno-pqrs-tuvwxyz12345/7/08b47-9aac9ed49baf,MIG-GPUb77baeb1-abcdefgh8280-hijk5af4-lmno-pqrs-tuvwxyz12345/8/0be1c-0ad656409721

External Interface Changes

...

Versions Compared

Old Version 5

New Version Current

Key

Overview

CUDA_VISIBLE_DEVICES

External Interface Changes