...
Link to discussion on Developer Forum: https://community.openpbs.org/t/nvidia-mig-support/2382
Link to issue: <issue link if available>
Link to pull request: <PR link if available> https://github.com/openpbs/openpbs/pull/2142
Link to pull request updating the MIG UUID format Switch to obtaining nvidia a100 MIG identifier/UUID from nvidia-smi -L directly, rather than constructing tuple format· Pull Request #2452 · openpbs/openpbs
Overview
From the nvidia documentation, "the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications".
...
The cgroups hook currently loads the gpu information via nvidia-smi
. At this point, it will also note if MIG is enabled on any gpus. If a GPU has MIG enabled, it will look up the GIs, and replace the physical GPU with the GIs it finds.
This means, if a node has a MIG GPU split into 7 GIs, it will replace the 1 physical GPU with the 7 GIs, and ngpus will be 7.
Now in order for the job to be able to use the GI, a CI(s) needs to be created on the nodefor that GI. Follow the nvidia documentation on how to do this.
...
Code Block |
---|
# nvidia-smi -h
NVIDIA System Management Interface -- v450.51.06
# nvidia-smi mig -lci
+-------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance |
| Instance ID ID |
| ID |
|=======================================================|
| 0 7 MIG 1g.5gb 0 0 |
+-------------------------------------------------------+
|
It also uses the nvidia-smi -L
command to list out the UUIDs of each MIG device. This command is used to update the $CUDA_VISIBLE_DEVICES
environment variable, which is used to specify which CIs a particular job would run on.
Code Block |
---|
[abdas@gpusrv-01 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ade9b969-b95f-7fc1-9075-860078a3b0b7)
MIG 1g.5gb Device 0: (UUID: MIG-1587894d-e0db-5f61-8b47-9aac9ed49baf)
MIG 1g.5gb Device 1: (UUID: MIG-b77baeb1-8280-5af4-be1c-0ad656409721)
MIG 1g.5gb Device 2: (UUID: MIG-d8b08635-c7ba-55aa-88ac-f86867aad854)
MIG 1g.5gb Device 3: (UUID: MIG-98782be8-73f5-5cb4-a225-02158827e203)
MIG 1g.5gb Device 4: (UUID: MIG-baeafc17-9af0-5df8-be81-36ee538070d9)
MIG 1g.5gb Device 5: (UUID: MIG-3f78d2a1-636f-5a9c-ade4-268bb246fbc0)
MIG 1g.5gb Device 6: (UUID: MIG-86c5989a-adb4-528d-ad03-c1f85b86c222)
|
CUDA_VISIBLE_DEVICES
Instead of CUDA_VISIBLE_DEVICES being filled with the UUIDs of the GPUs, it will be instead the UUID of the CIs.
Their format is MIGPreviously compute instances were identified via the format MIG-GPU-<GPU-_UUID>/<GI>/<CI>.<GI_ID>/<CI_ID>, but now each compute instance in each GI would have it's own UUID of format MIG-<MIG_UUID>
Code Block |
---|
[vstumpf@gpusrvabdas@gpusrv-01 ~]$ qsub -I -lselect=ngpus=2 qsub: waiting for job 1007368.gpusrv-01 to start qsub: job 1007368.gpusrv-01 ready [vstumpf@gpusrvabdas@gpusrv-01 ~]$ echo $CUDA_VISIBLE_DEVICES MIG-GPU1587894d-abcdefghe0db-hijk5f61-lmno-pqrs-tuvwxyz12345/7/08b47-9aac9ed49baf,MIG-GPUb77baeb1-abcdefgh8280-hijk5af4-lmno-pqrs-tuvwxyz12345/8/0be1c-0ad656409721 |
External Interface Changes
...