...
Link to discussion on Developer Forum: https://community.openpbs.org/t/nvidia-mig-support/2382
Link to issue: <issue link if available>
Link to pull request: https://github.com/openpbs/openpbs/pull/2142
Link to pull request updating the MIG UUID format Switch to obtaining nvidia a100 MIG identifier/UUID from nvidia-smi -L directly, rather than constructing tuple format· Pull Request #2452 · openpbs/openpbs
Overview
From the nvidia documentation, "the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications".
...
Code Block |
---|
# nvidia-smi -h
NVIDIA System Management Interface -- v450.51.06
# nvidia-smi mig -lci
+-------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance |
| Instance ID ID |
| ID |
|=======================================================|
| 0 7 MIG 1g.5gb 0 0 |
+-------------------------------------------------------+
|
It also uses the nvidia-smi -L
command to list out the UUIDs of each MIG device. This command is used to update the $CUDA_VISIBLE_DEVICES
environment variable, which is used to specify which CIs a particular job would run on.
Code Block |
---|
[abdas@gpusrv-01 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ade9b969-b95f-7fc1-9075-860078a3b0b7)
MIG 1g.5gb Device 0: (UUID: MIG-1587894d-e0db-5f61-8b47-9aac9ed49baf)
MIG 1g.5gb Device 1: (UUID: MIG-b77baeb1-8280-5af4-be1c-0ad656409721)
MIG 1g.5gb Device 2: (UUID: MIG-d8b08635-c7ba-55aa-88ac-f86867aad854)
MIG 1g.5gb Device 3: (UUID: MIG-98782be8-73f5-5cb4-a225-02158827e203)
MIG 1g.5gb Device 4: (UUID: MIG-baeafc17-9af0-5df8-be81-36ee538070d9)
MIG 1g.5gb Device 5: (UUID: MIG-3f78d2a1-636f-5a9c-ade4-268bb246fbc0)
MIG 1g.5gb Device 6: (UUID: MIG-86c5989a-adb4-528d-ad03-c1f85b86c222)
|
CUDA_VISIBLE_DEVICES
Instead of CUDA_VISIBLE_DEVICES being filled with the UUIDs of the GPUs, it will be instead the UUID of the CIs.
Their format is Previously compute instances were identified via the format MIG-GPU-<GPU-_UUID>/<GI>/<CI>.<GI_ID>/<CI_ID>, but now each compute instance in each GI would have it's own UUID of format MIG-<MIG_UUID>
Code Block |
---|
[vstumpf@gpusrvabdas@gpusrv-01 ~]$ qsub -I -lselect=ngpus=2 qsub: waiting for job 1007368.gpusrv-01 to start qsub: job 1007368.gpusrv-01 ready [vstumpf@gpusrvabdas@gpusrv-01 ~]$ echo $CUDA_VISIBLE_DEVICES MIG-GPU1587894d-abcdefghe0db-hijk5f61-lmno-pqrs-tuvwxyz12345/7/08b47-9aac9ed49baf,MIG-GPUb77baeb1-abcdefgh8280-hijk5af4-lmno-pqrs-tuvwxyz12345/8/0be1c-0ad656409721 |
External Interface Changes
...