Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Overview

From the nvidia documentation, "the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications".

...

Code Block
# nvidia-smi -h
NVIDIA System Management Interface -- v450.51.06

# nvidia-smi mig -lci
+-------------------------------------------------------+
| Compute instances:                                    |
| GPU     GPU       Name             Profile   Instance |
|       Instance                       ID        ID     |
|         ID                                            |
|=======================================================|
|   0      7       MIG 1g.5gb           0         0     |
+-------------------------------------------------------+

It also uses the nvidia-smi -L command to list out the UUIDs of each MIG device. This command is used to update the $CUDA_VISIBLE_DEVICES environment variable, which is used to specify which CIs a particular job would run on.

Code Block
[abdas@gpusrv-01 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ade9b969-b95f-7fc1-9075-860078a3b0b7)
  MIG 1g.5gb      Device  0: (UUID: MIG-1587894d-e0db-5f61-8b47-9aac9ed49baf)
  MIG 1g.5gb      Device  1: (UUID: MIG-b77baeb1-8280-5af4-be1c-0ad656409721)
  MIG 1g.5gb      Device  2: (UUID: MIG-d8b08635-c7ba-55aa-88ac-f86867aad854)
  MIG 1g.5gb      Device  3: (UUID: MIG-98782be8-73f5-5cb4-a225-02158827e203)
  MIG 1g.5gb      Device  4: (UUID: MIG-baeafc17-9af0-5df8-be81-36ee538070d9)
  MIG 1g.5gb      Device  5: (UUID: MIG-3f78d2a1-636f-5a9c-ade4-268bb246fbc0)
  MIG 1g.5gb      Device  6: (UUID: MIG-86c5989a-adb4-528d-ad03-c1f85b86c222)
  

CUDA_VISIBLE_DEVICES

Instead of CUDA_VISIBLE_DEVICES being filled with the UUIDs of the GPUs, it will be instead the UUID of the CIs.

Their format is Previously compute instances were identified via the format MIG-GPU-<GPU-_UUID>/<GI>/<CI>.<GI_ID>/<CI_ID>, but now each compute instance in each GI would have it's own UUID of format MIG-<MIG_UUID>

Code Block
[vstumpf@gpusrvabdas@gpusrv-01 ~]$ qsub -I -lselect=ngpus=2
qsub: waiting for job 1007368.gpusrv-01 to start
qsub: job 1007368.gpusrv-01 ready

[vstumpf@gpusrvabdas@gpusrv-01 ~]$ echo $CUDA_VISIBLE_DEVICES

MIG-GPU1587894d-abcdefghe0db-hijk5f61-lmno-pqrs-tuvwxyz12345/7/08b47-9aac9ed49baf,MIG-GPUb77baeb1-abcdefgh8280-hijk5af4-lmno-pqrs-tuvwxyz12345/8/0be1c-0ad656409721

External Interface Changes

...