Skip to content

GPU Nodes

Introduction

We currently have 4 GPUs installed on some of our older ARCHIE nodes, as outlined below:

  node297: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)
  node298: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)
  node299: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)
  node300: 16 core CPU, 64GB RAM, NVIDIA P100 (16GB)

The GPU partition

These are made available via the gpu partition in SLURM and can be access by supplying the following line in a job script:

 #SBATCH --partition=gpu

CUDA modules

You will also need to load an appropriate CUDA module, e.g.

 module load cuda/10.0.130

To see currently available CUDA modules, type:

 module avail cuda

Requesting access to the GPU partition

Contact Support to request access to the GPU partition

Selecting a particular GPU

If you need to specify a particular GPU, for example, you need to use a V100 because it has more RAM, then you can use the gres (Generic Resource) feature of SLURM as follows:

 #SBATCH --partition=gpu --gres="gpu:V100:1"

Alernatively, if you need to select the P100, then use:

 #SBATCH --partition=gpu --gres="gpu:P100:1"

If you so not supply a "gres" option, then you will simply be allocated any available GPU.

GPU nodes

As mentioned above, the GPU nodes are installed on our 16 core nodes, each of which has 64GB RAM.

Exclusive access

You can request a whole node, including the GPU, using the --exclusive flag in the usual way, i.e.

 #SBATCH --ntasks=16
 #SBATCH --exclusive

Shared access

If you wish to share a node, simply omit the exclusive flag and request the required number of CPU tasks (ntasks < 16):

 #SBATCH --ntasks=8

This of course means, that you may be sharing the GPU with another user.

Viewing GPU usage

GPU nodes

To see what GPU nodes are currently in use, simply type

 sinfo

which may produce output similar to the following:

   gpu          up 2-00:01:00      2    mix node[298-299]
   gpu          up 2-00:01:00      2   idle node[297,300]

In this example, node298 & node299 are in partial (shared) use and node297 & node300 are free.

GPU processes

To see what processes are running on a GPU, then do the following:

  1. ssh to the relevant node

    ssh node298
    
  2. execute nvidia-smi

    nvidia-smi
    

This will produce output similar to below:

  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |===============================+======================+======================|
  |   0  Tesla V100-PCIE...  Off  | 00000000:0A:00.0 Off |                    0 |
  | N/A   41C    P0   174W / 250W |   9548MiB / 32480MiB |     98%      Default |
  +-------------------------------+----------------------+----------------------+

  +-----------------------------------------------------------------------------+
  | Processes:                                                       GPU Memory |
  |  GPU       PID   Type   Process name                             Usage      |
  |=============================================================================|
  |    0     47707      C   python                                      9537MiB |
  +-----------------------------------------------------------------------------+

In this example, we can see that the 98% of the GPU cores are currently in use, with 9.5GB RAM.

Sample job script

The sample job script below is also available at:

   /opt/software/job-scripts/gpu-singlenode-exclusive.sh


   #!/bin/bash

   #======================================================
   #
   # Job script for running a parallel job on a single gpu node
   #
   #======================================================

   #======================================================
   # Propogate environment variables to the compute node
   #SBATCH --export=ALL
   #
   # Run in the standard partition (queue)
   #SBATCH --partition=gpu
   #
   # Specify project account
   #SBATCH --account=testing
   #
   # No. of tasks required (max. of 16)
   #SBATCH --ntasks=16
   #
   # If required, ensure the node is not shared with another job
   #SBATCH --exclusive
   #
   # Use appropriate line below to select the desired GPU (if required)
   #
   # For P100-16GB (node300)
   ##SBATCH --gres="gpu:P100:1" 
   #
   # For V100-32GB (nodes297-299)
   ##SBATCH --gres="gpu:V100:1" 
   #
   # Specify (hard) runtime (HH:MM:SS)
   #SBATCH --time=01:00:00
   #
   # Job name
   #SBATCH --job-name=gpu_test
   #
   # Output file
   #SBATCH --output=slurm-%j.out
   #======================================================

   module purge
   module load cuda/10.0.130

   #=========================================================
   # Prologue script to record job details
   # Do not change the line below
   #=========================================================
   /opt/software/scripts/job_prologue.sh 
   #----------------------------------------------------------

   # Modify the line below to run your program
   mpirun -np $SLURM_NPROCS myprogram.exe

   #=========================================================
   # Epilogue script to record job endtime and runtime
   # Do not change the line below
   #=========================================================
   /opt/software/scripts/job_epilogue.sh 
   #----------------------------------------------------------