Skip to content

GPU Nodes

Introduction

We currently have 4 GPUs installed on some of our older ARCHIE nodes, as outlined below:

  node297: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)
  node298: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)
  node299: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)
  node300: 16 core CPU, 64GB RAM, NVIDIA V100 (32GB)

The NVidia V100 has 5120 CUDA cores and 640 Tensor cores.

The GPU partition

These are made available via the gpu partition in SLURM and can be access by supplying the following line in a job script:

 #SBATCH --partition=gpu

Requesting access to the GPU partition

Contact Support to request access to the GPU partition

CUDA modules

You will also need to load an appropriate CUDA module, e.g.

 module load cuda/10.0.130

To see currently available CUDA modules, type:

 module avail cuda

GPU nodes

As mentioned above, the GPU nodes are installed on our 16 core Intel Xeon E5-2660 (2.2GHz) nodes, each of which has 64GB RAM.

Allocation

It is currently only possible to run a job on a single node, up to a miximum of two running jobs at any one time. In other words, multi-node GPU jobs are currently not permitted.

Exclusive access by default

By default whole nodes are allocate for GPU jobs. Even if fewer than 16 cores are requested, the following is assumed:

 #SBATCH --ntasks=16
 #SBATCH --exclusive

Therefore, core-hours are calculated based on 16 cores being allocated.

Maximum runtime

Maximum runtime is currently 48 hours.

Viewing GPU usage

GPU nodes

To see what GPU nodes are currently in use, simply type

 sinfo

which may produce output similar to the following:

   gpu          up 2-00:01:00      2    alloc node[298-299]
   gpu          up 2-00:01:00      2     idle node[297,300]

In this example, node298 & node299 are in use and node297 & node300 are free.

GPU processes

To see what processes are running on a GPU, then do the following:

  1. ssh to the relevant node

    ssh node298
    
  2. execute nvidia-smi

    nvidia-smi
    

This will produce output similar to below:

  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |===============================+======================+======================|
  |   0  Tesla V100-PCIE...  Off  | 00000000:0A:00.0 Off |                    0 |
  | N/A   41C    P0   174W / 250W |   9548MiB / 32480MiB |     98%      Default |
  +-------------------------------+----------------------+----------------------+

  +-----------------------------------------------------------------------------+
  | Processes:                                                       GPU Memory |
  |  GPU       PID   Type   Process name                             Usage      |
  |=============================================================================|
  |    0     47707      C   python                                      9537MiB |
  +-----------------------------------------------------------------------------+

In this example, we can see that the 98% of the GPU cores are currently in use, with 9.5GB RAM.

Sample job script

The sample job script below is also available at:

   /opt/software/job-scripts/gpu-singlenode-exclusive.sh


   #!/bin/bash

   #======================================================
   #
   # Job script for running a parallel job on a single gpu node
   #
   #======================================================

   #======================================================
   # Propogate environment variables to the compute node
   #SBATCH --export=ALL
   #
   # Run in the standard partition (queue)
   #SBATCH --partition=gpu
   #
   # Specify project account
   #SBATCH --account=testing
   #
   # No. of tasks required 
   # (not strictly necessary: 16 will be allocated anyway)        
   #SBATCH --ntasks=16
   #
   # Ensure the node is not shared with another job
   # (not strictly necessary: exclusivity enforced anyway)     
   #SBATCH --exclusive
   #
   # Specify (hard) runtime (HH:MM:SS)
   #SBATCH --time=01:00:00
   #
   # Job name
   #SBATCH --job-name=gpu_test
   #
   # Output file
   #SBATCH --output=slurm-%j.out
   #======================================================

   module purge
   module load cuda/10.0.130

   #=========================================================
   # Prologue script to record job details
   # Do not change the line below
   #=========================================================
   /opt/software/scripts/job_prologue.sh 
   #----------------------------------------------------------

   # Modify the line below to run your program
   mpirun -np $SLURM_NPROCS myprogram.exe

   #=========================================================
   # Epilogue script to record job endtime and runtime
   # Do not change the line below
   #=========================================================
   /opt/software/scripts/job_epilogue.sh 
   #----------------------------------------------------------