Skip to content

GPU Nodes

Introduction

We currently have 24 NVidia A100 GPUs and 16 P100 GPUs available on ARCHIE as follows:

  • 8 NVidia A100 (40GB on board RAM, 6192 CUDA cores, 432 3rd gen. Tensor cores)
  • 16 NVidia A100SXM (80GB on board RAM, 6192 CUDA cores, 432 3rd gen. Tensor cores, 4-way NVLink)
  • 16 NVidia P100 (16GB on board RAM, 3584 CUDA cores)

For more information on the specification of the GPUs, see the links below:

Configuration

A100 (40GB RAM)

The A100 (40GB RAM) GPUs are installed four per server (two Lenovo SR670 40-core Xeon Gold 5218R servers). AI workloads will be prioritised on these GPUs. They are available in the following configurations:

  gpu[01-02]: A100 (40GB GPU RAM), 10 CPU cores, 96GB system RAM

A100SXM (80GB RAM)

The A100SXM (80GB RAM) GPUs are installed four per server (four Dell XE8545 64-core AMD EPYC 7513 servers). AI workloads will be prioritised on these GPUs. They are available in the following configurations:

  gpu[03-06]: A100SXM (80GB GPU RAM), 16 CPU cores, 128GB system RAM

Note that by default ony one GPU will be allocated per job, however, access to a full 4-way A100SXM server can be requested (i.e. to run a 4-way GPU job).

Requesting permission to run a 4-way A100SXM GPU job

Contact Support to request permission to be able to run on more than one A100SXM GPU per job.

P100 (16GB RAM)

The P100 (16GB RAM) GPUs are installed one per server (16 Dell PowerEdge C6220 16 core Xeon E5-2660 servers). HPC calculation workloads will be prioritised on these GPUs e.g. Molecular Dynamics or CFD. They are available in the following configurations:

  wee[01-16]: P100 (16GB GPU RAM), 16 CPU cores, 64GB system RAM

Accessing A100 GPUs (Optmised for AI/ML)

Both the A100 and A100SXM GPUs are made available via the gpu partition:

 #SBATCH --partition=gpu --gpus=1

Requesting access to the gpu partition

Contact Support to request access to the gpu partition

By default, for both types of GPU, you will be allocated 10 CPU cores with 80GB of system RAM per GPU.

Requesting any type of GPU

To request a GPU (either type of A100), use the following:

job-script

Add the following to your job script:

 #SBATCH --partition=gpu --gpus=1

interactive job

Use the command below to obtain an interactive session and run a a job directly on a GPU node interactively.

For full details on obtaining an interactive session, see "Interactive Jobs".

 srun --account=my-account-id --partition=gpu --gpus=1 --time=6:00:00 --x11 --pty bash

Modify the account name and time as appropriate. Once the interactive session has been allocated, any necessary modules will need to be loaded from the command line.

Requesting a specific type of GPU

You can request a particular type of GPU using the GRES (generic resource) feature of SLURM:

job-script for A100

Add the following to your job script:

 #SBATCH --partition=gpu --gres=gpu:A100

Note, that by default you are allocated 8GB of system RAM per core (80GB per GPU job). However, this can be increased to 9.6GB per core (96GB per GPU) using:

 #SBATCH --partition=gpu --gres=gpu:A100 --mem-per-cpu=9600

job-script for A100SXM

Add the following to your job script:

 #SBATCH --partition=gpu --gres=gpu:A100SXM --ntasks=1 --cpus-per-task=16

Note that the "--ntasks=1 --cpus-per-task=16" is optional - by default you will be allocated 10 cores. However, there are 16 core available per GPU on the A100SXM servers. Assuming that any parallelisation will be using shared memory (and not MPI) then this is the recommended method of accessing the maximum number of CPU cores per job.

interactive job for A100

Use the command below to obtain an interactive session and run a a job directly on a A100 GPU node interactively.

For full details on obtaining an interactive session, see "Interactive Jobs".

 srun --account=my-account-id --partition=gpu --gres=gpu:A100 --time=6:00:00 --x11 --pty bash

Once the interactive session has been allocated, any necessary modules will need to be loaded from the command line.

Note, that by default you are allocated 8GB of system RAM per core (80GB per GPU job). However, this can be increased to 9.6GB per core (96GB per GPU) using:

 srun --account=my-account-id --partition=gpu --gres=gpu:A100  --mem-per-cpu=9600 --time=6:00:00 --x11 --pty bash

interactive job for A100SXM

Use the command below to obtain an interactive session and run a a job directly on a A100SXM GPU node interactively.

For full details on obtaining an interactive session, see "Interactive Jobs".

 srun --account=my-account-id --partition=gpu  --gres=gpu:A100SXM  --ntasks=1 --cpus-per-task=16 --time=6:00:00 --x11 --pty bash

Once the interactive session has been allocated, any necessary modules will need to be loaded from the command line.

Accessing P100 GPUs (for HPC calculations)

The P100 GPUs are made available via the gpu-p100 partition:

 #SBATCH --partition=gpu-p100

Requesting access to the gpu-p100 partition

Contact Support to request access to the gpu-p100 partition

By default you will be allocated a whole P100 enabled compute node with 16 CPU cores with 64GB of system RAM per GPU.

Undergraduate and taught postgraduate users will run on these GPUs using the default gpu Quality of Service.

Research users will be able to access these GPUs using the gpu-p100 Quality of Service.

Runtime limits

Quality of Service (QoS)

The number of GPUs that can be used by and individual user, and the maximum runtime, is controlled using the "Quality of Service (QoS)" feature of SLURM. By default, jobs are run on the GPUs using the "gpu" QoS. Running on more than one A100SXM GPU is made possible by using the A100SXM quality of service - users must request access to this.

gpu QoS (default)

By default, GPU users are allocated to the gpu QoS which imposes the following limits:

  • 1 GPU per job
  • Max 2 GPUs per user (i.e. max two jobs running at the one time)
  • 48 hr maximum runtime

gpu-A100SXM QoS (4-way GPU jobs)

Access to this QoS must be requested and will only be granted if sufficient justification is provided. This QoS allows access to a full 4-way A100SXM server to enable 4-way GPU jobs. The following limits apply:

  • Up to 4 GPUs per job
  • Max 4 GPUs per user (i.e. 1x4 GPU or 4x1GPU jobs)
  • 48 hr maximum runtime

The gpu-A100SXM can be used as follows:

job-script using the A100SXM QoS

To access 4 GPUs along with the maximum of 64 CPUs (shared memory threads) add the following to your job script (for example):

 #SBATCH --partition=gpu --qos=gpu-A100SXM --gres=gpu:A100SXM:4 --ntasks=1 --cpus-per-task=64

To access 4 GPUs with 1 task per GPU (and 16 threads per task) add the following to your job script :

 #SBATCH --partition=gpu --qos=gpu-A100SXM --gres=gpu:A100SXM:4 --ntasks=4 --cpus-per-task=16

interactive job using the A100SXM QoS

Use the command below to obtain an interactive session and run a a job interactively on a A100SXM with 4 GPUs and with 64 shared memory threads.

For full details on obtaining an interactive session, see "Interactive Jobs".

 srun --account=my-account-id --partition=gpu  --qos=gpu-A100SXM --gres=gpu:A100SXM:4  --ntasks=1 --cpus-per-task=64 --time=6:00:00 --x11 --pty bash

Or with 4 GPUs with 1 task per GPU and 16 threads per task:

 srun --account=my-account-id --partition=gpu  --qos=gpu-A100SXM --gres=gpu:A100SXM:4  --ntasks=1 --cpus-per-task=64 --time=6:00:00 --x11 --pty bash

Modify the account name and time as appropriate. Once the interactive session has been allocated, any necessary modules will need to be loaded from the command line.

gpu-priority QoS

For SL1 priority access customers. In addition to the standard gpu QoS:

  • Higher SLURM priority
  • Maximum runtime of 4 days

Use:

 --qos=gpu-priority

gpu-p100 QoS

This is intended for reasearch users who are able to take advantage of the P100 GPUs. The following limits apply

  • A maximum of 4 jobs
  • 48 hr maximum runtime

Use:

 --qos=gpu-p100

Viewing GPU usage

Allocated GPU nodes

To see what GPU nodes are currently in use, simply type

  sinfo     (or sinfo -p gpu)

which may produce output similar to the following:

PARTITION AVAIL  TIMELIMIT   NODES  CPUS(A/I/O/T)  STATE NODELIST
  gpu       up   5-00:01:00    1       40/0/0/40   alloc gpu01
  gpu       up   5-00:01:00    3     46/58/0/104     mix gpu[02,05]
  gpu       up   5-00:01:00    3     0/192/0/192    idle gpu[03-04,06]

which tells us that gpu01 is fully occupied, gpu[02,05] are partially occupied and gpu[03-04,06] are empty.

To view what gpu jobs are in the queue, type:

  squeue -p gpu    (or  sq -p gpu)

which will produce something like:

  JOBID PARTITION       QOS PRIORITY            NAME     USER ST          START_TIME        TIME   TIME_LEFT  NODES CPUS NODELIST(REASON)
8537544       gpu gpu-prior     1991            bash kXXXXXX0  R 2023-06-14T11:11:08       36:24     5:23:36      1   10 gpu02
8537517       gpu       gpu     1137    RBD_TAG_COOH qXXXXXX3  R 2023-06-14T09:16:39     2:30:53  1-21:29:07      1   10 gpu01
8537515       gpu       gpu     1137    RBD_TAG_COOH qXXXXXX3  R 2023-06-14T09:08:30     2:39:02    21:20:58      1   10 gpu02
8536351       gpu       gpu     1131 3DHSRONN_hypero pXXXXXX8  R 2023-06-12T16:52:47  1-18:54:45     5:05:15      1   10 gpu05
8536316       gpu       gpu     1131        Hyperopt pXXXXXX8  R 2023-06-12T16:08:51  1-19:38:41     4:21:19      1   10 gpu02
8537074       gpu       gpu     1008         bs=2048 bXXXXXX9  R 2023-06-13T15:42:54    20:04:38  1-03:55:22      1   10 gpu01

GPU processes on a node

To see what processes are running on a GPU on an allocated node, then ssh to the allocated node and use the following command:

    nvidia-smi

This will produce output similar to below:

 ++-----------------------------------------------------------------------------
 | NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  NVIDIA A100-PCI...  On   | 00000000:06:00.0 Off |                    0 |
 | N/A   73C    P0   254W / 250W |  39957MiB / 40960MiB |     63%      Default |
 |                               |                      |             Disabled |
 +-------------------------------+----------------------+----------------------+
 |   1  NVIDIA A100-PCI...  On   | 00000000:2F:00.0 Off |                    0 |
 | N/A   27C    P0    32W / 250W |      0MiB / 40960MiB |      0%      Default |
 |                               |                      |             Disabled |
 +-------------------------------+----------------------+----------------------+
 |   2  NVIDIA A100-PCI...  On   | 00000000:86:00.0 Off |                    0 |
 | N/A   34C    P0    76W / 250W |    891MiB / 40960MiB |     35%      Default |
 |                               |                      |             Disabled |
 +-------------------------------+----------------------+----------------------+
 |   3  NVIDIA A100-PCI...  On   | 00000000:D8:00.0 Off |                    0 |
 | N/A   27C    P0    34W / 250W |  30573MiB / 40960MiB |     26%      Default |
 |                               |                      |             Disabled |
 +-------------------------------+----------------------+----------------------+

 +-----------------------------------------------------------------------------+
 | Processes:                                  |
 |  GPU   GI   CI        PID   Type   Process name          GPU Memory |
 |        ID   ID                           Usage      |
 |=============================================================================|
 |    0   N/A  N/A   2423375      C   ...-3.9.7/2021.11/bin/python    39955MiB |
 |    2   N/A  N/A   2331845      C   namd3                 889MiB |
 |    3   N/A  N/A   1386626      C   python              30571MiB |
 +-----------------------------------------------------------------------------+

In this example, we can see that three of the GPUs on gpu01 are in use.

Installed software

The following software is installed. Other software can be installed upon request.

CUDA / NVidia SDK

The latest version of CUDA will be provided by the NVIDIA HPC Software Development Kit (SDK)

To view available versions type:

  module avail nvidia

To load the a specific version, load the appropriate SDK module:

 module load nvidia/sdk/23.3

The current latest version (23.3, March 2023) comes uses CUDA 11.8 by default, however, CUDA 12.0 is also available.

Python / Anaconda / ML & AI

Python is provided by means of the Anaconda environment. To see currently available versions, type:

 module avail anaconda

To load a specific version, load the appropriate module e.g.

 module load anaconda/python-3.9.7/2021.11

As far as possible, we try to use Anaconda to provide access to the most common ML & AI tools, for example:

  • Jupyter notebooks
  • Scikit-learn
  • Spyder IDE
  • Tensorflow
  • Torch / pyTorch

To view the list of packages that can be made availabe via Anaconda (upon request) visit: https://docs.anaconda.com/anaconda/packages/py3.8_linux-64.

Sample job scripts

Any GPU

This sample job-script is for running on any available GPU with the default gpu QoS, with 10 CPU-cores being allocated by default with 8GB RAM per core (80GB in total).

The job script below is also available on ARCHIE at: /opt/software/job-scripts/gpu.sh

  #!/bin/bash

  #=================================================================
  #
  # Job script for running a job on a single GPU (any available GPU)
  #
  #=================================================================

  #======================================================
  # Propogate environment variables to the compute node
  #SBATCH --export=ALL
  #
  # Run in the gpu partition (queue) with any GPU
  #SBATCH --partition=gpu --gpus=1
  #
  # Specify project account (replace as required)
  #SBATCH --account=my-account-id
  #
  # Specify (hard) runtime (HH:MM:SS)
  #SBATCH --time=01:00:00
  #
  # Job name
  #SBATCH --job-name=gpu_test
  #
  # Output file
  #SBATCH --output=slurm-%j.out
  #======================================================

  module purge
  module load nvidia/sdk/22.3
  module load anaconda/python-3.9.7/2021.11

  #Uncomment the following if you are running multi-threaded
  #export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
  #
  #=========================================================
  # Prologue script to record job details
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_prologue.sh 
  #----------------------------------------------------------

  #Modify the line below to run your program. This is an example

  python myprogram.py

  #=========================================================
  # Epilogue script to record job endtime and runtime
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_epilogue.sh 
  #----------------------------------------------------------

A100 (gpu QoS)

This sample job-script is for running on the 40GB A100 GPUs with the default gpu QoS, with 10 CPU-cores being allocated by default with the maximum 9.6GB RAM per core requested (96GB in total).

The sample job script below is also available on ARCHIE at: /opt/software/job-scripts/gpu-A100.sh

  #!/bin/bash

  #======================================================
  #
  # Job script for running a job on a single A100 GPU

  #======================================================

  #======================================================
  # Propogate environment variables to the compute node
  #SBATCH --export=ALL
  #
  # Run in the gpu partition (queue)
  #SBATCH --partition=gpu 
  #
  # Request a A100 GPU     
  #SBATCH --gres=gpu:A100 --mem-per-cpu=9600
  #
  # Specify project account (replace as required)
  #SBATCH --account=my-account-id
  #
  # Specify (hard) runtime (HH:MM:SS)
  #SBATCH --time=01:00:00
  #
  # Job name
  #SBATCH --job-name=gpu_test
  #
  # Output file
  #SBATCH --output=slurm-%j.out
  #======================================================

  module purge
  module load nvidia/sdk/22.3
  module load anaconda/python-3.9.7/2021.11

  #Uncomment the following if you are running multi-threaded
  #export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

  #=========================================================
  # Prologue script to record job details
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_prologue.sh 
  #----------------------------------------------------------

  #Modify the line below to run your program. This is an example

  python myprogram.py


  #=========================================================
  # Epilogue script to record job endtime and runtime
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_epilogue.sh 
  #----------------------------------------------------------

A100SXM (gpu QoS)

This sample job-script is for running on a single 80GB A100 GPUs with the default gpu QoS, but with the maximum 16 CPU-cores being allocated.

The sample job script below is also available on ARCHIE at: /opt/software/job-scripts/gpu-A100SXM_qos-A100SXM.sh

  #!/bin/bash

  #======================================================
  #
  # Job script for running a job on a single A100SXM GPU 
  #
  #======================================================

  #======================================================
  # Propogate environment variables to the compute node
  #SBATCH --export=ALL
  #
  # Run in the gpu partition (queue)
  #SBATCH --partition=gpu 
  #
  # Request a A100SXM GPU     
  #SBATCH --gres=gpu:A100SXM --ntasks=1 --cpus-per-task=16 
  #
  # Specify project account (replace as required)
  #SBATCH --account=my-account-id
  #
  # Specify (hard) runtime (HH:MM:SS)
  #SBATCH --time=01:00:00
  #
  # Job name
  #SBATCH --job-name=gpu_test
  #
  # Output file
  #SBATCH --output=slurm-%j.out
  #======================================================

  module purge
  module load nvidia/sdk/22.3
  module load anaconda/python-3.9.7/2021.11

  #Uncomment the following if you are running multi-threaded
  #export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

  #=========================================================
  # Prologue script to record job details
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_prologue.sh 
  #----------------------------------------------------------

  #Modify the line below to run your program. This is an example

  python myprogram.py


  #=========================================================
  # Epilogue script to record job endtime and runtime
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_epilogue.sh 
  #----------------------------------------------------------

4-way A100SXM (gpu-A100SXM QoS)

This sample job-script is for running on a four 80GB A100 GPUs using the gpu-A100SXM QoS, but with the maximum 64 CPU-cores being allocated.

The sample job script below is also available on ARCHIE at: /opt/software/job-scripts/gpu-A100SXM.sh

  #!/bin/bash

  #======================================================
  #
  # Job script for running a job on a single A100SXM GPU 
  #
  #======================================================

  #======================================================
  # Propogate environment variables to the compute node
  #SBATCH --export=ALL
  #
  # Run in the gpu partition (queue)
  #SBATCH --partition=gpu 
  #
  # Request four A100SXM GPUs with 64 cores       
  #SBATCH --gres=gpu:A100SXM:4 --qos=gpu-A100SXM
  #
  #SBATCH --ntasks=1 --cpus-per-task=64 
  #
  # Specify project account (replace as required)
  #SBATCH --account=my-account-id
  #
  # Specify (hard) runtime (HH:MM:SS)
  #SBATCH --time=01:00:00
  #
  # Job name
  #SBATCH --job-name=gpu_test
  #
  # Output file
  #SBATCH --output=slurm-%j.out
  #======================================================

  module purge
  module load nvidia/sdk/22.3
  module load anaconda/python-3.9.7/2021.11

  #Uncomment the following if you are running multi-threaded
  #export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

  #=========================================================
  # Prologue script to record job details
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_prologue.sh 
  #----------------------------------------------------------

  #Modify the line below to run your program. This is an example

  python myprogram.py


  #=========================================================
  # Epilogue script to record job endtime and runtime
  # Do not change the line below
  #=========================================================
  /opt/software/scripts/job_epilogue.sh 
  #----------------------------------------------------------

P100 (gpu-P100 QoS)

This sample job-script is for running HPC calculation workloads e.g. MD or CFD applications P100 GPUs using the gpu-p100 QoS.

The sample job script below is for running gromacs, and is also available on ARCHIE at: /opt/software/job-scripts/gromacs-gpu-p100.sh

   #!/bin/bash

   #======================================================
   #
   # Job script for running GROMACS on a P100 gpu node 
   #
   #======================================================

   #======================================================
   # Propogate environment variables to the compute node
   #SBATCH --export=ALL
   #
   # Run in the gpu-p100 partition (queue)
   #SBATCH --partition=gpu-p100 --qos=gpu-p100
   #
   # Specify project account (replace as required)
   #SBATCH --account=my-account-id
   #
   # No. of tasks required (max. of 16)
   #SBATCH --ntasks=1 --cpus-per-task=16
   #
   # Specify (hard) runtime (HH:MM:SS)
   #SBATCH --time=01:00:00
   #
   # Job name
   #SBATCH --job-name=gromacs_test
   #
   # Output file
   #SBATCH --output=slurm-%j.out
   #======================================================

   module purge
   module load gromacs/intel-2022.2/2022.1-single

   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

   #======================================================
   # Prologue script to record job details
   # Do not change the line below
   #======================================================
   /opt/software/scripts/job_prologue.sh  
   #------------------------------------------------------

   gmx mdrun -s gromacs-test.tpr

   #======================================================
   # Epilogue script to record job endtime and runtime
   # Do not change the line below
   #======================================================
   /opt/software/scripts/job_epilogue.sh 
   #------------------------------------------------------