Skip to content

Submitting Batch Jobs

All jobs run on the Recearch Computing (HPC) systems should be submitted via the job queueing system (scheduler), Slurm. While a job can be sumbitted directly from the command line, the recommended method is to use a job script. These types of jobs are often referred to as batch jobs.

Job Script

A job script is a file which contains all the necessary instructions to submit a job to the scheduler. The job is submitted by using the sbatch in conjunction with the job script.

Software

Before submitting a job ensure the necessary software modules are loaded.
It is recommended that modules are loaded within a job script.

Default job sizes and runtimes

By default all jobs will be restricted to a single node (40 cores, 192 GB RAM) and will have a maximum runtime of 7 days (168 hours).

If you need to run on more than one node (>40 cores) then you must perform some test runs in order to provide evidence as to how well your code scales with number of cores (see the section on Scalability). In order to perform the test runs, the ARCHIE-WeSt support team will give you access to the scalability project ID.

Once you have supplied the test run data to the support team, they will give permission for your project to access > 1 nodes, assuming it scales sufficiently well.

It is mandatory to supply an estimated runtime for your job (see below), however jobs with a long run time will likely take longer to be dispatched by the scheduler (see Job Priority and Backfilling below). Therefore, it is advisable not to excessivley over estimate the run time of your job.

Job Scripts

The sample LAMMPS job script (named lammps-slurm-multinode-exclusive.sh) is given below. Note it is a parallel job which requests exclusivity on each node. More job scripts examples can be found at Example Job Scripts Menu at the left panel.

Tip

Sample job scripts can also be found on ARCHIE at /opt/software/job-scripts

Example

/opt/software/job-scripts/lammps-slurm-multinode-exclusive.sh

#!/bin/bash

#======================================================
#
# Job script for running LAMMPS on multiple nodes
#
#======================================================

#======================================================
# Propogate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the standard partition (queue)
#SBATCH --partition=standard
#
# Specify project account
#SBATCH --account=testing
#
# No. of tasks required
#SBATCH --ntasks=80 
#
# Distribute processes in round-robin fashion for load balancing
#SBATCH --distribution=cyclic
#
# Ensure the nodes are not shared with other jobs
#SBATCH --exclusive
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=lammps_test
#
# Output file
#SBATCH --output=slurm-%j.out
#======================================================

module purge
module load lammps/intel-2018.2/16Mar18

#======================================================
# Prologue script to record job details
#======================================================
/opt/software/scripts/job_prologue.sh  
#------------------------------------------------------

mpirun -np $SLURM_NPROCS lmp_mpi -i in.lj.10000

#======================================================
# Epilogue script to record job endtime and runtime
#======================================================
/opt/software/scripts/job_epilogue.sh 
#------------------------------------------------------

The above job script requests number cores (tasks), specifies the running time, gives the project account ID, loads the required software module and finally refers to the input file and redirects the results to the newly created .out file with specified name.

The job is submitted to the scheduler using the command:

sbatch lammps-slurm-multinode-exclusive.sh

The job will be placed in the queue, until there are enough resources available to run the job.

Note

Lines starting with #SBATCH are Slurm options

Job Run Time

The job run time is specified in the line:

#SBATCH --time=01:00:00

It is important to specify the run time in the correct way: hh:mm:ss as above. Note:

  • if not specified, the default run time of the job is 14 days
  • the maximum job run time is 14 days
  • if the job exceeds the run time (specified or default) it is automatically terminated
  • for short jobs it is strongly advised to specify the job run time - this will enable your job to be scheduled more quickly (see "Backfilling" below).
  • always overestimate the run time by a modest amount
  • for long jobs the time specification is not so crucial (but still advisable)

Exceeding the Job Run Time

If the job exceeds the specified run time it will be automatically terminated. Therefore it is advisable to overestimate the run time. The maximum job run time is 7 days: --time=168:00:00

The job run time might be used both for queue and system monitoring by the Support Team as well as to increase the efficiency of the use of the facitlity. For details see section "Backfilling" below.

Job Name

The job name in specified in the line:

#SBATCH --job-name=lammps_test

In the above case the job name is lammps_test. Actually, it is advised not to give names longer than 8 characters.

Partition (Queue)

The partition is specified in the line:

#SBATCH --partition=standard

The job partition specifies which nodes the job will run on - for example, the standard compute nodes or the bigmem nodes. Other schedulers may use the term queue instead of partition.

Resource Reservation

The standard compute nodes have 40 cores and 192 GB RAM which equates to just over 4.5GB RAM per core. The scheduler enforces a one-to-one correspondence between the number of cores and the amount of RAM requested. In other words, a 10 core job will provide access to 45GB RAM and a job which requests 45GB RAM will provide access to 10 cores and will be charged accordingly.

Cores

The number of cores requested in the above example is specified in the line:

#SBATCH --ntasks=80

ARCHIE-WeSt has 40 cores per node

It is possible to reserve all the cores on the node (as in the example) or only some of them. If the job requests fewer than 40 cores per node it means the node might be shared with another job. In such a scenario the memory will be also shared. A user can of course request a whole node, but choose not to use all of the allocated cores or RAM. However, the user would be charged for occupying the full node.

Core Allocation per Job

The above job requests 10 nodes and all 40 cores available on the node, therefore the total number of cores requested is 400. This is the maximum allowed core allocation per job (and per user across all running jobs).

400 cores is the maximum number of cores that can be used by a single user at any given time. This means that there could be:

  • one 400 core job
  • two 200 core jobs
  • three 120 core + one 40 core job
  • five 80 core jobs ... or
  • four hundred 1 core jobs.

If the user would submit four 120 core jobs, only three jobs would run and the last one would wait until one of the running jobs would complete, even if the system is empty, so as not to exceed the maximum core limit.

Memory

By default the amount of RAM consumed by a job scales with the number of cores, where each core is allocated 4.5GB RAM. If you require more RAM, then this can be requested as follows:

 #SBATCH --mem=45G

This will allocate 45GB of RAM to the job and correspondingly allocate 10 cores for the job. If fewer cores were explicitly requested elsewhere in the job script, then this will override that request.

Node Exclusivity

It may be necessary for a job to require exclusive access to a node so as not to share any resources (memory or network bandwidth) with any other processes. This can be achieved with the following option:

#SBATCH --exclusive

Node Exclusivity Charging

Where exclusive access to a node has been requested, the user will be charged for using the full node, even if only a few cores have been used.

Project Identifiers

The project identifier is specified in the last line of Slrum instructions in the job script. The line is:

#SBATCH --account=a17-archie-west

Project Identifiers

Every job must be associated with a "project ID". If the project ID is not correct or it exceeds the core hour allocation per project the queuing system will not allow the job to run.

Project Identifiers are generated basing on the project application and indicate the Principal Investigator surname and the project title (its acronym). The common format is: PIsurname-acronym.prj. For example Karina Kubiak with the project entitled "Course-Grained Dynamics" would have the project ID: kubiak-cgd

Project Identifiers are used to monitor the core hour usage within the project. Multiple users might be assigned to the same project ID.

Project Core Hour Allocation

Each project is allocated an amount of core hours, based on the request made in the project application. The core hours consumed by each project is used to monitor the progress of the project as well as utilization of the facility.

Exceeding the Core Hour Allocation

If the job exceeds the project core hour allocation it will terminate along with all other running jobs with the same projectID. The system will not allow the queueing jobs to start.

If the project core hour allocation is exceeded the user is expected to ask the Principal Investigator of the project to fill the appropriate extension form. The jobs should be deleted from the queue and resubmitted once the extension is granted.

Loading Modules

Required modue(s) can be loaded from a terminal or from within the users bashrc file. However, the recommended method is for the module to be loaded within the job script. The reason for this is so as to make it obvious which software version was used for a particular job. The module is loaded after all Slrum options are specified, as in the above example:

module load namd/intel-2016.4/2.12

For more details how to use modules see the Environment Modules section.

Executing Programs

The main task of the jobs script is to run a program on requested number of cores/nodes. Again this is not a Slurm instruction therefore the line does not begin with #SBATCH. In the example above, the command line to execure namd is:

mpirun namd2 stmv_10nodes.inp > stmv_10nodes.out

In this example, the job configuration file which contains all input information is called stmv_10nodes.inp. Any output has been explicitly requested by the user to be redirected to the job output file with the given name: stmv_10nodes.out.

Large Memory nodes

ARCHIE-WeSt has two 3TB large memory nodes with the names bigmem1 & bigmem2. These nodes are to be used only for jobs with a memory requirement > 192GB. All jobs requiring less than 192GB should be run on the standard compute nodes, without exception.

The large memory nodes have four CPU's with 20 cores, so 80 cores in total. Here, each core has access to 38.4GB RAM.

Example

/opt/software/job-scripts/generic-slurm-bigmem.sh

#!/bin/bash

#===========================================================================
#
# Job script for running a parallel job on a single (shared) big memory node
#
#===========================================================================

#======================================================
# Propogate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the standard partition (queue)
#SBATCH --partition=bigmem
#
# Specify project account
#SBATCH --account=testing
#
# No. of tasks required (max. of 80)
#SBATCH --ntasks=20
#   
# Distribute processes in round-robin fashion
#SBATCH --distribution=cyclic
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=bigmem_test
#
# Output file
#SBATCH --output=slurm-%j.out
#======================================================

module purge

# choose which version to load 
# (foss 2018a contains the gcc 6.4.0 toolchain & openmpi 2.12)
module load foss/2018a

#=========================================================
# Prologue script to record job details
# Do not change the line below
#=========================================================
/opt/software/scripts/job_prologue.sh 
#----------------------------------------------------------

# Modify the line below to run your program
mpirun -np $SLURM_NPROCS myprogram.exe

#=========================================================
# Epilogue script to record job endtime and runtime
# Do not change the line below
#=========================================================
/opt/software/scripts/job_epilogue.sh 
#----------------------------------------------------------

Running Multiple Jobs From a Single Script

Occasionally there is a requirement for a user to run several independent jobs at the same time from with a single job script, for example where an exclusive access to a node has been requested. The basic way to do this (at least this works with openmpi) is to use include the following in the job script (this example is to run four 12-core jobs simultaneously):

#SBATCH --nodes=1
#SBATCH --exclusive

mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait1 &
mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait2 & 
mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait3 &
mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait4 &

wait

For Intel MPI, we can use the following:

#SBATCH --nodes=1
#SBATCH --exclusive

mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID  ~/bin/hello-intelmpi-wait1 &
mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID  ~/bin/hello-intelmpi-wait2 &
mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID  ~/bin/hello-intelmpi-wait3 & 
mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID  ~/bin/hello-intelmpi-wait4 &

wait

When it comes to Intel MPI, we need to use mpiexec.hydra to launch the mpi processes. It turns out that mpirun is a wrapper for mpiexec.hydra, which is what gets used under the bonnet by Intel MPI, but mpirun makes too many assumptions that are difficult to overcome (it should be possible to do this by setting appropriate environment variables, but that seems to be broken in 5.0.3 - allegedly fixed in 5.1). So, using mpiexec.hydra gives us much more control.

The hostfile in the above example is automatically generated by Grid Engine. -rr tells it to distribute the processes using "round-robin".

Array Jobs

A common requirement is to be able to run the same job a large number of times, with different input parameters and/or different input files. While this could be done by submitting lots of individual jobs, a more efficient and robust way is to submit the job as an array job.

For example, including the following in your job script:

#SBATCH --array=1-40

will cause 40 copies of the job to be submitted each with a different value of

$SLURM_ARRAY_TASK_ID

which in this example is a value from the range 1-40.

So, for example, an array of serial jobs with different inputs could be submitted using:

#!/bin/bash

#SBATCH --export=ALL
#SBATCH --partition=standard
#SBATCH --account=priority
#SBATCH --distribution=block
#SBATCH --ntasks=1
#SBATCH --array=1-40    
# Runtime (hard)
#SBATCH --time=00:20:00
# Job name
#SBATCH --job-name="serial_array_test"

#=========================================================
# Prologue script to record job details
#=========================================================
/opt/software/job-scripts//job_prologue.sh 
#----------------------------------------------------------

./myprogram < my_input_file_$SLURM_ARRAY_TASK_ID

#=========================================================
# Epilogue script to record job endtime and runtime
#=========================================================
/opt/software/job-scripts//job_epilogue.sh 
#----------------------------------------------------------

where the program reads data from the input file my_input_file_$SLURM_ARRAY_TASK_ID (so 40 copies numbered from my_input_file_1 to my_input_file_40).

Interactive Jobs

On occassion it might be necessary for a user to run a program interactively, for example via a graphical user interface. While this can be done on the login nodes for preparing jobs and testing, when running full calculations, it must be done via the queueing system.

A user can therefore request resources (nodes, cores and/or memory) from the scheduler but launch a program manually and interactively from a terminal allocated by scheduler. You should attempt to use the first method below, however, if you encounter display problems, then use the second method.

First, connect to ARCHIE using ssh, MobaXterm or ThinLinc. If connecting using ThinLinc, then start a Linux Terminal.

A sample command to request a 8 cores for an interactive sessions with access to the command line (e.g. for debugging):

(long version):

srun --account=testing --partition=standard --time=2:00:00 --x11 --ntasks=8 --pty bash

(short version):

srun -A testing -p standard -t 2:00:00 --x11 -n 8 --pty bash
Example Options
--account=testing -A testing the project account for the job
--partition=standard -p standard the partition on which the job should run
--time=2:00:00: -t 2:00:00 a run time of 2 hours has been requested
--x11: --x11 to allow graphical windows to be displayed
--ntasks=8: -n 8 8 cores has been requested
--pty bash: --pty bash a bash shell has been requested for executing commands

This will provide the user with a command prompt, which they can use to run execute:

[acs01234@node410 projects]$ ./example_job

In this instance, node410 has been allocated

Interactive Job (second method)

This method should only be used of you encounter problems displaying your application GUI using the first method above.

The outline of the process is as follow:

  1. use salloc to reserve some cores (or nodes).
  2. determine the node(s) you have been allocated
  3. use "ssh" to login to the (master) node
  4. run your software
  5. when finished, exit the node
  6. exit from "salloc"

1. Request an "allocation" for your job using the salloc command

Request as many cores as you need and the runtime in the usual way:

(long version):

salloc --account=testing --partition=standard --time=2:00:00 --ntasks=8

(short version):

salloc -A testing -p standard -t 2:00:00 -n 8

2. Determine the node(s) you have been allocated

echo $SLURM_NODELIST

This will return the list of nodes that you have been allocated. The first node will be the "master node" of your allocation e.g. node024.

3. Use ssh to login to the master node

 ssh -X node024

for example. The -X flag will allow you to remotely display the GUI of your application

4. Run your software application

You should run your software as you would on the login nodes e.g.

  1. change to your working directory
  2. load any necessary modules, for example, matlab:

     [acs01234@node024] module load matlab/R2019a
    
  3. run your software e.g.

     [acs01234@node024] matlab
    

5. When you are finished, exit from the node:

  exit

6. Exit from the allocation:

  exit

Tip: Short run time

Specifying a short run time will help the scheduler to allocate your job more quickly

Tip: Node exclusivity

Use the --exclusive flag if you require a whole node for your job

Email notifications

You can receive notifications by email of when your job starts or finishes (along with other events) by adding to your job script:

 #SBATCH --mail-user=email.address@strath.ac.uk 
 #SBATCH --mail-type=ALL

Or by using them with the srun command:

 srun --mail-user=email.address@strath.ac.uk --mail-type=ALL --account=testing --partition=standard  ...

see "--mail-type" at https://slurm.schedmd.com/sbatch.html for more details.

Job Priority and Backfilling

The greater the amount of resource requested by a job, the greater the priority that is given by the scheduler - both in terms of the number of cores requested and the length of the run time. This is because the scheduler needs to reserve nodes in order to allow large jobs to run, otherwise they would queue for a very long time.

However, while nodes are being reserved, the scheduler will allow smaller, shorter jobs to run if it can determine that they will complete before the larger job is scheduled to start. This is known as backfilling.

Therefore, it is in the interests of the user not to request more cores or run time than is necessary (although, run time should always be slightly over estimated to avoid premature termination).

Tip: Smaller, Shorter jobs will always be scheduled more quickly

Throughput

The measure of the productivity of your workflow is not simply a function of how quickly your jobs take to run, but how quickly your jobs get through the system. In other words, you need to take into account queueing time as well as execution time.

Given that many parallel jobs do not scale linearly as you increase the number of cores (see Scalability), it can actually be the case that a smaller job with a longer run time can get through the system more quickly than a larger job with a shorter runtime, by the time you factor in queueing times.

As you become more familier with your jobs, you should optimise your job size in order to maximise your throughput (i.e. minimise [queueing time + execution time]).

Tip: Optimise your job size in order to minimise queueing time + execution time