All jobs run on the Recearch Computing (HPC) systems should be submitted via the job queueing system (scheduler), Slurm. While a job can be sumbitted directly from the command line, the recommended method is to use a job script.
A job script is a file which contains all the necessary instructions to submit a job to the scheduler. The job is submitted by using the sbatch in conjunction with the job script.
Before submitting a job ensure the necessary software modules are loaded.
It is recommended that modules are loaded within a job script.
Default job sizes and runtimes
By default all jobs will be restricted to a single node (40 cores, 192 GB RAM) and will have a maximum runtime of 14 days (336 hours).
If you need to run on more than one node (>40 cores) then you must perform some test runs in order to provide evidence as to how well your code scales with number of cores (see the section on Scalability). In order to perform the test runs, the ARCHIE-WeSt support team will give you access to the scalability project ID.
Once you have supplied the test run data to the support team, they will give permission for your project to access > 1 nodes, assuming it scales sufficiently well.
It is mandatory to supply an estimated runtime for your job (see below), however jobs with a long run time will likely take longer to be dispatched by the scheduler (see Job Priority and Backfilling below). Therefore, it is advisable not to excessivley over estimate the run time of your job.
The sample LAMMPS job script (named lammps-slurm-multinode-exclusive.sh) is given below. Note it is a parallel job which requests exclusivity on each node. More job scripts examples can be found at Example Job Scripts Menu at the left panel.
Sample job scripts can also be found on ARCHIE at /opt/software/job-scripts
#!/bin/bash #====================================================== # # Job script for running LAMMPS on multiple nodes # #====================================================== #====================================================== # Propogate environment variables to the compute node #SBATCH --export=ALL # # Run in the standard partition (queue) #SBATCH --partition=standard # # Specify project account #SBATCH --account=testing # # No. of tasks required #SBATCH --ntasks=80 # # Distribute processes in round-robin fashion for load balancing #SBATCH --distribution=cyclic # # Ensure the nodes are not shared with other jobs #SBATCH --exclusive # # Specify (hard) runtime (HH:MM:SS) #SBATCH --time=01:00:00 # # Job name #SBATCH --job-name=lammps_test # # Output file #SBATCH --output=slurm-%j.out #====================================================== module purge module load lammps/intel-2018.2/16Mar18 #====================================================== # Prologue script to record job details #====================================================== /opt/software/scripts/job_prologue.sh #------------------------------------------------------ mpirun -np $SLURM_NPROCS lmp_mpi -i in.lj.10000 #====================================================== # Epilogue script to record job endtime and runtime #====================================================== /opt/software/scripts/job_epilogue.sh #------------------------------------------------------
The above job script requests number cores (tasks), specifies the running time, gives the project account ID, loads the required software module and finally refers to the input file and redirects the results to the newly created .out file with specified name.
The job is submitted to the scheduler using the command:
The job will be placed in the queue, until there are enough resources available to run the job.
Lines starting with #SBATCH are Slurm options
Job Run Time
The job run time is specified in the line:
It is important to specify the run time in the correct way: hh:mm:ss as above. Note:
- if not specified, the default run time of the job is 14 days
- the maximum job run time is 14 days
- if the job exceeds the run time (specified or default) it is automatically terminated
- for short jobs it is strongly advised to specify the job run time - this will enable your job to be scheduled more quickly (see "Backfilling" below).
- always overestimate the run time by a modest amount
- for long jobs the time specification is not so crucial (but still advisable)
Exceeding the Job Run Time
If the job exceeds the specified run time it will be automatically terminated. Therefore it is advisable to overestimate the run time. The maximum job run time is 14 days: --time=336:00:00
The job run time might be used both for queue and system monitoring by the Support Team as well as to increase the efficiency of the use of the facitlity. For details see section "Backfilling" below.
The job name in specified in the line:
In the above case the job name is lammps_test. Actually, it is advised not to give names longer than 8 characters.
The partition is specified in the line:
The job partition specifies which nodes the job will run on - for example, the standard compute nodes or the bigmem nodes. Other schedulers may use the term queue instead of partition.
The standard compute nodes have 40 cores and 192 GB RAM which equates to 4.8GB RAM per core. The scheduler enforces a one-to-one correspondence between the number of cores and the amount of RAM requested. In other words, a 10 core job will provide access to 48GB RAM and a job which requests 48GB RAM will provide access to 10 cores and will be charged accordingly.
The number of cores requested in the above example is specified in the line:
ARCHIE-WeSt has 40 cores per node
It is possible to reserve all the cores on the node (as in the example) or only some of them. If the job requests fewer than 40 cores per node it means the node might be shared with another job. In such a scenario the memory will be also shared. A user can of course request a whole node, but choose not to use all of the allocated cores or RAM. However, the user would be charged for occupying the full node.
Core Allocation per Job
The above job requests 10 nodes and all 40 cores available on the node, therefore the total number of cores requested is 400. This is the maximum allowed core allocation per job (and per user across all running jobs).
400 cores is the maximum number of cores that can be used by a single user at any given time. This means that there could be:
- one 400 core job
- two 200 core jobs
- three 120 core + one 40 core job
- five 80 core jobs ... or
- four hundred 1 core jobs.
If the user would submit four 120 core jobs, only three jobs would run and the last one would wait until one of the running jobs would complete, even if the system is empty, so as not to exceed the maximum core limit.
By default the amount of RAM consumed by a job scales with the number of cores, where each core is allocated 4.8GB RAM. If you require more RAM, then this can be requested as follows:
This will allocate 64GB of RAM to the job and correspondingly allocate 14 cores for the job. If fewer cores were explicitly requested elsewhere in the job script, then this will override that request.
It may be necessary for a job to require exclusive access to a node so as not to share any resources (memory or network bandwidth) with any other processes. This can be achieved with the following option:
Node Exclusivity Charging
Where exclusive access to a node has been requested, the user will be charged for using the full node, even if only a few cores have been used.
The project identifier is specified in the last line of Slrum instructions in the job script. The line is:
Every job must be associated with a "project ID". If the project ID is not correct or it exceeds the core hour allocation per project the queuing system will not allow the job to run.
Project Identifiers are generated basing on the project application and indicate the Principal Investigator surname and the project title (its acronym). The common format is: PIsurname-acronym.prj. For example Karina Kubiak with the project entitled "Course-Grained Dynamics" would have the project ID: kubiak-cgd
Project Identifiers are used to monitor the core hour usage within the project. Multiple users might be assigned to the same project ID.
Project Core Hour Allocation
Each project is allocated an amount of core hours, based on the request made in the project application. The core hours consumed by each project is used to monitor the progress of the project as well as utilization of the facility.
Exceeding the Core Hour Allocation
If the job exceeds the project core hour allocation it will terminate along with all other running jobs with the same projectID. The system will not allow the queueing jobs to start.
If the project core hour allocation is exceeded the user is expected to ask the Principal Investigator of the project to fill the appropriate extension form. The jobs should be deleted from the queue and resubmitted once the extension is granted.
Required modue(s) can be loaded from a terminal or from within the users bashrc file. However, the recommended method is for the module to be loaded within the job script. The reason for this is so as to make it obvious which software version was used for a particular job. The module is loaded after all Slrum options are specified, as in the above example:
module load namd/intel-2016.4/2.12
For more details how to use modules see the Environmental Modules section.
The main task of the jobs script is to run a program on requested number of cores/nodes. Again this is not a Slurm instruction therefore the line does not begin with #SBATCH. In the example above, the command line to execure namd is:
mpirun namd2 stmv_10nodes.inp > stmv_10nodes.out
In this example, the job configuration file which contains all input information is called stmv_10nodes.inp. Any output has been explicitly requested by the user to be redirected to the job output file with the given name: stmv_10nodes.out.
Large Memory nodes
ARCHIE-WeSt has two 3TB large memory nodes with the names bigmem1 & bigmem2. These nodes are to be used only for jobs with a memory requirement > 192GB. All jobs requiring less than 192GB should be run on the standard compute nodes, without exception.
The large memory nodes have four CPU's with 20 cores, so 80 cores in total. Here, each core has access to 38.4GB RAM.
#!/bin/bash #=========================================================================== # # Job script for running a parallel job on a single (shared) big memory node # #=========================================================================== #====================================================== # Propogate environment variables to the compute node #SBATCH --export=ALL # # Run in the standard partition (queue) #SBATCH --partition=bigmem # # Specify project account #SBATCH --account=testing # # No. of tasks required (max. of 80) #SBATCH --ntasks=20 # # Distribute processes in round-robin fashion #SBATCH --distribution=cyclic # # Specify (hard) runtime (HH:MM:SS) #SBATCH --time=01:00:00 # # Job name #SBATCH --job-name=bigmem_test # # Output file #SBATCH --output=slurm-%j.out #====================================================== module purge # choose which version to load # (foss 2018a contains the gcc 6.4.0 toolchain & openmpi 2.12) module load foss/2018a #========================================================= # Prologue script to record job details # Do not change the line below #========================================================= /opt/software/scripts/job_prologue.sh #---------------------------------------------------------- # Modify the line below to run your program mpirun -np $SLURM_NPROCS myprogram.exe #========================================================= # Epilogue script to record job endtime and runtime # Do not change the line below #========================================================= /opt/software/scripts/job_epilogue.sh #----------------------------------------------------------
Running Multiple Jobs From a Single Script
Occasionally there is a requirement for a user to run several independent jobs at the same time from with a single job script, for example where an exclusive access to a node has been requested. The basic way to do this (at least this works with openmpi) is to use include the following in the job script (this example is to run four 12-core jobs simultaneously):
#SBATCH --nodes=1 #SBATCH --exclusive mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait1 & mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait2 & mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait3 & mpirun -np 10 -npernode 3 ~/bin/hello-openmpi-wait4 & wait
For Intel MPI, we can use the following:
#SBATCH --nodes=1 #SBATCH --exclusive mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID ~/bin/hello-intelmpi-wait1 & mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID ~/bin/hello-intelmpi-wait2 & mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID ~/bin/hello-intelmpi-wait3 & mpiexec.hydra -np 10 -rr -f hosts.$SLURM_JOB_ID ~/bin/hello-intelmpi-wait4 & wait
When it comes to Intel MPI, we need to use mpiexec.hydra to launch the mpi processes. It turns out that mpirun is a wrapper for mpiexec.hydra, which is what gets used under the bonnet by Intel MPI, but mpirun makes too many assumptions that are difficult to overcome (it should be possible to do this by setting appropriate environment variables, but that seems to be broken in 5.0.3 - allegedly fixed in 5.1). So, using mpiexec.hydra gives us much more control.
The hostfile in the above example is automatically generated by Grid Engine. -rr tells it to distribute the processes using "round-robin".
A common requirement is to be able to run the same job a large number of times, with different input parameters and/or different input files. While this could be done by submitting lots of individual jobs, a more efficient and robust way is to submit the job as an array job.
For example, including the following in your job script:
will cause 40 copies of the job to be submitted each with a different value of
which in this example is a value from the range 1-40.
So, for example, an array of serial jobs with different inputs could be submitted using:
#!/bin/bash #SBATCH --export=ALL #SBATCH --partition=standard #SBATCH --account=priority #SBATCH --distribution=block #SBATCH --ntasks=1 #SBATCH --array=1-40 # Runtime (hard) #SBATCH --time=00:20:00 # Job name #SBATCH --job-name="serial_array_test" #========================================================= # Prologue script to record job details #========================================================= /opt/software/job-scripts//job_prologue.sh #---------------------------------------------------------- ./myprogram < my_input_file_$SLURM_ARRAY_TASK_ID #========================================================= # Epilogue script to record job endtime and runtime #========================================================= /opt/software/job-scripts//job_epilogue.sh #----------------------------------------------------------
where the program reads data from the input file my_input_file_$SLURM_ARRAY_TASK_ID (so 40 copies numbered from my_input_file_1 to my_input_file_40).
On occassion it might be necessary for a user to run a program interactively, for example via a graphical user interface. While this can be done on the login nodes for preparing jobs and testing, when running full calculations, it must be done via the queueing system.
A user can therefore request resources (nodes, cores and/or memory) from the scheduler but launch a program manually and interactively from a terminal allocated by scheduler. You should attempt to use the first method below, however, if you encounter display problems, then use the second method.
Interactive Job with Command Line Access (first method - recommended)
A sample command to request a 8 cores for an interactive sessions with access to the command line (e.g. for debugging):
srun --account=testing --partition=standard --time=2:00:00 --x11 --ntasks=8 --pty bash
srun -A testing -p standard -t 2:00:00 --x11 -n 8 --pty bash
|--account=testing||-A testing||the project account for the job|
|--partition=standard||-p standard||the partition on which the job should run|
|--time=2:00:00:||-t 2:00:00||a run time of 2 hours has been requested|
|--x11:||--x11||to allow graphical windows to be displayed|
|--ntasks=8:||-n 8||8 cores has been requested|
|--pty bash:||--pty bash||a bash shell has been requested for executing commands|
This will provide the user with a command prompt, which they can use to run execute:
[acs01234@node410 projects]$ ./example_job
In this instance, node410 has been allocated
Interactive Job (second method)
This method should only be used of you encounter problems displaying your application GUI using the first method above.
The outline of the process is as follow:
- use salloc to reserve some cores (or nodes).
- determine the node(s) you have been allocated
- use "ssh" to login to the (master) node
- run your software
- when finished, exit the node
- exit from "salloc"
1. Request an "allocation" for your job using the salloc command
Request as many cores as you need and the runtime in the usual way:
salloc --account=testing --partition=standard --time=2:00:00 --ntasks=8
salloc -A testing -p standard -t 2:00:00 -n 8
2. Determine the node(s) you have been allocated
This will return the list of nodes that you have been allocated. The first node will be the "master node" of your allocation e.g. node024.
3. Use ssh to login to the master node
ssh -X node024
for example. The -X flag will allow you to remotely display the GUI of your application
4. Run your software application
You should run your software as you would on the login nodes e.g.
- change to your working directory
load any necessary modules, for example, matlab:
[acs01234@node024] module load matlab/R2019a
run your software e.g.
5. When you are finished, exit from the node:
6. Exit from the allocation:
Tip: Short run time
Specifying a short run time will help the scheduler to allocate your job more quickly
Tip: Node exclusivity
Use the --exclusive flag if you require a whole node for your job
Job Priority and Backfilling
The greater the amount of resource requested by a job, the greater the priority that is given by the scheduler - both in terms of the number of cores requested and the length of the run time. This is because the scheduler needs to reserve nodes in order to allow large jobs to run, otherwise they would queue for a very long time.
However, while nodes are being reserved, the scheduler will allow smaller, shorter jobs to run if it can determine that they will complete before the larger job is scheduled to start. This is known as backfilling.
Therefore, it is in the interests of the user not to request more cores or run time than is necessary (although, run time should always be slightly over estimated to avoid premature termination).
"Tip: Smaller, Shorter jobs will always be scheduled more quickly"
The measure of the productivity of your workflow is not simply a function of how quickly your jobs take to run, but how quickly your jobs get through the system. In other words, you need to take into account queueing time as well as execution time.
Given that many parallel jobs do not scale linearly as you increase the number of cores (see Scalability), it can actually be the case that a smaller job with a longer run time can get through the system more quickly than a larger job with a shorter runtime, by the time you factor in queueing times.
As you become more familier with your jobs, you should optimise your job size in order to maximise your throughput (i.e. minimise [queueing time + execution time]).
Tip: Optimise your job size in order to minimise queueing time + execution time