Skip to content

Monitoring Jobs

All jobs run on the Recearch Computing (HPC) systems should be submitted via the job scheduling system, Slurm. To submit a job it is necessary to write a script which specifies the resources required to run the job. The script is submitted by the sbatch command.

Slurm

Slurm is an open source cluster management and job scheduling system. Slurm key fatures include:

  • it allocates exclusive and/or non-exclusive access to resources (compute nodes)

  • it provides a framework for starting, executing, and monitoring work on the set of allocated nodes

  • it arbitrates contention for resources by managing a queue of pending work.

Additionally Slurm can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

Overview of Slurm Commands

Clicking on the commands below will take you to more detailed information at the Slurm website.

Command Description
sbatch Submits a job script, script needs to contain at least one srun commands to launch the job
scancel Cancels a job
squeue Lists jobs in the queue
srun Command for launching a parallel job
sacct Produces report about active or completed job
alloc Allocates resources for a job
sattach Attaches standard input, output and error information to currently running job

With srun command it is possible to specify resource requirements such as: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.).

Job Submission and Monitoring

More information about preparing job scripts is available at the Submitting Jobs page.

Assuming an appropriate job script has been prepared (namd_stmv.sh) the job should be submitted using the sbatch command:

sbatch namd_stmv.sh

The system should give information about submitting the job, together with a jobID, e.g.:

Submitted batch job 2164

Monitoring job status

Viewing all jobs

Simply type:

squeue

or use the short hand version:

sq

the sq command is a local alias set up only on ARCHIE.

Viewing your own jobs

To check the status of your own jobs, type:

squeue -u DSusername

or

sqme

which, again, is a local alias.

The answer might look something like :

JOBID  PARTITION     NAME       USER    ST       TIME  NODES NODELIST(REASON)
122164   compute     stmv   cxb01102    PD       0:02      9 (AssociationJobLimit)
122164   compute     stmv   cxb01102     R       0:10      9 node[031-035,045-048]

R - means the job is running, PD - pending.

Column name Meaning
JOBID Shows Job ID number
PARTITION Shows partition (queue) on which the job is running
NAME Shows the job name as specified in the job script
USER Shows the DSusername of job owner
ST Job status (PD: pending; R: running; CF: configuring; CG: completing; CA: cancelled; F: failed)
TIME Job running time (wall clock)
NODES Shows number of nodes requested
NODELIST Lists the nodes allocated to the job

Deleting a job

To delete a job from the queue:

scancel JOBID

More detailed job information

For detailed information on a running job, e.g. showing amount of memory used etc. type:

sstat -j JOBID

For example:

sstat -j 401

   JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask      AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite 
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ 
401.0           210092K        node405              0  16380.40K      1944K    node405          0   1930.40K        0      node405              0          0  00:00.000    node405          0  00:00.000       10      2.93M       Unknown       Unknown       Unknown              0        0.31M         node405               0        0.23M        0.24M          node405                0        0.13M

For other information, you can use

scontrol show job JOBID

For example:

scontrol show job 401

JobId=401 JobName=namd-benchmark
    UserId=acx03155(5002) GroupId=users(100) MCS_label=N/A
    Priority=1 Nice=0 Account=testing QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:12 TimeLimit=1-00:00:00 TimeMin=N/A
    SubmitTime=2018-03-22T15:50:37 EligibleTime=2018-03-22T15:50:37
    StartTime=2018-03-22T15:50:37 EndTime=2018-03-23T15:50:37 Deadline=N/A
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
    Partition=parallel AllocNode:Sid=node401:9342
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=node[405-414]
    BatchHost=node405
    NumNodes=10 NumCPUs=80 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=80,node=10
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
    Features=(null) Gres=(null) Reservation=(null)
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/users/acx03155/slurm/NAMD/typical_protein/run_namd.sh
    WorkDir=/users/acx03155/slurm/NAMD/typical_protein
    StdErr=/users/acx03155/slurm/NAMD/typical_protein/slurm-401.out
    StdIn=/dev/null
    StdOut=/users/acx03155/slurm/NAMD/typical_protein/slurm-401.out
    Power=

Estimated job start time

You can obtain estimated job start times from the scheduler by typing:

squeue --start

For a particular job:

squeue --start -j JOBID

Cluster usage overview

To obtain and overview of the usage of the cluster and the availability of node, type

sinfo

This will generate output such as:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up 1-00:00:00     10  alloc node[405-414]
standard*    up 1-00:00:00      5   idle node[415-417,419-420]

In the above example, 10 nodes are in use (allocated) and 5 are idle (available)

Listing Job History

The sacct command allows you to obatain a history of your jobs. Typing sacct on its own will report jobs from midnight:

sacct -X

The -X option produces a report on the overall job allocation, and does not break it down into individual job steps (if, for example, a submitted job has several steps launched from a single job-script).

You can supply an arbitraty start date (MMDDYY):

sacct -X -S 010118

Or both a start date and an end date:

sacct -X -S 010118 -E 013118

Info

sacct without a Start date or End date only reports jobs from midnight.

A more useful form of the command to include additional information, such as job timings, could be something like:

sacct -X -S 010718 --format=JobID,User,Account,Partition,Submit,Start,Elapsed,ElapsedRaw,AllocCPUS,CPUTimeRaw

Where Elapsed gives the wallclock time (ElapsedRaw reports it in seconds) and CPUTimeRaw is the total core-seconds consumed (divide by 3600 to get core-hours). Submit and Start are the submission and start times, and -S 0107018 starts the report from 1st July 2018. -X produces the report for total allocations for every job, and not individual job steps. You should find that for each job, ElapsedRaw * AllocCPUS = CPUTimeRaw.

Tip: Output options

A full list of output options can be obtained by typing

sacct -e