Monitoring Jobs
All jobs run on the Recearch Computing (HPC) systems should be submitted via the job scheduling system, Slurm. To submit a job it is necessary to write a script which specifies the resources required to run the job. The script is submitted by the sbatch command.
Slurm
Slurm is an open source cluster management and job scheduling system. Slurm key fatures include:
-
it allocates exclusive and/or non-exclusive access to resources (compute nodes)
-
it provides a framework for starting, executing, and monitoring work on the set of allocated nodes
-
it arbitrates contention for resources by managing a queue of pending work.
Additionally Slurm can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
Overview of Slurm Commands
Clicking on the commands below will take you to more detailed information at the Slurm website.
Command | Description |
---|---|
sbatch | Submits a job script, script needs to contain at least one srun commands to launch the job |
scancel | Cancels a job |
squeue | Lists jobs in the queue |
srun | Command for launching a parallel job |
sacct | Produces report about active or completed job |
alloc | Allocates resources for a job |
sattach | Attaches standard input, output and error information to currently running job |
With srun command it is possible to specify resource requirements such as: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.).
Job Submission and Monitoring
More information about preparing job scripts is available at the Submitting Jobs page.
Assuming an appropriate job script has been prepared (namd_stmv.sh) the job should be submitted using the sbatch command:
sbatch namd_stmv.sh
The system should give information about submitting the job, together with a jobID, e.g.:
Submitted batch job 2164
Monitoring job status
Viewing all jobs
Simply type:
squeue
or use the short hand version:
sq
the sq command is a local alias set up only on ARCHIE.
Viewing your own jobs
To check the status of your own jobs, type:
squeue -u DSusername
or
sqme
which, again, is a local alias.
The answer might look something like :
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
122164 compute stmv cxb01102 PD 0:02 9 (AssociationJobLimit)
122164 compute stmv cxb01102 R 0:10 9 node[031-035,045-048]
R - means the job is running, PD - pending.
Column name | Meaning |
---|---|
JOBID | Shows Job ID number |
PARTITION | Shows partition (queue) on which the job is running |
NAME | Shows the job name as specified in the job script |
USER | Shows the DSusername of job owner |
ST | Job status (PD: pending; R: running; CF: configuring; CG: completing; CA: cancelled; F: failed) |
TIME | Job running time (wall clock) |
NODES | Shows number of nodes requested |
NODELIST | Lists the nodes allocated to the job |
Deleting a job
To delete a job from the queue:
scancel JOBID
More detailed job information
For detailed information on a running job, e.g. showing amount of memory used etc. type:
sstat -j JOBID
For example:
sstat -j 401
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------
401.0 210092K node405 0 16380.40K 1944K node405 0 1930.40K 0 node405 0 0 00:00.000 node405 0 00:00.000 10 2.93M Unknown Unknown Unknown 0 0.31M node405 0 0.23M 0.24M node405 0 0.13M
For other information, you can use
scontrol show job JOBID
For example:
scontrol show job 401
JobId=401 JobName=namd-benchmark
UserId=acx03155(5002) GroupId=users(100) MCS_label=N/A
Priority=1 Nice=0 Account=testing QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:12 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2018-03-22T15:50:37 EligibleTime=2018-03-22T15:50:37
StartTime=2018-03-22T15:50:37 EndTime=2018-03-23T15:50:37 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=parallel AllocNode:Sid=node401:9342
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[405-414]
BatchHost=node405
NumNodes=10 NumCPUs=80 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=80,node=10
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/users/acx03155/slurm/NAMD/typical_protein/run_namd.sh
WorkDir=/users/acx03155/slurm/NAMD/typical_protein
StdErr=/users/acx03155/slurm/NAMD/typical_protein/slurm-401.out
StdIn=/dev/null
StdOut=/users/acx03155/slurm/NAMD/typical_protein/slurm-401.out
Power=
Estimated job start time
You can obtain estimated job start times from the scheduler by typing:
squeue --start
For a particular job:
squeue --start -j JOBID
Cluster usage overview
To obtain and overview of the usage of the cluster and the availability of node, type
sinfo
This will generate output such as:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
standard* up 1-00:00:00 10 alloc node[405-414]
standard* up 1-00:00:00 5 idle node[415-417,419-420]
In the above example, 10 nodes are in use (allocated) and 5 are idle (available)
Listing Job History
The sacct command allows you to obatain a history of your jobs. Typing sacct on its own will report jobs from midnight:
sacct -X
The -X option produces a report on the overall job allocation, and does not break it down into individual job steps (if, for example, a submitted job has several steps launched from a single job-script).
You can supply an arbitraty start date (MMDDYY):
sacct -X -S 010118
Or both a start date and an end date:
sacct -X -S 010118 -E 013118
Info
sacct without a Start date or End date only reports jobs from midnight.
A more useful form of the command to include additional information, such as job timings, could be something like:
sacct -X -S 010718 --format=JobID,User,Account,Partition,Submit,Start,Elapsed,ElapsedRaw,AllocCPUS,CPUTimeRaw
Where Elapsed gives the wallclock time (ElapsedRaw reports it in seconds) and CPUTimeRaw is the total core-seconds consumed (divide by 3600 to get core-hours). Submit and Start are the submission and start times, and -S 0107018 starts the report from 1st July 2018. -X produces the report for total allocations for every job, and not individual job steps. You should find that for each job, ElapsedRaw * AllocCPUS = CPUTimeRaw.
Tip: Output options
A full list of output options can be obtained by typing
sacct -e