Scalability
Background and Purpose
The performance of parallelised codes can vary widely depending on the chosen parameterisation and the problem at hand. If you want to deploy your code on more than one computing node (40 cores), we ask you to demonstrate the scalability of your software. This is to ensure that the computing resource is used efficiently for the benefit of all users.
We provide a dedicated account to carry out this analysis. Please contact the ARCHIE-WeSt support team to gain access and for more information.
Info
To run on > 1 node (40 cores) you need to demonstrate the scalability of your code.
To request access to the scaling account ID to perform scaling tests, complete the online form at https://www.archie-west.ac.uk/archie-west-scaling-test-request
Speedup and Parallel Efficiency
The speedup of a parallel code is defined as the ratio of runtimes using two different numbers of compute cores, threads or MPI-tasks. It is usually measured against the single core performance of the code. If T_1 is the runtime on one core, and T_N the runtime on N cores, then the speedup S_N is given as
S_N = T_1 / T_N
Perfect speedup means that the runtime T_N on N cores is simply the Nth fraction of the runtime T_1 on one core. This would lead to linear speedup S_N = N.
If the code will be deployed on a larger number of cores, then it might be more appropriate to measure the speedup against the number of cores that form a small sub-part of the computing architecture. This can be for instance a compute node, which consists of 40 cores.
Another way to measure the code performance is the parallel efficiency. This is simply the speedup divided by the number of cores.
P_N = S_N / N = T_1 / (T_N * N)
Perfect parallel efficiency means P_N = 1.
Loop times
It is of fundamental importance that you measure the time that the code spends in the main iteration loop, not the total runtime including start-up and finalisation. Parallel jobs involve certain overheads in the range of dozens of seconds, which will render any scalability analysis meaningless. For the same reason we also strongly advise to suppress any significant I/O activity.
Warning
Only use timings from the main interation loop of your code i.e. exclude startup and end finalisation times.
Suitable Benchmark Case and Runtimes
The scaling test aims at what is referred to as strong scaling. This means you keep the system size fixed for all core counts. It is essential that you pick a representative, but simplified benchmark case. This could be a system of comparable size, but only a short sequence of the integration or a small number of iterations.
A few hundred iterations are usually sufficient to obtain reliable timings and for your code to visit every significant part of the algorithm. But you have to make sure that your benchmark runs still for a significant amount of time on the highest core count, e.g. for at least a few seconds (see also example below). As a general guidance you could assume ideal performance. For instance, if your benchmark were to run for 5 sec on 400 cores (10 nodes), you can assume that the code runs for 2000 sec on a single core. The upper limit of the wallclock time in the benchmarking queue is set to 1h = 3600 sec.
Tip
The benchmark case should run for at least a few seconds at the highest core count.
Example
Please provide at least a clear description of the benchmark case and code as well as a graphical display of the speedup and/or parallel efficiency, along with the numerical data.
Code: Molecular dynamics code LAMMPS (Large-scale Atomic/Molecular Massively Parallel Solver).
Benchmark: The benchmark consists of 100,000 Lennard-Jones particles in a cubic box of 100 length units along every coordinate dimension. The loop times reported here correspond to 100,000 iterations of a Langevin integrator. The particles interact via a truncated and shifted Lennard-Jones potential with epsilon = 1.0, sigma = 1.0, rcut = 1.1225 at a reduced temperature T=0.1. The system has been equilibrated prior to this benchmark run. Runtimes:
#nproc loop time [sec]
1 585.63
2 291.33
4 152.41
8 81.20
10 62.73
20 34.12
40 22.15
80 12.41
120 10.06
160 8.06
200 7.21
240 6.72
Table 1: Time spent in the main iteration loop depending on the number of processors
Speedup and Parallel Efficiency
Figure 1: Speedup against number of compute cores
Figure 2: Parallel efficiency against number of compute cores
Please be advised that a simple graphical illustration of the loop times in not suitable to demonstrate scalability as it obscures the performance (see plot below).
Figure 3: Loop time against number of compute cores
Acceptable Results
You will only be allowed access to the number of nodes for which performance continues to scale linearly.