Evaluating Job Performance
HPC vs HTC
A major benefit of supercomputers is the flexibility in scaling workloads: you can use a large amount of resources to tackle a monolithic workload, or a small amount of resources in many (thousands) of independent but concurrently running jobs. These computing modes are called High Performance Computing (HPC) and High-Throughput-Computing (HTC), respectively. Which applies and can accelerate your code is determined the applications you require, and the capacity to meaningfully split your computations.
Code Runs Faster on Workstation than the Supercomputer
This frequently occurs when your job is not adapted--or the applications (program binaries)--are not suited for parallelization.
Many user scripts, including Python and R, aren't written to use multiple CPU cores. Even if you request additional CPU cores, it may not accelerate the overall job's execution. It is important, therefore, to check whether your application can leverage this kind of parallelism across cores (commonly referred to as OpenMP, multi-threading, multi-processing).
Without specific and deliberate activation of these types of parallelization, your job may simply rely on the speed of the processor core (typically measured in GHz) and this usually paints a misleading picture: when it comes to cores, high-performance computing is about quantity over quality. Individual cores often run at slower speeds than the cores in your workstation, expecting instead to offset that slowness with the sheer volume of available cores, instead. Thus, it's expected that using one core on the supercomputer will be slower than one core on your laptop. Finding out how to scale to use multiple cores is the crux of properly leveraging supercomputer power.
The speed of individual cores is called clock speed or CPU frequency. The clock speed on supercomputer cores are about 2 GHz, while many workstation CPUs have clock speeds exceeding 3 or 4+ GHz.
Serial vs. Parallel Computing
Serial computing means running commands one at a time, in succession. This is what programs are doing when they only use a single core, they're running commands one by one. Parallel computing involves spreading out those commands over multiple cores so commands can run simultaneously, with their own independent CPUs and memory. It's analogous to grocery store checkout lanes: the more lanes that are open, the faster all the customers be served. Some programs do this automatically, such as most MATLAB functions. Some programs have options to do parallel computing, such as many SAS functions. Many programs need to be recoded by hand to run in parallel, such as Python and R scripts. Creating parallel scripts differs greatly between programing languages and will not be discussed here.
Not everything can be completely parallelized and sped up. If commands are dependent on previous commands, the workload is inherently serial. This can come up a lot with programs with timesteps, such as agent-based methods and genome evolution models. This means that there's a point where adding more cores cannot meaningfully speed up the calculation.
The process of measuring code execution speed up versus number of cores is called profiling. To learn more about profiling and parallel computing, we recommend the book Parallel and High Performance Computing, available in eBook format from the ASU library.
Use the seff
Command to See if your Code is Using Resources Efficiently
seff
is short for “slurm efficiency” and will display the percentage of CPU and Memory used by a job relative to how long the job ran. The goal is high-efficiency so that jobs are not allocating resources they are not using.
Example of an inefficient job:
[jeburks2@sol-login02:~]$ seff 11273084
Job ID: 11273084
Cluster: sol
User/Group: jeburks2/grp_rcadmins
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:59
CPU Efficiency: 12.55% of 00:07:50 core-walltime
Job Wall-clock time: 00:07:50
Memory Utilized: 337.73 MB
Memory Efficiency: 16.85% of 2.00 GB
[jeburks2@sol-login02:~]$
The two points of mention are:
- the job had a CPU for ~8 minutes, but only used the CPU for 59 seconds, showing 12% efficiency.
- 337MB of memory was used of 2024MB requested, showing 16.85% efficiency.
Example a CPU-efficient job:
[jeburks2@sol-login02:~]$ seff 11273083
Job ID: 11273083
Cluster: sol
User/Group: jeburks2/grp_rcadmins
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:59:39
CPU Efficiency: 98.98% of 01:00:16 core—wall time
Job Wall—clock time: 00:15:04
Memory Utilized: 337.73 MB
Memory Efficiency: 4.12% of 8.00 GB
Here, 4 CPU cores were requested for 15 minutes, which is a total of (4*15=) 60 cpu-minutes.
- The job used all four cores to an appreciable 98.98% efficiency, using 59:39 of 60:00 minutes--a properly efficient job!
- The job used 337MB of 8048MB of system memory, a low-efficiency utilization from a memory perspective.
This lets us know that if we run this job in the future, we can allocate less memory. This will reduce the impact to our fair share and use the system more efficiently.
Note: seff
does not account for GPU utilization. A GPU-heavy job may correctly and properly run, but also understandably report low CPU-efficiency.
Computational Research Accelerator
The Computational Research Accelerator is a team within Research Computing that can help with speeding up--or accelerating--your code. This can include optimization and parallelization of the code, as well as experimenting with novel hardware. They offer short consultation sessions or long term embedded consultations in a project. To request their services, create a support ticket by reviewing our RTO Request Help page.