Skip to main content

Pulling Job Statistics

Overview

This page will help researchers pull system usage information about the running and completed jobs.

Running Jobs

For currently running jobs, seff/sacct will not pull accurate stattitscs. To see the current CPU, Memory, and GPU usage, you will need to connect to the node.

If you have a sbatch script running, you can use the myjobs command to find the node the job is running on. Look for the NODELIST section.

jeburks2@sol-login02:~]$ myjobs
JobID ... PARTITION/QOS NAME STATE ... Node/Core/GPU NODELIST(REASON)
11273558 ... general/public myjob RUNNING ... 1/1/NA sc008

In the example above, the job is running on node c008. We will then connect directly to that node with ssh

tip

You can only ssh to nodes you have a job running on, otherwise the ssh connection will fail. When you ssh to a node, you are joining the cgroup on that node that runs your job

[jeburks2@sol-login02:~]$ ssh sc008
[jeburks2@sc008:~]$

Notice the bash prompt changed from username@sol-login02 to username@sc008, indicating we are on node sc008 now. We can now use the following commands to view information about our job

CPU Usage - htop

htop # This will show the CPU and Memory of the processes running in our job
top - 15:12:17 up 62 days, 16:04,  6 users,  load average: 83.96, 83.95, 83.37
Tasks: 1490 total, 7 running, 1483 sleeping, 0 stopped, 0 zombie
%Cpu(s): 63.6 us, 0.8 sy, 0.0 ni, 35.2 id, 0.1 wa, 0.3 hi, 0.0 si, 0.0 st
KiB Mem : 52768358+total, 43774208+free, 44024912 used, 45916584 buff/cache
KiB Swap: 4194300 total, 1393524 free, 2800776 used. 47617328+avail Mem

PID PPID USER PR VIRT RES SHR SWAP DATA TIME S %CPU %MEM nMaj nMin COMMAND P nDRT CODE UID GID TIME+ RUSER WCHAN nTH NI Flags TTY OOMs ENVIRON
931372 928130 jeburks2 20 7162680 1.0g 341492 0 1145200 0:07 S 69.1 0.2 138 191k /packag+ 125 0 144 1724+ 9999+ 0:07.10 jeburks2 - 44 0 4...4... ? 668 SLURM_M+
931369 929093 jeburks2 20 67228 7016 4456 0 3072 0:00 R 1.0 0.0 0 828 top -u + 101 0 108 1724+ 9999+ 0:00.34 jeburks2 - 1 0 ....4... pts/0 666 LS_COLO+
928130 928126 jeburks2 20 12868 3100 2908 168 536 5:02 S 0.0 0.0 516 1417 /bin/ba+ 125 0 1056 1724+ 9999+ 5:02.62 jeburks2 - 1 0 ....41.. ? 666 LS_COLO+
929092 929086 jeburks2 20 175560 5592 4208 0 1232 0:00 S 0.0 0.0 0 175 sshd: j+ 99 0 824 1724+ 9999+ 0:00.01 jeburks2 - 1 0 ....414. ? 666 -
929093 929092 jeburks2 20 24656 4516 3420 0 1196 0:00 S 0.0 0.0 0 2805 -bash 100 0 1056 1724+ 9999+ 0:00.70 jeburks2 - 1 0 ....4... pts/0 666 LANG=en+

This shows the processes and their CPU / Memory usage. CPU usage is a percentage. 100% CPU is 1 CPU core, so if a process is using 8 cores it may say 800%, or list 8 processes at 100%

Press q to quit out of top

GPU Usage - nvtop

nvtop # This will display GPU usage for our job, this only works on GPU nodes
Device 0 [NVIDIA A100-SXM4-80GB] PCIe GEN 4@16x RX: 1.089 GiB/s TX: 1.178 GiB/s
GPU 1410MHz MEM 1593MHz TEMP 38°C FAN N/A% POW 116 / 500 W
GPU[|||||||||||||||| 44%] MEM[ 1.238Gi/80.000Gi]

When done viewing job statistics, type exit to return to the login node

[jeburks2@sc008:~]$ exit
logout
Connection to c008 closed.
[jeburks2@sol-login02:~]$

Completed Jobs

Once a job has completed/canceled/failed, pulling the job statistics is rather simple. There are two main commands to do this: seff and mysacct

Seff

seff is short for "slurm efficiency" and will display the percentage of CPU and Memory used by a job relative to how long the job ran. The goal is high efficiency so that jobs are not allocating resources they are not using.

Example of seff for an inefficient job

[jeburks2@sol-login02:~]$ seff 11273084
Job ID: 11273084
Cluster: sol
User/Group: jeburks2/grp_rcadmins
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:59
CPU Efficiency: 12.55% of 00:07:50 core-walltime
Job Wall-clock time: 00:07:50
Memory Utilized: 337.73 MB
Memory Efficiency: 16.85% of 2.00 GB
[jeburks2@sol-login02:~]$

This shows the job had a CPU for 7 minutes, but only used the CPU for 59 seconds, resulting in a 12% efficiency, but did use the memory

Example of seff for a CPU efficient job

[jeburks2@sol-login02:~]$ seff 11273083
Job ID: 11273083
Cluster: sol
User/Group: jeburks2/grp_rcadmins
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:59:39
CPU Efficiency: 98.98% of 01:00:16 core-wall time
Job Wall-clock time: 00:15:04
Memory Utilized: 337.73 MB
Memory Efficiency: 4.12% of 8.00 GB

In this example, the job used all four cores it was allocated for 98% of the time the job ran. The core-wall time is calculated by the number of CPU cores * the length of the job. This 15-minute job with 4 CPUs had a core-wall time of 1:00:00. However, the memory efficiency is rather low. This lets us know that if we run this job in the future, we can allocate less memory. This will reduce the impact to our fair share and use the system more efficiently.

warning

Note: Seff does not display statists for GPUs, so a GPU-heavy job will likely have inaccurate seff results

sacct / mysacct

The sacct / mysacct command allows a user to easily pull up information about past jobs that have completed.

Specify either a job ID or username with the --jobs or --user flag, respectively to pull up all information on a job:

sacct --jobs=<jobid[,additional_jobids]>
sacct --user=<username>

Some available --format variables are contained in the below table, and may be passed as a comma separated list

sacct --user=<username> --format=<var_1[,var_2,...,var_N]>
VariableDescription
accountAccount the job ran under.
allocTRESAllocated trackable resources (e.g. cores/RAM)
avecpuAverage CPU time of all tasks in job.
cputimeFormatted (Elapsed time * core) count used
elapsedJobs elapsed time formatted as DD-HH:MM:SS.
stateThe job's state
jobidThe id of the job.
jobnameThe name of the job.
maxdiskreadMaximum number of bytes read
maxdiskwriteMaximum number of bytes written
maxrssMaximum RAM use of all job tasks
ncpusThe number of allocated CPUs
nnodesThe number of allocated nodes
ntasksNumber of tasks in a job
prioritySlurm priority
qosQuality of service
userUsername of the person who ran the job
tip

For convenience, the command mysacct has been added to the system. This is equivalent to sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,allocTRES%42,state and accepts the same flags that sacct would, e.g. --starttime=YYYY-MM-DD or --endtime=YYYY-MM-DD.

Examples for better understanding job hardware utilization

Note that by default, only jobs run on the current day will be listed. To search within a different period of time, use the --starttime flag. The --long flag can also be used to show a non-abbreviated version of sacct output. For example, to list detailed job characteristics for a user's jobs since December 15th, 2020:

sacct --user=$USER --starttime=2020-12-15 --long

This produces a lot of output. As an example for formatted output, the following complete command will list information about jobs that ran today for a user, specifically information about the job's id, average CPU use, maximum amount of RAM (memory) used, the core time (wall time multiplied by number of cores allocated), and the job's state:

sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,state
note

The above command in conjunction with appropriate --starttime filtering is very useful for understanding more efficient hardware requests for future jobs. For instance, if maxrss is 1 GB, then the default memory allocated to a job (4 GB+) is more than sufficient.

info

An additionally useful flag for the format is allocTRES%42 which will print the allocated "trackable resources" associated with the job with a width of 42 character, e.g. billing=1,cpu=1,mem=4G,node=1 would be printed for a 1 core job. The allocTRES field is helpful for comparing to the avecpu and maxrss values, for instance.

warning

If a + is listed at the end of a field, then that field has likely been truncated to fit into a fixed number of characters. Consider increasing the with by appending a % followed by a number to specify a new width. For example allocTRES%42 overrides the default width to 42 characters.

Additional Help

If you require further assistance, contact the Research Computing Team:

We also offer Educational Opportunities and Workshops.