Skip to main content

Understanding Slurm Jobstates and Reasons

A submitted job may be in one of numerous job states. This page will help roughly describe these states in terms of a roller coaster.

You can see the state of your job broadly with the myjobs command.

[rcsparky@sol-login01:~]$ myjobs
JobID PRIORITY PARTITION/QOS NAME STATE TIME TIME_LIMIT Node/Core/GPU NODELIST(REASON)
50488936 1311012 public/public run.sh PENDING 0:00 12:00:00 1/100/NA (Priority)
50489161 1310086 public/public run.sh PENDING 0:00 12:00:00 1/100/NA (Priority)
50489228 1309739 public/public run.sh PENDING 0:00 12:00:00 1/100/NA (Priority)

This shows your pending and running jobs, as well as the resources it is requesting and the duration.

Roller Coaster Analogy

Slurm Job States

Running State

The state RUNNING is designated for any job whose requested resources have become available, and allocated directly to the job. Whether batched or interactive, the user's request was fulfill-able and resources are ready for use (and/or are now being used).

This is analogous to actually being in the coaster car and riding the ride.

Pending State

A PENDING state means the job has been accepted by the scheduler, the resources request has been recognized as valid, and it is awaiting available resources to run.

This state comprises all those waiting in line awaiting the ride.

Reason Codes

Below is a list of reason codes that will accompany this job queue state:

ReqNodeNotAvail, May be reserved for other job

Details
$ scontrol show job 12340
JobId=12340 JobName=eval_1600
JobState=PENDING Reason=ReqNodeNotAvail, May be reserved for other job
StartTime=2026-04-06T13:26:25 EndTime=Unknown Deadline=N/A

ReqNodeNotAvail, May be reserved for other job reason code means that your job has an intended resource it is designated to run on--a specific host such as sg041 or sc012.

This state indicates job starting is imminent: this job is the next job to run on the named resources. scontrol show job lists a StartTime, which could also be earlier, if the job using that resource ends earlier.

This is analogous to a rider in the ready-to-load lines; the order is definitive and you're simply awaiting the return of the ride to the station. If "getting to ride in the front car" (say, getting 400GB of memory on a node) is the resource, others may still board the ride before you in other cars (other jobs may start), provided they don't compete for your spot (the resources you expect).

ReqNodeNotAvail,_Reserved_for_maintenance

Details
$ scontrol show job 12341
JobId=12341 JobName=eval_1600
JobState=PENDING Reason=ReqNodeNotAvail,_Reserved_for_maintenance
StartTime=2026-04-06T13:26:25 EndTime=Unknown Deadline=N/A

This means that your job is accepted into the queue, but it does not have enough time to start and finish (based on your requested runtime) before scheduled maintenance of the node.

This appears often as scheduled maintenance periods approach. If maintenance is set to begin in 168 hours (7 days), a job that requests 7 days, e.g., -t 7-0 will remain in the queue until at least maintenance ends.

Determine if your job requires this duration of time and reduce it (if possible) to allow potential execution before the scheduled maintenance.

Resources

Details
$ scontrol show job 12342
JobId=12342 JobName=eval_1600
JobState=PENDING Reason=Resources
StartTime=Unknown EndTime=Unknown Deadline=N/A

Resources indicates a job is the single highest-priority job in a given partition (queue). The next available resource that can fit this job will be allocated to this job.

Note, a StartTime is not listed because it is not yet determined by the scheduler where it will run. For example, the next available GPU might be planned to be released in 10 minutes, but if any other sufficient GPU is released earlier, this job will take it.

A job waiting on Resources can possibly be demoted to second-in-line if a new job is queued, but only if that job also has a higher priority than every other job in the partition. Naturally, this is possible, but not going to be common. It is most prominent for private-QOS submissions on private hardware.

Priority

Details
$ scontrol show job 12343
JobId=12343 JobName=eval_1600
JobState=PENDING Reason=Priority
StartTime=Unknown EndTime=Unknown Deadline=N/A

Priority as the reason your job has not yet started means you are still in the long-line.

The analogy is a little shaky here since real roller coaster lines are strict first-in-first-out, but if FairShare roller coaster queues ever existed, this would be the equivalent of permitting some cutting-in-line because some people haven't had as much ride time as you.

QOSMaxCpuPerJobLimit

Details
$ scontrol show job 12344
JobId=12344 JobName=eval_1600
JobState=PENDING Reason=QOSMaxCpuPerJobLimit
StartTime=Unknown EndTime=Unknown Deadline=N/A

QOSMaxCpuPerJobLimit applies to class accounts mostly, in that class accounts have resource limits.

This means the request is within the allowable limits, but cannot run yet, as the aggregate resources requested would exceed these limits, if permitted to run concurrently.

None

Details
$ scontrol show job 12345
JobId=12345 JobName=eval_1600
JobState=PENDING Reason=None
StartTime=Unknown EndTime=Unknown Deadline=N/A

None means there is no reason your job shouldn't be running. And alas, if you wait a few more moments, the job will soon be indicated as running. This is an interim state that indicates your job is starting. In the analogy, this is loading into the car itself.

Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

Details
$ scontrol show job 12346
JobId=12346 JobName=eval_1600
JobState=PENDING Reason=Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions
StartTime=Unknown EndTime=Unknown Deadline=N/A

This state means that your job can only be run on nodes that are not accepting jobs. The most common cause of this is you require a scarce resource, e,g., an uncommon GPU or FPGA, and that node has failed.

Alternatively, this may come up if you request a specific node (-w scc020) for your job. If any other jobs exist in other partitions with a higher priority than your job, their higher priority will be honored.

RequeueHold

Reason Codes

launch_failure_limit_exceeded_requeued_held

Details
$ scontrol show job 12347
JobId=12347 JobName=eval_1600
JobState=REQUEUEHOLD Reason=launch_failure_limit_exceeded_requeued_held
StartTime=Unknown EndTime=Unknown Deadline=N/A

This reason indicates your job is no longer scheduled to run. You have asked for valid resources, but something else is keeping it from running successfully; in most cases, this is applied jobs suffering environmental issues--sometimes this happens at the node-level and must be reported to Research Computing admins to investigate. Jobs in this state can be re-entered into the queue by the admins, as well.

Backfilling

Jobs that request a comparatively small amount of resources can be backfilled. Consider another user submitting a job requesting 4xA100 GPUs.

If a GPU node has four jobs running, it might look like this:

Time  →   now                +2h                       +4h
|-------------------|-------------------------|

sg001
GPU 0 [ Job 1 ==================================== ]
GPU 1 [ Job 2 ======== ] free ---------------------
GPU 2 [ Job 3 ======== ] free ---------------------
GPU 3 [ Job 4 ======== ] free ---------------------
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
all 4 GPUs busy 3 GPUs available for
backfilling for 2 hours

So while that user might have priority for 4 GPUs, 4 hours from now, if your job can use 1-3 GPUs for two hours, you can utilize these resources out-of-turn. This is because your use of the resource does not get in the way of any existing jobs, or any future jobs.

These jobs might be in queue as PENDING-PRIORITY, but if backfill-able, it can go straight to RUNNING.

Backfilling has the benefit of being able to "skip the line". Consider this like if a group of 7 people on a 8 person ride all want to ride together. Instead of letting that 8th spot go empty, it can be backfilled to allow maximal throughput, while also satisfying the request to keep the riders together.