Basics of SLURM Jobs

The Simple Linux Utility for Resource Management (SLURM) is a system providing job scheduling and job management on compute clusters. With SLURM, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them.

Do NOT run large, long, multi-threaded, parallel, or CPU-intensive jobs on a front-end login host. All users share the front-end hosts, and running anything but the smallest test job will negatively impact everyone's ability to use Gautschi. Always use SLURM to submit your work as a job.

Link to section 'Submitting a Job' of 'Basics of SLURM Jobs' Submitting a Job

The main steps to submitting a job are:

Follow the links below for information on these steps, and other basic information about jobs. A number of example SLURM jobs are also available.

Queues

On Gautschi, the required options for job submisison deviates from some of the other community clusters you might have experience using. In general every job submission will have four parts:

The number and type of resources you want
The account the resources should come out of
The quality of service (QOS) this job expects from the resources
The partition where the resources are located

If you have used other clusters, you will be familiar with first item. If you have not, you can read about how to format the request on our job submission page. The rest of this page will focus on the last three items as they are specific to Gautschi.

Link to section 'Accounts' of 'Queues' Accounts

On the Gautschi community cluster, users will have access to one or more accounts, also known as queues. These accounts are dedicated to and named after each partner who has purchased access to the cluster, and they provide partners and their researchers with priority access to their portion of the cluster. These accounts can be thought of as bank accounts that contain the resources a group has purchased access to which may include some number of cores, GPU hours or both. To see the list of accounts that you have access to on Gautschi as well as the resources they contain, you can use the command "slist".

On previous clusters, you may have had access to several other special accounts like "standby", "highmem", "gpu", etc. On Gautschi, these modes of access still exist, but they are no longer accounts. As well on Gautschi, you must explicitly define the account that you want to submit to using the "-A/--account=" option.

Link to section 'Quality of Service (QOS)' of 'Queues' Quality of Service (QOS)

On Gautschi, we use a Slurm concept called a Quality of Service or a QOS. A QOS can be thought of as a tag for a job that tells the scheduler how that job should be treated with respect to limits, priority, etc. The cluster administrators define the available QOSes as well as the policies for how each QOS should be treated on the cluster. A toy example of such a policy may be "no single user can have more than 200 jobs" that has been tagged with a QOS named "highpriority". There are two classes of QOSes and a job can have both:

Partition QOSes: A partition QOS is a tag that is automatically added to your job when you submit to a partition that defines a partition QOS.
Job QOSes: A Job QOS is a tag that you explicitly give to a job using the option "-q/--qos=". By explicitly tagging your jobs this way, you can choose the policy that each one of your jobs should abide by. We will describe the policies for the available job QOSes in the partition section below.

As an extended metaphor, if we think of a job as a package that we need to have shipped to some destination, then the partition can be thought of as the carrier we decide to ship our package with. That carrier is going to have some company policies that dictate how you need to label/pack that package, and that company policy is like the partition QOS. It is the policy that is enforced for simply deciding to submit to a particular partition. The Job QOS can then be thought of as the various different types of shipping options that carrier might offer. You might pay extra to have that package shipped overnight or have tracking information for it or you may choose to pay less and have your package arrive as available. Once we decide to go with a particular carrier, we are subject to their company policy, but we may negotiate how they handle that package by choosing one of their available shipping options. In the same way, when you choose to submit to a partition, you are subject to the limits enforced by the partition QOS and you may be able to ask for your job to be handled a particular way by specifying the job QOS, but that option has to be offered by the partition.

In order for a job to use a Job QOS, the user submitting the job must have access to the QOS, the account the job is being submitted to must accept the QOS, and the partition the job is being submitted to must accept the QOS. The following job QOSes every user and every account of Gautschi has access to:

"normal": The "normal" QOS is the default job QOS on the cluster meaning if you do not explicitly list an alternative job QOS, your job will be tagged with this QOS. The policy for this QOS provides a high priority and does not add any additional limits.
"standby": The "standby" QOS must be explicitly used if desired by using the option "-q standby" or "--qos=standby". The policy for this QOS gives access to idle resources on the cluster. Jobs tagged with this QOS are "low priority" jobs and are only allowed to run for up to four hours at a time, however the resources used by these jobs do not count against the resources in your Account. For users of our previous clusters, usage of this QOS replaces the previous -A standby style of submission.

Some of these QOSes may not be available in every partition. Each of the partitions in the following section will enumerate which of these QOSes are allowed in the partition.

Link to section 'Partitions' of 'Queues' Partitions

On Gautschi, the various types of nodes on the cluster are organized into distinct partitions. This allows CPU and GPU nodes to be charged separately and differently. This also means that Instead of only needing to specify the account name in the job script, the desired partition must also be specified. Each of these partitions is subject to different limitations and has a specific use case that will be described below.

Link to section 'CPU Partition' of 'Queues' CPU Partition

This partition contains the resources a group purchases access to when they purchase CPU resources on Gautschi and is made up of 336 Gautschi-A nodes. Each of these nodes contains two Zen 4 AMD EPYC 9654 96-core processors for a total of 192 cores and 384 GB of memory for a total of more than 64,000 cores in the partition. Memory in this partition is allocated proportional to your core request such that each core is given about 2 GB of memory per core requested. Submission to this partition can be accomplished by using the option: "-p cpu" or "--partition=cpu".

The purchasing model for this partition allows groups to purchase high priority access to some number of cores. When an account uses resources in this account by submitting a job tagged with the "normal" QOS, the cores used by that job are withdrawn from the account and deposited back into the account when the job terminates.

When using the CPU partition, jobs are tagged by the "normal" QOS by default, but they can be tagged with the "standby" QOS if explicitly submitted using the "-q standby" or "--qos=standby" option.

Jobs tagged with the "normal" QOS are subject to the following policies:
1. Jobs have a high priority and should not need to wait very long before starting.
2. Any cores requested by these jobs are withdrawn from the account until the job terminates.
3. These jobs can run for up to two weeks at a time.
Jobs tagged with the "standby" QOS are subject to the following policies:
1. Jobs have a low priority and there is no expectation of job start time. If the partition is very busy with jobs using the "normal" QOS or if you are requesting a very large job, then jobs using the "standby" QOS may take hours or days to start.
2. These jobs can use idle resources on the cluster and as such cores requested by these jobs are not withdrawn from the account to which they were submitted.
3. These jobs can run for up to four hours at a time.

Groups who purchased GPU resources on Gautschi will have the ability to run jobs in this partition tagged with the "standby" QOS described above, but will not be able to submit jobs tagged with the "normal" qos unless they have purchased CPU resources on the cluster.

Available QOSes: "normal", "standby"

Link to section 'AI Partition' of 'Queues' AI Partition

This partition contains the resources a group purchases access to when they purcahse GPU resources on Gautschi and is made up of 20 Gautschi-H nodes. Each of these nodes contains two Intel Xeon Platinum 8480+ 56-core processors and 8-way NVLinked H100s for a total of 160 H100s in the partition. CPU memory in this partition is allocated proportional to your core request such that each core is given about 9 GB of memory per core requested. Submission to this partition can be accomplished by using the option: "-p ai" or "--partition=ai".

The purchasing model for this cluster allows groups to purchase access to GPUs. When a group purchases access to a GPU for a year, they are given 365*24 hours of GPU time on the partition. The advantage to tracking resources this way is that if a group needs access to more than one GPU for a project, they can use multiple GPUs at once since instead of having to purchase additional GPUs, they just consume those GPU hours faster. When an account uses resources in this partition, the balance of GPU hours in their account is charged at a rate of at least one GPU hour for each GPU their job uses per hour (this is tracked by the minute), however the usage of some QOSes described below will charge usage at a slower rate.

When using the AI partition, jobs are tagged by the "normal" QOS by default, but they can be tagged with the "preemptible" QOS if explicitly submitted using the "-q preemptible" or "--qos=preemptible" option.

Jobs tagged with the "normal" QOS are subject to the following policies:
1. Jobs have a high priority and should not need to wait very long before starting.
2. The account the job is submitted to will be charged 1 GPU hour for each GPU this job uses per hour (this is tracked by the minute).
3. These jobs can run for up to two weeks at a time.
Jobs tagged with the "preemptible" QOS subject to the following policies:
1. Jobs have a low priority and there is no expectation of job start time. If the partition is very busy with jobs using the "normal" QOS or if you are requesting a very large job, then jobs using the "standby" QOS may take hours or days to start.
2. The account the job is submitted to will be charged 0.25 GPU hour for each GPU this job uses per hour (this is tracked by the minute) which effectively allows jobs to use 4x as many resources as a job tagged with the "normal" qos for the same cost.
3. These jobs can run for up to two weeks at a time.
4. If there are not enough idle resources in this partition for a job tagged with the "normal" QOS to start, then this job may be cancelled to make room for that job. This means it is imperative to use checkpointing if using this QOS.

Available QOSes: "normal", "preemptible"

Link to section 'Highmem Partition' of 'Queues' Highmem Partition

This partition is made up of 6 Gautschi-B nodes which have four times as much memory as a standard Gautschi-A node, and access to this partition is given to all accounts on the cluster to enable work that has higher memory requirements. Each of these nodes contains two Zen 4 AMD EPYC 9654 96-core processors for a total of 192 cores and 1.5 TB of memory. Memory in this partition is allocated proportional to your core request such that each core is given about 8 GB of memory per core requested. Submission to this partition can be accomplished by using the option: "-p highmem" or "--partition=highmem".

When using the Highmem partition, jobs are tagged by the "normal" QOS by default, and this is the only QOS that is available for this partition, so there is no need to specify a QOS when using this partition. Additionally jobs are tagged by a highmem partition QOS that enforces the following policies

There is no expectation of job start time as these nodes are a shared resources that are given as a bonus for purchasing access to high priority access to resources on Gautschi
You can have 2 jobs running in this partition at once
You can have 8 jobs submitted to thie partition at once
Your jobs must use more than 48 of the 192 cores on the node otherwise your memory footprint would fit on a standard Gautschi-A node
These jobs can run for up to 24 hours at a time.

Available QOSes: "normal"

Link to section 'Profiling Partition' of 'Queues' Profiling Partition

This partition is made up of 2 Gautschi-A nodes that have hardware performance counters enabled. By enabling hardware performance counters, profiling applications such as AMD MicroProf can track certain performance criteria for execution on the CPU such as L3 cache events, speculative execution misses, etc. Due to the fact that this allows greater visibility into the execution of each process, the nodes in this partition can only be used if your job uses the entire node. Submission to this partition can be accomplished by using the option: "-p profiling" or "--partition=profiling".

When using the profiling partition, jobs are tagged by the "normal" QOS by default, and this is the only QOS that is available for this partition, so there is no need to specify a QOS when using this partition. Additionally jobs are tagged by a profiling partition QOS that enforces the following policies

There is no expectation of job start time as these nodes are a shared resources that are given as a bonus for purchasing access to high priority access to resources on Gautschi
You can have 1 job running in this partition at once
If you have a single process running on one of the nodes, you must use that entire node when submitting to this partition
These jobs can run for up to 24 hours at a time.

This is a resource that is reserved for groups profiling their applications and we monitor to ensure that this partition is not being used simply for more compute.

Available QOSes: "normal"

Link to section 'Smallgpu Partiton' of 'Queues' Smallgpu Partiton

This partition is made up of 6 Gautschi-G nodes. Each of these nodes contains two NVIDIA L40s and two Zen 4 AMD EPYC 9554 64-core processors for a total of 128 cores and 384GB of memory. Memory in this partition is allocated proportional to your core request such that each core is given about 3 GB of memory per core requested. You should request cores proportional to the number of GPUs you are using in this partition (i.e. if you only need one of the two GPUs, you should request half of the cores on the node) Submission to this partition can be accomplished by using the option: "-p smallgpu" or "--partition=smallgpu".

When using the smallgpu partition, jobs are tagged by the "normal" QOS by default, and this is the only QOS that is available for this partition, so there is no need to specify a QOS when using this partition. Additionally jobs are tagged by a smallgpu partition QOS that enforces the following policies

There is no expectation of job start time as these nodes are a shared resources that are given as a bonus for purchasing access to high priority access to resources on Gautschi
You can use up to 2 GPUs in this partition at once
You can have 8 jobs submitted to thie partition at once
These jobs can run for up to 24 hours at a time.

Available QOSes: "normal"

Job Submission Script

To submit work to a SLURM queue, you must first create a job submission file. This job submission file is essentially a simple shell script. It will set any required environment variables, load any necessary modules, create or modify files and directories, and run any applications that you need:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

# Loads Matlab and sets the application up
module load matlab

# Change to the directory from which you originally submitted this job.
cd $SLURM_SUBMIT_DIR

# Runs a Matlab script named 'myscript'
matlab -nodisplay -singleCompThread -r myscript

Once your script is prepared, you are ready to submit your job.

Link to section 'Job Script Environment Variables' of 'Job Submission Script' Job Script Environment Variables

SLURM sets several potentially useful environment variables which you may use within your job submission files. Here is a list of some:
Name	Description
SLURM_SUBMIT_DIR	Absolute path of the current working directory when you submitted this job
SLURM_JOBID	Job ID number assigned to this job by the batch system
SLURM_JOB_NAME	Job name supplied by the user
SLURM_JOB_NODELIST	Names of nodes assigned to this job
SLURM_CLUSTER_NAME	Name of the cluster executing the job
SLURM_SUBMIT_HOST	Hostname of the system where you submitted this job
SLURM_JOB_PARTITION	Name of the original queue to which you submitted this job

Submitting a Job

Once you have a job submission file, you may submit this script to SLURM using the sbatch command. SLURM will find, or wait for, available resources matching your request and run your job there.

On Gautschi, in order to submit jobs, you need to specify the partition, account and Quality of Service (QoS) name to which you want to submit your jobs. To familiarize yourself with the partitions and QoS available on Gautschi, visit Gautschi Queues and Partitons. To check the available partitions on Gautschi, you can use the showpartitions , and to check your available accounts you can use slist commands. Slurm uses the term "Account" with the option -A or --account= to specify different batch accounts, the option -p or --partition= to select a specific partition for job submission, and the option -q or --qos= .

showpartitions

Partition statistics for cluster gautschi at Fri Dec  6 05:26:30 PM EST 2024
     Partition     #Nodes     #CPU_cores  Cores_pending   Job_Nodes MaxJobTime Cores Mem/Node
     Name State Total  Idle  Total   Idle Resorc  Other   Min   Max  Day-hr:mn /node     (GB)
       ai    up    20    20   2240   2240      0      0     1 infin   infinite   112    1031 
      cpu    up   336   300  64512  57600      0      0     1 infin   infinite   192     386 
  highmem    up     6     6   1152   1152      0      0     1 infin   infinite   192    1547   
 smallgpu    up     6     6    768    768      0      0     1 infin   infinite   128     386 
profiling    up     2     2    384    384      0      0     1 infin   infinite   192     386

Link to section 'CPU Partition' of 'Submitting a Job' CPU Partition

The CPU partition on Gautschi has two Quality of Service (QoS) levels: normal and standby. To submit your job to one compute node on cpu partition and 'normal' QoS which has "high priority":

$ sbatch --nodes=1 --ntasks=1 --partition=cpu --account=accountname --qos=normal myjobsubmissionfile
$ sbatch -N1 -n1 -p cpu -A accountname -q normal myjobsubmissionfile

To submit your job to one compute node on cpu partition and 'standby' QoS which is has "low priority":

$ sbatch --nodes=1 --ntasks=1 --partition=cpu --account=accountname --qos=standby myjobsubmissionfile
$ sbatch -N1 -n1 -p cpu -A accountname -q standby myjobsubmissionfile

Link to section ' AI Partition' of 'Submitting a Job' AI Partition

The CPU partition on Gautschi has two Quality of Service (QoS) levels: normal and preemptible. To submit your job to one compute node requesting one GPU on ai partition and 'normal' QoS which is has "high priority":

$ sbatch --nodes=1 --gpus-per-node=1 --ntasks=14 --partition=ai --account=accountname --qos=normal myjobsubmissionfile
$ sbatch -N1 --gpus-per-node=1 -n14 -p ai -A accountname -q normal myjobsubmissionfile

To submit your job to one compute node requesting one GPU on ai partition and 'preemptible' QoS which has "high priority":

$ sbatch --nodes=1 --gpus-per-node=1 --ntasks=14 --partition=ai --account=accountname --qos=preemptible myjobsubmissionfile
$ sbatch -N1 --gpus-per-node=1 -n14 -p ai -A accountname -q preemptible myjobsubmissionfile

Link to section 'Highmem Partition' of 'Submitting a Job' Highmem Partition

To submit your job to a compute node on the highmem partition, you don’t need to specify the QoS name because only one QoS exists for this partition, and the default is normal. However, the highmem partition is suitable for jobs with memory requirements that exceed the capacity of a standard node, so the number of requested tasks should be appropriately high.

Link to section 'Profiling Partition' of 'Submitting a Job' Profiling Partition

To submit your job to a compute node on the profiling partition, you also don’t need to specify the QoS name because only one QoS exists for this partition, and the default is normal.

$ sbatch --nodes=1 --ntasks=1 --partition=profiling --account=accountname myjobsubmissionfile
$ sbatch -N1 -n1 -p profiling -A accountname myjobsubmissionfile

Link to section 'Smallgpu Partition' of 'Submitting a Job' Smallgpu Partition

To submit your job to a compute node on the smallgpu partition, you don’t need to specify the QoS name because only one QoS exists for this partition, and the default is normal. You should request cores proportional to the number of GPUs you are using in this partition (i.e. if you only need one of the two GPUs, you should request half of the cores on the node).

$ sbatch --nodes=1 --ntasks=64 --gpus-per-node=1 --partition=smallgpu --account=accountname myjobsubmissionfile
$ sbatch -N1 -n64 --gpus-per-node=1 -p smallgpu -A accountname myjobsubmissionfile

Link to section 'General Information' of 'Submitting a Job' General Information

By default, each job receives 30 minutes of wall time, or clock time. If you know that your job will not need more than a certain amount of time to run, request less than the maximum wall time, as this may allow your job to run sooner. To request the 1 hour and 30 minutes of wall time:

$ sbatch -t 01:30:00 -N=1 -n=1 -p=cpu -A=accountname -q=standby myjobsubmissionfile

The --nodes= or -N value indicates how many compute nodes you would like for your job, and --ntasks= or -n value indicates the number of tasks you want to run.

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

To request 2 compute nodes:

$ sbatch -t 01:30:00 -N=2 -n=16 -p=cpu -A=accountname -q=standby myjobsubmissionfile

By default, jobs on Gautschi will share nodes with other jobs.

If more convenient, you may also specify any command line options to sbatch from within your job submission file, using a special form of comment:

#!/bin/sh -l
# FILENAME:  myjobsubmissionfile

#SBATCH --account=accountname
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=cpu
#SBATCH --qos=normal
#SBATCH --time=1:30:00
#SBATCH --job-name myjobname

# Print the hostname of the compute node on which this job is running.
/bin/hostname

If an option is present in both your job submission file and on the command line, the option on the command line will take precedence.

After you submit your job with SBATCH, it may wait in queue for minutes, hours, or even weeks. How long it takes for a job to start depends on the specific queue, the resources and time requested, and other jobs already waiting in that queue requested as well. It is impossible to say for sure when any given job will start. For best results, request no more resources than your job requires.

Once your job is submitted, you can monitor the job status, wait for the job to complete, and check the job output.

Checking Job Status

Once a job is submitted there are several commands you can use to monitor the progress of the job.

To see your jobs, use the squeue -u command and specify your username:

(Remember, in our SLURM environment a queue is referred to as an 'Account')

 

squeue -u myusername

JOBID      USER     ACCOUNT      PART QOS     NAME       NODES TRES_PER_NODE   CPUS  TIME_LIMIT ST TIME
182792     myusername account_name     cpu  standby job1         1 N/A              192     4:00:00  R 35:54
185841     myusername account_name     cpu  normal  job2         1 N/A              192     4:00:00  R 36:51
185844     myusername account_name     smal normal  job3         1 gres/gpu:1        64     8:00:00  R 22:11

To retrieve useful information about your queued or running job, use the scontrol show job command with your job's ID number. The output should look similar to the following:



scontrol show job 3519

JobId=3519 JobName=job.sub
   UserId=myusername GroupId=mygroup MCS_label=N/A
   Priority=1000 Nice=0 Account=myaccount QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:23 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2025-02-24T14:31:19 EligibleTime=2025-02-24T14:31:19
   AccrueTime=2025-02-24T14:31:19
   StartTime=2025-02-24T14:32:36 EndTime=2025-02-24T18:32:36 Deadline=N/A
   PreemptEligibleTime=2025-02-24T14:32:36 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-02-24T14:32:36 Scheduler=Backfill
   Partition=cpu AllocNode:Sid=login02:4061246
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=a166
   BatchHost=a166
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=4,mem=7668M,node=1,gres/hp_cpu=4
   AllocTRES=cpu=4,mem=7668M,node=1,gres/hp_cpu=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1917M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/sh
   WorkDir=/home/myusername
   TresPerJob=gres/hp_cpu=4

There are several useful bits of information in this output.

JobState lets you know if the job is Pending, Running, Completed, or Held.
RunTime and TimeLimit will show how long the job has run and its maximum time.
SubmitTime is when the job was submitted to the cluster.
NumNodes, NumCPUs, NumTasks and CPUs/Task are the number of Nodes, CPUs, Tasks, and CPUs per Task are shown.
WorkDir is the job's working directory.

Checking Job Output

Once a job is submitted, and has started, it will write its standard output and standard error to files that you can read.

SLURM catches output written to standard output and standard error - what would be printed to your screen if you ran your program interactively. Unless you specfied otherwise, SLURM will put the output in the directory where you submitted the job in a file named slurm- followed by the job id, with the extension out. For example slurm-3509.out. Note that both stdout and stderr will be written into the same file, unless you specify otherwise.

If your program writes its own output files, those files will be created as defined by the program. This may be in the directory where the program was run, or may be defined in a configuration or input file. You will need to check the documentation for your program for more details.

Link to section 'Redirecting Job Output' of 'Checking Job Output' Redirecting Job Output

It is possible to redirect job output to somewhere other than the default location with the --error and --output directives:

#!/bin/bash
#SBATCH --output=/home/myusername/joboutput/myjob.out
#SBATCH --error=/home/myusername/joboutput/myjob.out

# This job prints "Hello World" to output and exits
echo "Hello World"

Holding a Job

Sometimes you may want to submit a job but not have it run just yet. You may be wanting to allow lab mates to cut in front of you in the queue - so hold the job until their jobs have started, and then release yours.

To place a hold on a job before it starts running, use the scontrol hold job command:

$ scontrol hold job  myjobid

Once a job has started running it can not be placed on hold.

To release a hold on a job, use the scontrol release job command:

$ scontrol release job  myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.

Job Dependencies

Dependencies are an automated way of holding and releasing jobs. Jobs with a dependency are held until the condition is satisfied. Once the condition is satisfied jobs only then become eligible to run and must still queue as normal.

Job dependencies may be configured to ensure jobs start in a specified order. Jobs can be configured to run after other job state changes, such as when the job starts or the job ends.

These examples illustrate setting dependencies in several ways. Typically dependencies are set by capturing and using the job ID from the last job submitted.

To run a job after job myjobid has started:

sbatch --dependency=after:myjobid myjobsubmissionfile

To run a job after job myjobid ends without error:

sbatch --dependency=afterok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with errors:

sbatch --dependency=afternotok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with or without errors:

sbatch --dependency=afterany:myjobid myjobsubmissionfile

To set more complex dependencies on multiple jobs and conditions:

sbatch --dependency=after:myjobid1:myjobid2:myjobid3,afterok:myjobid4 myjobsubmissionfile

Canceling a Job

To stop a job before it finishes or remove it from a queue, use the scancel command:

scancel myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.