Basics of SLURM Jobs

The Simple Linux Utility for Resource Management (SLURM) is a system providing job scheduling and job management on compute clusters. With SLURM, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them.

Do NOT run large, long, multi-threaded, parallel, or CPU-intensive jobs on a front-end login host. All users share the front-end hosts, and running anything but the smallest test job will negatively impact everyone's ability to use Gilbreth. Always use SLURM to submit your work as a job.

Link to section 'Submitting a Job' of 'Basics of SLURM Jobs' Submitting a Job

The main steps to submitting a job are:

Follow the links below for information on these steps, and other basic information about jobs. A number of example SLURM jobs are also available.

Queues

Link to section '"mylab" Queues' of 'Queues' "mylab" Queues

Gilbreth, as a community cluster, has one or more queues dedicated to and named after each partner who has purchased access to the cluster. These queues provide partners and their researchers with priority access to their portion of the cluster. Jobs in these queues are typically limited to 336 hours. The expectation is that any jobs submitted to your research lab queues will start within 4 hours, assuming the queue currently has enough capacity for the job (that is, your lab mates aren't using all of the cores currently).

Link to section 'Training Queue' of 'Queues' Training Queue

If your job can scale well to multiple GPUs and it requires longer than 24 hours, then use the training queue. Since the training nodes have specialty hardware and are few in number, these are restricted to users whose workloads can scale well with the number of GPUs. Please note that staff may ask you to provide evidence that your jobs can fully utilize the GPUs, before granting access to this queue. The Max wall time is 3 days, the number of jobs a user could concurrently run is 2, and the total number of consumed GPUs is 8. There are only 5 nodes in this queue, so you may have to wait a considerable amount of time before your job is scheduled.

Link to section 'Standby Queue' of 'Queues' Standby Queue

Additionally, community clusters provide a "standby" queue which is available to all cluster users. This "standby" queue allows users to utilize portions of the cluster that would otherwise be idle, but at a lower priority than partner-queue jobs, and with a relatively short time limit, to ensure "standby" jobs will not be able to tie up resources and prevent partner-queue jobs from running quickly. Jobs in standby are limited to 4 hours. There is no expectation of job start time. If the cluster is very busy with partner queue jobs, or you are requesting a very large job, jobs in standby may take hours or days to start.

Link to section 'Debug Queue' of 'Queues' Debug Queue

The debug queue allows you to quickly start small, short, interactive jobs in order to debug code, test programs, or test configurations. You are limited to one running job at a time in the queue, and you may run up to two GPUs for 30 minutes. The expectation is that debug jobs should start within a couple of minutes, assuming all of its dedicated nodes are not taken by others.

Link to section 'List of Queues' of 'Queues' List of Queues

To see a list of all queues on Gilbreth that you may submit to, use the slist command

This lists each queue you can submit to, the number of nodes allocated to the queue, how many are available to run jobs, and the maximum walltime you may request. Options to the command will give more detailed information. This command can be used to get a general idea of how busy an individual queue is and how long you may have to wait for your job to start.

The default output mode of slist command shows the available GPU counts in queues:

$ slist

                      Current Number of GPUs                        Node
Account           Total    Queue     Run    Free    Max Walltime    Type
==============  =================================  ==============  ======
debug               183        0       0     183      00:30:00     B,D,E,F,G,H,I
standby             183       77      55      98      04:00:00     B,D,E,F,G,H,I
training             20        0       8      12     3-00:00:00    C,J
mylab                80        0       0      80    14-00:00:00    F

To check the number of CPUs mounted on each queue, please use slist -c command.

Link to section 'Summary of Queues' of 'Queues' Summary of Queues

Gilbreth contains several queues and heterogeneous hardware consisting of different number of cores and different GPU models. Some queues are backed by only one node type, but some queues may land on multiple node types. On queues that land on multiple node types, you will need to be mindful of your resource request. Below are the current combinations of queues, GPU types, and resources you may request.

Gilbreth queues
Queue	GPU Type	Number of GPUs per node	Intended use-case	Max walltime	Max GPUs pre user concurrently	Max Jobs running per user
Standby	V100 (16 GB), V100 (32 GB), A100 (40 GB), A100 (80 GB), A10 (24 GB), A30 (24 GB)	16 (2), 40 (2), 128 (2), 128 (2), 32 (3), 24/16 (3)	Short to moderately long jobs	4 hours	16	16
training	V100 (32 GB, NVLink), A100 (80GB, NVLink)	20 (4), 128 (4)	Long jobs that can scale well to multiple GPUs, such as Deep Learning model training	3 days	8	2
debug	V100 (16 GB), V100 (32 GB), A100 (40 GB), A100 (80 GB), A10 (24 GB), A30 (24 GB)	16 (2), 40 (2), 128 (2), 128 (2), 32 (3), 24/16 (3)	Quick testing	30 mins	2	1
"mylab"	Based on Purchase	Based on Purchase	There will be a separate queue for each type of GPU the partners have purchased.	2 Weeks	Amount Purchased	Based on Purchase

Job Submission Script

To submit work to a SLURM queue, you must first create a job submission file. This job submission file is essentially a simple shell script. It will set any required environment variables, load any necessary modules, create or modify files and directories, and run any applications that you need:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

# Loads Matlab and sets the application up
module load matlab

# Change to the directory from which you originally submitted this job.
cd $SLURM_SUBMIT_DIR

# Runs a Matlab script named 'myscript'
matlab -nodisplay -singleCompThread -r myscript

Once your script is prepared, you are ready to submit your job.

Link to section 'Job Script Environment Variables' of 'Job Submission Script' Job Script Environment Variables

SLURM sets several potentially useful environment variables which you may use within your job submission files. Here is a list of some:
Name	Description
SLURM_SUBMIT_DIR	Absolute path of the current working directory when you submitted this job
SLURM_JOBID	Job ID number assigned to this job by the batch system
SLURM_JOB_NAME	Job name supplied by the user
SLURM_JOB_NODELIST	Names of nodes assigned to this job
SLURM_CLUSTER_NAME	Name of the cluster executing the job
SLURM_SUBMIT_HOST	Hostname of the system where you submitted this job
SLURM_JOB_PARTITION	Name of the original queue to which you submitted this job

Submitting a Job

Once you have a job submission file, you may submit this script to SLURM using the sbatch command. SLURM will find, or wait for, available resources matching your request and run your job there.

To submit your job to one compute node:


$ sbatch --nodes=1 --gpus-per-node=1 myjobsubmissionfile

Slurm uses the word 'Account' and the option '-A' to specify different batch queues. To submit your job to a specific queue:

$ sbatch --nodes=1 --gpus-per-node=1 -A standby myjobsubmissionfile

On Gilbreth, you must specify the number of GPUs with the --gpus-per-node option.

By default, each job receives 30 minutes of wall time, or clock time. If you know that your job will not need more than a certain amount of time to run, request less than the maximum wall time, as this may allow your job to run sooner. To request the 1 hour and 30 minutes of wall time:

 $ sbatch -t 1:30:00 --nodes=1 --gpus-per-node=1 -p standby myjobsubmissionfile

The --nodes value indicates how many compute nodes you would like for your job.

Each compute node in Gilbreth has various cores per node. Refer to the Hardware Overview and Queue Overview for details.

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

To request 2 compute nodes:

 $ sbatch --nodes=2 --gpus-per-node=1 myjobsubmissionfile

By default, jobs on Gilbreth will share nodes with other jobs.

To submit a job using 1 compute node with 4 tasks, each using the default 1 core and 1 GPU per node:

$ sbatch --nodes=1 --ntasks=4 --gpus-per-node=1 myjobsubmissionfile

If more convenient, you may also specify any command line options to sbatch from within your job submission file, using a special form of comment:

#!/bin/sh -l
# FILENAME:  myjobsubmissionfile

#SBATCH -A myqueuename
#SBATCH --nodes=1 --gpus-per-node=1 
#SBATCH --time=1:30:00
#SBATCH --job-name myjobname

# Print the hostname of the compute node on which this job is running.
/bin/hostname

If an option is present in both your job submission file and on the command line, the option on the command line will take precedence.

After you submit your job with SBATCH, it may wait in queue for minutes, hours, or even weeks. How long it takes for a job to start depends on the specific queue, the resources and time requested, and other jobs already waiting in that queue requested as well. It is impossible to say for sure when any given job will start. For best results, request no more resources than your job requires.

Once your job is submitted, you can monitor the job status, wait for the job to complete, and check the job output.

Job Dependencies

Dependencies are an automated way of holding and releasing jobs. Jobs with a dependency are held until the condition is satisfied. Once the condition is satisfied jobs only then become eligible to run and must still queue as normal.

Job dependencies may be configured to ensure jobs start in a specified order. Jobs can be configured to run after other job state changes, such as when the job starts or the job ends.

These examples illustrate setting dependencies in several ways. Typically dependencies are set by capturing and using the job ID from the last job submitted.

To run a job after job myjobid has started:

sbatch --dependency=after:myjobid myjobsubmissionfile

To run a job after job myjobid ends without error:

sbatch --dependency=afterok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with errors:

sbatch --dependency=afternotok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with or without errors:

sbatch --dependency=afterany:myjobid myjobsubmissionfile

To set more complex dependencies on multiple jobs and conditions:

sbatch --dependency=after:myjobid1:myjobid2:myjobid3,afterok:myjobid4 myjobsubmissionfile

Holding a Job

Sometimes you may want to submit a job but not have it run just yet. You may be wanting to allow lab mates to cut in front of you in the queue - so hold the job until their jobs have started, and then release yours.

To place a hold on a job before it starts running, use the scontrol hold job command:

$ scontrol hold job  myjobid

Once a job has started running it can not be placed on hold.

To release a hold on a job, use the scontrol release job command:

$ scontrol release job  myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.

Checking Job Status

Once a job is submitted there are several commands you can use to monitor the progress of the job.

To see your jobs, use the squeue -u command and specify your username:

(Remember, in our SLURM environment a queue is referred to as an 'Account')

 

squeue -u myusername

    JOBID   ACCOUNT    NAME    USER   ST    TIME   NODES  NODELIST(REASON)
   182792   standby    job1    myusername    R   20:19       1  gilbreth-a000
   185841   standby    job2    myusername    R   20:19       1  gilbreth-a001
   185844   standby    job3    myusername    R   20:18       1  gilbreth-a002
   185847   standby    job4    myusername    R   20:18       1  gilbreth-a003

To retrieve useful information about your queued or running job, use the scontrol show job command with your job's ID number. The output should look similar to the following:



scontrol show job 3519

JobId=3519 JobName=t.sub
   UserId=myusername GroupId=mygroup MCS_label=N/A
   Priority=3 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-29T16:56:52 EligibleTime=2019-08-29T23:30:00
   AccrueTime=Unknown
   StartTime=2019-08-29T23:30:00 EndTime=2019-09-05T23:30:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-29T16:56:52
   Partition=workq AllocNode:Sid=mack-fe00:54476
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/myusername/jobdir/myjobfile.sub
   WorkDir=/home/myusername/jobdir
   StdErr=/home/myusername/jobdir/slurm-3519.out
   StdIn=/dev/null
   StdOut=/home/myusername/jobdir/slurm-3519.out
   Power=

There are several useful bits of information in this output.

JobState lets you know if the job is Pending, Running, Completed, or Held.
RunTime and TimeLimit will show how long the job has run and its maximum time.
SubmitTime is when the job was submitted to the cluster.
NumNodes, NumCPUs, NumTasks and CPUs/Task are the number of Nodes, CPUs, Tasks, and CPUs per Task are shown.
WorkDir is the job's working directory.
StdOut and Stderr are the locations of stdout and stderr of the job, respectively.
Reason will show why a PENDING job isn't running. The above error says that it has been requested to start at a specific, later time.

Checking Job Output

Once a job is submitted, and has started, it will write its standard output and standard error to files that you can read.

SLURM catches output written to standard output and standard error - what would be printed to your screen if you ran your program interactively. Unless you specfied otherwise, SLURM will put the output in the directory where you submitted the job in a file named slurm- followed by the job id, with the extension out. For example slurm-3509.out. Note that both stdout and stderr will be written into the same file, unless you specify otherwise.

If your program writes its own output files, those files will be created as defined by the program. This may be in the directory where the program was run, or may be defined in a configuration or input file. You will need to check the documentation for your program for more details.

Link to section 'Redirecting Job Output' of 'Checking Job Output' Redirecting Job Output

It is possible to redirect job output to somewhere other than the default location with the --error and --output directives:

#!/bin/bash
#SBATCH --output=/home/myusername/joboutput/myjob.out
#SBATCH --error=/home/myusername/joboutput/myjob.out

# This job prints "Hello World" to output and exits
echo "Hello World"

Canceling a Job

To stop a job before it finishes or remove it from a queue, use the scancel command:

scancel myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.