Batch Jobs
Link to section 'Job Submission Script' of 'Batch Jobs' Job Submission Script
To submit work to a Slurm queue, you must first create a job submission file. This job submission file is essentially a simple shell script. It will set any required environment variables, load any necessary modules, create or modify files and directories, and run any applications that you need:
#!/bin/sh -l
# FILENAME: myjobsubmissionfile
# Loads Matlab and sets the application up
module load matlab
# Change to the directory from which you originally submitted this job.
cd $SLURM_SUBMIT_DIR
# Runs a Matlab script named 'myscript'
matlab -nodisplay -singleCompThread -r myscript
The standard Slurm environment variables that can be used in the job submission file are listed in the table below:
Name | Description |
---|---|
SLURM_SUBMIT_DIR | Absolute path of the current working directory when you submitted this job |
SLURM_JOBID | Job ID number assigned to this job by the batch system |
SLURM_JOB_NAME | Job name supplied by the user |
SLURM_JOB_NODELIST | Names of nodes assigned to this job |
SLURM_SUBMIT_HOST | Hostname of the system where you submitted this job |
SLURM_JOB_PARTITION | Name of the original queue to which you submitted this job |
Once your script is prepared, you are ready to submit your job.
Link to section 'Submitting a Job' of 'Batch Jobs' Submitting a Job
Once you have a job submission file, you may submit this script to SLURM using the sbatch command. Slurm will find, or wait for, available resources matching your request and run your job there.
To submit your job to one compute node with one task:
$ sbatch --nodes=1 --ntasks=1 myjobsubmissionfile
By default, each job receives 30 minutes of wall time, or clock time. If you know that your job will not need more than a certain amount of time to run, request less than the maximum wall time, as this may allow your job to run sooner. To request the 1 hour and 30 minutes of wall time:
$ sbatch -t 1:30:00 --nodes=1 --ntasks=1 myjobsubmissionfile
Each compute node in Anvil has 128 processor cores. In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must utilize all the cores to support this ability. To request 2 compute nodes with 256 tasks:
$ sbatch --nodes=2 --ntasks=256 myjobsubmissionfile
If more convenient, you may also specify any command line options to sbatch from within your job submission file, using a special form of comment:
#!/bin/sh -l
# FILENAME: myjobsubmissionfile
#SBATCH -A myallocation
#SBATCH -p queue-name # the default queue is "shared" queue
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=1:30:00
#SBATCH --job-name myjobname
module purge # Unload all loaded modules and reset everything to original state.
module load ...
...
module list # List currently loaded modules.
# Print the hostname of the compute node on which this job is running.
hostname
If an option is present in both your job submission file and on the command line, the option on the command line will take precedence.
After you submit your job with sbatch
, it may wait in the queue for minutes, hours, or even days. How long it takes for a job to start depends on the specific queue, the available resources, and time requested, and other jobs that are already waiting in that queue. It is impossible to say for sure when any given job will start. For best results, request no more resources than your job requires.
Once your job is submitted, you can monitor the job status, wait for the job to complete, and check the job output.
Link to section 'Checking Job Status' of 'Batch Jobs' Checking Job Status
Once a job is submitted there are several commands you can use to monitor the progress of the job. To see your jobs, use the squeue -u
command and specify your username.
$ squeue -u myusername
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
188 wholenode job1 myusername R 0:14 2 a[010-011]
189 wholenode job2 myusername R 0:15 1 a012
To retrieve useful information about your queued or running job, use the scontrol show job
command with your job's ID number.
$ scontrol show job 189
JobId=189 JobName=myjobname
UserId=myusername GroupId=mygroup MCS_label=N/A
Priority=103076 Nice=0 Account=myacct QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:01:28 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2021-10-04T14:59:52 EligibleTime=2021-10-04T14:59:52
AccrueTime=Unknown
StartTime=2021-10-04T14:59:52 EndTime=2021-10-04T15:29:52 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-10-04T14:59:52 Scheduler=Main
Partition=wholenode AllocNode:Sid=login05:1202865
ReqNodeList=(null) ExcNodeList=(null)
NodeList=a010
BatchHost=a010
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=257526M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=257526M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/myusername/jobdir
Power=
JobState
lets you know if the job is Pending, Running, Completed, or Held.RunTime and TimeLimit
will show how long the job has run and its maximum time.SubmitTime
is when the job was submitted to the cluster.- The job's number of Nodes, Tasks, Cores (CPUs) and CPUs per Task are shown.
WorkDir
is the job's working directory.StdOut
andStderr
are the locations of stdout and stderr of the job, respectively.Reason
will show why aPENDING
job isn't running.
For historic (completed) jobs, you can use the jobinfo
command. While not as detailed as scontrol output, it can also report information on jobs that are no longer active.
Link to section 'Checking Job Output' of 'Batch Jobs' Checking Job Output
Once a job is submitted, and has started, it will write its standard output and standard error to files that you can read.
SLURM catches output written to standard output and standard error - what would be printed to your screen if you ran your program interactively. Unless you specified otherwise, SLURM will put the output in the directory where you submitted the job in a file named slurm- followed by the job id, with the extension out. For example slurm-3509.out
. Note that both stdout and stderr will be written into the same file, unless you specify otherwise.
If your program writes its own output files, those files will be created as defined by the program. This may be in the directory where the program was run, or may be defined in a configuration or input file. You will need to check the documentation for your program for more details.
Link to section 'Redirecting Job Output' of 'Batch Jobs' Redirecting Job Output
It is possible to redirect job output to somewhere other than the default location with the --error
and --output
directives:
#! /bin/sh -l
#SBATCH --output=/path/myjob.out
#SBATCH --error=/path/myjob.out
# This job prints "Hello World" to output and exits
echo "Hello World"
Link to section 'Holding a Job' of 'Batch Jobs' Holding a Job
Sometimes you may want to submit a job but not have it run just yet. You may be wanting to allow lab mates to cut in front of you in the queue - so hold the job until their jobs have started, and then release yours.
To place a hold on a job before it starts running, use the scontrol hold job command:
$ scontrol hold job myjobid
Once a job has started running it can not be placed on hold.
To release a hold on a job, use the scontrol release job command:
$ scontrol release job myjobid
Link to section 'Job Dependencies' of 'Batch Jobs' Job Dependencies
Dependencies are an automated way of holding and releasing jobs. Jobs with a dependency are held until the condition is satisfied. Once the condition is satisfied jobs only then become eligible to run and must still queue as normal.
Job dependencies may be configured to ensure jobs start in a specified order. Jobs can be configured to run after other job state changes, such as when the job starts or the job ends.
These examples illustrate setting dependencies in several ways. Typically dependencies are set by capturing and using the job ID from the last job submitted.
To run a job after job myjobid has started:
$ sbatch --dependency=after:myjobid myjobsubmissionfile
To run a job after job myjobid ends without error:
$ sbatch --dependency=afterok:myjobid myjobsubmissionfile
To run a job after job myjobid ends with errors:
$ sbatch --dependency=afternotok:myjobid myjobsubmissionfile
To run a job after job myjobid ends with or without errors:
$ sbatch --dependency=afterany:myjobid myjobsubmissionfile
To set more complex dependencies on multiple jobs and conditions:
$ sbatch --dependency=after:myjobid1:myjobid2:myjobid3,afterok:myjobid4 myjobsubmissionfile
Link to section 'Canceling a Job' of 'Batch Jobs' Canceling a Job
To stop a job before it finishes or remove it from a queue, use the scancel command:
$ scancel myjobid