Generic SLURM Jobs
The following examples demonstrate the basics of SLURM jobs, and are designed to cover common job request scenarios. These example jobs will need to be modified to run your application or code.
Simple Job
Every SLURM job consists of a job submission file. A job submission file contains a list of commands that run your program and a set of resource (nodes, walltime, queue) requests. The resource requests can appear in the job submission file or can be specified at submit-time as shown below.
This simple example submits the job submission file hello.sub to the standby queue on Gilbreth and requests a single node:
#!/bin/bash
# FILENAME: hello.sub
# Show this ran on a compute node by running the hostname command.
hostname
echo "Hello World"
On Gilbreth, specifying the number of GPUs requested per node is required.
sbatch -A standby --nodes=1 --ntasks=1 --cpus-per-task=1 --gpus-per-node=1 --time=00:01:00 hello.sub
Submitted batch job 3521
For a real job you would replace echo "Hello World"
with a command, or sequence of commands, that run your program.
After your job finishes running, the ls command will show a new file in your directory, the .out file:
ls -l
hello.sub
slurm-3521.out
The file slurm-3521.out contains the output and errors your program would have written to the screen if you had typed its commands at a command prompt:
cat slurm-3521.out
gilbreth-a001.rcac.purdue.edu
Hello World
You should see the hostname of the compute node your job was executed on. Following should be the "Hello World" statement.
Multiple Node
In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.
This example shows a request for multiple compute nodes. The job submission file contains a single command to show the names of the compute nodes allocated:
# FILENAME: myjobsubmissionfile.sub
#!/bin/bash
echo "$SLURM_JOB_NODELIST"
On Gilbreth, specifying the number of GPUs requested per node is required.
sbatch --nodes=2 --ntasks=32 --gpus-per-node=1 --time=00:10:00 -A standby myjobsubmissionfile.sub
Compute nodes allocated:
gilbreth-a[014-015]
The above example will allocate the total of 32 CPU cores across 2 nodes. Note that if your multi-node job requests fewer than each node's full 16 cores per node, by default Slurm provides no guarantee with respect to how this total is distributed between assigned nodes (i.e. the cores may not necessarily be split evenly). If you need specific arrangements of your tasks and cores, you can use --cpus-per-task= and/or --ntasks-per-node= flags. See Slurm documentation or man sbatch for more options.
Directives
So far these examples have shown submitting jobs with the resource requests on the sbatch
command line such as:
sbatch -A standby --nodes=1 --gpus-per-node=1 --time=00:01:00 hello.sub
The resource requests can also be put into job submission file itself. Documenting the resource requests in the job submission is desirable because the job can be easily reproduced later. Details left in your command history are quickly lost. Arguments are specified with the #SBATCH
syntax:
#!/bin/bash
# FILENAME: hello.sub
#SBATCH -A standby
#SBATCH --nodes=1 --gpus-per-node=1 --time=00:01:00
# Show this ran on a compute node by running the hostname command.
hostname
echo "Hello World"
The #SBATCH
directives must appear at the top of your submission file. SLURM will stop parsing directives as soon as it encounters a line that does not start with '#'. If you insert a directive in the middle of your script, it will be ignored.
This job can be then submitted with:
sbatch hello.sub
Specific Types of Nodes
SLURM allows running a job on specific types of compute nodes to accommodate special hardware requirements (e.g. a certain CPU or GPU type, etc.)
Cluster nodes have a set of descriptive features assigned to them, and users can specify which of these features are required by their job by using the constraint option at submission time. Only nodes having features matching the job constraints will be used to satisfy the request.
Example: a job requires a compute node in an "A" sub-cluster:
sbatch --nodes=1 --ntasks=16 --gres=gpu:1 --constraint=A myjobsubmissionfile.sub
Compute node allocated:
gilbreth-a003
Feature constraints can be used for both batch and interactive jobs, as well as for individual job steps inside a job. Multiple constraints can be specified with a predefined syntax to achieve complex request logic (see detailed description of the '--constraint' option in man sbatch
or online Slurm documentation).
Refer to Detailed Hardware Specification section for list of available sub-cluster labels, their respective per-node memory sizes and other hardware details. You could also use sfeatures
command to list available constraint feature names for different node types.
Interactive Jobs
Interactive jobs are run on compute nodes, while giving you a shell to interact with. They give you the ability to type commands or use a graphical interface in the same way as if you were on a front-end login host.
To submit an interactive job, use sinteractive to run a login shell on allocated resources.
sinteractive accepts most of the same resource requests as sbatch, so to request a login shell on the cpu account while allocating 2 nodes and 16 total cores, you might do:
sinteractive -A cpu -N2 -n32 --gpus-per-node=1
To quit your interactive job:
exit
or Ctrl-D
The above example will allocate the total of 32 CPU cores across 2 nodes. Note that if your multi-node job requests fewer than each node's full 16 cores per node, by default Slurm provides no guarantee with respect to how this total is distributed between assigned nodes (i.e. the cores may not necessarily be split evenly). If you need specific arrangements of your tasks and cores, you can use --cpus-per-task= and/or --ntasks-per-node= flags. See Slurm documentation or man salloc for more options.
Serial Jobs
This shows how to submit one of the serial programs compiled in the section Compiling Serial Programs.
Create a job submission file:
#!/bin/bash
# FILENAME: serial_hello.sub
./serial_hello
Submit the job:
sbatch --nodes=1 --ntasks=1 --gpus-per-node=1 --time=00:01:00 serial_hello.sub
After the job completes, view results in the output file:
cat slurm-myjobid.out
Runhost:gilbreth-a009.rcac.purdue.edu
hello, world
If the job failed to run, then view error messages in the file slurm-myjobid.out.
OpenMP
A shared-memory job is a single process that takes advantage of a multi-core processor and its shared memory to achieve parallelization.
This example shows how to submit an OpenMP program compiled in the section Compiling OpenMP Programs.
When running OpenMP programs, all threads must be on the same compute node to take advantage of shared memory. The threads cannot communicate between nodes.
To run an OpenMP program, set the environment variable OMP_NUM_THREADS to the desired number of threads:
In csh:
setenv OMP_NUM_THREADS 16
In bash:
export OMP_NUM_THREADS=16
This should almost always be equal to the number of cores on a compute node. You may want to set to another appropriate value if you are running several processes in parallel in a single job or node.
Create a job submissionfile:
#!/bin/bash
# FILENAME: omp_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --gpus-per-node=1
#SBATCH --time=00:01:00
export OMP_NUM_THREADS=16
./omp_hello
Submit the job:
sbatch omp_hello.sub
View the results from one of the sample OpenMP programs about task parallelism:
cat omp_hello.sub.omyjobid
SERIAL REGION: Runhost:gilbreth-a003.rcac.purdue.edu Thread:0 of 1 thread hello, world
PARALLEL REGION: Runhost:gilbreth-a003.rcac.purdue.edu Thread:0 of 16 threads hello, world
PARALLEL REGION: Runhost:gilbreth-a003.rcac.purdue.edu Thread:1 of 16 threads hello, world
...
If the job failed to run, then view error messages in the file slurm-myjobid.out.
If an OpenMP program uses a lot of memory and 16 threads use all of the memory of the compute node, use fewer processor cores (OpenMP threads) on that compute node.
MPI
An MPI job is a set of processes that take advantage of multiple compute nodes by communicating with each other. OpenMPI and Intel MPI (IMPI) are implementations of the MPI standard.
This section shows how to submit one of the MPI programs compiled in the section Compiling MPI Programs.
Use module load to set up the paths to access these libraries. Use module avail to see all MPI packages installed on Gilbreth.
Create a job submission file:
#!/bin/bash
# FILENAME: mpi_hello.sub
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --gpus-per-node=1
#SBATCH --time=00:01:00
#SBATCH -A standby
srun -n 32 ./mpi_hello
SLURM can run an MPI program with the srun command. The number of processes is requested with the -n option. If you do not specify the -n option, it will default to the total number of processor cores you request from SLURM.
If the code is built with OpenMPI, it can be run with a simple srun -n command. If it is built with Intel IMPI, then you also need to add the --mpi=pmi2 option: srun --mpi=pmi2 -n 32 ./mpi_hello in this example.
Submit the MPI job:
sbatch ./mpi_hello.sub
View results in the output file:
cat slurm-myjobid.out
Runhost:gilbreth-a010.rcac.purdue.edu Rank:0 of 32 ranks hello, world
Runhost:gilbreth-a010.rcac.purdue.edu Rank:1 of 32 ranks hello, world
...
Runhost:gilbreth-a011.rcac.purdue.edu Rank:16 of 32 ranks hello, world
Runhost:gilbreth-a011.rcac.purdue.edu Rank:17 of 32 ranks hello, world
...
If the job failed to run, then view error messages in the output file.
If an MPI job uses a lot of memory and 16 MPI ranks per compute node use all of the memory of the compute nodes, request more compute nodes, while keeping the total number of MPI ranks unchanged.
Submit the job with double the number of compute nodes and modify the resource request to halve the number of MPI ranks per compute node.
#!/bin/bash
# FILENAME: mpi_hello.sub
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=1
#SBATCH -t 00:01:00
#SBATCH -A standby
srun -n 32 ./mpi_hello
sbatch ./mpi_hello.sub
View results in the output file:
cat slurm-myjobid.out
Runhost:gilbreth-a10.rcac.purdue.edu Rank:0 of 32 ranks hello, world
Runhost:gilbreth-a010.rcac.purdue.edu Rank:1 of 32 ranks hello, world
...
Runhost:gilbreth-a011.rcac.purdue.edu Rank:8 of 32 ranks hello, world
...
Runhost:gilbreth-a012.rcac.purdue.edu Rank:16 of 32 ranks hello, world
...
Runhost:gilbreth-a013.rcac.purdue.edu Rank:24 of 32 ranks hello, world
...
Notes
- Use slist to determine which queues (
--account
or-A
option) are available to you. The name of the queue which is available to everyone on Gilbreth is "standby". - Invoking an MPI program on Gilbreth with ./program is typically wrong, since this will use only one MPI process and defeat the purpose of using MPI. Unless that is what you want (rarely the case), you should use srun or mpiexec to invoke an MPI program.
- In general, the exact order in which MPI ranks output similar write requests to an output file is random.
GPU
The Gilbreth cluster nodes contain NVIDIA GPUs that support CUDA and OpenCL. See the detailed hardware overview for the specifics on the GPUs in Gilbreth.
This section illustrates how to use SLURM to submit a simple GPU program.
Suppose that you named your executable file gpu_hello from the sample code gpu_hello.cu
(see the section on compiling NVIDIA GPU codes). Prepare a job submission file with an appropriate name, here named gpu_hello.sub:
#!/bin/bash
# FILENAME: gpu_hello.sub
module load cuda
host=`hostname -s`
echo $CUDA_VISIBLE_DEVICES
# Run on the first available GPU
./gpu_hello 0
Submit the job:
sbatch -A ai --nodes=1 --gres=gpu:1 -t 00:01:00 gpu_hello.sub
Requesting a GPU from the scheduler is required.
You can specify total number of GPUs, or number of GPUs per node, or even number of GPUs per task:
sbatch-A ai --nodes=1 --gres=gpu:1 -t 00:01:00 gpu_hello.sub
sbatch-A ai --nodes=1 --gpus-per-node=1 -t 00:01:00 gpu_hello.sub
sbatch-A ai --nodes=1 --gpus-per-task=1 -t 00:01:00 gpu_hello.sub
After job completion, view the new output file in your directory:
ls -l
gpu_hello
gpu_hello.cu
gpu_hello.sub
slurm-myjobid.out
View results in the file for all standard output, slurm-myjobid.out
0
hello, world
If the job failed to run, then view error messages in the file slurm-myjobid.out.
To use multiple GPUs in your job, simply specify a larger value to the GPU specification parameter. However, be aware of the number of GPUs installed on the node(s) you may be requesting. The scheduler can not allocate more GPUs than physically exist. See detailed hardware overview and output of sfeatures command for the specifics on the GPUs in Gilbreth.
Link to section 'Collecting System Resource Utilization Data' of 'Monitoring Resources' Collecting System Resource Utilization Data
Knowing the precise resource utilization an application had during a job, such as GPU load or memory, can be incredibly useful. This is especially the case when the application isn't performing as expected.
One approach is to run a program like htop
during an interactive job and keep an eye on system resources. You can get precise time-series data from nodes associated with your job using XDmod as well, online. But these methods don't gather telemetry in an automated fashion, nor do they give you control over the resolution or format of the data.
As a matter of course, a robust implementation of some HPC workload would include resource utilization data as a diagnostic tool in the event of some failure.
The monitor
utility is a simple command line system resource monitoring tool for gathering such telemetry and is available as a module.
module load utilities monitor
Complete documentation is available online at resource-monitor.readthedocs.io. A full manual page is also available for reference, man monitor
.
In the context of a SLURM job you will need to put this monitoring task in the background to allow the rest of your job script to proceed. Be sure to interrupt these tasks at the end of your job.
#!/bin/bash
# FILENAME: monitored_job.sh
module load utilities monitor
# track GPU load
monitor gpu percent >gpu-percent.log &
GPU_PID=$!
# track CPU load
monitor cpu percent >cpu-percent.log &
CPU_PID=$!
# your code here
# shut down the resource monitors
kill -s INT $GPU_PID $CPU_PID
A particularly elegant solution would be to include such tools in your prologue script and have the tear down in your epilogue script.
For large distributed jobs spread across multiple nodes, mpiexec
can be used to gather telemetry from all nodes in the job. The hostname is included in each line of output so that data can be grouped as such. A concise way of constructing the needed list of hostnames in SLURM is to simply use srun hostname | sort -u
.
#!/bin/bash
# FILENAME: monitored_job.sh
module load utilities monitor
# track all GPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
monitor gpu percent >gpu-percent.log &
GPU_PID=$!
# track all CPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!
# your code here
# shut down the resource monitors
kill -s INT $GPU_PID $CPU_PID
To get resource data in a more readily computable format, the monitor
program can be told to output in CSV format with the --csv
flag.
monitor gpu memory --csv >gpu-memory.csv
For a distributed job you will need to suppress the header lines otherwise one will be created by each host.
monitor gpu memory --csv | head -1 >gpu-memory.csv
mpiexec -machinefile <(srun hostname | sort -u) \
monitor gpu memory --csv --no-header >>gpu-memory.csv