Generic SLURM Jobs

The following examples demonstrate the basics of SLURM jobs, and are designed to cover common job request scenarios. These example jobs will need to be modified to run your application or code.

Simple Job

Every SLURM job consists of a job submission file. A job submission file contains a list of commands that run your program and a set of resource (nodes, walltime, queue) requests. The resource requests can appear in the job submission file or can be specified at submit-time as shown below.

This simple example submits the job submission file hello.sub to the standby queue on Gilbreth and requests a single node:

#!/bin/bash
# FILENAME: hello.sub

# Show this ran on a compute node by running the hostname command.
hostname

echo "Hello World"

On Gilbreth, specifying the number of GPUs requested per node is required.

sbatch -A standby --nodes=1 --ntasks=1 --cpus-per-task=1 --gpus-per-node=1 --time=00:01:00 hello.sub 
Submitted batch job 3521

For a real job you would replace echo "Hello World" with a command, or sequence of commands, that run your program.

After your job finishes running, the ls command will show a new file in your directory, the .out file:

ls -l
hello.sub
slurm-3521.out

The file slurm-3521.out contains the output and errors your program would have written to the screen if you had typed its commands at a command prompt:

cat slurm-3521.out 


gilbreth-a001.rcac.purdue.edu 
Hello World

You should see the hostname of the compute node your job was executed on. Following should be the "Hello World" statement.

Multiple Node

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

This example shows a request for multiple compute nodes. The job submission file contains a single command to show the names of the compute nodes allocated:

# FILENAME:  myjobsubmissionfile.sub
#!/bin/bash
echo "$SLURM_JOB_NODELIST"

On Gilbreth, specifying the number of GPUs requested per node is required.

sbatch --nodes=2 --ntasks=32 --gpus-per-node=1 --time=00:10:00 -A standby myjobsubmissionfile.sub

Compute nodes allocated:

gilbreth-a[014-015]

The above example will allocate the total of 32 CPU cores across 2 nodes. Note that if your multi-node job requests fewer than each node's full 16 cores per node, by default Slurm provides no guarantee with respect to how this total is distributed between assigned nodes (i.e. the cores may not necessarily be split evenly). If you need specific arrangements of your tasks and cores, you can use --cpus-per-task= and/or --ntasks-per-node= flags. See Slurm documentation or man sbatch for more options.

Directives

So far these examples have shown submitting jobs with the resource requests on the sbatch command line such as:

sbatch -A standby --nodes=1 --gpus-per-node=1 --time=00:01:00 hello.sub

The resource requests can also be put into job submission file itself. Documenting the resource requests in the job submission is desirable because the job can be easily reproduced later. Details left in your command history are quickly lost. Arguments are specified with the #SBATCH syntax:

#!/bin/bash

# FILENAME: hello.sub

#SBATCH -A standby 

#SBATCH --nodes=1 --gpus-per-node=1 --time=00:01:00 

# Show this ran on a compute node by running the hostname command.
hostname

echo "Hello World"

The #SBATCH directives must appear at the top of your submission file. SLURM will stop parsing directives as soon as it encounters a line that does not start with '#'. If you insert a directive in the middle of your script, it will be ignored.

This job can be then submitted with:

sbatch hello.sub

Specific Types of Nodes

SLURM allows running a job on specific types of compute nodes to accommodate special hardware requirements (e.g. a certain CPU or GPU type, etc.)

Cluster nodes have a set of descriptive features assigned to them, and users can specify which of these features are required by their job by using the constraint option at submission time. Only nodes having features matching the job constraints will be used to satisfy the request.

Example: a job requires a compute node in an "A" sub-cluster:

sbatch --nodes=1 --ntasks=16 --gres=gpu:1 --constraint=A myjobsubmissionfile.sub

Compute node allocated:

gilbreth-a003

Feature constraints can be used for both batch and interactive jobs, as well as for individual job steps inside a job. Multiple constraints can be specified with a predefined syntax to achieve complex request logic (see detailed description of the '--constraint' option in man sbatch or online Slurm documentation).

Refer to Detailed Hardware Specification section for list of available sub-cluster labels, their respective per-node memory sizes and other hardware details. You could also use sfeatures command to list available constraint feature names for different node types.

Interactive Jobs

Interactive jobs are run on compute nodes, while giving you a shell to interact with. They give you the ability to type commands or use a graphical interface in the same way as if you were on a front-end login host.

To submit an interactive job, use sinteractive to run a login shell on allocated resources.

sinteractive accepts most of the same resource requests as sbatch, so to request a login shell on the cpu account while allocating 2 nodes and 16 total cores, you might do:

sinteractive -A cpu -N2 -n32 --gpus-per-node=1

To quit your interactive job:

exit or Ctrl-D

Serial Jobs

This shows how to submit one of the serial programs compiled in the section Compiling Serial Programs.

Create a job submission file:

#!/bin/bash
# FILENAME:  serial_hello.sub

./serial_hello

Submit the job:

sbatch --nodes=1 --ntasks=1 --gpus-per-node=1 --time=00:01:00 serial_hello.sub

After the job completes, view results in the output file:

cat slurm-myjobid.out

Runhost:gilbreth-a009.rcac.purdue.edu
hello, world

If the job failed to run, then view error messages in the file slurm-myjobid.out.

OpenMP

A shared-memory job is a single process that takes advantage of a multi-core processor and its shared memory to achieve parallelization.

This example shows how to submit an OpenMP program compiled in the section Compiling OpenMP Programs.

When running OpenMP programs, all threads must be on the same compute node to take advantage of shared memory. The threads cannot communicate between nodes.

To run an OpenMP program, set the environment variable OMP_NUM_THREADS to the desired number of threads:

In csh:

setenv OMP_NUM_THREADS 16

In bash:

export OMP_NUM_THREADS=16

This should almost always be equal to the number of cores on a compute node. You may want to set to another appropriate value if you are running several processes in parallel in a single job or node.

Create a job submissionfile:

#!/bin/bash
# FILENAME:  omp_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --gpus-per-node=1
#SBATCH --time=00:01:00

export OMP_NUM_THREADS=16
./omp_hello

Submit the job:

sbatch omp_hello.sub

View the results from one of the sample OpenMP programs about task parallelism:

cat omp_hello.sub.omyjobid
SERIAL REGION:     Runhost:gilbreth-a003.rcac.purdue.edu   Thread:0 of 1 thread    hello, world
PARALLEL REGION:   Runhost:gilbreth-a003.rcac.purdue.edu   Thread:0 of 16 threads   hello, world
PARALLEL REGION:   Runhost:gilbreth-a003.rcac.purdue.edu   Thread:1 of 16 threads   hello, world
   ...

If the job failed to run, then view error messages in the file slurm-myjobid.out.

If an OpenMP program uses a lot of memory and 16 threads use all of the memory of the compute node, use fewer processor cores (OpenMP threads) on that compute node.

MPI

An MPI job is a set of processes that take advantage of multiple compute nodes by communicating with each other. OpenMPI and Intel MPI (IMPI) are implementations of the MPI standard.

This section shows how to submit one of the MPI programs compiled in the section Compiling MPI Programs.

Use module load to set up the paths to access these libraries. Use module avail to see all MPI packages installed on Gilbreth.

Create a job submission file:

#!/bin/bash
# FILENAME:  mpi_hello.sub
#SBATCH  --nodes=2
#SBATCH  --ntasks-per-node=16
#SBATCH  --gpus-per-node=1
#SBATCH  --time=00:01:00
#SBATCH  -A standby

srun -n 32 ./mpi_hello

SLURM can run an MPI program with the srun command. The number of processes is requested with the -n option. If you do not specify the -n option, it will default to the total number of processor cores you request from SLURM.

If the code is built with OpenMPI, it can be run with a simple srun -n command. If it is built with Intel IMPI, then you also need to add the --mpi=pmi2 option: srun --mpi=pmi2 -n 32 ./mpi_hello in this example.

Submit the MPI job:

sbatch ./mpi_hello.sub

View results in the output file:

cat slurm-myjobid.out
Runhost:gilbreth-a010.rcac.purdue.edu   Rank:0 of 32 ranks   hello, world
Runhost:gilbreth-a010.rcac.purdue.edu   Rank:1 of 32 ranks   hello, world
...
Runhost:gilbreth-a011.rcac.purdue.edu   Rank:16 of 32 ranks   hello, world
Runhost:gilbreth-a011.rcac.purdue.edu   Rank:17 of 32 ranks   hello, world
...

If the job failed to run, then view error messages in the output file.

If an MPI job uses a lot of memory and 16 MPI ranks per compute node use all of the memory of the compute nodes, request more compute nodes, while keeping the total number of MPI ranks unchanged.

Submit the job with double the number of compute nodes and modify the resource request to halve the number of MPI ranks per compute node.

#!/bin/bash
# FILENAME:  mpi_hello.sub

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=1
#SBATCH -t 00:01:00 
#SBATCH -A standby

srun -n 32 ./mpi_hello

sbatch ./mpi_hello.sub

View results in the output file:

cat slurm-myjobid.out
Runhost:gilbreth-a10.rcac.purdue.edu   Rank:0 of 32 ranks   hello, world
Runhost:gilbreth-a010.rcac.purdue.edu   Rank:1 of 32 ranks   hello, world
...
Runhost:gilbreth-a011.rcac.purdue.edu   Rank:8 of 32 ranks   hello, world
...
Runhost:gilbreth-a012.rcac.purdue.edu   Rank:16 of 32 ranks   hello, world
...
Runhost:gilbreth-a013.rcac.purdue.edu   Rank:24 of 32 ranks   hello, world
...

Notes

Use slist to determine which queues (--account or -A option) are available to you. The name of the queue which is available to everyone on Gilbreth is "standby".
Invoking an MPI program on Gilbreth with ./program is typically wrong, since this will use only one MPI process and defeat the purpose of using MPI. Unless that is what you want (rarely the case), you should use srun or mpiexec to invoke an MPI program.
In general, the exact order in which MPI ranks output similar write requests to an output file is random.

GPU

The Gilbreth cluster nodes contain NVIDIA GPUs that support CUDA and OpenCL. See the detailed hardware overview for the specifics on the GPUs in Gilbreth.

This section illustrates how to use SLURM to submit a simple GPU program.

Suppose that you named your executable file gpu_hello from the sample code gpu_hello.cu (see the section on compiling NVIDIA GPU codes). Prepare a job submission file with an appropriate name, here named gpu_hello.sub:

#!/bin/bash
# FILENAME:  gpu_hello.sub

module load cuda

host=`hostname -s`

echo $CUDA_VISIBLE_DEVICES

# Run on the first available GPU
./gpu_hello 0

Submit the job:

sbatch -A gpu --nodes=1 --gres=gpu:1 -t 00:01:00 gpu_hello.sub

Requesting a GPU from the scheduler is required.
You can specify total number of GPUs, or number of GPUs per node, or even number of GPUs per task:

sbatch -A gpu --nodes=1 --gres=gpu:1 -t 00:01:00 gpu_hello.sub
sbatch -A gpu --nodes=1 --gpus-per-node=1 -t 00:01:00 gpu_hello.sub
sbatch -A gpu --nodes=1 --gpus-per-task=1 -t 00:01:00 gpu_hello.sub

After job completion, view the new output file in your directory:

ls -l
gpu_hello
gpu_hello.cu
gpu_hello.sub
slurm-myjobid.out

View results in the file for all standard output, slurm-myjobid.out

0
hello, world

If the job failed to run, then view error messages in the file slurm-myjobid.out.

To use multiple GPUs in your job, simply specify a larger value to the GPU specification parameter. However, be aware of the number of GPUs installed on the node(s) you may be requesting. The scheduler can not allocate more GPUs than physically exist. See detailed hardware overview and output of sfeatures command for the specifics on the GPUs in Gilbreth.

Link to section 'Collecting System Resource Utilization Data' of 'Monitoring Resources' Collecting System Resource Utilization Data

Knowing the precise resource utilization an application had during a job, such as GPU load or memory, can be incredibly useful. This is especially the case when the application isn't performing as expected.

One approach is to run a program like htop during an interactive job and keep an eye on system resources. You can get precise time-series data from nodes associated with your job using XDmod as well, online. But these methods don't gather telemetry in an automated fashion, nor do they give you control over the resolution or format of the data.

As a matter of course, a robust implementation of some HPC workload would include resource utilization data as a diagnostic tool in the event of some failure.

The monitor utility is a simple command line system resource monitoring tool for gathering such telemetry and is available as a module.

module load monitor

Complete documentation is available online at resource-monitor.readthedocs.io. A full manual page is also available for reference, man monitor.

In the context of a SLURM job you will need to put this monitoring task in the background to allow the rest of your job script to proceed. Be sure to interrupt these tasks at the end of your job.

#!/bin/bash
# FILENAME: monitored_job.sh

module load monitor

# track GPU load
monitor gpu percent >gpu-percent.log &
GPU_PID=$!

# track CPU load
monitor cpu percent >cpu-percent.log &
CPU_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $GPU_PID $CPU_PID

A particularly elegant solution would be to include such tools in your prologue script and have the tear down in your epilogue script.

For large distributed jobs spread across multiple nodes, mpiexec can be used to gather telemetry from all nodes in the job. The hostname is included in each line of output so that data can be grouped as such. A concise way of constructing the needed list of hostnames in SLURM is to simply use srun hostname | sort -u.

#!/bin/bash
# FILENAME: monitored_job.sh

module load monitor

# track all GPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor gpu percent >gpu-percent.log &
GPU_PID=$!

# track all CPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $GPU_PID $CPU_PID

To get resource data in a more readily computable format, the monitor program can be told to output in CSV format with the --csv flag.

monitor gpu memory --csv >gpu-memory.csv

For a distributed job you will need to suppress the header lines otherwise one will be created by each host.

monitor gpu memory --csv | head -1 >gpu-memory.csv
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor gpu memory --csv --no-header >>gpu-memory.csv