Generic SLURM Jobs

The following examples demonstrate the basics of SLURM jobs, and are designed to cover common job request scenarios. These example jobs will need to be modified to run your application or code.

Serial job in shared queue

This shows an example of a job submission file of the serial programs:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation   # Allocation name 
#SBATCH --nodes=1         # Total # of nodes (must be 1 for serial job)
#SBATCH --ntasks=1        # Total # of MPI tasks (should be 1 for serial job)
#SBATCH --time=1:30:00    # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname      # Job name
#SBATCH -o myjob.o%j      # Name of stdout output file
#SBATCH -e myjob.e%j      # Name of stderr error file
#SBATCH -p shared  # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all   # Send email to above address at begin and end of job

# Manage processing environment, load compilers and applications.
module purge
module load compilername
module load applicationname
module list

# Launch serial code
./myexecutablefiles

If you would like to submit one serial job at a time, using shared queue will only charge 1 core, instead of charging 128 cores for wholenode queue.

MPI job in wholenode queue

An MPI job is a set of processes that take advantage of multiple compute nodes by communicating with each other. OpenMPI, Intel MPI (IMPI), and MVAPICH2 are implementations of the MPI standard.

This shows an example of a job submission file of the MPI programs:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation  # Allocation name
#SBATCH --nodes=2        # Total # of nodes 
#SBATCH --ntasks=256     # Total # of MPI tasks
#SBATCH --time=1:30:00   # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname     # Job name
#SBATCH -o myjob.o%j     # Name of stdout output file
#SBATCH -e myjob.e%j     # Name of stderr error file
#SBATCH -p wholenode     # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH--mail-type=all   # Send email to above address at begin and end of job

# Manage processing environment, load compilers and applications.
module purge
module load compilername
module load mpilibrary
module load applicationname
module list

# Launch MPI code
mpirun -np $SLURM_NTASKS ./myexecutablefiles

SLURM can run an MPI program with the srun command. The number of processes is requested with the -n option. If you do not specify the -n option, it will default to the total number of processor cores you request from SLURM.

If the code is built with OpenMPI, it can be run with a simple srun -n command. If it is built with Intel IMPI, then you also need to add the --mpi=pmi2 option: srun --mpi=pmi2 -n 256 ./mycode.exe in this example.

Invoking an MPI program on Anvil with ./myexecutablefiles is typically wrong, since this will use only one MPI process and defeat the purpose of using MPI. Unless that is what you want (rarely the case), you should use srun which is the Slurm analog of mpirun or mpiexec, or use mpirun or mpiexec to invoke an MPI program.

OpenMP job in wholenode queue

A shared-memory job is a single process that takes advantage of a multi-core processor and its shared memory to achieve parallelization.

When running OpenMP programs, all threads must be on the same compute node to take advantage of shared memory. The threads cannot communicate between nodes.

To run an OpenMP program, set the environment variable OMP_NUM_THREADS to the desired number of threads. This should almost always be equal to the number of cores on a compute node. You may want to set to another appropriate value if you are running several processes in parallel in a single job or node.

This example shows how to submit an OpenMP program, this job asked for 2 MPI tasks, each with 64 OpenMP threads for a total of 128 CPU-cores:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation         # Allocation name 
#SBATCH --nodes=1               # Total # of nodes (must be 1 for OpenMP job)
#SBATCH --ntasks-per-node=2     # Total # of MPI tasks per node
#SBATCH --cpus-per-task=64      # cpu-cores per task (default value is 1, >1 for multi-threaded tasks)
#SBATCH --time=1:30:00          # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname            # Job name
#SBATCH -o myjob.o%j            # Name of stdout output file
#SBATCH -e myjob.e%j            # Name of stderr error file
#SBATCH -p wholenode            # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all         # Send email to above address at begin and end of job

# Manage processing environment, load compilers and applications.
module purge
module load compilername
module load applicationname
module list

# Set thread count (default value is 1).
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Launch OpenMP code
./myexecutablefiles

The ntasks x cpus-per-task should equal to or less than the total number of CPU cores on a node.

If an OpenMP program uses a lot of memory and 128 threads use all of the memory of the compute node, use fewer processor cores (OpenMP threads) on that compute node.

Hybrid job in wholenode queue

A hybrid program combines both MPI and shared-memory to take advantage of compute clusters with multi-core compute nodes. Libraries for OpenMPI, Intel MPI (IMPI), and MVAPICH2 and compilers which include OpenMP for C, C++, and Fortran are available.

This example shows how to submit a hybrid program, this job asked for 4 MPI tasks (with 2 MPI tasks per node), each with 64 OpenMP threads for a total of 256 CPU-cores:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation       # Allocation name 
#SBATCH --nodes=2             # Total # of nodes 
#SBATCH --ntasks-per-node=2   # Total # of MPI tasks per node
#SBATCH --cpus-per-task=64    # cpu-cores per task (default value is 1, >1 for multi-threaded tasks)
#SBATCH --time=1:30:00        # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname          # Job name
#SBATCH -o myjob.o%j          # Name of stdout output file
#SBATCH -e myjob.e%j          # Name of stderr error file
#SBATCH -p wholenode          # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all       # Send email at begin and end of job

# Manage processing environment, load compilers and applications.
module purge
module load compilername
module load mpilibrary
module load applicationname
module list

# Set thread count (default value is 1).
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Launch MPI code
mpirun -np $SLURM_NTASKS ./myexecutablefiles

The ntasks x cpus-per-task should equal to or less than the total number of CPU cores on a node.

GPU job in GPU queue

The Anvil cluster nodes contain GPUs that support CUDA and OpenCL. See the detailed hardware overview for the specifics on the GPUs in Anvil or use sfeatures command to see the detailed hardware overview..

Link to section 'How to use Slurm to submit a SINGLE-node GPU program:' of 'GPU job in GPU queue' How to use Slurm to submit a SINGLE-node GPU program:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myGPUallocation       # allocation name
#SBATCH --nodes=1             # Total # of nodes 
#SBATCH --ntasks-per-node=1   # Number of MPI ranks per node (one rank per GPU)
#SBATCH --gpus-per-node=1     # Number of GPUs per node
#SBATCH --time=1:30:00        # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname          # Job name
#SBATCH -o myjob.o%j          # Name of stdout output file
#SBATCH -e myjob.e%j          # Name of stderr error file
#SBATCH -p gpu                # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all       # Send email to above address at begin and end of job

# Manage processing environment, load compilers, and applications.
module purge
module load modtree/gpu
module load applicationname
module list

# Launch GPU code
./myexecutablefiles

Link to section 'How to use Slurm to submit a MULTI-node GPU program:' of 'GPU job in GPU queue' How to use Slurm to submit a MULTI-node GPU program:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myGPUallocation       # allocation name
#SBATCH --nodes=2             # Total # of nodes 
#SBATCH --ntasks-per-node=4   # Number of MPI ranks per node (one rank per GPU)
#SBATCH --gpus-per-node=4     # Number of GPUs per node
#SBATCH --time=1:30:00        # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname          # Job name
#SBATCH -o myjob.o%j          # Name of stdout output file
#SBATCH -e myjob.e%j          # Name of stderr error file
#SBATCH -p gpu                # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all       # Send email to above address at begin and end of job

# Manage processing environment, load compilers, and applications.
module purge
module load modtree/gpu
module load applicationname
module list

# Launch GPU code
mpirun -np $SLURM_NTASKS ./myexecutablefiles

Make sure to use --gpus-per-node command, otherwise, your job may not run properly.

NGC GPU container job in GPU queue

Link to section 'What is NGC?' of 'NGC GPU container job in GPU queue' What is NGC?

Nvidia GPU Cloud (NGC) is a GPU-accelerated cloud platform optimized for deep learning and scientific computing. NGC offers a comprehensive catalogue of GPU-accelerated containers, so the application runs quickly and reliably in the high-performance computing environment. Anvil team deployed NGC to extend the cluster capabilities and to enable powerful software and deliver the fastest results. By utilizing Singularity and NGC, users can focus on building lean models, producing optimal solutions, and gathering faster insights. For more information, please visit https://www.nvidia.com/en-us/gpu-cloud and NGC software catalog.

Link to section ' Getting Started ' of 'NGC GPU container job in GPU queue' Getting Started

Users can download containers from the NGC software catalog and run them directly using Singularity instructions from the corresponding container’s catalog page.

In addition, a subset of pre-downloaded NGC containers wrapped into convenient software modules are provided. These modules wrap underlying complexity and provide the same commands that are expected from non-containerized versions of each application.

On Anvil, type the command below to see the lists of NGC containers we deployed.

$ module load modtree/gpu
$ module load ngc 
$ module avail

Once module loaded ngc, you can run your code as with normal non-containerized applications. This section illustrates how to use SLURM to submit a job with a containerized NGC program.

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation       # allocation name 
#SBATCH --nodes=1             # Total # of nodes 
#SBATCH --ntasks-per-node=1   # Number of MPI ranks per node (one rank per GPU)
#SBATCH --gres=gpu:1          # Number of GPUs per node
#SBATCH --time=1:30:00        # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname          # Job name
#SBATCH -o myjob.o%j          # Name of stdout output file
#SBATCH -e myjob.e%j          # Name of stderr error file
#SBATCH -p gpu                # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all       # Send email to above address at begin and end of job

# Manage processing environment, load compilers, container, and applications.
module purge
module load modtree/gpu
module load ngc
module load applicationname
module list

# Launch GPU code
myexecutablefiles

BioContainers Collection

Link to section 'What is BioContainers?' of 'BioContainers Collection' What is BioContainers?

The BioContainers project came from the idea of using the containers-based technologies such as Docker or rkt for bioinformatics software. Having a common and controllable environment for running software could help to deal with some of the current problems during software development and distribution. BioContainers is a community-driven project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics containers with a special focus on omics fields such as proteomics, genomics, transcriptomics, and metabolomics. For more information, please visit BioContainers project.

Link to section ' Getting Started ' of 'BioContainers Collection' Getting Started

Users can download bioinformatic containers from the BioContainers.pro and run them directly using Singularity instructions from the corresponding container’s catalog page.

Detailed Singularity user guide is available at: sylabs.io/guides/3.8/user-guide

In addition, Anvil team provides a subset of pre-downloaded biocontainers wrapped into convenient software modules. These modules wrap underlying complexity and provide the same commands that are expected from non-containerized versions of each application.

On Anvil, type the command below to see the lists of biocontainers we deployed.

$ module purge
$ module load modtree/cpu
$ module load biocontainers 
$ module avail

Once module loaded biocontainers, you can run your code as with normal non-containerized applications. This section illustrates how to use SLURM to submit a job with a biocontainers program.

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation       # allocation name
#SBATCH --nodes=1             # Total # of nodes 
#SBATCH --ntasks-per-node=1   # Number of MPI ranks per node 
#SBATCH --time=1:30:00        # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname          # Job name
#SBATCH -o myjob.o%j          # Name of stdout output file
#SBATCH -e myjob.e%j          # Name of stderr error file
#SBATCH -p wholenode          # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all       # Send email to above address at begin and end of job 

# Manage processing environment, load compilers, container, and applications.
module purge
module load modtree/cpu
module load biocontainers
module load applicationname
module list

# Launch code
./myexecutablefiles

Monitoring Resources

Knowing the precise resource utilization an application had during a job, such as CPU load or memory, can be incredibly useful. This is especially the case when the application isn't performing as expected.

One approach is to run a program like htop during an interactive job and keep an eye on system resources. You can get precise time-series data from nodes associated with your job using XDmod as well, online. But these methods don't gather telemetry in an automated fashion, nor do they give you control over the resolution or format of the data.

As a matter of course, a robust implementation of some HPC workload would include resource utilization data as a diagnostic tool in the event of some failure.

The monitor utility is a simple command line system resource monitoring tool for gathering such telemetry and is available as a module.

module load monitor

Complete documentation is available online at resource-monitor.readthedocs.io. A full manual page is also available for reference, man monitor.

In the context of a SLURM job you will need to put this monitoring task in the background to allow the rest of your job script to proceed. Be sure to interrupt these tasks at the end of your job.

#!/bin/bash
# FILENAME: monitored_job.sh

module load monitor

# track CPU load
monitor cpu percent >cpu-percent.log &
CPU_PID=$!

# track GPU load if any
monitor gpu percent >gpu-percent.log &
GPU_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $CPU_PID $GPU_PID

A particularly elegant solution would be to include such tools in your prologue script and have the tear down in your epilogue script.

For large distributed jobs spread across multiple nodes, mpiexec can be used to gather telemetry from all nodes in the job. The hostname is included in each line of output so that data can be grouped as such. A concise way of constructing the needed list of hostnames in SLURM is to simply use srun hostname | sort -u.

#!/bin/bash
# FILENAME: monitored_job.sh

module load monitor

# track all CPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!

# track all GPUs if any (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor gpu percent >gpu-percent.log &
GPU_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $CPU_PID $GPU_PID

To get resource data in a more readily computable format, the monitor program can be told to output in CSV format with the --csv flag.

monitor cpu memory --csv >cpu-memory.csv

Or for GPU

monitor gpu memory --csv >gpu-memory.csv

For a distributed job you will need to suppress the header lines otherwise one will be created by each host.

monitor cpu memory --csv | head -1 >cpu-memory.csv
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu memory --csv --no-header >>cpu-memory.csv

Or for GPU

monitor gpu memory --csv | head -1 >gpu-memory.csv
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor gpu memory --csv --no-header >>gpu-memory.csv