NGC GPU container job in GPU queue

Link to section 'What is NGC?' of 'NGC GPU container job in GPU queue' What is NGC?

Nvidia GPU Cloud (NGC) is a GPU-accelerated cloud platform optimized for deep learning and scientific computing. NGC offers a comprehensive catalogue of GPU-accelerated containers, so the application runs quickly and reliably in the high-performance computing environment. Anvil team deployed NGC to extend the cluster capabilities and to enable powerful software and deliver the fastest results. By utilizing Singularity and NGC, users can focus on building lean models, producing optimal solutions, and gathering faster insights. For more information, please visit https://www.nvidia.com/en-us/gpu-cloud and NGC software catalog.

Link to section ' Getting Started ' of 'NGC GPU container job in GPU queue' Getting Started

Users can download containers from the NGC software catalog and run them directly using Singularity instructions from the corresponding container’s catalog page.

In addition, a subset of pre-downloaded NGC containers wrapped into convenient software modules are provided. These modules wrap underlying complexity and provide the same commands that are expected from non-containerized versions of each application.

On Anvil, type the command below to see the lists of NGC containers we deployed.

$ module load modtree/gpu
$ module load ngc 
$ module avail

Once module loaded ngc, you can run your code as with normal non-containerized applications. This section illustrates how to use SLURM to submit a job with a containerized NGC program.

#!/bin/bash
# FILENAME:  myjobsubmissionfile

#SBATCH -A myallocation       # allocation name 
#SBATCH --nodes=1             # Total # of nodes 
#SBATCH --ntasks-per-node=1   # Number of MPI ranks per node (one rank per GPU)
#SBATCH --gres=gpu:1          # Number of GPUs per node
#SBATCH --time=1:30:00        # Total run time limit (hh:mm:ss)
#SBATCH -J myjobname          # Job name
#SBATCH -o myjob.o%j          # Name of stdout output file
#SBATCH -e myjob.e%j          # Name of stderr error file
#SBATCH -p gpu                # Queue (partition) name
#SBATCH --mail-user=useremailaddress
#SBATCH --mail-type=all       # Send email to above address at begin and end of job

# Manage processing environment, load compilers, container, and applications.
module purge
module load modtree/gpu
module load ngc
module load applicationname
module list

# Launch GPU code
myexecutablefiles