ROCm Containers Collection

Link to section 'What is ROCm Containers?' of 'ROCm Containers Collection' What is ROCm Containers?

The AMD Infinity Hub contains a collection of advanced AMD GPU software containers and deployment guides for HPC, AI & Machine Learning applications, enabling researchers to speed up their time to science. Containerized applications run quickly and reliably in the high performance computing environment with full support of AMD GPUs. A collection of Infinity Hub tools were deployed to extend cluster capabilities and to enable powerful software and deliver the fastest results. By utilizing Singularity and Infinity Hub ROCm-enabled containers, users can focus on building lean models, producing optimal solutions and gathering faster insights. For more information, please visit AMD Infinity Hub.

Link to section 'Getting Started' of 'ROCm Containers Collection' Getting Started

Users can download ROCm containers from the AMD Infinity Hub and run them directly using Singularity instructions from the corresponding container’s catalog page.

In addition, a subset of pre-downloaded ROCm containers wrapped into convenient software modules are provided. These modules wrap underlying complexity and provide the same commands that are expected from non-containerized versions of each application.

On Bell, type the command below to see the lists of ROCm containers we deployed.

module load rocmcontainers
module avail

------------ ROCm-based application container modules for AMD GPUs -------------
   cp2k/20210311--h87ec1599
   deepspeed/rocm4.2_ubuntu18.04_py3.6_pytorch_1.8.1
   gromacs/2020.3                                    (D)
   namd/2.15a2
   openmm/7.4.2
   pytorch/1.8.1-rocm4.2-ubuntu18.04-py3.6
   pytorch/1.9.0-rocm4.2-ubuntu18.04-py3.6           (D)
   specfem3d/20201122--h9c0626d1
   specfem3d_globe/20210322--h1ee10977
   tensorflow/2.5-rocm4.2-dev
[....]

Some of these modules use the container build-in MPI libraries (you may get some error messages like "Cannot load module because these module(s) are loaded: openmpi") and may require module unload openmpi.

Link to section 'Examples of running ROCm-based containers on AMD GPUs' of 'ROCm Containers Collection' Examples of running ROCm-based containers on AMD GPUs

Examples below show how to run some containerized applications using rocmcontainers modules. In all cases, the general workflow follows the same pattern (load the rocmcontainers module; load specific application's module; run the application as if it was built natively). Additional information can be found in module help output and on each application's AMD Infinity Hub page.

Tensorflow

This example demonstrates how to run Tensorflow on AMD GPUs with rocmcontainers modules.

First, prepare the matrix multiplication example from Tensorflow documentation:

# filename: matrixmult.py
import tensorflow as tf

# Log device placement
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Submit a Slurm job, making sure to request GPU-enabled queue and desired number of GPUs. For illustration purpose, the following example shows an interactive job submission, asking for one node (128 cores) in the "gpu" account with and two GPUs for 6 hours, but the same applies to your production batch jobs as well:

sinteractive -A gpu -N 1 -n 128 -t 6:00:00 --gres=gpu:2
salloc: Granted job allocation 5401130
salloc: Waiting for resource configuration
salloc: Nodes bell-g000 are ready for job

Inside the job, load necessary modules:

module load rocmcontainers
module load tensorflow/2.5-rocm4.2-dev

And run the application as usual:

python matrixmult.py
Num GPUs Available:  2
[...]
2021-09-02 21:07:34.087607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32252 MB memory) -> physical GPU (device: 0, name: Vega 20, pci bus id: 0000:83:00.0)
[...]
2021-09-02 21:07:36.265167: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-09-02 21:07:36.266755: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

For more information, see the application’s AMD Infinity Hub page. For applications deployed as modules, see module help command for a direct link to the relevant page (e.g. module help tensorflow/2.5-rocm4.2-dev in the above example).