tensorflow

Link to section 'Description' of 'tensorflow' Description

TensorFlow is an end-to-end open source platform for machine learning.

Link to section 'Versions' of 'tensorflow' Versions

Bell: 2.5-rocm4.2-dev, 2.7-rocm5.0-dev
Negishi: 2.5-rocm4.2-dev, 2.7-rocm5.0-dev

Link to section 'Module' of 'tensorflow' Module

You can load the modules by:

module load rocmcontainers
module load tensorflow

Link to section 'Example job' of 'tensorflow' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run tensorflow on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=tensorflow
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers tensorflow

This example demonstrates how to run Tensorflow on AMD GPUs with rocmcontainers modules.

First, prepare the matrix multiplication example from Tensorflow documentation:

# filename: matrixmult.py
import tensorflow as tf

# Log device placement
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Submit a Slurm job, making sure to request GPU-enabled queue and desired number of GPUs. For illustration purpose, the following example shows an interactive job submission, asking for one node (${resource.nodecores} cores) in the "gpu" account with and two GPUs for 6 hours, but the same applies to your production batch jobs as well:

sinteractive -A gpu -N 1 -n ${resource.nodecores} -t 6:00:00 --gres=gpu:2
salloc: Granted job allocation 5401130
salloc: Waiting for resource configuration
salloc: Nodes ${resource.hostname}-g000 are ready for job

Inside the job, load necessary modules:

module load rocmcontainers
module load tensorflow/2.5-rocm4.2-dev

And run the application as usual:

python matrixmult.py
Num GPUs Available:  2
[...]
2021-09-02 21:07:34.087607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32252 MB memory) -> physical GPU (device: 0, name: Vega 20, pci bus id: 0000:83:00.0)
[...]
2021-09-02 21:07:36.265167: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-09-02 21:07:36.266755: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)