Link to section 'Running Tensorflow code in a batch job' of 'Tensorflow Batch Job' Running Tensorflow code in a batch job
Batch jobs allow us to automate model training without human intervention. They are also useful when you need to run a large number of simulations on the clusters. In the example below, we shall run the tensor_hello.py script in a batch job (refer to Tensorflow guide to see the code). We consider two situations: in the first example, we use the ML-Toolkit modules to run tensorflow, while in the second example, we use our custom installation of tensorflow.
Link to section 'Using Ml-Toolkit modules' of 'Tensorflow Batch Job' Using Ml-Toolkit modules
Save the following code as tensor_hello.sub in the same directory where tensor_hello.py is located.
# filename: tensor_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --time=00:05:00
#SBATCH -A standby
#SBATCH -J hello_tensor
module purge
module load learning/conda-5.1.0-py36-cpu
module load ml-toolkit-cpu/tensorflow
module list
python tensor_hello.py
Link to section 'Using custom tensorflow installation' of 'Tensorflow Batch Job' Using custom tensorflow installation
Save the following code as tensor_hello.sub in the same directory where tensor_hello.py is located.
# filename: tensor_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --time=00:05:00
#SBATCH -A standby
#SBATCH -J hello_tensor
module purge
module load anaconda/5.1.0-py36
module load use.own
module load conda-env/my_tf_env-py3.6.4
module list
echo $PYTHONPATH
python tensor_hello.py
Now you can submit the batch job using the sbatch command.
sbatch tensor_hello.sub
Once the job finishes, you will find an output (slurm-xxxxx.out). If tensorflow ran successfully, then the output file will contain the message shown below.
Hello, TensorFlow!