Machine Learning

We support several common machine learning (ML) frameworks on the community clusters through pre-installed modules. The collection of these pre-installed ML modules is referred to as ml-toolkit throughout this documentation. Currently, the following libraries are included in ML-Toolkit.

caffe           cntk            gym            keras
mxnet           opencv          pytorch
tensorflow      tflearn         theano

Note that managing dependencies with ML applications can be non-trivial, therefore, we recommend users start by using ml-toolkit. If a custom installation is required after trying ml-toolkit, make sure to read documentation carefully.

ML-Toolkit

A set of pre-installed popular machine learning (ML) libraries, called ML-Toolkit is maintained on Scholar. These are Anaconda/Python-based distributions of the respective libraries. Currently, applications are supported for Python 2 and 3. Detailed instructions for searching and using the installed ML applications are presented below.

Link to section 'Instructions for using ML-Toolkit Modules' of 'ML-Toolkit' Instructions for using ML-Toolkit Modules

Link to section 'Find and Use Installed ML Packages' of 'ML-Toolkit' Find and Use Installed ML Packages

To search or load a machine learning application, you must first load one of the learning modules. The learning module loads the prerequisites (such as anaconda and cudnn) and makes ML applications visible to the user.

Step 1. Find and load a preferred learning module. Several learning modules may be available, corresponding to a specific Python version and whether the ML applications have GPU support or not. Running module load learning without specifying a version will load the version with the most recent python version. To see all available modules, run module spider learning then load the desired module.

Step 2. Find and load the desired machine learning libraries

ML packages are installed under the common application name ml-toolkit-X, where X can be cpu or gpu.

You can use the module spider ml-toolkit command to see all options and versions of each library.

Load the desired modules using the module load command. Note that both CPU and GPU options may exist for many libraries, so be sure to load the correct version. For example, if you wanted to load the most recent version of PyTorch for CPU, you would run module load ml-toolkit-cpu/pytorch

caffe          cntk          gym          keras          mxnet 
opencv         pytorch       tensorflow   tflearn        theano

Step 3. You can list which ML applications are loaded in your environment using the command module list

Link to section 'Verify application import' of 'ML-Toolkit' Verify application import

Step 4. The next step is to check that you can actually use the desired ML application. You can do this by running the import command in Python. The example below tests if PyTorch has been loaded correctly.

python -c "import torch; print(torch.__version__)"

If the import operation succeeded, then you can run your own ML code. Some ML applications (such as tensorflow) print diagnostic warnings while loading -- this is the expected behavior.

If the import fails with an error, please see the troubleshooting information below.

Step 5. To load a different set of applications, unload the previously loaded applications and load the new desired applications. The example below loads Tensorflow and Keras instead of PyTorch and OpenCV.

module unload ml-toolkit-cpu/opencv
module unload ml-toolkit-cpu/pytorch
module load ml-toolkit-cpu/tensorflow
module load ml-toolkit-cpu/keras

Link to section 'Troubleshooting' of 'ML-Toolkit' Troubleshooting

ML applications depend on a wide range of Python packages and mixing multiple versions of these packages can lead to error. The following guidelines will assist you in identifying the cause of the problem.

Check that you are using the correct version of Python with the command python --version. This should match the Python version in the loaded anaconda module.
Start from a clean environment. Either start a new terminal session or unload all the modules using module purge. Then load the desired modules following Steps 1-2.
Verify that PYTHONPATH does not point to undesired packages. Run the following command to print PYTHONPATH: echo $PYTHONPATH. Make sure that your Python environment is clean. Watch out for any locally installed packages that might conflict.
If you don't see GPU devices in your code, make sure that you are using the ml-toolkit-gpu/ modules and not using their cpu versions.
ML applications often have dependency on specific versions of Cuda and CuDNN libraries. Make sure that you have loaded the required versions using the command: module list
Note that Caffe has a conflicting version of PyQt5. So, if you want to use Spyder (or any GUI application that uses PyQt), then you should unload the caffe module.
Use Google search to your advantage. Copy the error message in Google and check probable causes.

More examples showing how to use ml-toolkit modules in a batch job are presented in ML Batch Jobs guide.

Link to section 'Installation of Custom ML Libraries' of 'Custom ML Packages' Installation of Custom ML Libraries

While we try to include as many common ML frameworks and versions as we can in ML-Toolkit, we recognize that there are also situations in which a custom installation may be preferable. We recommend using conda-env-mod to install and manage Python packages. Please follow the steps carefully, otherwise you may end up with a faulty installation. The example below shows how to install TensorFlow in your home directory.

Link to section 'Install' of 'Custom ML Packages' Install

Step 1: Unload all modules and start with a clean environment.

module purge

Step 2: Load the anaconda module with desired Python version.

module load anaconda

Step 2A: If the ML application requires Cuda and CuDNN, load the appropriate modules. Be sure to check that the versions you load are compatible with the desired ML package.

module load cuda
module load cudnn

Many machine-learning packages (including PyTorch and TensorFlow) now provide installation pathways that include the full cudatoolkit within the environment, making it unnecessary to load these modules.

Step 3: Create a custom anaconda environment. Make sure the python version matches the Python version in the anaconda module.

conda-env-mod create -n env_name_here

Step 4: Activate the anaconda environment by loading the modules displayed at the end of step 3.

module load use.own
module load conda-env/env_name_here-py3.6.4

Step 5: Now install the desired ML application. You can install multiple Python packages at this step using either conda or pip.

pip install --ignore-installed tensorflow==2.6

If the installation succeeded, you can now proceed to testing and using the installed application. You must load the environment you created as well as any supporting modules (e.g., anaconda) whenever you want to use this installation. If your installation did not succeed, please refer to the troubleshooting section below as well as documentation for the desired package you are installing.

Note that loading the modules generated by conda-env-mod has different behavior than conda create env_name_here followed by source activate env_name_here. After running source activate, you may not be able to access any Python packages in anaconda or ml-toolkit modules. Therefore, using conda-env-mod is the preferred way of using your custom installations.

Link to section 'Testing the Installation' of 'Custom ML Packages' Testing the Installation

Verify the installation by using a simple import statement, like that listed below for TensorFlow:
```
python -c "import tensorflow as tf; print(tf.__version__);"
```
Note that a successful import of TensorFlow will print a variety of system and hardware information. This is expected.

If importing the package leads to errors, be sure to verify that all dependencies for the package have been managed, and the correct versions installed. Dependency issues between python packages are the most common cause for errors. For example, in TF, conflicts with the h5py or numpy versions are common, but upgrading those packages typically solves the problem. Managing dependencies for ML libraries can be non-trivial.

Next, we can test using our installation of TensorFlow for a GPU run. For this we shall use the matrix multiplication example from Tensorflow documentation.

# filename: matrixmult.py
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/CPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# Run on the GPU
c = tf.matmul(a, b)
print(c)

Run the example
```
$ python matrixmult.py
```

This will produce an output like:

Num GPUs Available:  3
2022-07-25 10:33:23.358919: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-25 10:33:26.223459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22183 MB memory:  -> device: 0, name: NVIDIA A30, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-07-25 10:33:26.225495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22183 MB memory:  -> device: 1, name: NVIDIA A30, pci bus id: 0000:af:00.0, compute capability: 8.0
2022-07-25 10:33:26.228514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22183 MB memory:  -> device: 2, name: NVIDIA A30, pci bus id: 0000:d8:00.0, compute capability: 8.0
2022-07-25 10:33:26.933709: I tensorflow/core/common_runtime/eager/execute.cc:1323] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2022-07-25 10:33:28.181855: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

For more details, please refer to Tensorflow User Guide.

Link to section 'Troubleshooting' of 'Custom ML Packages' Troubleshooting

In most situations, dependencies among Python modules lead to errors. If you cannot use a Python package after installing it, please follow the steps below to find a workaround.

Unload all the modules.
```
module purge
```
Clean up PYTHONPATH.
```
unset PYTHONPATH
```

Next load the modules, e.g., anaconda and your custom environment.

module load anaconda
module load use.own
module load conda-env/env_name_here-py3.6.4

For GPU-enabled applications, you may also need to load the corresponding cuda/ and cudnn/ modules.
Now try running your code again.
A few applications only run on specific versions of Python (e.g. Python 3.6). Please check the documentation of your application if that is the case.
If you have installed a newer version of an ml-toolkit package (e.g., a newer version of PyTorch or Tensorflow), make sure that the ml-toolkit modules are NOT loaded. In general, we recommend that you don't mix ml-toolkit modules with your custom installations.
GPU-enabled ML applications often have dependencies on specific versions of Cuda and CuDNN. For example, Tensorflow version 1.5.0 and higher needs Cuda 9. Please check the application documentation about such dependencies.

Link to section 'Tensorboard' of 'Custom ML Packages' Tensorboard

You can visualize data from a Tensorflow session using Tensorboard. For this, you need to save your session summary as described in the Tensorboard User Guide.

Launch Tensorboard:

$ python -m tensorboard.main --logdir=/path/to/session/logs

When Tensorboard is launched successfully, it will give you the URL for accessing Tensorboard.


<... build related warnings ...> 
TensorBoard 0.4.0 at http://scholar-a000.rcac.purdue.edu:6006

Follow the printed URL to visualize your model.
Please note that due to firewall rules, the Tensorboard URL may only be accessible from Scholar nodes. If you cannot access the URL directly, you can use Firefox browser in Thinlinc.
For more details, please refer to the Tensorboard User Guide.

Link to section 'Running ML Code in a Batch Job' of 'ML Batch Jobs' Running ML Code in a Batch Job

Batch jobs allow us to automate model training without human intervention. They are also useful when you need to run a large number of simulations on the clusters. In the example below, we shall run a simple tensor_hello.py script in a batch job. We consider two situations: in the first example, we use the ML-Toolkit modules to run tensorflow, while in the second example, we use a custom installation of tensorflow (See Custom ML Packages page).

Link to section 'Using ML-Toolkit Modules' of 'ML Batch Jobs' Using ML-Toolkit Modules

Save the following code as tensor_hello.sub in the same directory where tensor_hello.py is located.

# filename: tensor_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 
#SBATCH --time=00:05:00
#SBATCH -A scholar
#SBATCH -J hello_tensor

module purge
module load learning
module load ml-toolkit-gpu/tensorflow 
module list

python tensor_hello.py

Link to section 'Using a Custom Installation' of 'ML Batch Jobs' Using a Custom Installation

Save the following code as tensor_hello.sub in the same directory where tensor_hello.py is located.

# filename: tensor_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 
#SBATCH --time=00:05:00
#SBATCH -A scholar
#SBATCH -J hello_tensor

module purge
module load anaconda
module load cuda
module load cudnn
module load use.own
module load conda-env/my_tf_env-py3.8.5 
module list

echo $PYTHONPATH

python tensor_hello.py

Link to section 'Running a Job' of 'ML Batch Jobs' Running a Job

Now you can submit the batch job using the sbatch command.

sbatch tensor_hello.sub

Once the job finishes, you will find an output file (slurm-xxxxx.out).