Skip to main content

Link to section 'Installation of Custom ML Libraries' of 'Custom ML Packages' Installation of Custom ML Libraries

While we try to include as many common ML frameworks and versions as we can in ML-Toolkit, we recognize that there are also situations in which a custom installation may be preferable. We recommend using conda-env-mod to install and manage Python packages. Please follow the steps carefully, otherwise you may end up with a faulty installation. The example below shows how to install TensorFlow in your home directory.

Link to section 'Install' of 'Custom ML Packages' Install

Step 1: Unload all modules and start with a clean environment.

module purge

Step 2: Load the anaconda module with desired Python version.

module load anaconda

Step 2A: If the ML application requires Cuda and CuDNN, load the appropriate modules. Be sure to check that the versions you load are compatible with the desired ML package.

module load cuda
module load cudnn

Many machine-learning packages (including PyTorch and TensorFlow) now provide installation pathways that include the full cudatoolkit within the environment, making it unnecessary to load these modules.

Step 3: Create a custom anaconda environment. Make sure the python version matches the Python version in the anaconda module.

conda-env-mod create -n env_name_here

Step 4: Activate the anaconda environment by loading the modules displayed at the end of step 3.

module load use.own
module load conda-env/env_name_here-py3.8.5

Step 5: Now install the desired ML application. You can install multiple Python packages at this step using either conda or pip.

For TensorFlow (as of 2024) the recommended approach is to use pip (see tensorflow.org/install/gpu).
pip install --ignore-installed 'tensorflow[and-cuda]'
For PyTorch the recommended approach is to use conda (see pytorch.org).
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

 

If the installation succeeded, you can now proceed to testing and using the installed application. You must load the environment you created as well as any supporting modules (e.g., anaconda) whenever you want to use this installation. If your installation did not succeed, please refer to the troubleshooting section below as well as documentation for the desired package you are installing.

Note that loading the modules generated by conda-env-mod has different behavior than conda create env_name_here followed by source activate env_name_here. After running source activate, you may not be able to access any Python packages in anaconda or ml-toolkit modules. Therefore, using conda-env-mod is the preferred way of using your custom installations.

Link to section 'Testing the Installation' of 'Custom ML Packages' Testing the Installation

  • Verify the installation by using a simple import statement, like that listed below for TensorFlow:

    python -c "import tensorflow as tf; print(tf.__version__);"

    Note that a successful import of TensorFlow will print a variety of system and hardware information. This is expected.

    If importing the package leads to errors, be sure to verify that all dependencies for the package have been managed, and the correct versions installed. Dependency issues between python packages are the most common cause for errors. For example, in TF, conflicts with the h5py or numpy versions are common, but upgrading those packages typically solves the problem. Managing dependencies for ML libraries can be non-trivial.

  • Next, we can test using our installation of TensorFlow for a GPU run. For this we shall use the matrix multiplication example from Tensorflow documentation.

    # filename: matrixmult.py
    import tensorflow as tf
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    tf.debugging.set_log_device_placement(True)
    
    # Place tensors on the CPU
    with tf.device('/CPU:0'):
      a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
      b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    
    # Run on the GPU
    c = tf.matmul(a, b)
    print(c)
    
  • Run the example

    $ python matrixmult.py
  • This will produce an output like:

    Num GPUs Available:  3
    2022-07-25 10:33:23.358919: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2022-07-25 10:33:26.223459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22183 MB memory:  -> device: 0, name: NVIDIA A30, pci bus id: 0000:3b:00.0, compute capability: 8.0
    2022-07-25 10:33:26.225495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22183 MB memory:  -> device: 1, name: NVIDIA A30, pci bus id: 0000:af:00.0, compute capability: 8.0
    2022-07-25 10:33:26.228514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22183 MB memory:  -> device: 2, name: NVIDIA A30, pci bus id: 0000:d8:00.0, compute capability: 8.0
    2022-07-25 10:33:26.933709: I tensorflow/core/common_runtime/eager/execute.cc:1323] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
    2022-07-25 10:33:28.181855: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
    tf.Tensor(
    [[22. 28.]
     [49. 64.]], shape=(2, 2), dtype=float32)
    
  • For more details, please refer to Tensorflow User Guide.

Link to section 'Troubleshooting' of 'Custom ML Packages' Troubleshooting

In most situations, dependencies among Python modules lead to errors. If you cannot use a Python package after installing it, please follow the steps below to find a workaround.

  • Unload all the modules.
    module purge
  • Clean up PYTHONPATH.
    unset PYTHONPATH
  • Next load the modules, e.g., anaconda and your custom environment.
    module load anaconda
    module load use.own
    module load conda-env/env_name_here-py3.8.5
  • For GPU-enabled applications, you may also need to load the corresponding cuda/ and cudnn/ modules.
  • Now try running your code again.
  • A few applications only run on specific versions of Python (e.g. Python 3.6). Please check the documentation of your application if that is the case.
  • If you have installed a newer version of an ml-toolkit package (e.g., a newer version of PyTorch or Tensorflow), make sure that the ml-toolkit modules are NOT loaded. In general, we recommend that you don't mix ml-toolkit modules with your custom installations.
  • GPU-enabled ML applications often have dependencies on specific versions of Cuda and CuDNN. For example, Tensorflow version 1.5.0 and higher needs Cuda 9. Please check the application documentation about such dependencies.

Link to section 'Tensorboard' of 'Custom ML Packages' Tensorboard

  • You can visualize data from a Tensorflow session using Tensorboard. For this, you need to save your session summary as described in the Tensorboard User Guide.
  • Launch Tensorboard:
    $ python -m tensorboard.main --logdir=/path/to/session/logs
  • When Tensorboard is launched successfully, it will give you the URL for accessing Tensorboard.
    
    <... build related warnings ...> 
    TensorBoard 0.4.0 at http://gilbreth-a000.rcac.purdue.edu:6006
    
  • Follow the printed URL to visualize your model.
  • Please note that due to firewall rules, the Tensorboard URL may only be accessible from Gilbreth nodes. If you cannot access the URL directly, you can use Firefox browser in Thinlinc.
  • For more details, please refer to the Tensorboard User Guide.
Helpful?

Thanks for letting us know.

Please don't include any personal information in your comment. Maximum character limit is 250.
Characters left: 250
Thanks for your feedback.