Distributed Deep Learning with Horovod

Link to section 'What is Horovod?' of 'Distributed Deep Learning with Horovod' What is Horovod?

Horovod is a framework originally developed by Uber for distributed deep learning. While a traditionally laborious process, Horovod makes it easy to scale up training scripts from single GPU to multi-GPU processes with minimal code changes. Horovod enables quick experimentation while also ensuring efficient scaling, making it an attractive choice for multi-GPU work.

Link to section 'Installing Horovod' of 'Distributed Deep Learning with Horovod' Installing Horovod

Before continuing, ensure you have loaded the following modules by running:

ml modtree/gpu
ml learning

Next, load the module for the machine learning framework you are using. Examples for tensorflow and pytorch are below:

ml ml-toolkit-gpu/tensorflow
ml ml-toolkit-gpu/pytorch

Create or activate the environment you want Horovod to be installed in then install the following dependencies:

pip install pyparsing
pip install filelock

Finally, install Horovod. The following command will install Horovod with support for both Tensorflow and Pytorch, but if you do not need both simply remove the HOROVOD_WITH_...=1 part of the command.

HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_TORCH=1 pip install horovod[all-frameworks]

Link to section 'Submitting Jobs' of 'Distributed Deep Learning with Horovod' Submitting Jobs

It is highly recommended that you run Horovod within batch jobs instead of interactive jobs. For information about how to format a submission file and submit a batch job, please reference Batch Jobs. Ensure you load the modules listed above as well as your environment in the submission script.

Finally, this line will actually launch your Horovod script inside your job. You will need to limit the number of processes to the number of GPUs you requested.

horovodrun -np {number_of_gpus} python {path/to/training/script.py}

An example usage of this is as follows for 4 GPUs and a file called horovod_mnist.py:

horovodrun -np 4 python horovod_mnist.py

Link to section 'Writing Horovod Code' of 'Distributed Deep Learning with Horovod' Writing Horovod Code

It is relatively easy to incorporate Horovod into existing training scripts. The main additional elements you need to incorporate are listed below (syntax for use with pytorch), but much more information, including syntax for other frameworks, can be found on the Horovod website.

#import required horovod framework -- e.g. for pytorch:
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin to a GPU
if torch.cuda.is_available():
    torch.cuda.set_device(hvd.local_rank())

#Split dataset among workers
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

#Build Model

#Wrap optimizer with Horovod DistributedOptimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

#Broadcast initial variable states from first worker to all others
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

#Train model