Horovod
Tutorials and documentation
Getting started: Go to this link.
How to implement Horovod with TensorFlow.
Towards Data Science: Distributed Deep Learning with Horovod.
Some tutorials for Horovod are available here:
$ git clone https://github.com/horovod/tutorials
Run Horovod examples on a GPU cluster
The horovod Docker image comes with examples. Run
$ nvidia-docker run -it horovod/horovod
The examples directory comes with an example directory per backend
examples
├── adasum
├── elastic
├── keras
├── ...
├── tensorflow
└── tensorflow2
If you choose the tensorflow2 backend
$ cd tensorflow2
$ CUDA_VISIBLE_DEVICES=0,1,2,3 horovodrun -np 4 -H localhost:4 python tensorflow2_synthetic_benchmark.py
If the terminal flushes stddiag: Read -1, refer to this issue to remove the warning.
Understanding the horovodrun command
CUDA_VISIBLE_DEVICES="0,1"
This part allows you to pass certain gpus to the horovodrun command, in this case we are using the gpu
0 and 1.
horovodrun -np 2 -H localhost:2 python tensorflow2_synthetic_benchmark.py
This part explicitly calls horovodrun with 2 gpus in the localhost, this case is assuming that you are
working on only one machine.
Use horovod
In this section we will implement Horovod to a TensorFlow V2 code from this example.
Import horovod
import tensorflow as tf import horovod.tensorflow as hvd
Initialize horovod
hvd.init()
Pin your GPUS (given the
CUDA_VISIBLE_DEVICESgpus)gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) if gpus: tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
The for part of this code allows you to tell the program to use variable memory and not allocating the entire
GPU VRAM.
After setting your dataset and model
model = ... dataset = ... opt = tf.optimizers.Adam(0.001 * hvd.size()) loss = tf.losses.SparseCategoricalCrossentropy()
Set up the function
checkpoint_dir = './checkpoints' checkpoint = tf.train.Checkpoint(model=model, optimizer=opt)
Function
def training_step(images, labels, first_batch): with tf.GradientTape() as tape: probs = mnist_model(images, training = True) loss_value = loss(labels, probs) # Horovod: add Horovod Distributed GradientTape. tape = hvd.DistributedGradientTape(tape) grads = tape.gradient(loss_value, mnist_model.trainable_variables) opt.apply_gradients(zip(grads, mnist_model.trainable_variables)) # Horovod: broadcast initial variable states from rank 0 to all other processes. # This is necessary to ensure consistent initialization of all workers when # training is started with random weights or restored from a checkpoint. # # Note: broadcast should be done after the first gradient step to ensure optimizer # initialization. if first_batch: hvd.broadcast_variables(mnist_model.variables, root_rank = 0) hvd.broadcast_variables(opt.variables(), root_rank = 0) return loss_value
Run the code
# Horovod: adjust number of steps based on number of GPUs. for batch, (images, labels) in enumerate(dataset.take(10000 // hvd.size())): loss_value = training_step(images, labels, batch == 0) if batch % 10 == 0 and hvd.local_rank() == 0: print('Step #%d\tLoss: %.6f' % (batch, loss_value)) # Horovod: save checkpoints only on worker 0 to prevent other workers from # corrupting it. if hvd.rank() == 0: checkpoint.save(checkpoint_dir)
Observations
Every code would run on the CPU, but with this implementation it will run on the GPU(s).
Example: If we are working with a dataset of 60000 images, with 6 epochs and a batch size of 128. And therefore 469 number of iterations.
In this case, the dataset is finite so we can’t decide how many steps_per_epoch we want.
Infinite amount of data
If we would have an infinite amount of data:
And the quantity number of data that our model will take to train will be in this form.