Distributed Runs

With Spell you can easily distribute your code to run across a number of machines using our first-class horovod integration. horovod is distributed training SDK that can be used to distribute TensorFlow, Keras, PyTorch, or MXNet models across multiple GPUs, which can drastically reduce training time.

You might find the distributed run feature particularly useful for doing distributed deep learning. The horovod library comes pre-installed in distributed runs and can be used to easily distribute an existing TensorFlow, Keras, PyTorch, or MXNet model across multiple GPUs, which can drastically reduce training time.

Additionally Open MPI is used to facilitate coordination and communication across the parallel processes. Furthermore, if you are using a machine type with a GPU, then the NVIDIA Collective Communications Library (NCCL) is used to accelerate the inter-GPU communication and data transfer.

Note

Distributed runs are only available to Teams plan subscribers.

Creating a distributed run

The spell run command has a --distributed N option that you can specify to distribute your run across N machines. N must be an integer greater than or equal to 1. For example, the following command will actually run the <command> on two separate K80 machines simultaneously:

$ spell run --machine-type K80 --distributed 2 <command>

Spell automatically infers the number of GPUs per machine type, if applicable, and will spin up 1 process per GPU per machine. For example, the following command will spin up 8 processes on 4 separate machines to utilize all 32 GPUs:

$ spell run --machine-type K80x8 --distributed 4 <command>

The --distributed 1 option can also be specified to leverage Open MPI and NCCL to start multiple processes on a single machine if there are multiple GPUs on the machine.

You can try this feature out using the MNIST example in the Horovod repository:

$ git clone https://github.com/horovod/horovod.git
$ cd horovod/examples
$ spell run \
    --machine-type K80 --distributed 2 \
    'python keras_mnist.py'

You should then see the Keras logs for MNIST training on two separate GPU machines.

Using the Horovod timeline

We automatically enable the Horovod timeline feature when you're doing a distributed run. This can help you analyze the performance of your training. The file is written to the run outputs as horovod_timeline.json. You can open that file in Chrome with chrome://tracing.

Screenshot of Horovod timeline tracing in action.