Distributed Runs

Overview

With Spell you can easily distribute your code to run across a number of processes simultaneously (optionally distributed across multiple machines). Open MPI is used to facilitate coordination and communication across the parallel processes. Furthermore, if you are using a machine type with a GPU, then the NVIDIA Collective Communications Library (NCCL) is used to accelerate the inter-GPU communication and data transfer.

You might find the distributed run feature particularly useful for doing distributed deep learning. The horovod library comes pre-installed in distributed runs and can be used to easily distribute an existing TensorFlow, Keras, PyTorch, or MXNet model across multiple GPUs, which can drastically reduce training time.

Note

Distributed runs are only available to Teams plan subscribers. If you're interested in upgrading to a Teams plan, contact us at support@spell.run.

Creating a distributed run

The spell run command has a --distributed N option that you can specify to distribute your run across N machines. N must be an integer greater than or equal to 1. For example, the following command will actually run the <command> on two separate K80 machines simultaneously:

$ spell run --machine-type K80 --distributed 2 <command>

Spell automatically infers the number of GPUs per machine type, if applicable, and will spin up 1 process per GPU per machine. For example, the following command will spin up 8 processes on 4 separate machines to utilize all 32 GPUs:

$ spell run --machine-type K80x8 --distributed 4 <command>

The --distributed 1 option can also be specified to leverage Open MPI and NCCL to start multiple processes on a single machine if there are multiple GPUs on the machine.

Example

In this example we will do a distributed run to train MNIST on two separate GPUs using the horovod library.

  1. First, you'll need to set up Spell in your AWS account. Next, navigate to "Clusters" in the sidebar and create a machine type that is one of the GPU instance types. Substitute the name of this machine type for <machine_type_name> in the commands below.

  2. Clone the horovod repository:

        $ git clone https://github.com/horovod/horovod.git
        

  3. Create the distributed run to train MNIST
        $ cd horovod/examples
        
        $ spell run -t <machine_type_name> --framework tensorflow --distributed 2 python keras_mnist.p
        

You should then see the Keras logs for MNIST training on two separate GPU machines.