Cluster Machine Type Management

For users on the Teams plan, we deploy Spell in your cloud and provide the same cluster management tools backing our own internal infrastructure. This means you can keep your data in your own S3 buckets, perform runs on your own machines, and deploy models within your own cloud infrastructure.

Once you have set up an AWS or GCP cluster, you can set up distinct machine types to run runs on. This guide helps you create machines in your cluster.

Creating a New Machine Type

Machine Type Creation Dialog

  1. Under Clusters click on your cluster, then click on "Add New Machine Type"
  2. Specify a name. This name will be referenced by the --machine_type parameter when you create runs.
  3. Select whether you want a machine with a Graphical Processing Unit (GPU) or without one (CPU).
  4. Select from our list of preselected instance types. For details, consult these tables.
  5. Select the size of disk attached to your new machines. Note that disk usage by Spell-related images will depend on which framework images you assign to this machine type.
  6. Attach framework images. These framework images are cached to each machine and include all the dependencies required for a given machine learning/deep learning framework. Consult this table for details.
  7. Specify the minimum and maximum number of machines. Spell will attempt to maintain Minimum Machines limit, spinning up more machines to accomodate bursts in usage. The total machines for the machine type will never exceed Maximum Machines.
  8. Specify an idle timeout for idle machines. This timeout will specify the time in minutes that Spell wait to termiante any extra idle machines. This can be helpful if you are manually waiting for a previous run to finish before you start a new one and prevents you from having to wait for a machine to warm up again.

Modifying an Existing Machine Type

Machine Type Edit Dialog

Once you have a machine type set up, you can modify its Minimum and Maximum machines, as well as the Idle Timeout. Use this if you are running many runs in parallel and need more machines, or to terminate all existing machines when you know that you are not going to use them for some amount of time.

Deleting a Machine Type

Once you know that you no longer need a machine type, you can delete it. This will do the following:

  1. All runs queued on that machine type will be preemptively killed.
  2. All machines will be terminated.
  3. The machine type will be removed from your cluster dashboard, and you will not be able to schedule future runs on this machine type. However, you can always create a new machine type, even with the same name.

Available Instance Types

CPUs

Instance Type AWS GCP
cpu c5.large n1-standard-2
cpu-big c5.4xlarge n1-standard-16
cpu-huge c5.18xlarge n1-standard-96
ram-big r5.4xlarge
ram-huge r5.24xlarge

GPUs

Instance Type AWS GCP Instance GCP Accelerator
K80 p2.xlarge n1-highmem-4 nvidia-tesla-k80 x 1
K80x2 p2.2xlarge n1-highmem-8 nvidia-tesla-k80 x 2
K80x4 p2.4xlarge n1-highmem-16 nvidia-tesla-k80 x 4
K80x8 p2.8xlarge n1-highmem-32 nvidia-tesla-k80 x 8
V100 p3.2xlarge n1-highmem-4 nvidia-tesla-v100 x 1
V100x4 p3.8xlarge n1-highmem-32 nvidia-tesla-v100 x 4
V100x8 p3.16xlarge n1-highmem-64 nvidia-tesla-v100 x 8
V100x8-big p3dn.24xlarge nvidia-tesla-v100 x 8
P100 n1-highmem-8 nvidia-tesla-p100 x 1
P100x2 n1-highmem-16 nvidia-tesla-p100 x 2
P100x4 n1-highmem-32 nvidia-tesla-p100 x 4

For support of other instance types, please contact support@spell.run.

Available Frameworks

Framework Notes Approximate disk size
PyTorch torch==1.0.1 4GB
Tensorflow tensorflow==1.13.1, Keras==2.2.4 3-4GB
Caffe caffe==1.0.0 2GB
Conda Using a conda environment file or spec file. 4GB
Torch Torch7 3GB
MXNet mxnet==1.3.1 1-3GB
DyNet dyNET==2.1 2GB
FastAI torch==1.0.52 4GB