Cluster Machine Type Management

For users on the Spell for Teams plan, we deploy Spell in your cloud and provide the same cluster management tools backing our own internal infrastructure. This means you can keep your data in your own S3 buckets, perform runs on your own machines, and deploy models within your own cloud infrastructure.

Once you have set up an AWS or GCP cluster, you can set up distinct machine types to run runs on. This guide helps you create machines in your cluster.

Creating a new machine type

Machine Type Creation Dialog

Under Clusters click on your cluster, then click on "Add New Machine Type". This will bring up a menu of configuration options:

  • Name. This name will be referenced by the --machine_type parameter when you create runs.
  • Instance Type. Select whether you want a machine with a Graphical Processing Unit (GPU) or without one (CPU). Then use the dropdown menu to choose from our list of preselected instance types. For a list of the instance types available see the section "Available instance types".
  • On Demand/Spot. Spot instances are significantly cheaper than on demand instances, but come with the caveat that they may be reclaimed by the cloud provider (AWS/GCP) at any time. For recommendations on when to use which type, see the section "Using preemptible instances".
  • Storage Size. This is the size of disk attached to the machine at startup. Keep in mind that some of this space will be reserved by Spell-related images; precisely how much depends on which/how many framework images you assign to this machine type.
  • Additional Images. All instances come with the default framework (TensorFlow 1, PyTorch, and Conda) image pre-installed. Use these checkboxes to attach additional framework images. Consult the section "Available frameworks" for details.
  • Machine Limits Specify the minimum and maximum number of machines that can be running at one time, and an idle timeout. For guidance on what to set these values to see the next section of this guide, "Guidance on machine types".

Guidance on machine limits

TLDR: we recommend setting Minimum Machines to 0 (the default), Maximum Machines to the size of your team, and Idle Timeout to 30 minutes (the default).

Spell will attempt to maintain Minimum Machines ready at all times, and will spin up additional machines as needed to accommodate bursts in usage.

Spinning up a machine takes time, so keeping some number of machines idle can help speed up development (connecting to an idle machine is nearly instantaneous). On the other hand, machines kept idle but unused used cost money without providing any benefit. For most use cases we recommend setting Minimum Machines to 0; this ensures that you are only paying for compute that you actually use.

The total machines for the machine type will never exceed Maximum Machines. By default, we recommend matching this value to the size of your team. For example, if you have 8 data scientists on staff, set Maximum Machines to 8.

AWS/GCP enforce resource limits on instance types. Some of these resource limits are very aggressive, e.g. by default AWS only allows one concurrent machine of most GPU instances. Before finishing cluster setup, double-check your service account limits with your cloud provider. If this is too low, you may put in a request for a service limit increase or resource quota increase. These are usually approved quickly, within 24 hours. Avoid setting Maximum Machines higher than your service limit for the instance.

The last machine limit value is the Idle Timeout. This value controls how long Spell will wait before spinning down machines that have finished running. If another run gets scheduled in the idle time period, it will get started almost immediately, as it does not have to wait for another cloud machine to become available and spin up. We recommend keeping Idle Timeout on the default setting of 30 minutes for most instance types. Consider setting Idle Timeout to 0 minutes for especially powerful/expensive instance types that you expect to use rarely.

Modifying an existing machine type

Machine Type Edit Dialog

Once you have a machine type set up, you can modify its Minimum and Maximum machines, as well as the Idle Timeout. Use this if you are running many runs in parallel and need more machines, or to terminate all existing machines when you know that you are not going to use them for some amount of time.

Using preemptible instances

When you create a new machine type you can select Preemptible/Spot instead of On Demand. These machines are significantly cheaper then their On Demand counterparts, however AWS/GCP can terminate them at any time.

Spell will save all the files changed during your run even if your machine is terminated. The platform manually manages the volume on the machine, and if your machine is terminated before the save completes the platform will automatically attach that volume to a new CPU machine and finish saving the rest of your progress. Your run will then reach a final status of Interrupted.

If your code makes use of checkpoints, Preemptible Instances are a great way to save quite a bit of money. Also note, while Spell will make a best effort to spin up machines based on queued runs and minimum machine count, Spell can't guarantee there will be machines available. This might mean you will have to wait for machines to become available.

Note

Runs executing in on demand instances on GCP may experience unwanted shutdowns due to the GCP machine going into a REPAIRING state. Very long-running instances, e.g. runs over 24 hours in length, are the most susceptible to this. For long-running training jobs running on GCP machines it may be preferable to use spot instances, in combination with checkpointing, to avoid losing training data.

Deleting a machine type

Once you know that you no longer need a machine type, you can delete it. This will do the following:

  1. All runs queued on that machine type will be preemptively killed.
  2. All machines will be terminated.
  3. The machine type will be removed from your cluster dashboard, and you will not be able to schedule future runs on this machine type. However, you can always create a new machine type, even with the same name.

Available instance types

CPUs

Instance Type AWS GCP
cpu c5.large n1-standard-2
cpu-big c5.4xlarge n1-standard-16
cpu-huge c5.18xlarge n1-standard-96
ram-big r5.4xlarge
ram-huge r5.24xlarge

GPUs

Instance Type AWS GCP Instance GCP Accelerator
K80 p2.xlarge n1-highmem-4 nvidia-tesla-k80 x 1
K80x2 p2.2xlarge n1-highmem-8 nvidia-tesla-k80 x 2
K80x4 p2.4xlarge n1-highmem-16 nvidia-tesla-k80 x 4
K80x8 p2.8xlarge n1-highmem-32 nvidia-tesla-k80 x 8
V100 p3.2xlarge n1-highmem-4 nvidia-tesla-v100 x 1
V100x4 p3.8xlarge n1-highmem-32 nvidia-tesla-v100 x 4
V100x8 p3.16xlarge n1-highmem-64 nvidia-tesla-v100 x 8
V100x8-big p3dn.24xlarge
P100 n1-highmem-8 nvidia-tesla-p100 x 1
P100x2 n1-highmem-16 nvidia-tesla-p100 x 2
P100x4 n1-highmem-32 nvidia-tesla-p100 x 4

For support of other instance types, please contact support@spell.run.

Available frameworks

All machines support PyTorch (torch==1.4.0 and torchvision==0.5.0), TensorFlow 1 (tensorflow==1.14.0), and Conda.

You may also include additional frameworks by toggling the checkboxes on the machine definition in the cluster management pane. Note that including additional frameworks increases the footprint on disk of Spell-related resources as well as the amount of time the machine takes to start up, so it's a good idea to be picky about which frameworks you provide on a machine type.

Framework Notes Approximate disk size
TensorFlow 2 tensorflow==2.0.0 2GB
Caffe caffe==1.0.0 2GB
Torch Torch7 3GB
MXNet mxnet==1.3.1 1-3GB
DyNet dyNET==2.1 2GB
FastAI torch==1.0.52 4GB