Bridging the Gap Between Preemptible VMs and Everyday Apps

One of the most compelling innovations in cloud technology of late has little to do with new technology, and everything to do with managing cost: Spot instances or preemptible VMs. The concept is simple and completely logical. The cloud providers have large quantities of compute resources in their data centers. Just like any system, supply must sufficiently exceed demand to ensure that service levels can be met without delays. That creates a buffer, and buffers cost money to keep around. What if, instead of letting the buffer sit idle when unused, the buffer could be sold? But how would you sell the buffer capacity without defeating the purpose of the buffer in the first place? Simple — only sell it for a few hours at a time and make it intentionally unreliable. Then you have the option to pull it back and satisfy the main demand if it’s needed… all without breaking the promise of the SLA.


This is the basic idea behind “spot instances (AWS),” “preemptible VMs (GCP),” and “low priority VMs (Azure).” Since they carry a very limited SLA (essentially specifying “Caution: this machine can and will be destroyed at any time with only 30 seconds notice.”) and since the cloud provider can monetize their spare capacity, these resources are sold at a shockingly large discount. For example, the cost for a preemptible VM on Google Cloud can be 80% less than the price of a standard compute VM. That is compelling! We’ll focus on preemptible VM (or PVM for short) as our descriptor of the concept for this article.


For applications suited to regular machine failures, PVMs are right up their alley. If a machine is preempted, no problem, a new machine just starts working and takes over the load. This is great for certain workloads that are largely stateless and not mission critical, such as: batch processing, certain HPC workloads, or software test farms. BUT — A 5X cost reduction is enough to get anyone’s attention, and has certainly attracted people to cloud that may have hesitated based solely on the retail cost of standard VMs. Which means people want to run all sorts of applications on PVMs, even the mission critical stateful apps not well suited to the reliably unreliable cadence of PVMs.


What to do? How can the main stream of applications take advantage of the cost savings from PVMs without suffering data loss or downtime due to their inherent unreliability?


We have found that the missing link between stable mainline applications and PVMs is reliable performant shared storage. If a PVM goes away, so does its work in progress. In the storage world this is called “data loss” and it’s a very bad thing for most applications. The storage available to most PVMs are locally-attached SSDs, which are fast but non-persistent. Any data stored in a PVM’s local SSDs is lost forever when the machine powers off.


The ideal storage solution would be accessible like a local drive, fast, shared, and persistent. Bring these four together and a PVM becomes nearly indistinguishable from a normal VM, but with a 5X cost savings!


At Elastifile, we’ve seen many of our cloud customers embracing our solution for this very reason. Unlike the native storage options in cloud, Elastifile hits the mark on all these categories, all without breaking the bank. This Venn diagram shows the overlap of requirements and compares the various in cloud storage options.



In general the storage offerings all have their own personalities, as shown in the table below.

StorageMountable as a local driveTransactional PerformanceStateWriteable sharesCost
Local SSDYesHighestNon persistentNoMedium
Persistent DiskYesMediumPersistentNoHigh
Cloud StorageNoLowPersistentYesLow


So each of the storage approaches has value, but exhibits tradeoffs. With Elastifile, reliability is ensured while adding sharing functionality and great aggregate performance. Though Elastifile adds a new cost variable, even this area can be mitigated based on the nature of the data and the features Elastifile provides. While outside the scope of this post, Elastifile storage costs can be as low as $0.03/GB/month while still achieving the reliability needed to make PVMs work. This makes Elastifile nearly tradeoff free!


Let’s take a look at some of the use cases that we’ve seen enabled with Elastifile added into PVM-based architecture.


Naidu Annamaneni from eSilicon Corporation presented with me on stage at Google Cloud Next 2018 in San Francisco highlighting their hugely successful use of PVMs to do a silicon chip design simulation “tape out.” Watch Naidu tell the story here. The punchline? Naidu led his team to complete this project on time without adding thousands of cores to his datacenter infrastructure, without changing his application code, and still meeting the performance needs of his demanding workflow. In the architecture diagram below, you can see the heavy usage of PVMs in the final design.

This pattern is repeating across industries. We have another customer doing large scale video processing in cloud. Another doing drug discovery completely in silicon (a.k.a. computational drug discovery). The common attraction is PVMs, while the common enabler is Elastifile persistent shared storage.


In general, a large scale compute architecture based on PVMs enhanced by Elastifile looks like this in cloud:

When leveraging large farms of compute nodes, it’s very useful, if not essential, to manage the work with a job scheduler or queue manager. Most of the popular schedulers have already adapted (or are in the process of doing so) their systems to manage cloud servers just like on-premises servers. The scheduler detects when PVMs get preempted, then spins up a new machine and reschedules the associated work. This feature ensures that the work continues to flow and downtime is minimized. With Elastifile adding data reliability with persistent shared storage, the new VMs can get right to work without skipping a beat.


Elastifile has been able to integrate with all of the schedulers that our customers use. 3 main capabilities ensure our integrations are relatively seamless:


  1. We use NFSv3 for our client protocol. Most schedulers already use NFS as their main storage option and it’s natively built in to every linux OS.
  2. We have a robust RESTful API for automating any function in the system. Provisioning Elastifile, bulk data loading, snapshots, elastic capacity add/removes, etc — all are done programmatically on the fly via the API.
  3. We have already integrated with Kubernetes and Docker to handle storage provisioning nuances and to deliver persistent/stateful storage for containers, a popular compute layer on PVMs. This integration, as demonstrated by our Kubernetes storage provisioner, greatly reduces complexity for administrators leveraging containers.


By design, preemptible VMs are inexpensive but intentionally unreliable…making them attractive but difficult to use for many applications. Elastifile adds data reliability to preemptible VMs and, in so doing, makes these cost savings accessible to many more applications. When you’re ready to take Elastifile for a spin with your application on PVMs, give it a try in your own cloud project.

Topics: Media & Entertainment, Manufacturing, Life Sciences, Containers / Kubernetes,