April | 2021 | Tater's Tech Blog

What is a spot instance? Spot instances are how cloud providers sell their unused capacity, often at a big discounts. Currently I see most VM pricing discounted anywhere from 70% to 90%.

Every major cloud provider currently supports spot instances. AWS first released spot instance in Dec 2019. Microsoft nearly 10 years later released spot instances in May 2020. Google cloud also has spot instance offering which they named Preemptible VM Instances. For this post I will mainly focus on Azure’s spot instances and how one can mitigate these downsides.

When you create a Spot instance in Azure, Microsoft provides data on eviction rates and the current reduced rate you can purchase the VM at. For example, here you can purchase the D8s_v3 at a cost of $0.11201 cents per hour versus the standard pay-as-you-go pricing of $0.384 per hour, a 70% reduction in cost.

Fig 2. See all Sizes (Eviction Rate / Discount / Cost)

Create A Azure Spot Instance

When creating a VM one must select Azure Spot Instance. A VM can only be made a Spot Instance when it is created. If your VM is not a spot instance and you would like to make the VM a spot instance there are various scripts available on the internet to do this. You will also want to make sure you select Eviction type capacity. This will reduce the number of evictions that might occur on your Sport Instance. However, the pricing is variable. For example, if current going price for a D8s_v3 is $0.11201 cents per hour, and then there is a spike in demand the price may float up to $0.12480 cents per hour. As the caption says your VM will not be evicted for pricing reasons. Only if Azure needs the capacity for pay-as-you-go workloads. Eviction policy is also another important selection. If you are running static VM images in Azure for your development environment you want to ensure the Eviction Policy is set to Stop/Deallocate.

Another interesting selection is the View pricing history and compare prices in nearby regions. You will be able to view pricing in the last few months as well as compare against other regions.

How do we solve for VM evictions?

Do we really want to ask our developers to login into Azure Portal and restart VMs after each eviction???

My first try at solving the eviction problem.

Initially when I set out to solve this problem I started looking into Azure Functions and Azure Event Grid. Unfortunately after building a working process with Azure function and Azure Event Grid which worked flawlessly using the simulated eviction API, I ran into a big issue. What happens when an actual eviction event is generated by Azure and a simulated eviction are different. 😦 Thanks Microsoft!!!

Microsoft only allows the following event types:

ResourceWriteSuccess
ResourceWriteFailure
ResourceWriteCancel
ResourceDeleteSuccess
ResourceDeleteFailure
ResourceDeleteCancel
ResourceActionSuccess
ResourceActionFailure
ResourceActionCancel

A simulated eviction seems to fall under ResourceActionSuccess event type (see example payload below). From what I could gather using requetbin and the event grid webhook to capture the Event Grid events. There doesn’t seem to be a eviction event which gets sent on event grid when an actual eviction event is generated. I suspect an actual eviction doesn’t fall into one of the categories listed above.

As you can see in the simulated eviction the action field clearly states this is an eviction action. This made it very easy to parse and determine if the event coming off the event grid was an eviction.

Example event grid message snippet from simulated eviction.


 {   
   "subject": "/subscriptions/a6074dfd-ce1d-4687-af3a-6e7c3d2d6852/resourceGroups/rg-eviction-test/providers/Microsoft.Compute/virtualMachines/myvm123",
    "eventType": "Microsoft.Resources.ResourceActionSuccess",
    "id": "d62277a8-ce65-490a-8dfd-c6eab09ab43b",
    "data": {
      "authorization": {
        "scope": "/subscriptions/a6074dfd-ce1d-4687-af3a-6e7c3d2d6852/resourceGroups/rg-eviction-test/providers/Microsoft.Compute/virtualMachines/myvm123",
        "action": "Microsoft.Compute/virtualMachines/simulateEviction/action",
        "evidence": {
          "role": "Contributor",
          "roleAssignmentScope": "/subscriptions/a6074dfd-ce1d-4687-af3a-6e7c3d2d6852",
          "roleAssignmentId": "b8a05722fc9d48b085238e99ae995fbf",
          "roleDefinitionId": "b24988ac618042a0ab8820f7382dd24c",
          "principalId": "654cf51a87364887b083b5aa1e64ddb4",
          "principalType": "Group"
        }
      },
}

Still determined, I considered triggering the Azure Function off a deallocation event, however, I was not keen on using deallocation as a trigger event, as it might cause too many false VM starts when someone was actually trying to deallocate their VM. Here is an example of a deallocate event when an actual eviction occurs.

        "subject": "/subscriptions/a6074dfd-ce1d-4687-af3a-6e7c3d2d6852/resourceGroups/rg-eviction-test/providers/Microsoft.Compute/virtualMachines/myvm123",
        "eventType": "Microsoft.Resources.ResourceActionSuccess",
        "id": "ef032195-f5d9-4276-83f8-c50a52949167",
        "data": {
            "authorization": {
                "scope": "/subscriptions/a6074dfd-ce1d-4687-af3a-6e7c3d2d6852/resourceGroups/rg-eviction-test/providers/Microsoft.Compute/virtualMachines/myvm123",
                "action": "Microsoft.Compute/virtualMachines/deallocate/action",
                "evidence": {

Fig. 5 Azure Function and Event Grid start VM process

My second try at solving the eviction problem. Success!!!

I was determined to make this happen as the cost savings potential for a big environment was too great. I had remembered some earlier work I had done with the Azure Instance Metadata service, I investigated if the metadata service on each instance would expos eviction events. It turns out the VMs do. Each VM on Azure exposes a metadata via Azure Instance Metadata Service. Through this service one can retrieve information about the service like the subscription the VM is running in, resource group, IP address, VM size, tags and etc…

The basic idea behind this process, is we monitor the Azure Instance Metadata Service for an eviction event. Once the eviction event is received the process will call Jenkins via its reset endpoint and start the VM after it has been deallocated/evicted from Azure.

Fig 6. VM restart process initiated from client side

The downside of this architecture is you have to deploy and manage a small client side service running on each VM. However, with tools like Ansible, Chef, Puppet or Cloud-Init this usually isn’t too much trouble. The upside writing this service does not require any knowledge of Azure functions. Theoretically this architecture could be implemented on AWS as well, they provide instance metadata similar to Azure.

Simulating the eviction Process

Microsoft recently released REST api to simulate eviction process on Azure VMs. The ability to simulate an eviction made it much easier for me to develop and a process the eviction monitor service.

# you can log into Azure portal and if using Chrome go to dev tools 
# and get your Bearer token from one of the requests
TOKEN=<my auth token>

curl -H "Accept: */*" \
-H "Content-Length: 0" \
-H "Host: management.azure.com" \
-H "Authorization: Bearer ${TOKEN}" \
-X POST \
https://management.azure.com/subscriptions/a6074dfd-ce1d-4687-af3a-6e7c3d2d6852/resourceGroups//providers/Microsoft.Compute/virtualMachines/myvm123/simulateEviction?api-version=2020-12-01

Fig 7. Azure Portal Active Log Simulated Eviction

I went to work and wrote small service in Python which queries the metadata service on the VM. I the installed the service via systemd service file. The python code, Jenkinsfile and Python code are published on github. Below is the output of systemd logs from the eviction_notify app right after the simulated eviction curl POST. Microsoft will give you a minimum of 30 seconds notification before the VM will be shut down. However, I have seen notifications of up to 3 to 4 minutes.

● eviction-notify.service - Eviction Notify
   Loaded: loaded (/etc/systemd/system/eviction-notify.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2021-04-24 14:41:34 UTC; 3min 0s ago
 Main PID: 1949 (python3)
    Tasks: 1 (limit: 4074)
   CGroup: /system.slice/eviction-notify.service
           └─1949 /usr/bin/python3 /opt/eviction_notify.py

Apr 24 14:44:32 myvm123 /eviction_notify.py[1949]: b'{"DocumentIncarnation":1,"Events":[{"EventId":"2E372AE8-3F48-40A8-BAF0-6C54A23C2669","EventStatus":"Scheduled","EventType":"Preempt","ResourceType":"VirtualMachine","Resources":["myvm123"],"NotBefore":"Sat, 24 Apr 2021 14

Apr 24 14:44:32 myvm123 /eviction_notify.py[1949]: Shutting down in: 5 seconds!!!

Apr 24 14:44:33 myvm123 /eviction_notify.py[1949]: b'{"DocumentIncarnation":1,"Events":[{"EventId":"2E372AE8-3F48-40A8-BAF0-6C54A23C2669","EventStatus":"Scheduled","EventType":"Preempt","ResourceType":"VirtualMachine","Resources":["myvm123"],"NotBefore":"Sat, 24 Apr 2021 14

Apr 24 14:44:33 myvm123 /eviction_notify.py[1949]: Shutting down in: 4 seconds!!!

Apr 24 14:44:33 myvm123 /eviction_notify.py[1949]: Calling Jenkins URL : http://10.111.252.21:8080/job/automation/job/startvm/buildWithParameters?subscription_id=a6074dfd-ce1d-4687-af3a-6e7c3d2d6852&resource_group=rg-eviction-test&name=myvm123

Apr 24 14:44:33 myvm123 /eviction_notify.py[1949]: {"message": "Spot instance evicted, shutting down in: 0:00:04.970472 seconds", "timestamp": "04/24/2021, 14:44:33"}

Apr 24 14:44:34 myvm123 /eviction_notify.py[1949]: b'{"DocumentIncarnation":1,"Events":[{"EventId":"2E372AE8-3F48-40A8-BAF0-6C54A23C2669","EventStatus":"Scheduled","EventType":"Preempt","ResourceType":"VirtualMachine","Resources":["myvm123"],"NotBefore":"Sat, 24 Apr 2021 14

Apr 24 14:44:34 myvm123 /eviction_notify.py[1949]: Shutting down in: 3 seconds!!!

Apr 24 14:44:34 myvm123 /eviction_notify.py[1949]: Calling Jenkins URL : http://10.111.252.21:8080/job/automation/job/startvm/buildWithParameters?subscription_id=a6074dfd-ce1d-4687-af3a-6e7c3d2d6852&resource_group=rg-eviction-test&name=myvm123

Apr 24 14:44:34 myvm123 /eviction_notify.py[1949]: {"message": "Spot instance evicted, shutting down in: 0:00:03.935851 seconds", "timestamp": "04/24/2021, 14:44:34"}

What are some other approaches you can use to reduce the chance your VMs might get evicted?

Another way you can mitigate the chance your VM might get evicted is to run different sizes across different Availability Zones and regions. AWS actually has a really good article on this in reference to its kubernetes offering EKS. Although the post is about Kubernetes one could apply the same principles to any group of VMs. Demand for compute can very across availability zones, VM sizes and regions. Spreading the VMs across all three will greatly reduce the chance that all VMs will get evicted at a single point in time.

What has my experience with Spot Instances been?

Although I have only several months of experience with Spot Instances, I have a pretty good understanding of the ins-and-outs of Spot Instances. Recently I was testing deployment of a 12 node Cassandra cluster using Ansible, I deployed this Cassandra cluster with Spot Instances and set Eviction Type to Capacity. After about a week of testing I noticed on average 3 to 4 evictions total across the whole cluster per day. For the small inconvenience, I was saving 75% per day on the cost of compute. And with the process I built to start the VM upon eviction, most times I did not notice a node went down for a few minutes. I am looking at additional use cases for Spot Instances, and will publish the findings here.

Where can I get the source code?

Source code for eviction_notify has been published on github

Summary

In summary spot instance can save a substantial amount on public cloud costs. For example, if your monthly cloud bill is $10,000 for dev/test environments. By shutting down environments for 12 hours a day you can reduce the cost to $5,000. By implementing spot instances you can potentially lower this to $1,500 per month assuming 70% discount. Between shutting down VMs and Spot instances there is a potential for big savings. In addition, spot instances also allow you to use much more powerful instance types for less cost in dev environments. I also showed how to mitigate the down side of using Spot instances, by engineering a solution to reduce the impact of the evictions.

Tater's Tech Blog

Month: April 2021

Configuring an API Key on Kubernetes nginx.ingress