Rippling is a workforce management platform. As of mid 2023, we have 650 engineers pushing hundreds of pull requests every day. To ensure top quality for our clients, we have a vast test suite of over 60,000 tests which provides high coverage on all the different products and the integrations between them. The CI pipeline running the test suite for all our engineers operates at a very high scale; we can have hundreds of builds simultaneously.
Each build uses extensive parallelization to run all the tests (multiple nodes, multiple processes per node). At peak time, we can have over 1,200 large VMs, which are equivalent to 50,000 CPU and 100,000 GB memory. We use Buildkite as our control plane and AWS as our cloud provider. At this scale, when we get a million-dollar cloud bill, we view it as an opportunity to reduce our cost. But, we also see it as an engineering challenge to do so without compromising performance or efficiency.
This article will focus on how we were able to use Spot Instances to save on infrastructure costs. We had to deal with some unexpected issues during this project, but we also made some interesting discoveries. Let’s dive in.
Why are Spot Instances a good fit for CI/CD?
Using Spot Instances for CI/CD pipelines offers significant savings advantages. Spot Instances are spare EC2 instances offered at discounted prices. They can scale up or down based on demand, ensuring efficient resource utilization and cost optimization. They’re also ideal for parallel execution of CI/CD tasks, enabling faster pipeline completion and reducing overall duration.
As an organization is growing, it adds more engineers. These new users of the build systems increase the demand for CI and contribute to the existing codebase. This leads to a larger test suite that takes longer to execute during CI, resulting in increased costs. Organizations should regularly review and optimize their test suites; this topic is not covered in this article.
A common strategy for CI is to use Reserved Instances (RIs), which works well if you have a stable workload–these typically need to be committed ahead of time and usually for a full year. They have some flexibility regarding conversion and scheduling that could be relevant for a build system. In our case, however, we have a variable load mostly trending up, but we make some frequent optimizations—and we don’t want to end up with unused RIs. We wanted to have the flexibility to switch instance types over time. For these reasons, using Spot Instances was a better fit for us.
Estimate the potential savings
To begin our journey to a cheaper build system, we started by looking at our AWS bill. Using Spot Instances was an initial idea that came up, but we needed to verify the impact.
Our methodology follows a three-step process:
- Measure the infrastructure cost related to computing, as opposed to network and storage.
- Identify the CI pipelines using the most computing resources.
- Estimate the potential savings.
Our assumption was that most of the cost came from computing (also called EC2 Instance Pricing) and our initial analysis confirmed that it was accounting for 90%. It is essential to remember that using Spot Instances can only impact computing costs. There are other options to consider for other categories but they are not covered in this article.
Next, we looked at our build system to identify where we should focus our effort. Using the 80/20 rule, our goal was to identify the 20% of pipelines that contributed to 80% of the cost. In our case, since we have a monolith application, the pipeline for this primary application was the only one that we needed to focus on. We could even further simplify the scope by only considering the CI pipeline, since it is hundreds of times more frequent than the CD pipeline.
From this point, we used simple math to estimate how much we could potentially save:
- Our monolith CI represented 80% of the computing cost.
- We estimated 75% savings while using Spot instances; AWS advertises up to 90% savings.
- Estimated savings on computing 80% x 75% = 60%
- Computing represents 80% of our total cloud cost.
- Estimated savings on total cost: 60% x 80% = ~50%
This estimation step didn’t require making any changes, but:
- We greatly simplified the problem and only needed to migrate a single pipeline to use Spot.
- We estimated that we can potentially save 50%.
In our case, having a monolithic application was an advantage, as there was a clear path about which pipelines should be optimized.
Our journey to efficiently use Spot Instances
Step 1 - The naive approach, turn on the Spot checkbox
We use Buildkite which lets us start our agents in our own infrastructure. AWS is our cloud provider, so we have set up many Buildkite Elastic CI Stack for AWS to run our workloads. Each stack is primarily an autoscaling group that manages some EC2 agents that connect to Buildkite as agent (worker nodes). There is a lambda function that manages the size of the autoscaler group based on the current load.
At first, we did the simplest approach which was to lower the OnDemandPercentage in our stack definition, directly setting the property with the same name on the autoscaler. We started with 90% and lowered this number gradually with the idea to go to zero, which would mean using 100% Spot—so we could get the maximum discount.
There were two major issues with this strategy:
- We were not able to confidently go lower than 50% without statistically having outages. We experienced some periods of time where we were waiting for EC2 instances to be created. This was usually during peak periods. It created a backlog of builds that could not be processed. It was not a great experience for developers as their build would be stuck in the pipeline during these periods.
- We also had a high level of Spot interruption, maybe 5% of the machines would be interrupted in the first hour. Given that each build runs a large number of tests, we rely heavily on parallelization. The test suite is divided into batches, and each batch is executed on a worker node. This was increasing the probability that a given build would have some interruption. And when we had a batch that was interrupted, it would have to be retried. This would add some time to the critical path of the build. In worst case scenarios, the retry job would also be interrupted.
With this approach, we were able to save ~25% on compute cost, but we also added 25% on our average build time. We were mostly impacted by these interruptions during rush hours, where developers needed the CI system the most.
Step 2 - Mitigate Spot outages
The next step for us was to build some services to detect a Spot outage and switch our CI pipeline to use on demand instances. Here is an example of a simple solution using two different queues.
Over time, we became more efficient at detecting Spot outages, but switching the worker queues to be on demand was challenging. If a build is in progress and we experience a Spot outage, dynamically switching the queue on retry is not a feature offered by Buildkite. We ended up using one queue and updating the autoscaling group On Demand percentage. But this was also not an ideal solution as the value was not persisted. When our terraform stack was updated by an unrelated change, terraform would replace the original On Demand percentage.
Step 3 - Making our pipeline Spot friendly
At this stage, we reconsidered our approach to Spot Instances and tried to extend our Spot availability pool to accept a larger set of instance types which would lower the probability of experiencing or not having a Spot Instance when requesting one.
We had initially optimized our pipeline for performance and cost. We found some issues when using EBS volumes: Our workload would sometimes deplete the IO burst balance of the volume which would result in AWS throttling the operations on the disk. It would make the EC2 instance unusable until it recovered. To prevent this issue, we tried out some instances with a local SSD storage such as M6id and C6id Instances. They don’t have a burst balance and solved this issue immediately.
But now, having this specific instance requirement was playing against us, as we were limited in our ability to leverage Spot Instances. When we realized this was a big limiting factor, we went back to this limitation and made some adjustments to our pipeline to be able to run on a larger set of instance types without the need for SSD storage. After we made these changes, we didn’t experience any wait time when creating a new instance using Spot.
We still had to deal with interruptions where a job would be terminated unexpectedly before the end and we would need to do a full retry. To solve this issue, we modified our pipeline to not retry tests that were already completed. This is an extremely interesting topic and my colleague JD Palomino covers it in the article here.
Step 4 - Using price-optimized instances
At this time, we had a fairly stable pipeline using 100% of Spot Instances. But we also had some occasional setbacks due to Spot prices fluctuation.
- Some Spot Instance types became more expensive over time.
- Or for the same Spot Instance type, we experienced some very high price volatility within the same region depending on which availability zone is used.
To prevent these unexpected cost increases, we started to monitor the Spot prices on a weekly basis and made some adjustments. But there was a better solution.
Since our pipeline can tolerate Spot interruptions due to previous work done in step 3, we were able to leverage the lowest-price strategy, which would pick the instance where Spot price is the lowest but can result in more frequent interruptions. This being a relatively new feature offered by AWS—Buildkite only supports capacity-optimized—we forked the template and modified it to be able to support this allocation strategy.
This strategy has been working very well for us; we no longer need to review all the Spot prices and make manual adjustments. Our pipeline always gets the lowest price instance for our needs. At times, we end up with larger instance types than what we asked for, but we still pay the lowest price possible at that given time, so this isn’t a concern.
Measuring the impact
When done with this project, we hit our goal of 60% savings on EC2 compute cost, and 50% overall savings. In the first stages of this project, we were able to yield some savings, but we degraded the experience for the developers by not handling the Spot interruptions properly. After a few iterations, we made our infrastructure fault tolerant and extended the Spot Instance pool to improve the experience. Finally, we picked the most optimal solution for us by using the lowest-price strategy.
P.S. – if you love solving similar problems, we're hiring! Click here!