Photo by AI

Unlocking Cost Efficiency: Using EC2 Spot Instances for GitHub Actions

In today's ever-evolving software development landscape, managing costs effectively is paramount. As many teams have discovered, using GitHub-hosted runners for Continuous Integration (CI) and Continuous Deployment (CD) can become prohibitively expensive over time. In early 2023, we faced this challenge head-on and decided to turn to AWS for a self-hosted solution. This journey culminated in the launch of HyperEnv, a self-hosted GitHub runners solution, which has since undergone several iterations to reach its current version, 2.9.0, that introduces the ability to use EC2 spot instances—a game changer for cost savings.

EC2 Spot Instances vs. On-Demand Instances

When moving workloads to the AWS cloud, understanding the different pricing models for virtual machines is crucial. AWS primarily offers three models:

On-Demand Instances: This is the standard pricing model where users pay an hourly fee based on the instance type. For instance, an m5.large instance (with 2 vCPUs and 8 GiB of memory) costs about $0.0960 per hour in the us-east-1 region.
Spot Instances: These instances are offered at significant discounts (up to 70%) on AWS's unused capacity, dynamically fluctuating based on supply and demand. For example, as of this writing, an m5.large spot instance costs only $0.0348 in us-east-1. However, there's a caveat: AWS reserves the right to terminate a spot instance at any time with little notice, known as a spot interruption.
Savings Plans: This is a more predictable approach, where users commit to specific instance usage in exchange for a discount. This is ideal for static workloads that can be planned out for the long term.

Employing spot instances for your GitHub runners can cut costs by approximately 60% compared to running on-demand instances, but there are essential considerations to keep in mind.

Ephemeral vs. Long-Running Runners

When leveraging AWS for GitHub runners, there are three different strategies:

Long-running: Keep an EC2 instance alive 24/7, which can become cost-inefficient.
Auto-scaled: Adjust the number of instances automatically based on job queue lengths.
Ephemeral: Launch instances on-the-fly for each job and terminate them right after.

The ephemeral approach significantly reduces potential costs since builds take only 5 to 15 minutes. This greatly minimizes the risk of interruptions as AWS is less likely to terminate a spot instance during such a short runtime. Thus, ephemeral runners are ideally suited for spot instances.

Implementing a Fallback Strategy

While the cost benefits of spot instances are compelling, their availability can fluctuate. You may occasionally encounter situations where spot instances are unavailable in your chosen availability zone. To counteract this, it's wise to adopt a fallback strategy that automatically deploys on-demand instances when spot instances can't be provisioned.

To implement this, use AWS Auto Scaling and CloudWatch metrics to monitor spot instance availability. Based on this data, your Auto Scaling group can effectively manage capacity, ensuring reliable access to GitHub runners. The fallback mechanism ensures operational continuity and helps you capitalize on cost savings whenever possible.

Interruption Resilience in GitHub Workflows

While spot instances can provide substantial savings, not every GitHub task can tolerate interruptions. Tasks such as running unit tests or building artifacts can usually be restarted without problematic consequences. However, for operations like terraform apply, interruptions may lead to corrupted states and subsequent failures. Therefore, it’s critical to configure whether a job can run on a spot instance at the job level to prevent potential disruptions.

Designing Your AWS Architecture

To create a seamless setup for running ephemeral GitHub Actions on EC2 spot instances with an automatic fallback, consider an architecture like the following:

An API Gateway receives HTTP requests from GitHub.
This gateway triggers a Lambda function (e.g., webhook) which verifies incoming webhook events.
Upon verification, this function starts a Step Function, orchestrating the runner operations.
The Step Function invokes another Lambda function to launch a spot instance.
If a spot instance request fails, the Step Function attempts to launch a spot instance in another availability zone.
Should all spot attempts fail, it ultimately defaults to launching an on-demand instance.

Architecture Diagram

Try HyperEnv for Your Needs!

If you're seeking a production-ready and well-maintained solution for managing self-hosted GitHub Actions runners on AWS, HyperEnv is your go-to option. It combines the advantages of cost savings with the reliability that modern software development demands. Give it a try and start optimizing your build costs today!