The ultimate guide to cloud cost optimization


The cloud’s scalability is both a blessing and a curse. Sure, you can experiment with new ideas without having to worry about getting another rack of servers. But if you’re reading this, you know there’s a price to pay for this comfort.

Overprovisioning and cloud sprawl are real. They will make even a tech giant CFO’s eye twitch at the end of the quarter.

Take Pinterest as an example. During one holiday season, the company’s cloud bill went way over the initial estimates because of increased usage and Pinterest had to pay AWS $20 million on top of the $170 million worth of reserved resources. 

The only way you can deal with the long-term cost implications of the cloud is by implementing cloud cost optimization. And if you don’t want optimization to become a drag on your engineering team, automating it is the only move that gets you there.

Check out this guide to optimizing cloud costs step-by-step:

  1. Know what you can win by optimizing cloud costs
  2. Start by understanding your cloud bill
  3. Choose the best compute resources for your application
  4. Achieve greater savings with spot instances
  5. Don’t get lured by the promise of savings plans
  6. Pick the right tool for the job
  7. Cloud automation opens the doors to the greatest savings
  8. What you’re doing in the cloud could be done for 50% less

1. Know what you can win by optimizing cloud costs 

cloud cost optimization gains

Is optimizing cloud costs worth your time? Take a look at ​​the optimization gains reported by companies in communication, entertainment, SaaS, and e-commerce:

  • In Q1 2021, Zoom reported that its gross margin widened to 73.9% from 69.4% in the previous quarter primarily because of the effort invested in optimizing public cloud resources.
  • Spotify built a custom tool called Cost Insights to track cloud expenses and encourage engineers to take ownership of the cloud spend, reducing its annual cloud spend by millions of dollars.
  • By making some smart incremental optimization of infrastructure decisions, Segment increased its gross margin by 20% and reduced its infrastructure costs by 30% despite experiencing a 25% larger traffic volume – all within three months. 
  • The e-commerce startup La Fourche saw its cloud bill rise dramatically and ran the CAST AI Savings Report to find optimization opportunities. By turning automated optimization on, the company reduced its monthly cloud bill by 69.9% without increasing engineer workload.

Now that you know that it’s worth playing the optimization game, let’s see what methods teams choose to prevent their cloud costs from spiraling out of control.

2. Start by understanding your cloud bill

Take a look at your cloud bill and you’re likely to get lost. 

Bills are long, complex, and hard to unpack because every service has a defined billing metric. Understanding your usage to the point where you can make a decision confidently is next to impossible.

And we’re talking about analyzing costs for only one cloud and one team. Try billing for multiple teams or clouds!

This is where cost allocation comes in and reveals who is using which resources. How else can you make anyone accountable for these costs? Cost allocation is especially challenging in dynamic infrastructures running on Kubernetes.

Why is it worth examining and allocating costs based on your cloud bill? Because it’s a treasure trove of data that will help you forecast your requirements better and secure the right amount of resources (and avoid the curse of overprovisioning!).

But estimating your future resource demands is no small feat.

how to estimate future cloud resource demands

Here’s an example sequence you may follow:

  1. Gain visibility and analyze your usage reports to identify any patterns in spending.
  2. Detect peak resource usage scenarios with the help of periodic analytics and crunching your historical usage data.
  3. Take seasonal customer demand patterns into account and check if they correlate with your peak resource usage. If you see that, identifying them in advance might get just a tiny bit easier.
  4. Make sure to monitor resource usage reports regularly and set up alerts to keep cloud costs in check.
  5. Create an application-level cost plan by measuring application- or workload-specific costs. This will also open the doors to calculating the total cost of ownership of your cloud infrastructure. 
  6. Next, take a look at the pricing models of your cloud providers and plan capacity requirements over time. Putting all of this data in one place makes understanding your costs easier.

The tasks listed above aren’t one-off jobs. You need to do that on a regular basis to get results. 

Learn more about how to analyze your cloud bill here: Surprised by your cloud bill? 5 common issues & how to deal with them

3. Choose the best compute resources for your application

Choosing the right virtual machine can be a huge game-changer if your application relies on compute. But AWS has almost 400 different instances. Similar instance types deliver different performance across cloud providers – and even in the same cloud, a more expensive instance doesn’t equal higher performance.

how to choose the best virtual machine for your workload

1. Define your minimum requirements 

Make sure to do it across all compute dimensions including CPU (architecture, count, choice of processor), Memory, SSD, and network connectivity. 

2. Select the right instance type 

You can choose from various combinations of CPU, memory, storage, and networking capacities that come packaged in instance types that are optimized for one such capability. 

3. Set the size of your instance 

Remember that the instance should have enough capacity to accommodate your workload’s requirements and include options like bursting if necessary.

4. Examine different pricing models 

The three major cloud providers offer different rates: on-demand (pay-as-you-go), reserved capacity, spot instances, and dedicated hosts. Each of these options has its advantages and drawbacks. They’re covered in detail in this guide: How to choose the best VM type for the job and save on your cloud bill 

4. Achieve greater savings with spot instances

It’s smart to buy idle capacity from AWS and other large cloud providers because spot instances are up to 90% cheaper than on-demand ones. However, there is a catch: the vendor reserves the right to reclaim these resources at any moment. You need to make sure that your application is prepared for that before jumping on the spot bandwagon.

Here’s how to use spot instances:

1. Examine your workload to see if it’s ready for a spot instance

Can withstand interruptions? How long will it take to complete the job? Is this a mission-critical workload? These and other questions aid in the qualification of a workload for Spot Instances.

is your workload ready for spot instances

2. Examine the services of your cloud provider

It’s a good idea to look at less popular instances because they’re less likely to be interrupted and can operate for longer periods of time. Check the frequency of interruption of an instance before settling on it.

3. Now it’s time to bid

Set the highest amount you’re prepared to pay for your chosen spot instance. Note that it will only run as long as the market price meets your offer (or is lower). Setting the maximum price at the level of on-demand pricing is the rule of thumb here.

4. Manage spot instances in groups

That way, you’ll be able to request numerous instance types at once, increasing your chances of landing a spot instance.

To make all of the above work well, prepare to spend a lot of time on configuration, setup, and maintenance tasks (unless you decide to automate it).

Want to learn more about spot instances? Here’s a complete guide: Spot instances: How to reduce AWS, Azure, and GCP costs by 90%

5. Don’t get lured by the promise of savings plans

Reserving capacity for one or three years in advance at a much cheaper rate seems like an interesting option. Why not buy capacity in advance when you know that you’ll be using the service anyway?

But like anything else in the world of the cloud, this only seems easy.

You already know that forecasting cloud costs is hard. Even companies that have entire teams dedicated to cloud cost optimization miss the mark here. 

How are you meant to plan ahead for capacity when you have no clue how much your teams will require in one or three years? This is the main issue with products like reserved instances and savings plans.

Here are a few things you should know about reserving capacity:

  • A reserved instance works by “use it or lose it” – every hour that it sits idle is an hour lost to your team (with any financial benefits you might have secured).
  • When you commit to specific resources or levels of consumption, you assume that your needs won’t change throughout the contract’s duration. But even one year of commitment is an eternity in the cloud. And when your requirements go beyond what you reserved, you’ll have to pay the price – just like Pinterest did. 
  • When confronted with a new issue, your team may be forced to commit to even more resources. Or you’ll find yourself with underutilized capacity that you’ve already paid for. In both scenarios, you’re on the losing end of the game.
  • By entering into this type of contract with a cloud service provider, you risk vendor lock-in – i.e. becoming dependent on that provider (and whatever changes they introduce) for the next year or three. 
  • Selecting optimal resources for reservation is complex (just check out point 3 above in this article).

The above is just the tip of the iceberg. We wrote an entire article that dives into the details of reserved instances: Do AWS Reserved Instances and Savings Plans really reduce costs?

6. Pick the right tool for the job

To gain control over their cloud expenses, companies apply various cost management and optimization strategies and solutions in tandem:

  1. Cost visibility and allocation – Using a variety of cost allocation, monitoring, and reporting tools, you can figure out where the expenses are coming from. Real-time cost monitoring is especially useful here since it instantly alerts you when you’re going over a set threshold. A computing operation left running on Azure resulted in an unanticipated cloud charge of over 500k for one of Adobe’s teams. One alert could have prevented this.
  2. Cost budgeting and forecasting – You can estimate how many resources your teams will need and plan your budget if you crunched enough historical data and have a fair idea of your future requirements. Sounds simple? It’s anything but – Pinterest’s story shows that really well.
  3. Legacy cost optimization solutions – This is where you combine all of the information you got in the first two points to create a complete picture of your cloud spend and discover potential candidates for improvement. Many solutions on the market can assist with that, like Cloudability or VMware’s CloudHeath. But most of the time, all they give you is static recommendations for engineers to implement manually.
  4. Automated, cloud native cost optimization – This is the most powerful solution for reducing cloud costs you can use. This type of optimization doesn’t require any extra work from teams and results in round-the-clock savings of 50% and more, even if you’ve been doing a great job optimizing manually. A fully autonomous and automated solution that can react quickly to changes in resource demand or pricing is the best approach here. 

Should we continue to rely on software engineers to do all the management and optimization tasks manually? 

Not with so many automation options at hand!

7. Cloud automation opens the doors to the greatest savings

As you can tell from the points above, manual cost optimization is a complex and time-consuming process. 

And regardless of the skill level of engineers, many of the cost optimization tasks are just not suited for humans. 

Allocate, comprehend, analyze, and anticipate cloud expenses and you’ll see how hard that is. Then you need to make infrastructure adjustments, investigate pricing plans, spin up more instances, and do a variety of other tasks to create a cost-effective infrastructure.

Automation takes many of these tasks off your plate:

cloud native cloud cost optimization

Apart from getting rid of all the tasks above, an automated solution adds more value because it:

  • Selects the most cost-effective instance types and sizes to meet your application’s needs.
  • Automatically scales your cloud resources up and down to cope with demand spikes and drops.
  • Removes resources that aren’t in use to eliminate waste.
  • Makes use of spot instances and gracefully manages disruptions.
  • Automates storage and backups, security and compliance management, and changes to configurations and settings to help you save money in other areas.

Most importantly, an automated platform implements all of these modifications in real time, mastering the point-in-time nature of cloud cost optimization.

Automation takes advantage of things you’d never imagine to check

We used a combination of AWS On-Demand and spot instances to operate our application running on 8 CPUs and 16 GB of RAM. 

Then we decided to run it through CAST AI to check if our configuration was optimized. The platform suggested moving to a spot instance INF1. But wait, isn’t that a pricey, ML-specialized GPU instance? 

As it turned out, that instance was at that time actually cheaper than the general-purpose compute instances we were using. We would have lost out on this hidden gem without automation.

8. What you’re doing in the cloud could be done for 50% less

You already learned from examples like ZoomSpotify or La Fourche that reducing cloud costs can have a significant impact on your bottom line.

The low-hanging fruit here is cloud cost optimization. But standard tactics such as expense monitoring and reporting will only get you halfway there at a significant engineer time cost.

Discover what automated cloud cost optimization can do for your business. Book a quick call with CAST AI – the #1 cloud optimization platform for Kubernetes.
Did we mention how fast can CAST AI get you results?You wouldn’t believe us if we told you the actual number, but 30 minutes are enough to show you the platform’s key capabilities. Book a demo and see the platform in action.Book a 30 minute demo

P.S. You can always check the product out on your own terms and even run a free report. Simply register here to get started.

Leon Kuperman

Leon Kuperman

Leon is co-founder and CTO at CAST AI. Formerly Vice President of Security Products OCI at Oracle, Leon has 20+ years of experience spanning across companies such as IBM, Truition, and HostedPCI. He founded and served as the CTO of Zenedge, acquired by Oracle).