Cloud Costs in Practice: What Actually Helped Reduce Spend
FinOps Lessons from Running EKS, EC2, RDS, and Supporting Platform Services on AWS
Cloud Costs in Practice: What Actually Helped Reduce Spend
FinOps Lessons from Running EKS, EC2, RDS, and Supporting Platform Services on AWS
1. Cost Work Came From a Different Cloud Estate, but the Lesson Was the Same
Most of the earlier posts in this series focused on Azure, AKS, private networking, platform separation, GitOps, and operating model design. This post comes from a different environment on AWS, but I have kept it in the portfolio because it shaped my thinking in exactly the same way. The cloud provider was different. The underlying lesson was not.
Cloud cost is usually discussed as if it belongs to finance, procurement, or a reporting function on the edge of engineering. In practice, the biggest savings I saw came from platform decisions, workload behavior, and the discipline to distinguish between what the system genuinely needed and what it was simply carrying by habit.
That mattered a lot in an AWS estate built around EKS, EC2, RDS, S3, and a mix of supporting platform services. By the time cost became a serious topic, the bill was already large enough that vague optimization advice was not going to help. Nobody needed another reminder to "be mindful of spend." What we needed was a clearer view of which parts of the platform were predictably valuable, which parts were genuinely variable, and which parts were just expensive because nobody had challenged them properly.
The useful part of FinOps, at least in this environment, was not the label. It was the discipline of making cost legible to engineers.
2. FinOps Was Useful Only When It Stopped Being Abstract
I have never found FinOps especially helpful when it is treated as a parallel management activity full of dashboards, allocation models, and generic cost optimization advice. It becomes useful when it is tied directly to how the platform actually behaves.
That meant starting from engineering reality rather than finance language.
The platform had a normal set of cost drivers for a modern SaaS environment. EKS provided the orchestration layer. EC2 carried a large portion of the compute footprint, including worker capacity and other supporting workloads. Aurora PostgreSQL sat under a meaningful part of the application. S3 stored a very large amount of data. GitLab runners and a few heavier job types introduced their own kind of bursty compute demand. None of that was surprising. What mattered was understanding how much of it was steady, how much of it was seasonal or bursty, and how much of it existed because past decisions had simply never been revisited.
That is where FinOps stopped sounding like a corporate program and started sounding like engineering work. Once the discussion moved away from "reduce the monthly bill" and toward "separate baseline, burst, and waste," the decisions became much easier to defend.
3. Cost Was Not the Problem at First. Visibility Was.
When cloud spend starts climbing, the first instinct is often to look for savings instruments, new tooling, or provider discounts. Those can help, but they are not where I started.
The first real problem was visibility.
Without a reliable picture of where spend was going, most optimization effort turns into guesswork. You can right-size a handful of instances and still miss the much larger pattern. You can talk about reservations before you understand the steady floor of the platform. You can argue about whether Kubernetes is expensive without knowing whether the problem is actually EKS itself, oversized node groups, idle environments, or storage growth that nobody is watching closely enough.
The cost work only became productive once it was possible to answer practical questions quickly. Which parts of the platform were stable enough to commit to? Which services or environments were disproportionately expensive? Which resources were heavily used during working hours but mostly idle at night? Which line items reflected deliberate architecture decisions, and which ones were just leftovers from earlier stages of the platform?
That visibility came from usage history, environment knowledge, and cost breakdowns that engineering teams could actually map back to real workloads. It was much less glamorous than a FinOps pitch deck and much more useful.
4. Usage Patterns Told Us More Than the Invoice Did
One of the more useful things about this environment was that the application usage profile was not random. It was a SaaS platform in a construction context, which meant traffic was strongly tied to working hours.
During the day, especially between roughly 8 AM and 8 PM, usage was predictably higher. Evenings dropped off. Weekends were materially quieter. That pattern mattered because it told us something the invoice alone could not: a meaningful part of the compute footprint was steady enough to plan around, but not everything needed to be paid for at on-demand rates all the time.
This is where a lot of cost work goes wrong. Teams jump straight from "the bill is high" to "we should optimize everything" without separating baseline demand from burst demand. Once those two are mixed together, almost every decision gets worse. You under-commit because peak usage looks scary, or you over-commit because the platform is large and the discounts look attractive.
Historical EKS usage trends were particularly useful here. Looking at node usage over time gave a much more honest picture of what the platform consistently needed in order to operate safely and what only showed up during predictable peaks or occasional spikes. That made later decisions around Reserved Instances much less speculative.
The important step was not identifying the highest traffic hour. It was understanding the floor of the platform well enough to commit to it confidently.
5. Reserved Instances for EC2 Had the Biggest Financial Impact
The single most effective cost measure in this environment was Reserved Instances for the EC2 footprint that represented the platform's steady baseline.
A large portion of the compute layer sat on compute-optimized instances in the c5 family. Those were not chosen because they were fashionable. They matched the actual workload profile well enough that they had become the normal shape of a lot of the platform's compute demand. Once usage history made it clear that a substantial amount of that demand was persistent rather than occasional, keeping it all on on-demand pricing stopped making sense.
This is where the useful part of cost optimization was not "buy Reserved Instances." Anyone can say that. The real work was identifying how much of the EC2 footprint was stable enough to reserve without painting the platform into a corner.
That is a more careful decision than it sounds. Overcommitting can be just as bad as staying entirely on-demand. If you reserve too aggressively, you lock yourself into assumptions the platform may outgrow or invalidate. If you avoid reservations entirely because uncertainty feels safer, you end up paying on-demand rates for capacity that is effectively permanent.
What worked well was reserving the baseline rather than the peaks. Some commitments were made on 1-year terms and some on 3-year terms, depending on how stable the underlying usage looked. The point was not to maximize the reservation percentage for its own sake. The point was to cover the part of the platform we were already confident would exist regardless of day-to-day traffic variation.
That produced the largest savings because it addressed the biggest recurring line item without relying on risky architectural change.
6. RDS Was an Easier Commitment Than Compute
If EC2 required some judgment, the database layer was even more straightforward.
Aurora PostgreSQL was carrying a meaningful and relatively steady portion of the platform's workload, and the database shape was much less burst-driven than parts of the application tier. In this kind of environment, that matters. Stateless compute often moves around. Database capacity tends to change more slowly and with more caution.
That made the reservation decision simpler.
For the Aurora PostgreSQL footprint, a 1-year reservation on db.r5.2xlarge was a very easy win. The operational risk was low because the demand was stable and the database was not the kind of component that was likely to disappear or shrink dramatically in the near term. It was exactly the kind of spend that should not have been living at full on-demand pricing once the usage pattern was clear.
I think this is one of the more practical parts of FinOps that gets lost when people only talk at the portfolio level. Different parts of the platform deserve different commitment strategies. Databases are not application node groups. Bursty job runners are not Aurora. Treating them all as one commitment problem is a good way to make mediocre decisions in every direction.
The database layer was a good reminder that cost optimization improves when the platform is discussed as a system of workloads with different behaviors, not as one giant number.
7. DoiT Helped on the Remaining On-Demand Usage
Even after reservations, there was still a meaningful amount of on-demand usage that was not sensible to commit to.
That included the kinds of workloads most platforms always have some amount of: bursty usage, less predictable demand, and capacity that would have been risky to lock into a long commitment. This is where DoiT was useful.
It was not the biggest lever in the environment, and I would not pretend otherwise. The larger savings came from getting the commitment strategy right on the steady-state compute and database footprint. But for the remaining on-demand capacity, DoiT helped deliver roughly 10% savings without forcing awkward engineering changes just to chase a discount.
That was valuable precisely because it addressed the part of the bill that reservations were never meant to solve.
I think this is an important point because cost stories often become too clean in hindsight. They make it sound as if one strategy solved everything. It did not. Reservations were right for baseline demand. DoiT helped with the still-variable on-demand layer. Those were complementary decisions, not competing ones.
The engineering lesson was simple: do not force a financial mechanism to solve the wrong class of usage.
8. Storage Tiering Mattered More Than People Expected
Storage was another major cost area, especially once the S3 footprint moved past 200 TB.
At that scale, it stops making sense to talk about storage as one flat bucket of data. Different data has different access patterns, different business value, and different expectations around retrieval speed. If all of it sits in the same storage class indefinitely, the platform is paying for convenience it does not actually need.
Lifecycle policies made a real difference here because they introduced a more honest relationship between access pattern and storage cost. Frequently used data could remain where it needed to remain. Less frequently accessed data could move to cheaper tiers. Rarely accessed or archival material could move much further down the cost curve.
For some of the archive-heavy use cases, Glacier One Zone was a sensible fit. This was mostly data not directly needed by customers in a day-to-day operational path and more often touched by data warehouse or downstream analytical use cases. In other words, it did not carry the same retrieval expectations as customer-facing transactional data.
What mattered here was not just the storage class choice. It was acknowledging that "keep everything in the expensive tier forever" is usually a product of indecision, not of actual access requirements.
At 200 TB and beyond, even modest improvements in lifecycle policy discipline become real money. This was one of those areas where the savings were not flashy, but they were undeniable.
9. EKS Was Part of the Story, but Worker Behavior Mattered More Than the Control Plane
It is easy to blame Kubernetes itself for a high bill because it is a visible part of the platform architecture. In practice, the EKS control plane fee was not the heart of the problem. The more important questions lived underneath it.
How large were the worker footprints during the hours that mattered? How much of that size reflected real demand versus inherited assumptions? Which node groups were carrying stable application load, and which were mostly there to absorb variability that could have been treated differently?
This is where historical EKS usage trends paid off again. Once the underlying worker demand was understood, the conversation stopped being "EKS is expensive" and became much more precise. The platform was paying for a combination of baseline worker capacity, daytime peaks, and a handful of supporting workloads that behaved very differently from the core application.
That precision mattered because it prevented the wrong kind of reaction. The answer was not to make the platform fragile by squeezing worker capacity too hard. The answer was to reserve what was demonstrably stable, leave the unpredictable part flexible, and stop confusing variability with waste.
Kubernetes cost work often sounds more complicated than it really is. Most of the time, it comes back to understanding how much of the worker estate is structural and how much of it is situational.
10. GitLab Runners and Ephemeral Compute Were Easy Wins
Some of the cleanest cost wins came from workloads that never needed to be running continuously in the first place.
GitLab runners were a good example. A few of the job types required relatively large EC2 instances, and some workloads occasionally needed GPU-backed machines that were completely outside the normal EKS pattern. Keeping those instances alive full time would have been a very expensive way to avoid a small amount of orchestration work.
The better approach was to make them genuinely ephemeral.
Instances were brought up when a job started and shut down again after the job completed. GitLab automation handled the mechanics, which meant the platform did not rely on someone remembering to clean up expensive build infrastructure later. That mattered especially for the larger or more specialized instances, where the financial penalty for laziness would have been obvious very quickly.
This was one of the clearest examples of a broader principle: turning something off is often more effective than endlessly optimizing something that should not have been running in the first place.
There is a certain kind of cloud waste that comes not from wrong sizing but from unnecessary runtime. Ephemeral compute is where that waste is easiest to challenge because the system itself can enforce the lifecycle instead of hoping people do the right thing manually.
11. Tagging Temporary Resources and Shutting Them Down at Midnight Helped More Than It Sounds
The same logic applied beyond GitLab runners.
Some resources were clearly temporary or non-essential outside working hours, but they still had a habit of surviving overnight simply because nobody was actively thinking about them after the workday ended. Once that pattern exists, the bill fills up with small amounts of runtime that nobody would ever defend individually and nobody gets around to removing systematically.
The simple answer was tagging and scheduled shutdown.
Resources designated as ephemeral were tagged accordingly and automatically shut down around 00:00 each day. This was not a sophisticated piece of cost engineering, but it was effective precisely because it did not depend on intention surviving the end of the day. If a resource genuinely needed to stay alive, it should not have been in the ephemeral category. If it did not need to stay alive, the platform should not have left the choice to memory.
There is a tendency in cloud cost discussions to look for sophisticated optimization first. In my experience, a lot of spend disappears once the platform gets better at enforcing lifecycle on things that were never meant to be permanent.
The platform did not become cheaper because we found a clever algorithm. It became cheaper because we stopped paying for nighttime inertia.
12. Tagging, Review Loops, and Cost Ownership Made the Savings Stick
One-off savings are easy to lose if nobody owns the follow-through.
That is why tagging mattered for more than just reporting. Resources needed enough metadata around environment, service, and ownership that cost analysis could be tied back to a real engineering conversation. If a workload was unusually expensive, it should be possible to identify it quickly. If a platform service had grown well beyond what was expected, that should be visible before the end of the quarter. If an environment cost changed materially, the right people should not discover that by accident weeks later.
Regular review loops helped keep the optimization work from turning into a one-time cleanup exercise. Daily checks, weekly summaries, and broader monthly or quarterly reviews were useful not because more meetings are inherently good, but because cost drift is rarely dramatic at the start. It accumulates. The earlier it is made visible, the easier it is to correct without disrupting platform work.
Alerts were part of that as well. The platform team and the relevant departmental leadership could see meaningful changes before they turned into unpleasant surprises. That kept cost discussions grounded in recent data rather than in stale assumptions.
This is one of the places where FinOps, when done properly, is just operational hygiene. Visibility only matters if it feeds a loop that people trust and respond to.
13. What Did Not Help
Some approaches were consistently less useful than they sounded.
Over-optimizing very small resources rarely moved the needle. It created noise and made people feel busy, but it did not address the meaningful parts of the bill. The real savings came from baseline compute, database commitments, storage lifecycle policy, and runtime discipline around ephemeral workloads.
Trying to optimize everything at once was also a mistake. Large platforms always have more potential savings ideas than anyone has time to pursue well. The right move was to start with the most structurally important cost drivers and only then work downward. That is much more effective than scattering attention across dozens of small items with unclear impact.
I was also cautious about making commitment decisions too early. Reserved capacity is powerful when it matches reality. It is much less attractive when it is being used to paper over a platform that has never been properly understood. The right order was visibility first, then baseline analysis, then commitments.
The same was true of tooling. Tools can help, but they do not replace judgment. Cost optimization only becomes durable when the platform model itself is sane enough that the numbers mean something.