Why Are Companies Still Struggling with Cloud Costs?
$600 billion.
That's not the GDP of Sweden—though it could be. It's not the combined market cap of Home Depot and Chevron—though it could also be that. It's the rough amount that Gartner predicts companies will spend on public-cloud services in 2023.
If that number doesn't mean anything to you, read the numerous articles from the last calendar year about companies struggling for solvency and training their sights on cloud costs. Companies such as Netflix, Twitter, and Airbnb have made very public efforts to cut cloud costs; others have doubted the virtue of investing in the cloud.
Cloud costs have been accelerating for years. No business leader worth their salt would say that a continuously accelerating, variable cost is a stable foundation for business.
It prompts the question: How did we get here? Why, after nothing but danger signs, are companies still struggling with cloud costs?
We can boil the answer down to one word: denial. Denial that a problem existed, and denial about its true cause.
To trace the roots of that denial, let's look to the past—specifically, to a previous watershed moment for software, in January 2002.
Trustworthy Computing
On January 15, 2002, at 5:22 pm, Bill Gates sent out a now-famous, Microsoft-wide email. The subject line: Trustworthy computing.
"Within 10 years, [computing] will be an integral and indispensable part of almost everything we do," the email read. But "Microsoft and the industry will only succeed in that world if CIOs, consumers and everyone else sees that Microsoft has created a platform for Trustworthy Computing."
For those of you too young to remember, Gates was responding to an existential threat facing Microsoft: compromised security. Microsoft had suffered its first high-profile attack in 1997. Despite billions of dollars of federal investment and counter-efforts by Microsoft's security team, malware and hackers were still finding ways to traumatize the nascent world of computing.
"Before the Bill Gates [trustworthy computing] memo in 2001, Microsoft developers thought they were building great software, but were blissfully unaware of the security mistakes they were making," Chris Wysopal, co-founder and CTO of Veracode and one of the original members of L0pht, a Boston-based hacker group that testified on cybersecurity before Congress in 1998, wrote to me in an email. "This led to a 'patch later' mentality. If someone found a vulnerability and reported it, they would fix that one issue and then go back to work."
"Patch later" was a reactive method that only addressed known security flaws. The core software remained riddled with vulnerabilities, and ambitious hackers found ways around what few patches companies could manage. Trustworthy computing represented a breaking point—an "emergency brake," as Wysopal put it.
Gates's memo included the following decree, paraphrased: Microsoft isn't going to write any new software or release any new features until we get this security thing right. The company effected substantial delays to its product road map, endured criticism, and redefined security standards for the entire computing industry.
Why? Because it had to. Because Microsoft wouldn't have survived as a business, let alone been a key player in potentiating e-commerce, if it hadn't made security a top-level priority.
"It turns out building security in is the only way to achieve secure software," wrote Wysopal. "Many software vendors find this out the hard way."
There is a direct parallel between the security challenges of yesteryear and the cost challenges of today. Where we used to have "patch later," we now have "optimize later." Most organizations don't view cost as a primary criterion of success when building cloud infrastructure. As a result, they build cost-inefficient infrastructure and then try to put a Band-Aid on it after the fact with enterprise discount plans, committed-use discounts, and savings plans—in short: better buying.
"It's the same mindset, applied to a different area of software development," Wysopal said.
We've reached the limit of better buying. I've had a lot of interaction with VCs lately, and without naming names, certain investors have told me that they expect a rash of business failures in 2023—some estimating hundreds. The failures will be broadly attributable to reckless spending—and specifically attributable, in large part, to years of reckless spending in the cloud.
Years of overindulgence, denying that we had a problem.
Why Cloud Spenders Have Been in Denial
Why were companies in denial for so many years? A few reasons:
1. Cloud cost hasn't had its "trustworthy computing" moment.
No Microsoft-scale company has yet risked going out of business if it doesn't wrangle its cloud spend. Hundreds of smaller companies now face that risk, and thousands more have suffered wounded valuations because of bleak margins. But no company has put production on pause to address the cloud-spend issue once and for all.
2. Cloud infrastructure went from "cutting-edge" to "inefficient" in less than a decade.
Circa 2014, it was an inside joke at AWS re:Invent that every session you went to was given by a Netflix employee. In the mid-2010s, Netflix was at the leading edge of cloud-infrastructure efficiency.
But a lot has happened since 2014. Netflix has faced a siege of competition from players such as Disney+, a platform architected after Netflix was built, with cloud tools that weren’t available when Netflix was building its platform. The cloud world evolves at a breakneck pace; cloud systems that enabled world-scale, booming business 10 years ago are now legacy systems that hinder competitiveness.
3. Cloud costs have long been underwritten by VCs.
For more than a decade, cloud recklessness was underwritten by VCs eager to jump on the SaaS wagon before all the greenfield was settled. "Growth at all costs" was the dictum, and portfolio companies happily spent the money.
Years of liberal VC funding let companies believe they didn't have a problem. Even if reckless spending defied business fundamentals, even if it flew in the face of decades-old profitability standards, the money was there—so they didn't fret.
But the VC market has dried up. Venture funding dropped an unprecedented 35% year over year in 2022. The gravy train is decelerating—and is likely in need of engine repairs. The age of subsidized cloud recklessness is over.
How Modern Companies Can Address Cloud-Cost Recklessness at Its Root
Some companies are Thelma-and-Louise-ing toward the precipice of business solvency. But many others are in a position where, if they adopt the right mindset and make the right changes, they can peer thoughtfully over the edge and change course.
1. Take a cue from 2002 and treat this as an engineering problem.
Cybersecurity reached its crisis point in 2002 because the business world had made the understandable mistake of pinning the issue on security teams. The software of the day was insecure, so companies staffed up security teams and spent millions on third-party security products.
But that didn't work. Why? Because the security issues didn't stem from understaffed or under-armed security teams; they stemmed from the software itself. Engineers ultimately became responsible for securing their own software, writing code in such a way that bad actors couldn't hack it.
It's exactly the same scenario with cloud-cost management, also known as FinOps (short for "financial operations"). Because cloud cost is a financial topic, it’s long fallen on financial teams to control it. But cost issues don’t stem from finance teams; CFOs don’t provision cloud infrastructure. Cost issues stem from—you guessed it—engineers.
Every time an engineer makes a decision—writes a line of code, picks an EC2 instance, ships a product feature—they are spending money. All building decisions are buying decisions.
The reason organizations overspend on the cloud isn't because finance teams are negotiating bad deals with cloud providers. Or because they're buying reserved instances when they should be buying convertible instances. It's because engineers don't have the data at their fingertips to know what their code costs, understand spending benchmarks, and remediate issues as they arise.
2. Get engineers the right visibility at the right time.
According to CloudZero's research, 87% of companies have less than 75% of their spend allocated. Put differently, the vast majority of companies are missing a crucial 25% (or more) of their cloud-spend data—keeping the relevant engineers in the dark on cloud spend.
For engineers to spend responsibly, they have to know (1) what they're spending, (2) when they're spending it, and (3) how it compares to what they should be spending. They need near-real-time data on the cost of their infrastructure. What they don't need is the organization's lump-sum S3 spend—which is noncontextual, and therefore meaningless.
To wit, engineers need to get relevant, accurate cost visibility in a timely fashion.
I chose the above words carefully.
"Relevant" means cost data concerning the right infrastructure. The team responsible for your live-chat feature needs immediate access to current cost data about the live-chat feature—not the live-demo feature that another team maintains.
"Accurate" means precisely reflective of your actual billing info—which platforms that base their spend data on tags are not. There are too many tags to reliably label by hand, tag labels often have typos, and tagging frameworks can differ by organization (which is particularly problematic in the event of a merger or acquisition).
"In a timely fashion" means as soon as the code is shipped or as close to it as possible—not at the end of the month, when the cost damage is done. The data should be granularly detailed, readily accessible, and easy to incorporate into an engineer's daily workflow.
3. Get out of the way.
Great engineers have one common superpower: They are experts at solving problems—including inspecting code, diagnosing, debugging, and iterating. They can understand and address challenges that most people would struggle to understand. They can look at data, at the code it references, and draw nuanced conclusions about ambiguous, confusing, often noncorrelated events and activities.
Great engineers don't need finance teams knocking on the door at the end of the month, AWS invoices in hand, stressing the need to reduce cloud spend by 22%. Great engineers need the raw materials for critical thinking and decision making. Once you get them relevant, accurate cost visibility in a timely fashion, step out of the way—and let them do what they do best.
Cost-Efficient Code
Speaking of timely, it's relevant that Microsoft joined the FinOps Foundation earlier this year. The company that, 20 years ago, recognized the need for secure engineering has now planted its four-toned flag in the world of cost-efficient engineering. Microsoft isn't alone. The FinOps Foundation has ballooned to more than 8,700 members (of which I’m one) working for more than 3,500 companies.
FinOps practitioners tilt technical. In the foundation's most recent State of FinOps survey, only 13.4% of respondents indicated that they report to the CFO in their organizations; another 60.6% of respondents, meanwhile, report to either a CTO or CIO—meaning that they are engineers who want to lead the cost-efficient code charge.
Why are companies still struggling with their cloud costs? Partly because they've denied that cloud costs are a problem, and partly because many of those who acknowledge the problem wrongly pin it on finance teams.
Excessive cloud cost is an engineering problem. Effective cloud-cost management starts with empowering engineers with relevant, accurate cost visibility in a timely fashion—then letting them do what they do best.