As more workloads find their way to the cloud and demand increases, it would not be surprising for cloud customers to experience a lot more outages. Yet this has not proved to be the case—at least not directly.
According to the Uptime Institute’s 2022 annual outage analysis, over the past three years, public cloud outages are occurring at about the same historical rate. In Uptime’s 2022 Data Center Resiliency Survey, 80% of data center managers and operators said they had experienced an outage within the past three years. According to Uptime, that number is roughly consistent with the normal range—between 70% and 80%.
While on the surface this may seem like good news, it actually masks what is really going on, said Andy Lawrence, founding member and executive director of Uptime Institute Intelligence (the Uptime Institute’s research division). What’s interesting, said Lawrence, is that the amount of IT being built out and consumed every year exceeds the rate of growth of outages.
“One of the things we see is that ... a single CIO has way more IT than [they] did five years ago,” said Lawrence. “So we can still expect some significant outages even though CIOs are running much more IT without outages overall.”
If the reliability numbers continue to hold, the percentage of customers experiencing significant and severe outages should, in time, come down, according to the report.
“It is likely systems will become more resilient, as operators become better at managing complex, at-scale architectures,” the report said.
Increased IT reliability is a good thing. The challenge today for cloud providers—and their customers—is that, when outages occur, they are often more impactful and high profile. This is because many of the workloads that are being shifted to the cloud are customer-facing, said Neil Miles, senior product marketing manager for Micro Focus’s ITOM Portfolio.
“I had a consultant in Toronto tell me that his slogan was ‘Slow is the new down,’” said Miles. “If a service is running badly, people can quickly jump to another vendor. The impact to the company is higher because the percentage of services that are represented in that failed cloud instance is higher.”
What’s causing outages hasn’t changed much
The leading cause of outages is human error, said Lawrence. Since measuring human error directly is difficult, Lawrence and his research team look at the failures those mistakes create.
Over the past three years, almost 40% of organizations have experienced a major outage caused by human error. Of these incidents, 85% were caused either by staff who do not follow procedures or by flaws in the processes and procedures themselves, according to the report.
Historically, a loss of power is the No. 1 cause of significant outages. But most of those outages, too, were caused by human error. Over the last 25 years, electrical failures “have accounted for 80% of all IT load losses in data centers,” the report said.
Networking- and connectivity-related issues represent another leading cause of significant outages. These outages are driven primarily by the fact that IT architectures and application topologies are becoming more complex by the day. Organizations today increasingly rely on a mix of on-premises hardware and services, multiple cloud providers, and third-party APIs running in containers and virtual machines.
“On the whole, [cloud] architectures provide high levels of service availability at scale,” reads the report. “Despite this, no architecture is fail-safe, and many failures have now been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data and networks.”
This could be why only 13% of survey respondents said the cloud is resilient enough to run all of their workloads.
When you add in the human-error element, the resulting cascade of failure can be dramatic and quick, said Forrester senior analyst Brent Ellis. When AWS’s East region went down in December of 2021, it was caused by a scripting error on AWS’s internal billing network that scaled out of control. In minutes, the mistake had consumed enough network resources to shut down the entirety of AWS East’s production network.
“This was big enough that, after the US East outage, the European Union proposed legislation requiring that banks and companies in the financial sector diversify across either multiple clouds or cloud plus on-prem infrastructure,” said Ellis.
In another instance, a company Ellis works with was importing data into its Salesforce CRM system from Google Cloud—but the data had to go through a serverless function in Microsoft Azure to be transformed into a data format Salesforce’s app could consume. This worked well until the function hit undocumented threshold limits for Azure and Salesforce’s APIs. It took many hours of troubleshooting to isolate the issues and get the service functioning smoothly again.
Even though all of the technologies involved were functioning as designed, when combined into a system, they created a situation that led to an outage.
Recovery times and costs are increasing
Agile programming methodologies, DevOps, and automated continuous integration and deployment (CI/CD) pipelines all push updates from development into production more quickly. This can create situations where IT administrators (who are tasked with the mundane business of keeping lights on) and developers don't talk to one another about important production-related issues. This lack of communication means IT is not up to date on all of the changes that are taking place in the production environments it is charged with maintaining. When outages happen, it can take IT much longer to figure out what the root causes are and what to do about them.
As with human error, according to Miles, increasing recovery times also affect the cloud reliability equation.
“When there is a performance problem or an outage, if a dev team has turned on a [cloud] instance the IT team doesn't even know about, they’re blind,” said Miles.
All of these factors are combining to increase the cost of an outage. Although the Uptime Institute does not calculate an average cost of outages overall, it does look at the most expensive ones reported in its annual surveys. In 2019, 60% of respondents said their average cost of an outage was less than $100,000. In 2021, that number had dropped to 39%, while the number of outages costing between $100,000 and $1 million went from 28% in 2019 to 47% in 2021.
“Things used to be a little simpler when it was just a VM, because you'd just restart a VM,” said Miles. “Now, you've got containers, you’ve got Kubernetes, you've got everything out there in the environment. It’s, in some ways, a more fragile mix.”
Keep learning
Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.