If your enterprise is like most others, the pandemic took your CloudOps work remote and spread it over a widely distributed team. Everyone collaborates using a series of web-delivered AIOps dashboards that report issues with your applications and data in the cloud and provide interfaces to most systems and infrastructure to remotely fix the issues. Could things be any better?
Well, yes. Despite all the tools that now allow us to see inside applications, data, platforms, and most infrastructure, we still have system outages. Indeed, while CloudOps may be the next big thing, most of CloudOps is reactionary: It waits for something to happen and then reacts to correct the issues.
But reactionary is no longer acceptable. With all the tools and data at our fingertips, most system problems—storage failures, data corruption, and network outages—can and should be known before they occur. In this ideal world, the problems are fixed before they are known to the end users, or even to the CloudOps team.
Some sort of sorcery?
We've been doing this magic act for years, with both on-premises and cloud-based systems, using standard operational tools. Granted, until now, their abilities were somewhat primitive, and most predictions were made using analytics tools that were outside of the standard ops tool chest.
Called "decoupled predictive analytics" for on-premises and cloud operations, this grew out of the emerging data analytics world that includes data warehouses, data lakes, and big data. The idea is that massive amounts of data can be "weaponized" for all sorts of purposes. Analytics could outright predict the future using any number of data sources.
While it seems as if we should be telling others that this is a "coming soon" technology, it's actually been around for a while. Most traditional ops tools support some types of predictive analytics as related to cloud and non-cloud systems operations.
The game-changer is the rise of the AIOps space that pushed the outer limits of possibilities when we apply predictive analytics to CloudOps. We can couple predictive analytics to spot system troubles, and then engage an orchestration engine to automate most of the fixes needed to resolve the problems.
If those don't work, we can even automatically coordinate with humans inside and outside of the company to resolve the issue. Things such as system failures and resulting outages are no longer issues.
Anybody can do it
Ops-oriented predictive analytics seem like something that would be only in the wheelhouse of data scientists. But it's not that complex when you break the concept down into its component parts. These parts include:
- Data gathered
- Patterns understood
- Policies set
- Action taken
Data is at the core of all predictive analytics. It includes data gathered from the systems under management, as well as other data that can be correlated.
Here's an example. After two years of monitoring sales, you know that processing orders from both web- and human-based systems stresses the allocated resources on your public cloud. This may include compute, storage, databases, and application-oriented container processing using Kubernetes clusters.
The balancing act here is to leverage the fewest number of cloud resources required to support the business processing requirements. Leverage too many resources and you'll get huge cloud bills for resources you never use. Leverage too few resources and spikes will lead to performance problems and outright failures. (This example puts aside things such as serverless computing and storage that auto-scale.)
The objective is to use the minimum number of resources required to do sales order processing for the entire company but to have enough resources available to handle spikes in sales processing. At the same time, you need to avoid performance and stability issues when unexpected spikes mean not enough resources are available. CloudOps teams deal with this balancing act every day.
The data you'll need for analytics
For this cloud-based sales order-processing example, two types of data are available to do some rudimentary predictive analytics.
First, there is primary data that comes from the cloud. This is from resources such as database, compute, network, and storage, including memory and CPU utilization, network saturation, database growth and processing, etc. You could have several gigabytes of information that is gathered and stored weekly from those resources. You can think of primary data as telling you what has been going on in the past and what is going on in the present.
Second, there is correlated data, which is a very different concept. This data should tell you why something is happening. For example, if the business sells stand-up paddleboards (SUPs) primarily in the US, your team will soon discover that sales directly relate to weather patterns. This includes seasonal weather, such as spring and summer being the best time for paddleboard sales. However, intermittent weather, such as a cold front over several weeks or months, may cause sales to decline.
So the business knows there is a direct correlation between the weather and the sales of SUPs. The amount of warm, storm-free weather, both current and future, largely determines the sales of SUPs. When the sales order-entry system will be at or near capacity, the ops teams must proactively and automatically add more resources when they know resources will be more stressed.
Now we understand a rather simple pattern at the SUP business: Sales of SUPs are directly related to the weather, which directly relates to system utilization. This means nice weather will be more likely to stress the system, which will result in spikes and cloud-resource shortage problems.
Public databases track and predict weather. The SUP business can use those databases to set up correlated data sources.
Connections among different types of data
Next, the SUP business can set up the relationship between the primary and correlated data, which will provide the foundation of the predictive analytics model for CloudOps. In this case, the predictive analytics engine determines the degree of correlation. It looks at past patterns on both the primary and correlated data sides and determines likely issues as related to cloud operations.
Of course, there are other likely correlations. These might include economic data, the market for SUPs, and so forth.
This simple example was designed to illustrate how the most basic predictive analytics work. Indeed, this type of basic analytics might still be performed manually on a spreadsheet.
When considering more complex models, automated predictive analytics become more important. It's not unusual to find a setup where thousands of data attributes about hundreds of cloud-based resources provide the primary data (the what), and many other correlated data sources come from internal and external systems (the why).
Keep in mind that correlated data from internal business systems might include supply chain and logistics data, which would also determine how fast things are produced. Correlated data might also include external data such as key economic indicators, market data, changes in demographics, etc.
You'll need some heavy-duty data analytics systems to span these internal and external data sources. These systems can be either general-purpose or purpose-built for CloudOps, such as AIOps tools and other ops tools that include analytics. Most AIOps and ops tools have data analysis to some degree, either as an integrated subsystem or as add-ons from technology partners.
Get your policies straight
Policies, or the small procedures that determine how you will act upon the results of predictive analytics, are important to set up. These policies would be easy to define for our simple SUP business example:
- If the weather trends warmer by 27%
- Then cloud system utilization will trend higher:
- Allocate 10% more cloud storage.
- Allocate 20% more cloud compute.
- Add 25% more capacity to the cloud database.
- Schedule 15% more cloud system support staff.
- Add more fine-grained monitoring of network traffic in and between the clouds.
- Then cloud system utilization will trend higher:
- Else do nothing
The idea is to define a condition that affects your business, such as weather changes. Next you define the proactive action to ensure that the cloud system can handle the new condition, such as processing load increases.
Also note that these policies define actions to be taken, such as allocate 10% more cloud storage. This means that some process is invoked that can carry out the action. It's usually a system orchestration tool that carries out the action using a public cloud API.
What about AI?
Although AI is not required to create predictive analytics for CloudOps, it's nice to have. AI systems can determine patterns using data. They can also leverage the data for training to build knowledge over time around the likely trends of that data.
The key here is to understand that there are degrees of sophistication within the approaches and tooling. As outlined above, simple predictive analytics models can be built and deployed on spreadsheets at low cost.
As your systems get more complex, you'll need higher-end data analytics tools that leverage AI. The capabilities of these more complex predictive analytics models add much more value and accuracy to these types of proactive CloudOps systems. However, their purchase price is higher, and they require special skills to set them up and maintain them.
You need to consider the true business case with these tools or you'll have a tough conversation with whatever fiscal oversight you deal with. You could take the risk and spend $1 million for a predictive analytics CloudOps system that saves you only $100,000 a year. Will the business wait 10 years for the system's initial ROI? Or are there other tangible and intangible factors that could increase the system's value to the business sooner?
There is a pragmatic balance between what can be done when using predictive analytics for CloudOps and the value that it can bring. The tooling and capabilities will become more attractive as time progresses. The fact that cloud deployments continue to become much more operationally complex means that you'll need to get better at predictive analytics very soon.
At least, that's our prediction.
Keep learning
Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.