All too often, companies rely on customers as their alerting system for priority IT incidents. Their IT operations teams are first notified of an outage or a slowdown when the phones light up or an email server starts straining.
Finding out about problems from customers or end users should be the exception though, right? Companies have a number of monitoring tools in place to gather a plethora of metrics, service management tools to notify people, and dashboard tools to display the state of any piece of the IT environment during any given time.
Unfortunately for all concerned—IT service providers and subscribers—the data tell us something different. For example, we work with a global Internet content provider that tells us that its various monitoring tools reported just around 45 percent of incidents, which means that 55 percent were first being reported by its customers.
This ratio is pretty consistent across enterprise IT teams we talk to. In fact, the aforementioned content provider was already doing better than the norm. A Forrester Research survey showed that end users report nearly three-quarters of outages on average.
Users have problems? You have problems
"So what," you might say. As long as you know about the problem, you can go and fix it.
If you are looking at a process in the abstract, everything starts from when a problem is first known by the organization, whether the detecting source is a monitoring tool or a customer. The crucial aspect that gets left out of this simplistic representation is that there is a lot that happens before anyone calls the service desk. Users get frustrated, retry a couple of times, ask their friends or colleagues whether they are experiencing the same problem, then vent and gripe on social media. Eventually, they may get around to contacting the service desk and notifying the company about the issue. Meanwhile, other users are being impacted by the same outage, while the service desk remains blind to the problem, its details hidden behind a dashboard that is showing solid green. Well, more likely, the dashboard is showing "solid red" without any useful context or prioritization, but either way, the result is the same: Crucial information is hidden.
The impact of this blindness can be severe. The same content provider that I wrote about earlier was able to recognize that the duration of impact in minutes was ten times higher for user-reported incidents than for machine-reported ones. In other words, the impact was far more widespread and lasted far longer before the organization could even begin to troubleshoot and (hopefully) resolve the issue. Given that a recent IDC survey also tells us that the average hourly cost of an infrastructure outage is $100,000, anything that can reduce the duration of an outage will help. And let's not forget that outages are actually the simplest issues to detect; performance degradation is much harder precisely because it is subtler and less obviously catastrophic.
Get there early
The implication is clear: IT organizations must be able to detect issues earlier so they can minimize, if not prevent, negative customer impacts.
Unfortunately, the way to do this is not so clear. The problem in IT is rarely a lack of data anymore. The issue is far more likely to be that there is too much data, so much that nobody can make sense of it all, or at least not in real time.
Most IT operations centers have several big screens dotted around with system status displays. All too often, these will be showing a sea of red, with no context or situational awareness, making it difficult to discern any meaning from it all, let alone to work out the relative priorities of the events and alerts received.
Detect earlier, diagnose faster
The goal is to identify problems well before users notice them, and certainly before they have become so irritated that they are forced to call the service desk. There are three aspects to this aspiration:
1. See more
You might think this one is solved, with all the different tools at practically every level of the stack: traditional monitoring, log management, application performance monitoring, specialist tools for network or database performance, and more. However, if you can't make sense of all the data and how it all relates, there is limited value in collecting it. There is a re-emerging class of tool—the manager of managers , or MoM—that can clean all these data streams by using analytics to eliminate repetitive and low-content events, sorting out only the important alerts for further processing.
2. Detect earlier
Cleaning all the data streams is just the first step. Many events and alerts can all relate to a single incident. Additional algorithms can then correlate related alerts and cluster them together into "situations." This helps to detect earlier the anomalous conditions in unfolding incidents that are starting to affect service, instead of wasting time to wade through a massive volume of unrelated events.
3. Diagnose faster
The final step of course is to troubleshoot and resolve the underlying problem. The traditional approach involves lots of finger-pointing, duplicated effort, and wasted time. But new tools can reinvent that by notifying relevant domain experts into a virtual war room, where they can work collaboratively to resolve the complex incident. Such virtual war rooms should also integrate with other operational management tools that might be useful or required, such as a service desk and domain-specific diagnostic or forensic analysis tools.
Happy users make everyone happy
Adopting a modern approach to incident management will ultimately help shift the balance from customer-reported incidents to machine-reported ones, as well as ensure that incidents are resolved before customers' experiences degrade. A new generation of tools does exactly this, plugging right into existing systems and procedures to make the whole of IT more effective and efficient.
The result will be to improve the quality of experience that customers or end users receive, while also minimizing the costs that an IT organization has to manage.
Keep learning
Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
- Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.