4 Predictive Analytics Challenges Facing SRE Teams

“Predictive analytics” refers to the many techniques used to analyze data to make predictions about the future, such as data mining, statistics, modeling, machine learning and artificial intelligence. Being able to predict technical and business outcomes is one of the main promises of big data. Indeed, many businesses have invested heavily in infrastructure to anticipate key performance indicators such as demand, pricing or maintenance. The investment required is substantial but, when it works, it yields a very positive ROI.

In this post, we’ll examine why implementing predictive analytics in a site reliability engineering (SRE) context requires a significant investment in both setting up the right data sets and applying the right logic to that data.

SRE should in theory be a prime target area for predictive analytics. Why? Because companies have already widely adopted monitoring tools over the past decade. In doing so, they’ve amassed extensive system health and behavior data that is ripe for analysis. SRE teams have also increased their ability to make sense of this data as data science tools have become more accessible. Basic data science aptitude is increasingly become a desired skill when hiring engineers.

Additionally, the SRE team is responsible for a mission-critical KPI: system availability. This single metric is directly tied to customer SLAs in digitally dependent businesses. SRE teams have a strong incentive to implement predictive analytics. Why? Having the ability to prevent downtime by addressing potential issues before they hit the production environment is hugely valuable. More system availability translates into more revenue, higher customer satisfaction and lower costs of problem mitigation.

However, the reality is that an SREs job is still mostly focused on the diagnostic, mitigation and “fixing” parts, as opposed to more proactive tasks. This is in big part because predictive analytics applied to SRE/DevOps has proven difficult to implement and perform in a disciplined manner. Therefore, it is still an aspirational goal at most companies.

SRE Challenges in Predictive Analytics

Up next, let’s explore the four challenges that need to be overcome to implement high-quality predictive analytics practice in an SRE context, including:

Data collection
Data quality
Data volume
Model usability

Challenge 1: Data Collection

First, the working data sets need to fit what you are trying to predict. In this case, it is challenging to get the right data set because the SRE team’s mission is very broad; specifically, ensuring the reliability of the entire system. The difficulty lies in the fact that there is no such thing as “full system” data readily available. All the relevant data is distributed across siloed, or overlapping repositories and monitoring tools. Centralizing the data requires either a full system agent (which takes a lot of work to set up) or through APIs, which requires mastering a multitude of API integrations. An additional obstacle in getting the right data lies in the inconsistent nature of events, time series data and logs. These disparate data types can’t be analyzed together unless they can be transformed into a uniform data set.

Challenge 2: Data Quality

SRE environments emit lots of data and some won’t make sense for the purpose of the analysis. Whichever data collection method is used, the information tracked often reflects some level of bias on the part of whoever set up the tracking mechanism in the first place. This leads to capturing both irrelevant data or inadvertently leaving out important information. In other words, the data is often both noisy and lacking at the same time.

The key to successful modeling lies in the selection of prediction variables with data that is obtained before the event predicted happens (in this case, a failure). Defining the variables that lead to the determination of a subsystem failure is critical and requires specific domain knowledge. It is difficult to define and extract a pure data set sanitized of noise. The more data you leverage, the higher the chances that you are actually “feeding the beast” with irrelevant information. Too much irrelevant data leads to inaccurate models and false positives.

Challenge 3: Data Volume

Here you can have two different types of challenges: too little data and too much.

Not enough data

You typically need at least 100 instances of what you are trying to predict and probably at least 100 where it didn’t work to train a model. This assumes the data set is clean and directly relevant to a very narrow, specifically defined problem. In the case of SRE, the scope of issues that can affect a system is such that the data set required is very substantial. Likely, it’s several months worth of a wide variety of system behavior measurements. For example, if you collect data only through passive collection, think about the lag time between t-zero (when you start collecting data) and the time when you would have results. During that time, you have a bootstrapping problem—your model can’t be trained to predict accurately what you want.

Too much data

There are two practical points to make around the notion of “too much data.”

Lack of standardization: Processing large volumes of data for analysis remains a difficult engineering challenge. There is no standardization for data pipeline; every method has a limit in terms of how many inputs it can ingest on a per-second basis, and ingest rates often end up becoming bottlenecks in large enterprise environments.
Volume does not equal accuracy: There is a point beyond which you get diminishing returns: statistical theorems show that after a certain point, feeding more data into a predictive analytics model will not provide more accurate results.

Overall, getting the quality and quantity of the training data right is the best investment you can make in building your predictive capabilities.

Challenge 4: Model Usability

From a pure “mechanics” standpoint, the analytical concepts and methods are there to reflect the complex thinking required to predict the behavior of an enterprise’s combination of technology, information and infrastructure. Predictive software relies heavily on advanced algorithms and methodologies such as logic regressions, time series analysis and decision trees. The problem is that the workings of the kinds of systems that SREs deal with are very hard to understand in the first place, let alone predict. It is not uncommon even for SREs to not entirely understand how a system works if they were not the ones who initially set it up. In addition, every company’s system is different.

When data scientists design predictive models without years of experience in an SRE capacity in multiple environments—in other words, without leveraging subject matter expertise—it rarely works perfectly. Often the models they create lack usability, even after the validation process to confirm its quality. A modeler will split the available data into a training set where the model is run to generate predictions and compare performance against the actual outcomes. That will require multiple iterations to generate a model that has high overall predictive accuracy, and 9 out of 10 times the result is overfitting, or perfect for the training data. A subject matter expert in the domain of technical operations can cut the cycles of iteration and get to a predictive model that works with an acceptable level of risk.

In short, implementing predictive analytics in an SRE context takes a significant investment both in setting up the right data set and applying the right logic. Stay tuned for a future blog post that will provide some pointers on how to achieve the best prediction results.

About the Author / JP Emelie Marcos

JP Emelie Marcos is the co-founder and CEO of SignifAI. Previously, JP served as the Chief Operating Officer of Tango, a leading mobile consumer application that reached 400 million regular users in five years. JP’s career spans 20+ years leading teams in technology startups and large companies as an investor, GM and CEO.