DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • Leadership Suite
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More Topics
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Blogs » 4 Predictive Analytics Challenges Facing SRE Teams

4 Predictive Analytics Challenges Facing SRE Teams

By: contributor on August 23, 2017 4 Comments

“Predictive analytics” refers to the many techniques used to analyze data to make predictions about the future, such as data mining, statistics, modeling, machine learning and artificial intelligence. Being able to predict technical and business outcomes is one of the main promises of big data. Indeed, many businesses have invested heavily in infrastructure to anticipate key performance indicators such as demand, pricing or maintenance. The investment required is substantial but, when it works, it yields a very positive ROI.

Recent Posts By contributor
  • How to Ensure DevOps Success in a Distributed Network Environment
  • Dissecting the Role of QA Engineers and Developers in Functional Testing
  • DevOps Primer: Using Vagrant with AWS
More from contributor
Related Posts
  • 4 Predictive Analytics Challenges Facing SRE Teams
  • What Does AIOps Mean for SREs?
  • The Pros and Cons of Embedded SREs
    Related Categories
  • Blogs
  • DevOps Practice
    Related Topics
  • big data
  • data modeling
  • data science
  • data set
  • developers
  • logic
  • Predictive Analytics
  • site reliability engineering
Show more
Show less

In this post, we’ll examine why implementing predictive analytics in a site reliability engineering (SRE) context requires a significant investment in both setting up the right data sets and applying the right logic to that data.

DevOps/Cloud-Native Live! Boston

SRE should in theory be a prime target area for predictive analytics. Why? Because companies have already widely adopted monitoring tools over the past decade. In doing so, they’ve amassed extensive system health and behavior data that is ripe for analysis. SRE teams have also increased their ability to make sense of this data as data science tools have become more accessible. Basic data science aptitude is increasingly become a desired skill when hiring engineers.

Additionally, the SRE team is responsible for a mission-critical KPI: system availability. This single metric is directly tied to customer SLAs in digitally dependent businesses. SRE teams have a strong incentive to implement predictive analytics. Why? Having the ability to prevent downtime by addressing potential issues before they hit the production environment is hugely valuable. More system availability translates into more revenue, higher customer satisfaction and lower costs of problem mitigation.

However, the reality is that an SREs job is still mostly focused on the diagnostic, mitigation and “fixing” parts, as opposed to more proactive tasks. This is in big part because predictive analytics applied to SRE/DevOps has proven difficult to implement and perform in a disciplined manner. Therefore, it is still an aspirational goal at most companies.

SRE Challenges in Predictive Analytics

Up next, let’s explore the four challenges that need to be overcome to implement high-quality predictive analytics practice in an SRE context, including:

  • Data collection
  • Data quality
  • Data volume
  • Model usability

Challenge 1: Data Collection

First, the working data sets need to fit what you are trying to predict. In this case, it is challenging to get the right data set because the SRE team’s mission is very broad; specifically, ensuring the reliability of the entire system. The difficulty lies in the fact that there is no such thing as “full system” data readily available. All the relevant data is distributed across siloed, or overlapping repositories and monitoring tools. Centralizing the data requires either a full system agent (which takes a lot of work to set up) or through APIs, which requires mastering a multitude of API integrations. An additional obstacle in getting the right data lies in the inconsistent nature of events, time series data and logs. These disparate data types can’t be analyzed together unless they can be transformed into a uniform data set.

Challenge 2: Data Quality

SRE environments emit lots of data and some won’t make sense for the purpose of the analysis. Whichever data collection method is used, the information tracked often reflects some level of bias on the part of whoever set up the tracking mechanism in the first place. This leads to capturing both irrelevant data or inadvertently leaving out important information. In other words, the data is often both noisy and lacking at the same time.

The key to successful modeling lies in the selection of prediction variables with data that is obtained before the event predicted happens (in this case, a failure). Defining the variables that lead to the determination of a subsystem failure is critical and requires specific domain knowledge. It is difficult to define and extract a pure data set sanitized of noise. The more data you leverage, the higher the chances that you are actually “feeding the beast” with irrelevant information. Too much irrelevant data leads to inaccurate models and false positives.

Challenge 3: Data Volume

Here you can have two different types of challenges: too little data and too much.

Not enough data

You typically need at least 100 instances of what you are trying to predict and probably at least 100 where it didn’t work to train a model. This assumes the data set is clean and directly relevant to a very narrow, specifically defined problem. In the case of SRE, the scope of issues that can affect a system is such that the data set required is very substantial. Likely, it’s several months worth of a wide variety of system behavior measurements. For example, if you collect data only through passive collection, think about the lag time between t-zero (when you start collecting data) and the time when you would have results. During that time, you have a bootstrapping problem—your model can’t be trained to predict accurately what you want.

Too much data

There are two practical points to make around the notion of “too much data.”

  • Lack of standardization: Processing large volumes of data for analysis remains a difficult engineering challenge. There is no standardization for data pipeline; every method has a limit in terms of how many inputs it can ingest on a per-second basis, and ingest rates often end up becoming bottlenecks in large enterprise environments.
  • Volume does not equal accuracy: There is a point beyond which you get diminishing returns: statistical theorems show that after a certain point, feeding more data into a predictive analytics model will not provide more accurate results.

Overall, getting the quality and quantity of the training data right is the best investment you can make in building your predictive capabilities.

Challenge 4: Model Usability

From a pure “mechanics” standpoint, the analytical concepts and methods are there to reflect the complex thinking required to predict the behavior of an enterprise’s combination of technology, information and infrastructure. Predictive software relies heavily on advanced algorithms and methodologies such as logic regressions, time series analysis and decision trees. The problem is that the workings of the kinds of systems that SREs deal with are very hard to understand in the first place, let alone predict. It is not uncommon even for SREs to not entirely understand how a system works if they were not the ones who initially set it up. In addition, every company’s system is different.

When data scientists design predictive models without years of experience in an SRE capacity in multiple environments—in other words, without leveraging subject matter expertise—it rarely works perfectly. Often the models they create lack usability, even after the validation process to confirm its quality. A modeler will split the available data into a training set where the model is run to generate predictions and compare performance against the actual outcomes. That will require multiple iterations to generate a model that has high overall predictive accuracy, and 9 out of 10 times the result is overfitting, or perfect for the training data. A subject matter expert in the domain of technical operations can cut the cycles of iteration and get to a predictive model that works with an acceptable level of risk.

In short, implementing predictive analytics in an SRE context takes a significant investment both in setting up the right data set and applying the right logic. Stay tuned for a future blog post that will provide some pointers on how to achieve the best prediction results.

About the Author / JP Emelie Marcos

JP Emelie Marcos is the co-founder and CEO of SignifAI. Previously, JP served as the Chief Operating Officer of Tango, a leading mobile consumer application that reached 400 million regular users in five years. JP’s career spans 20+ years leading teams in technology startups and large companies as an investor, GM and CEO.

Filed Under: Blogs, DevOps Practice Tagged With: big data, data modeling, data science, data set, developers, logic, Predictive Analytics, site reliability engineering

Sponsored Content
Featured eBook
Hybrid Cloud Security 101

Hybrid Cloud Security 101

No matter where you are in your hybrid cloud journey, security is a big concern. Hybrid cloud security vulnerabilities typically take the form of loss of resource oversight and control, including unsanctioned public cloud use, lack of visibility into resources, inadequate change control, poor configuration management, and ineffective access controls ... Read More
« Fast Pass: Your Ultimate Guide to the Best of Jenkins World
Using Application Deltas in Deployments »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Accelerating Continuous Security With Value Stream Management
Monday, May 23, 2022 - 11:00 am EDT
The Complete Guide to Open Source Licenses 2022
Monday, May 23, 2022 - 3:00 pm EDT
Building a Successful Open Source Program Office
Tuesday, May 24, 2022 - 11:00 am EDT

Latest from DevOps.com

DevSecOps Deluge: Choosing the Right Tools
May 20, 2022 | Gary Robinson
Managing Hardcoded Secrets to Shrink Your Attack Surface 
May 20, 2022 | John Morton
DevOps Institute Releases Upskilling IT 2022 Report 
May 18, 2022 | Natan Solomon
Creating Automated GitHub Bots in Go
May 18, 2022 | Sebastian Spaink
Is Your Future in SaaS? Yes, Except …
May 18, 2022 | Don Macvittie

Get The Top Stories of the Week

  • View DevOps.com Privacy Policy
  • This field is for validation purposes and should be left unchanged.

Download Free eBook

Hybrid Cloud Security 101
New call-to-action

Most Read on DevOps.com

Why Over-Permissive CI/CD Pipelines are an Unnecessary Evil
May 16, 2022 | Vladi Sandler
Apple Allows 50% Fee Rise | @ElonMusk Fans: 70% Fake | Micro...
May 17, 2022 | Richi Jennings
DevOps Institute Releases Upskilling IT 2022 Report 
May 18, 2022 | Natan Solomon
Making DevOps Smoother
May 17, 2022 | Gaurav Belani
Creating Automated GitHub Bots in Go
May 18, 2022 | Sebastian Spaink

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.