Part 2: From Reactive to Predictive: Training LLMs on Your Incident History

Every organization has a goldmine buried in its incident management system: Thousands of hours of human reasoning, debugging insights, and hard-won knowledge about how systems actually fail in production. We write post-mortems, tag them with root causes, and then, never systematically learn from them again.

The real breakthrough in autonomous SRE isn’t detecting known problems faster—it’s predicting unknown problems before they cascade. After training AI agents on three years of incident history at My Workplace, we’ve built systems that identify failure patterns 15-45 minutes before they would trigger traditional alerts. In production environments serving millions of users, 15 minutes is the difference between graceful degradation and a P1 outage.

Here’s how to transform your incident history from a documentation graveyard into predictive intelligence.

The Incident History Problem

Your incident database contains critical signals, but they’re trapped in an unstructured narrative:

“API latency increased gradually over 6 hours starting at 04:00 UTC. Initial investigation focused on database performance but metrics looked normal. Eventually traced to a memory leak in the new caching layer introduced in v2.3.1 deployment on Tuesday. Cache objects weren’t being properly garbage collected under specific query patterns involving nested JSON fields.”

That paragraph contains multiple learnable patterns:

Gradual degradation suggesting resource exhaustion

Misdirection toward the database (red herring)

Connection to recent deployment

Interaction between feature (caching) and data structure (nested JSON)

Time-based pattern (overnight gradual buildup)

Traditional monitoring would only catch this once latency breaches thresholds. A trained agent can recognize the early signature: slight memory creep + recent deployment + specific query patterns = probable memory leak scenario.

Building the Training Dataset

The quality of your predictive system depends entirely on how you structure your historical incident data. Raw post-mortems aren’t enough, you need to extract structured reasoning chains.

Phase 1: Incident Decomposition

Transform narrative post-mortems into structured knowledge:

class IncidentDecomposer:

def __init__(self, llm_client):

self.llm = llm_client

async def extract_structure(self, postmortem):

prompt = “””

Extract structured data from this incident post-mortem:

{postmortem}

Provide:

Timeline of observed symptoms (with timestamps)
Initial hypotheses considered (correct and incorrect)
Diagnostic actions taken
Root cause factors (technical and organizational)
Precursor signals that existed before detection
Dependencies and interaction effects
Resolution actions and their effectiveness

Format as JSON with confidence scores for each extraction.

“””

structured = await self.llm.extract(prompt)

return {

‘symptoms’: structured[‘symptoms’],

‘reasoning_chain’: self.build_reasoning_graph(structured),

‘precursors’: structured[‘precursor_signals’],

‘causal_factors’: structured[‘root_causes’],

‘resolution_pattern’: structured[‘resolution’]

}

Phase 2: Creating Causal Embeddings

Standard embeddings capture semantic similarity but miss causal relationships. We need embeddings that understand “A caused B” not just “A and B are related.”

I built a custom embedding pipeline that creates separate vector spaces for:

Symptom Space: What the system looked like when failing

Cause Space: Why it was actually failing

Resolution Space: What actions fixed it

Precursor Space: What early signals existed

class CausalEmbeddings:

def __init__(self):

self.symptom_model = self.load_finetuned_model(‘symptoms’)

self.cause_model = self.load_finetuned_model(’causes’)

def embed_incident(self, structured_incident):

# Create embeddings in different semantic spaces

symptom_embedding = self.symptom_model.embed(

structured_incident[‘symptoms’]

)

cause_embedding = self.cause_model.embed(

structured_incident[‘causal_factors’]

)

# Learn the relationship between symptom and cause

causal_vector = self.compute_causal_relationship(

symptom_embedding,

cause_embedding

)

return {

‘symptom’: symptom_embedding,

’cause’: cause_embedding,

‘causal_vector’: causal_vector,

‘precursor’: self.embed_precursors(

structured_incident[‘precursors’]

)

}

The magic happens in the causal_vector: it encodes the transformation from “what we observed” to “what was actually wrong.” This lets the agent pattern-match on reasoning chains, not just surface symptoms.

The Predictive Architecture

Once you have causal embeddings, you can build a system that recognizes failure patterns before they fully manifest.

Layer 1: Continuous Pattern Matching

Every 60 seconds, the agent compares the current system state against the precursor embeddings:

class PredictiveAgent:

async def scan_for_precursors(self):

current_state = await self.gather_system_state()

# Embed current observability data

current_embedding = self.embed_current_state(current_state)

# Search for similar precursor patterns

similar_precursors = self.precursor_index.query(

current_embedding,

top_k=10,

threshold=0.75 # High confidence only

)

if similar_precursors:

# We’ve seen this pattern lead to incidents before

await self.initiate_predictive_investigation(

similar_precursors

)

Layer 2: Temporal Pattern Recognition

Incidents rarely appear instantly—they develop. The agent tracks time-series patterns in metric drift:

class TemporalAnalyzer:

def analyze_drift_pattern(self, metric_history, window=’6h’):

# Calculate rate of change across different timescales

drift_signature = {

‘5min’: self.compute_drift(metric_history, ‘5min’),

‘1hour’: self.compute_drift(metric_history, ‘1hour’),

‘6hour’: self.compute_drift(metric_history, ‘6hour’)

}

# Match against known incident velocity patterns

similar_velocities = self.velocity_index.query(

drift_signature

)

# Incidents that started with similar drift patterns

return similar_velocities

A memory leak develops differently from a connection pool exhaustion or a cascading retry storm. Each has a velocity signature—the agent learns these from historical data.

Layer 3: Multi-Signal Correlation

The breakthrough moment: combining weak signals that individually mean nothing but collectively indicate impending failure.

Real example from production:

Timestamp: 08:23 UTC

Weak Signal 1: Error rate increased 0.3% (below threshold)

Weak Signal 2: 99th percentile latency up 8% (below threshold)

Weak Signal 3: Database connection pool at 67% (normal range)

Weak Signal 4: Deployment occurred 4 hours ago (routine)

Weak Signal 5: Memory usage increasing 2MB/minute (very subtle)

None of these trigger alerts. But the agent recognized the pattern:

precursor_match = {

‘incident_id’: ‘INC-2847’,

‘similarity’: 0.89,

‘historical_outcome’: ‘P1 – Memory leak in caching layer’,

‘time_to_incident’: ’43 minutes’,

‘confidence’: 0.91

}

The agent alerted our team 43 minutes before traditional monitoring would have caught it. We proactively rolled back the deployment, zero user impact.

Training on Your Own Incident History

You don’t need years of data to start. Here’s how to bootstrap with limited history:

Minimum Viable Dataset: 30-50 well-documented incidents

Extract incident narratives from your ticketing system (PagerDuty, Jira, etc.)
Use GPT-4 to structure them using the decomposition prompt above
Generate embeddings for symptoms, causes, and precursors
Build similarity indices using FAISS or Pinecone
Deploy passive monitoring to gather confidence data

Start with detection, not prediction. Get the agent recognizing known failure patterns faster than humans can. Build trust with your team.

Iteration Strategy:

Week 1-2: Passive monitoring, generate insights but don’t alert

Week 3-4: Shadow mode, compare agent detections to actual incidents

Week 5+: Gradually increase confidence threshold for automated alerts

The Fine-Tuning Breakthrough

Generic LLMs are good, but fine-tuned models are 10x better for incident prediction. We fine-tuned Claude Sonnet on our incident history:

Training Data Format:

{

“system”: “You are an SRE analyzing system observability data to predict incidents.”,

“input”: “Current metrics: [metric_data], Recent changes: [change_log]”,

“output”: “Precursor pattern detected: Memory leak signature similar to INC-2847. Confidence: 89%. Predicted escalation: 35-50 minutes. Recommended action: Review caching layer in service-auth.”

}

We generated 500 training examples from our incident database:

300 from real incidents (precursor signals → actual outcome)

200 synthetic variations (using GPT-4 to generate plausible scenarios)

Results after fine-tuning:

False positive rate: 6% (down from 23% with base model)

Detection time improvement: 15-45 minutes earlier

Root cause accuracy: 78% (up from 41%)

Handling False Positives: The Trust Problem

Predictive systems fail if they cry wolf. Our approach:

Confidence Calibration:

class ConfidenceCalibrator:

def calculate_prediction_confidence(self, match):

base_similarity = match[‘similarity_score’]

# Adjust for recency (older patterns less reliable)

recency_factor = self.calculate_recency_weight(

match[‘incident_date’]

)

# Adjust for system evolution (architecture changes)

drift_factor = self.calculate_architecture_drift(

match[‘system_version’],

current_version

)

# Adjust for environmental similarity

env_factor = self.calculate_environment_similarity(

match[‘conditions’],

current_conditions

)

calibrated = base_similarity * recency_factor * drift_factor * env_factor

return calibrated

We only alert on predictions above 85% confidence. Anything 70-85% goes to a “watch list” for passive monitoring. Below 70%, we log but don’t act.

Feedback Loop:

Every prediction gets labeled:

True Positive: Predicted incident occurred as expected

False Positive: Predicted incident didn’t occur

Near Miss: Different incident occurred with similar symptoms

Delayed Hit: Predicted incident occurred but outside time window

This data feeds back into the confidence calibration model, continuously improving accuracy.

Real-World Impact: The Numbers

Six months of predictive monitoring at My Workplace:

Incident Prevention:

23 potential P1 incidents caught in precursor phase

47 P2 incidents detected 10+ minutes before threshold breach

3 critical security issues identified through anomaly correlation

Operational Efficiency:

Mean time to detection: 4.3 minutes (down from 18 minutes)

Mean time to root cause: 12 minutes (down from 47 minutes)

False positive rate: 6% (acceptable for our risk tolerance)

The Incident That Didn’t Happen:

Our most impressive catch: the agent detected a subtle interaction between a CDN configuration change and backend timeout settings that would have caused cascading failures during the morning traffic spike. Confidence score: 87%. Time before projected incident: 38 minutes.

We adjusted the timeout values, monitored the morning spike, zero issues. Traditional monitoring would have caught this only after user impact began.

What’s Next

We’ve moved from reactive runbooks to predictive intelligence, but we’re still dependent on human approval for remediation actions. In Part 3, I’ll show you how to build the final piece: Zero-touch infrastructure that not only predicts failures but automatically prevents them through autonomous healing actions.

The future of SRE isn’t humans investigating alerts; it’s systems that fix themselves before humans even know there was a problem.