Every organization has a goldmine buried in its incident management system: Thousands of hours of human reasoning, debugging insights, and hard-won knowledge about how systems actually fail in production. We write post-mortems, tag them with root causes, and then, never systematically learn from them again.
The real breakthrough in autonomous SRE isn’t detecting known problems faster—it’s predicting unknown problems before they cascade. After training AI agents on three years of incident history at My Workplace, we’ve built systems that identify failure patterns 15-45 minutes before they would trigger traditional alerts. In production environments serving millions of users, 15 minutes is the difference between graceful degradation and a P1 outage.
Here’s how to transform your incident history from a documentation graveyard into predictive intelligence.
The Incident History Problem
Your incident database contains critical signals, but they’re trapped in an unstructured narrative:
“API latency increased gradually over 6 hours starting at 04:00 UTC. Initial investigation focused on database performance but metrics looked normal. Eventually traced to a memory leak in the new caching layer introduced in v2.3.1 deployment on Tuesday. Cache objects weren’t being properly garbage collected under specific query patterns involving nested JSON fields.”
That paragraph contains multiple learnable patterns:
- Gradual degradation suggesting resource exhaustion
- Misdirection toward the database (red herring)
- Connection to recent deployment
- Interaction between feature (caching) and data structure (nested JSON)
- Time-based pattern (overnight gradual buildup)
Traditional monitoring would only catch this once latency breaches thresholds. A trained agent can recognize the early signature: slight memory creep + recent deployment + specific query patterns = probable memory leak scenario.
Building the Training Dataset
The quality of your predictive system depends entirely on how you structure your historical incident data. Raw post-mortems aren’t enough, you need to extract structured reasoning chains.
Phase 1: Incident Decomposition
Transform narrative post-mortems into structured knowledge:
class IncidentDecomposer:
def __init__(self, llm_client):
self.llm = llm_client
async def extract_structure(self, postmortem):
prompt = “””
Extract structured data from this incident post-mortem:
{postmortem}
Provide:
- Timeline of observed symptoms (with timestamps)
- Initial hypotheses considered (correct and incorrect)
- Diagnostic actions taken
- Root cause factors (technical and organizational)
- Precursor signals that existed before detection
- Dependencies and interaction effects
- Resolution actions and their effectiveness
Format as JSON with confidence scores for each extraction.
“””
structured = await self.llm.extract(prompt)
return {
‘symptoms’: structured[‘symptoms’],
‘reasoning_chain’: self.build_reasoning_graph(structured),
‘precursors’: structured[‘precursor_signals’],
‘causal_factors’: structured[‘root_causes’],
‘resolution_pattern’: structured[‘resolution’]
}
Phase 2: Creating Causal Embeddings
Standard embeddings capture semantic similarity but miss causal relationships. We need embeddings that understand “A caused B” not just “A and B are related.”
I built a custom embedding pipeline that creates separate vector spaces for:
- Symptom Space: What the system looked like when failing
- Cause Space: Why it was actually failing
- Resolution Space: What actions fixed it
- Precursor Space: What early signals existed
class CausalEmbeddings:
def __init__(self):
self.symptom_model = self.load_finetuned_model(‘symptoms’)
self.cause_model = self.load_finetuned_model(’causes’)
def embed_incident(self, structured_incident):
# Create embeddings in different semantic spaces
symptom_embedding = self.symptom_model.embed(
structured_incident[‘symptoms’]
)
cause_embedding = self.cause_model.embed(
structured_incident[‘causal_factors’]
)
# Learn the relationship between symptom and cause
causal_vector = self.compute_causal_relationship(
symptom_embedding,
cause_embedding
)
return {
‘symptom’: symptom_embedding,
’cause’: cause_embedding,
‘causal_vector’: causal_vector,
‘precursor’: self.embed_precursors(
structured_incident[‘precursors’]
)
}
The magic happens in the causal_vector: it encodes the transformation from “what we observed” to “what was actually wrong.” This lets the agent pattern-match on reasoning chains, not just surface symptoms.
The Predictive Architecture
Once you have causal embeddings, you can build a system that recognizes failure patterns before they fully manifest.
Layer 1: Continuous Pattern Matching
Every 60 seconds, the agent compares the current system state against the precursor embeddings:
class PredictiveAgent:
async def scan_for_precursors(self):
current_state = await self.gather_system_state()
# Embed current observability data
current_embedding = self.embed_current_state(current_state)
# Search for similar precursor patterns
similar_precursors = self.precursor_index.query(
current_embedding,
top_k=10,
threshold=0.75 # High confidence only
)
if similar_precursors:
# We’ve seen this pattern lead to incidents before
await self.initiate_predictive_investigation(
similar_precursors
)
Layer 2: Temporal Pattern Recognition
Incidents rarely appear instantly—they develop. The agent tracks time-series patterns in metric drift:
class TemporalAnalyzer:
def analyze_drift_pattern(self, metric_history, window=’6h’):
# Calculate rate of change across different timescales
drift_signature = {
‘5min’: self.compute_drift(metric_history, ‘5min’),
‘1hour’: self.compute_drift(metric_history, ‘1hour’),
‘6hour’: self.compute_drift(metric_history, ‘6hour’)
}
# Match against known incident velocity patterns
similar_velocities = self.velocity_index.query(
drift_signature
)
# Incidents that started with similar drift patterns
return similar_velocities
A memory leak develops differently from a connection pool exhaustion or a cascading retry storm. Each has a velocity signature—the agent learns these from historical data.
Layer 3: Multi-Signal Correlation
The breakthrough moment: combining weak signals that individually mean nothing but collectively indicate impending failure.
Real example from production:
Timestamp: 08:23 UTC
Weak Signal 1: Error rate increased 0.3% (below threshold)
Weak Signal 2: 99th percentile latency up 8% (below threshold)
Weak Signal 3: Database connection pool at 67% (normal range)
Weak Signal 4: Deployment occurred 4 hours ago (routine)
Weak Signal 5: Memory usage increasing 2MB/minute (very subtle)
None of these trigger alerts. But the agent recognized the pattern:
precursor_match = {
‘incident_id’: ‘INC-2847’,
‘similarity’: 0.89,
‘historical_outcome’: ‘P1 – Memory leak in caching layer’,
‘time_to_incident’: ’43 minutes’,
‘confidence’: 0.91
}
The agent alerted our team 43 minutes before traditional monitoring would have caught it. We proactively rolled back the deployment, zero user impact.
Training on Your Own Incident History
You don’t need years of data to start. Here’s how to bootstrap with limited history:
Minimum Viable Dataset: 30-50 well-documented incidents
- Extract incident narratives from your ticketing system (PagerDuty, Jira, etc.)
- Use GPT-4 to structure them using the decomposition prompt above
- Generate embeddings for symptoms, causes, and precursors
- Build similarity indices using FAISS or Pinecone
- Deploy passive monitoring to gather confidence data
Start with detection, not prediction. Get the agent recognizing known failure patterns faster than humans can. Build trust with your team.
Iteration Strategy:
- Week 1-2: Passive monitoring, generate insights but don’t alert
- Week 3-4: Shadow mode, compare agent detections to actual incidents
- Week 5+: Gradually increase confidence threshold for automated alerts
The Fine-Tuning Breakthrough
Generic LLMs are good, but fine-tuned models are 10x better for incident prediction. We fine-tuned Claude Sonnet on our incident history:
Training Data Format:
{
“system”: “You are an SRE analyzing system observability data to predict incidents.”,
“input”: “Current metrics: [metric_data], Recent changes: [change_log]”,
“output”: “Precursor pattern detected: Memory leak signature similar to INC-2847. Confidence: 89%. Predicted escalation: 35-50 minutes. Recommended action: Review caching layer in service-auth.”
}
We generated 500 training examples from our incident database:
- 300 from real incidents (precursor signals → actual outcome)
- 200 synthetic variations (using GPT-4 to generate plausible scenarios)
Results after fine-tuning:
- False positive rate: 6% (down from 23% with base model)
- Detection time improvement: 15-45 minutes earlier
- Root cause accuracy: 78% (up from 41%)
Handling False Positives: The Trust Problem
Predictive systems fail if they cry wolf. Our approach:
Confidence Calibration:
class ConfidenceCalibrator:
def calculate_prediction_confidence(self, match):
base_similarity = match[‘similarity_score’]
# Adjust for recency (older patterns less reliable)
recency_factor = self.calculate_recency_weight(
match[‘incident_date’]
)
# Adjust for system evolution (architecture changes)
drift_factor = self.calculate_architecture_drift(
match[‘system_version’],
current_version
)
# Adjust for environmental similarity
env_factor = self.calculate_environment_similarity(
match[‘conditions’],
current_conditions
)
calibrated = base_similarity * recency_factor * drift_factor * env_factor
return calibrated
We only alert on predictions above 85% confidence. Anything 70-85% goes to a “watch list” for passive monitoring. Below 70%, we log but don’t act.
Feedback Loop:
Every prediction gets labeled:
- True Positive: Predicted incident occurred as expected
- False Positive: Predicted incident didn’t occur
- Near Miss: Different incident occurred with similar symptoms
- Delayed Hit: Predicted incident occurred but outside time window
This data feeds back into the confidence calibration model, continuously improving accuracy.
Real-World Impact: The Numbers
Six months of predictive monitoring at My Workplace:
Incident Prevention:
- 23 potential P1 incidents caught in precursor phase
- 47 P2 incidents detected 10+ minutes before threshold breach
- 3 critical security issues identified through anomaly correlation
Operational Efficiency:
- Mean time to detection: 4.3 minutes (down from 18 minutes)
- Mean time to root cause: 12 minutes (down from 47 minutes)
- False positive rate: 6% (acceptable for our risk tolerance)
The Incident That Didn’t Happen:
Our most impressive catch: the agent detected a subtle interaction between a CDN configuration change and backend timeout settings that would have caused cascading failures during the morning traffic spike. Confidence score: 87%. Time before projected incident: 38 minutes.
We adjusted the timeout values, monitored the morning spike, zero issues. Traditional monitoring would have caught this only after user impact began.
What’s Next
We’ve moved from reactive runbooks to predictive intelligence, but we’re still dependent on human approval for remediation actions. In Part 3, I’ll show you how to build the final piece: Zero-touch infrastructure that not only predicts failures but automatically prevents them through autonomous healing actions.
The future of SRE isn’t humans investigating alerts; it’s systems that fix themselves before humans even know there was a problem.

