Every organization has a goldmine buried in its incident management system: Thousands of hours of human reasoning, debugging insights, and hard-won knowledge about how systems actually fail in production. We write post-mortems, tag them with root causes, and then, never systematically learn from them again.Â
The real breakthrough in autonomous SRE isn’t detecting known problems faster—it’s predicting unknown problems before they cascade. After training AI agents on three years of incident history at My Workplace, we’ve built systems that identify failure patterns 15-45 minutes before they would trigger traditional alerts. In production environments serving millions of users, 15 minutes is the difference between graceful degradation and a P1 outage.Â
Here’s how to transform your incident history from a documentation graveyard into predictive intelligence.Â
The Incident History ProblemÂ
Your incident database contains critical signals, but they’re trapped in an unstructured narrative:Â
“API latency increased gradually over 6 hours starting at 04:00 UTC. Initial investigation focused on database performance but metrics looked normal. Eventually traced to a memory leak in the new caching layer introduced in v2.3.1 deployment on Tuesday. Cache objects weren’t being properly garbage collected under specific query patterns involving nested JSON fields.”Â
That paragraph contains multiple learnable patterns:Â
- Gradual degradation suggesting resource exhaustionÂ
- Misdirection toward the database (red herring)Â
- Connection to recent deploymentÂ
- Interaction between feature (caching) and data structure (nested JSON)Â
- Time-based pattern (overnight gradual buildup)Â
Traditional monitoring would only catch this once latency breaches thresholds. A trained agent can recognize the early signature: slight memory creep + recent deployment + specific query patterns = probable memory leak scenario.Â
Building the Training DatasetÂ
The quality of your predictive system depends entirely on how you structure your historical incident data. Raw post-mortems aren’t enough, you need to extract structured reasoning chains.Â
Phase 1: Incident DecompositionÂ
Transform narrative post-mortems into structured knowledge:Â
Â
class IncidentDecomposer:Â
   def __init__(self, llm_client):Â
       self.llm = llm_clientÂ
        Â
   async def extract_structure(self, postmortem):Â
       prompt = “””Â
       Extract structured data from this incident post-mortem:Â
        Â
       {postmortem}Â
        Â
       Provide:Â
- Timeline of observed symptoms (with timestamps)
- Initial hypotheses considered (correct and incorrect)
- Diagnostic actions taken
- Root cause factors (technical and organizational)
- Precursor signals that existed before detection
- Dependencies and interaction effects
- Resolution actions and their effectiveness
        Â
       Format as JSON with confidence scores for each extraction.Â
       “””Â
        Â
       structured = await self.llm.extract(prompt)Â
        Â
       return {Â
           ‘symptoms’: structured[‘symptoms’],Â
           ‘reasoning_chain’: self.build_reasoning_graph(structured),Â
           ‘precursors’: structured[‘precursor_signals’],Â
           ‘causal_factors’: structured[‘root_causes’],Â
           ‘resolution_pattern’: structured[‘resolution’]Â
       }Â
Â
Â
Phase 2: Creating Causal EmbeddingsÂ
Standard embeddings capture semantic similarity but miss causal relationships. We need embeddings that understand “A caused B” not just “A and B are related.”Â
I built a custom embedding pipeline that creates separate vector spaces for:Â
- Symptom Space: What the system looked like when failingÂ
- Cause Space: Why it was actually failingÂ
- Resolution Space: What actions fixed itÂ
- Precursor Space: What early signals existedÂ
Â
class CausalEmbeddings:Â
   def __init__(self):Â
       self.symptom_model = self.load_finetuned_model(‘symptoms’)Â
       self.cause_model = self.load_finetuned_model(’causes’)Â
        Â
   def embed_incident(self, structured_incident):Â
       # Create embeddings in different semantic spacesÂ
       symptom_embedding = self.symptom_model.embed(Â
           structured_incident[‘symptoms’]Â
       )Â
        Â
       cause_embedding = self.cause_model.embed(Â
           structured_incident[‘causal_factors’]Â
       )Â
        Â
       # Learn the relationship between symptom and causeÂ
       causal_vector = self.compute_causal_relationship(Â
           symptom_embedding,Â
           cause_embeddingÂ
       )Â
        Â
       return {Â
           ‘symptom’: symptom_embedding,Â
           ’cause’: cause_embedding,Â
           ‘causal_vector’: causal_vector,Â
           ‘precursor’: self.embed_precursors(Â
               structured_incident[‘precursors’]Â
           )Â
       }Â
Â
Â
The magic happens in the causal_vector: it encodes the transformation from “what we observed” to “what was actually wrong.” This lets the agent pattern-match on reasoning chains, not just surface symptoms.Â
The Predictive ArchitectureÂ
Once you have causal embeddings, you can build a system that recognizes failure patterns before they fully manifest.Â
Layer 1: Continuous Pattern MatchingÂ
Every 60 seconds, the agent compares the current system state against the precursor embeddings:Â
Â
class PredictiveAgent:Â
   async def scan_for_precursors(self):Â
       current_state = await self.gather_system_state()Â
        Â
       # Embed current observability dataÂ
       current_embedding = self.embed_current_state(current_state)Â
        Â
       # Search for similar precursor patternsÂ
       similar_precursors = self.precursor_index.query(Â
           current_embedding,Â
           top_k=10,Â
           threshold=0.75 # High confidence onlyÂ
       )Â
        Â
       if similar_precursors:Â
           # We’ve seen this pattern lead to incidents beforeÂ
           await self.initiate_predictive_investigation(Â
               similar_precursorsÂ
           )Â
Â
Â
Layer 2: Temporal Pattern RecognitionÂ
Incidents rarely appear instantly—they develop. The agent tracks time-series patterns in metric drift:Â
Â
class TemporalAnalyzer:Â
   def analyze_drift_pattern(self, metric_history, window=’6h’):Â
       # Calculate rate of change across different timescalesÂ
       drift_signature = {Â
           ‘5min’: self.compute_drift(metric_history, ‘5min’),Â
           ‘1hour’: self.compute_drift(metric_history, ‘1hour’),Â
           ‘6hour’: self.compute_drift(metric_history, ‘6hour’)Â
       }Â
        Â
       # Match against known incident velocity patternsÂ
       similar_velocities = self.velocity_index.query(Â
           drift_signatureÂ
       )Â
        Â
       # Incidents that started with similar drift patternsÂ
       return similar_velocitiesÂ
Â
Â
A memory leak develops differently from a connection pool exhaustion or a cascading retry storm. Each has a velocity signature—the agent learns these from historical data.Â
Layer 3: Multi-Signal CorrelationÂ
The breakthrough moment: combining weak signals that individually mean nothing but collectively indicate impending failure.Â
Real example from production:Â
Timestamp: 08:23 UTCÂ
Weak Signal 1: Error rate increased 0.3% (below threshold)Â
Weak Signal 2: 99th percentile latency up 8% (below threshold)Â Â Â
Weak Signal 3: Database connection pool at 67% (normal range)Â
Weak Signal 4: Deployment occurred 4 hours ago (routine)Â
Weak Signal 5: Memory usage increasing 2MB/minute (very subtle)Â
Â
None of these trigger alerts. But the agent recognized the pattern:Â
Â
precursor_match = {Â
   ‘incident_id’: ‘INC-2847’,Â
   ‘similarity’: 0.89,Â
   ‘historical_outcome’: ‘P1 – Memory leak in caching layer’,Â
   ‘time_to_incident’: ’43 minutes’,Â
   ‘confidence’: 0.91Â
}Â
Â
Â
The agent alerted our team 43 minutes before traditional monitoring would have caught it. We proactively rolled back the deployment, zero user impact.Â
Training on Your Own Incident HistoryÂ
You don’t need years of data to start. Here’s how to bootstrap with limited history:Â
Minimum Viable Dataset: 30-50 well-documented incidentsÂ
- Extract incident narratives from your ticketing system (PagerDuty, Jira, etc.)Â
- Use GPT-4 to structure them using the decomposition prompt aboveÂ
- Generate embeddings for symptoms, causes, and precursorsÂ
- Build similarity indices using FAISS or PineconeÂ
- Deploy passive monitoring to gather confidence dataÂ
Start with detection, not prediction. Get the agent recognizing known failure patterns faster than humans can. Build trust with your team.Â
Iteration Strategy:Â
- Week 1-2: Passive monitoring, generate insights but don’t alertÂ
- Week 3-4: Shadow mode, compare agent detections to actual incidentsÂ
- Week 5+: Gradually increase confidence threshold for automated alertsÂ
The Fine-Tuning BreakthroughÂ
Generic LLMs are good, but fine-tuned models are 10x better for incident prediction. We fine-tuned Claude Sonnet on our incident history:Â
Training Data Format:Â
Â
{Â
 “system”: “You are an SRE analyzing system observability data to predict incidents.”,Â
 “input”: “Current metrics: [metric_data], Recent changes: [change_log]”,Â
 “output”: “Precursor pattern detected: Memory leak signature similar to INC-2847. Confidence: 89%. Predicted escalation: 35-50 minutes. Recommended action: Review caching layer in service-auth.”Â
}Â
Â
Â
We generated 500 training examples from our incident database:Â
- 300 from real incidents (precursor signals → actual outcome)Â
- 200 synthetic variations (using GPT-4 to generate plausible scenarios)Â
Results after fine-tuning:Â
- False positive rate: 6% (down from 23% with base model)Â
- Detection time improvement: 15-45 minutes earlierÂ
- Root cause accuracy: 78% (up from 41%)Â
Handling False Positives: The Trust ProblemÂ
Predictive systems fail if they cry wolf. Our approach:Â
Confidence Calibration:Â
Â
class ConfidenceCalibrator:Â
   def calculate_prediction_confidence(self, match):Â
       base_similarity = match[‘similarity_score’]Â
        Â
       # Adjust for recency (older patterns less reliable)Â
       recency_factor = self.calculate_recency_weight(Â
           match[‘incident_date’]Â
       )Â
        Â
       # Adjust for system evolution (architecture changes)Â
       drift_factor = self.calculate_architecture_drift(Â
           match[‘system_version’],Â
           current_versionÂ
       )Â
        Â
       # Adjust for environmental similarityÂ
       env_factor = self.calculate_environment_similarity(Â
           match[‘conditions’],Â
           current_conditionsÂ
       )Â
        Â
       calibrated = base_similarity * recency_factor * drift_factor * env_factorÂ
        Â
       return calibratedÂ
Â
Â
We only alert on predictions above 85% confidence. Anything 70-85% goes to a “watch list” for passive monitoring. Below 70%, we log but don’t act.Â
Feedback Loop:
Every prediction gets labeled:Â
- True Positive: Predicted incident occurred as expectedÂ
- False Positive: Predicted incident didn’t occurÂ
- Near Miss: Different incident occurred with similar symptomsÂ
- Delayed Hit: Predicted incident occurred but outside time windowÂ
This data feeds back into the confidence calibration model, continuously improving accuracy.Â
Real-World Impact: The NumbersÂ
Six months of predictive monitoring at My Workplace:Â
Incident Prevention:Â
- 23 potential P1 incidents caught in precursor phaseÂ
- 47 P2 incidents detected 10+ minutes before threshold breachÂ
- 3 critical security issues identified through anomaly correlationÂ
Operational Efficiency:Â
- Mean time to detection: 4.3 minutes (down from 18 minutes)Â
- Mean time to root cause: 12 minutes (down from 47 minutes)Â
- False positive rate: 6% (acceptable for our risk tolerance)Â
The Incident That Didn’t Happen:
Our most impressive catch: the agent detected a subtle interaction between a CDN configuration change and backend timeout settings that would have caused cascading failures during the morning traffic spike. Confidence score: 87%. Time before projected incident: 38 minutes.Â
We adjusted the timeout values, monitored the morning spike, zero issues. Traditional monitoring would have caught this only after user impact began.Â
What’s NextÂ
We’ve moved from reactive runbooks to predictive intelligence, but we’re still dependent on human approval for remediation actions. In Part 3, I’ll show you how to build the final piece: Zero-touch infrastructure that not only predicts failures but automatically prevents them through autonomous healing actions.Â
The future of SRE isn’t humans investigating alerts; it’s systems that fix themselves before humans even know there was a problem.Â

