I have been called at 3 a.m. more times than I would like to admit. The payment system went down during Black Friday; a database silently filled up until it crashed; a certificate was expiring on a Sunday morning. Each incident taught a painful lesson about what we were not watching closely enough.
After a decade of handling production incidents and implementing monitoring across startups and enterprise environments, I developed a framework that actually works. Most monitoring guides tell you to install tools such as Prometheus and Grafana. That advice is not wrong, but it rarely answers the most important question: What should you actually monitor?
This article outlines the 10-layer monitoring framework we use in production environments.
Every layer exists because it was missed at some point, and missing it caused real outages.
Every system is different, but these layers cover the fundamentals that apply to most Kubernetes platforms and even traditional VM-based setups.
The Layers
Monitoring works best when broken down into layers. Each layer answers a different operational question. Skipping a layer creates blind spots that only show up during incidents.
The 10 layers are:
- System and Infrastructure
- Application Performance
- HTTP, API and Real User Monitoring
- Database
- Cache
- Message Queues
- Tracing Infrastructure
- SSL and Certificates
- External Dependencies
- Log Patterns and Errors
Let’s walk through each one.
Layer 1: System and Infrastructure
This is the foundation. If the infrastructure is unhealthy, everything above it suffers. Monitoring happens at two levels: Nodes and pods.
Node Level
Pods run on nodes. When nodes struggle, pods follow. Using Prometheus with Node Exporter, we monitor:
- CPU usage and load average
- Memory usage and available memory
- Disk usage and disk I/O
- Network I/O
- Node availability
- Kubelet health
A common mistake is focusing only on pod metrics. In one incident, an e-commerce application repeatedly crashed during a flash sale. Pod CPU and memory looked normal. After an hour of debugging application code, the real issue surfaced: The node disk was 98% full due to unrotated container logs. The application failed because it could not write temporary files. The root cause was visible only at the node level.
Pod and Container Level
At the pod level, we track:
- Pod availability
- Container restart counts
- Resource requests versus actual usage
- CPU and memory limit saturation
Kubernetes Error States
Kubernetes exposes error states that should never be ignored:
- CrashLoopBackOff
- ImagePullBackOff
- OOMKilled
- Pending pods
- Evicted pods
- Failed liveness or readiness probes
If a production workload enters CrashLoopBackOff, it is an immediate signal that something is broken.
Layer 2: Application Performance
Infrastructure metrics show whether the system is alive. Application metrics show whether the code is behaving.
This is where application performance monitoring (APM) tools are essential. Common tools include:
- New Relic
- Datadog APM
- Dynatrace
- Elastic APM
- SigNoz
For distributed tracing specifically:
- Jaeger
- Zipkin
- Grafana Tempo
OpenTelemetry is used for vendor-neutral instrumentation. Key metrics include:
- Endpoint response times
- Error rates
- Transaction traces
- Slow database queries
- Slow external API calls
When latency increases, tracing should make it immediately clear whether the bottleneck is code, database or an external dependency.
Layer 3: HTTP, API and Real User Monitoring
This layer answers a critical question: Can users actually use the system?
It consists of synthetic monitoring, API monitoring and real user monitoring (RUM).
Synthetic Monitoring
Synthetic monitoring checks systems from the outside, without caring how they work internally.
Using tools such as the Prometheus Blackbox Exporter, we probe:
- Health endpoints
- Critical user flows
- HTTP status codes
- Response latency
Running probes from multiple regions is essential. An application may be reachable from one location but unavailable elsewhere due to CDN or routing issues.
API Monitoring
An endpoint returning HTTP 200 does not mean it is working correctly.
API monitoring validates behavior and data, not just availability. Commonly used tools include:
- Checkly
- Runscope
- Postman Monitors
- Assertible
Checks typically validate:
- Response schemas
- Authentication flows
- Correct data returned
- Sequential API workflows
- Proper error responses
API monitoring often catches failures that health checks completely miss.
Real User Monitoring
Real user monitoring shows what actual users experience in their browsers. Tools include:
- Google Analytics 4
- Datadog RUM
- New Relic Browser
- LogRocket
- Sentry
- SpeedCurve
Metrics tracked:
- Core Web Vitals
- Page load times by region and device
- Front-end JavaScript errors
- User session flows
- Time to interactive
Back-end metrics cannot reveal front-end performance problems. RUM fills that gap.
Layer 4: Database
Databases deserve their own monitoring layer. Common tools include:
- Prometheus PostgreSQL Exporter
- Prometheus MySQL Exporter
- PgHero
- Percona Monitoring and Management
Key signals include:
- Active connections
- Query latency
- Slow queries
- Replication lag
- Lock waits and deadlocks
- Disk and memory usage
Connection pool exhaustion is one of the most common production failure modes and often goes unnoticed until users are already affected.
Layer 5: Cache
Caches such as Redis or Memcached are critical performance components. Commonly used tools include:
- Prometheus Redis Exporter
- Redis NFO metrics
- Prometheus Memcached Exporter
- Cloud provider metrics for managed services
Important metrics include:
- Cache hit and miss ratio
- Memory usage
- Eviction rate
- Connection count
- Availability
A dropping hit ratio or rising eviction rate usually indicates misconfiguration or insufficient memory.
Layer 6: Message Queues
Message queues power asynchronous processing. When they back up, work stops.
Tools include:
- Kafka Exporter
- Burrow
- RabbitMQ Prometheus Plugin
- SQS exporters or cloud-native metrics
Key metrics include:
- Queue depth
- Consumer lag
- Message throughput
- Dead letter queue size
A growth in consumer lag is an early warning that the system is falling behind.
Layer 7: Tracing Infrastructure
Tracing systems need monitoring, too. Metrics to watch:
- Collector availability
- Span ingestion rate
- Storage back-end health
- Dropped spans
If tracing infrastructure fails, visibility disappears exactly when it is most needed.
Layer 8: SSL and Certificates
Certificate expiry causes avoidable outages.
Monitor:
- Certificate expiration dates
- Days remaining until expiry
- TLS versions
Alerting well in advance (30 days or more) prevents last-minute emergencies.
Layer 9: External Dependencies
Most applications depend on services outside their control. Monitoring should include:
- External API response times
- Error rates from third-party calls
- Availability of critical services
- Aggregated status page alerts
When an external provider is down, knowing it immediately saves hours of internal debugging.
Layer 10: Log Patterns and Errors
Metrics show you something is wrong. Logs explain why. Centralized logging should track:
- Sudden spikes in error rates
- New or unusual error patterns
- Timeouts and connection failures
- Memory and disk errors
- Database deadlocks
Pattern-based log monitoring often detects issues before metrics cross alert thresholds.
What We Don’t Monitor
Monitoring everything is neither practical nor useful. We intentionally avoid:
- High-cardinality metrics
- Debug-level logging in production
- Per-request logs for high-traffic endpoints
Reducing noise keeps dashboards usable and alerts meaningful.
Alerting Philosophy
Alerts should reflect user impact, not internal noise. We follow three levels:
- Immediate paging for user-facing failures
- Notifications for issues requiring attention soon
- Silent logging for informational signals
If an alert consistently fires without action, it is removed.
Wrapping Up
Effective monitoring is not about collecting more data. It is about asking the right questions:
- Is the infrastructure healthy?
- Is the application behaving correctly?
- Can users actually use the system?
- Are dependencies working?
- What errors are occurring?
When monitoring can answer these questions clearly, teams are prepared for production reality.

