The 10-Layer Monitoring Framework That Saved Our Clients From 3 a.m. Pages

I have been called at 3 a.m. more times than I would like to admit. The payment system went down during Black Friday; a database silently filled up until it crashed; a certificate was expiring on a Sunday morning. Each incident taught a painful lesson about what we were not watching closely enough.

After a decade of handling production incidents and implementing monitoring across startups and enterprise environments, I developed a framework that actually works. Most monitoring guides tell you to install tools such as Prometheus and Grafana. That advice is not wrong, but it rarely answers the most important question: What should you actually monitor?

This article outlines the 10-layer monitoring framework we use in production environments.

Every layer exists because it was missed at some point, and missing it caused real outages.

Every system is different, but these layers cover the fundamentals that apply to most Kubernetes platforms and even traditional VM-based setups.

The Layers

Monitoring works best when broken down into layers. Each layer answers a different operational question. Skipping a layer creates blind spots that only show up during incidents.

The 10 layers are:

System and Infrastructure
Application Performance
HTTP, API and Real User Monitoring
Database
Cache
Message Queues
Tracing Infrastructure
SSL and Certificates
External Dependencies
Log Patterns and Errors

Let’s walk through each one.

Layer 1: System and Infrastructure

This is the foundation. If the infrastructure is unhealthy, everything above it suffers. Monitoring happens at two levels: Nodes and pods.

Node Level

Pods run on nodes. When nodes struggle, pods follow. Using Prometheus with Node Exporter, we monitor:

CPU usage and load average

Memory usage and available memory

Disk usage and disk I/O

Network I/O

Node availability

Kubelet health

A common mistake is focusing only on pod metrics. In one incident, an e-commerce application repeatedly crashed during a flash sale. Pod CPU and memory looked normal. After an hour of debugging application code, the real issue surfaced: The node disk was 98% full due to unrotated container logs. The application failed because it could not write temporary files. The root cause was visible only at the node level.

Pod and Container Level

At the pod level, we track:

Pod availability

Container restart counts

Resource requests versus actual usage

CPU and memory limit saturation

Kubernetes Error States

Kubernetes exposes error states that should never be ignored:

CrashLoopBackOff

ImagePullBackOff

OOMKilled

Pending pods

Evicted pods

Failed liveness or readiness probes

If a production workload enters CrashLoopBackOff, it is an immediate signal that something is broken.

Layer 2: Application Performance

Infrastructure metrics show whether the system is alive. Application metrics show whether the code is behaving.

This is where application performance monitoring (APM) tools are essential. Common tools include:

New Relic

Datadog APM

Dynatrace

Elastic APM

SigNoz

For distributed tracing specifically:

Jaeger

Zipkin

Grafana Tempo

OpenTelemetry is used for vendor-neutral instrumentation. Key metrics include:

Endpoint response times

Error rates

Transaction traces

Slow database queries

Slow external API calls

When latency increases, tracing should make it immediately clear whether the bottleneck is code, database or an external dependency.

Layer 3: HTTP, API and Real User Monitoring

This layer answers a critical question: Can users actually use the system?

It consists of synthetic monitoring, API monitoring and real user monitoring (RUM).

Synthetic Monitoring

Synthetic monitoring checks systems from the outside, without caring how they work internally.

Using tools such as the Prometheus Blackbox Exporter, we probe:

Health endpoints

Critical user flows

HTTP status codes

Response latency

Running probes from multiple regions is essential. An application may be reachable from one location but unavailable elsewhere due to CDN or routing issues.

API Monitoring

An endpoint returning HTTP 200 does not mean it is working correctly.

API monitoring validates behavior and data, not just availability. Commonly used tools include:

Checkly

Runscope

Postman Monitors

Assertible

Checks typically validate:

Response schemas

Authentication flows

Correct data returned

Sequential API workflows

Proper error responses

API monitoring often catches failures that health checks completely miss.

Real User Monitoring

Real user monitoring shows what actual users experience in their browsers. Tools include:

Google Analytics 4

Datadog RUM

New Relic Browser

LogRocket

Sentry

SpeedCurve

Metrics tracked:

Core Web Vitals

Page load times by region and device

Front-end JavaScript errors

User session flows

Time to interactive

Back-end metrics cannot reveal front-end performance problems. RUM fills that gap.

Layer 4: Database

Databases deserve their own monitoring layer. Common tools include:

Prometheus PostgreSQL Exporter

Prometheus MySQL Exporter

PgHero

Percona Monitoring and Management

Key signals include:

Active connections

Query latency

Slow queries

Replication lag

Lock waits and deadlocks

Disk and memory usage

Connection pool exhaustion is one of the most common production failure modes and often goes unnoticed until users are already affected.

Layer 5: Cache

Caches such as Redis or Memcached are critical performance components. Commonly used tools include:

Prometheus Redis Exporter

Redis NFO metrics
Prometheus Memcached Exporter
Cloud provider metrics for managed services

Important metrics include:

Cache hit and miss ratio

Memory usage

Eviction rate

Connection count

Availability

A dropping hit ratio or rising eviction rate usually indicates misconfiguration or insufficient memory.

Layer 6: Message Queues

Message queues power asynchronous processing. When they back up, work stops.

Tools include:

Kafka Exporter

Burrow

RabbitMQ Prometheus Plugin

SQS exporters or cloud-native metrics

Key metrics include:

Queue depth

Consumer lag

Message throughput

Dead letter queue size

A growth in consumer lag is an early warning that the system is falling behind.

Layer 7: Tracing Infrastructure

Tracing systems need monitoring, too. Metrics to watch:

Collector availability

Span ingestion rate

Storage back-end health

Dropped spans

If tracing infrastructure fails, visibility disappears exactly when it is most needed.

Layer 8: SSL and Certificates

Certificate expiry causes avoidable outages.

Monitor:

Certificate expiration dates

Days remaining until expiry

TLS versions

Alerting well in advance (30 days or more) prevents last-minute emergencies.

Layer 9: External Dependencies

Most applications depend on services outside their control. Monitoring should include:

External API response times

Error rates from third-party calls

Availability of critical services

Aggregated status page alerts

When an external provider is down, knowing it immediately saves hours of internal debugging.

Layer 10: Log Patterns and Errors

Metrics show you something is wrong. Logs explain why. Centralized logging should track:

Sudden spikes in error rates

New or unusual error patterns

Timeouts and connection failures

Memory and disk errors

Database deadlocks

Pattern-based log monitoring often detects issues before metrics cross alert thresholds.

What We Don’t Monitor

Monitoring everything is neither practical nor useful. We intentionally avoid:

High-cardinality metrics

Debug-level logging in production

Per-request logs for high-traffic endpoints

Reducing noise keeps dashboards usable and alerts meaningful.

Alerting Philosophy

Alerts should reflect user impact, not internal noise. We follow three levels:

Immediate paging for user-facing failures
Notifications for issues requiring attention soon
Silent logging for informational signals

If an alert consistently fires without action, it is removed.

Wrapping Up

Effective monitoring is not about collecting more data. It is about asking the right questions:

Is the infrastructure healthy?

Is the application behaving correctly?

Can users actually use the system?

Are dependencies working?

What errors are occurring?

When monitoring can answer these questions clearly, teams are prepared for production reality.

Amjad Syed

Sweet Security Brings Autonomous Protection to the AI Enterprise with New Blocking Capabilities

Insignary Closes SBOM Accuracy Gap With Binary-Level Clarity for Regulatory Risk

SpyCloud Report Finds Phishing Attacks Surge as Employee Data Is Exposed at 86% of Fortune 100 Companies

Heimdal Survey: Executives Four Times More Confident About AI Risk Than the Teams Managing It

Lyrie.ai Joins First Batch of Anthropic’s Cyber Verification Program

Sign up for our newsletter!Stay informed on the latest DevOps news