Most Kubernetes monitoring tutorials show you how to install Prometheus and call it done. But raw metrics without the right alert rules and dashboards are just noise. After managing clusters at FAANG scale, here's the monitoring stack I actually deploy in production.
The Stack
- Prometheus — metrics scraping and storage
- Grafana — visualization and alerting UI
- Alertmanager — alert routing (PagerDuty/Slack)
- kube-state-metrics — Kubernetes object state metrics
- node-exporter — node-level hardware/OS metrics
Everything deployed via Helm for reproducibility.
Installation
# Add the kube-prometheus-stack chart (bundles everything)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install with custom values
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values values.yaml \
--wait
values.yaml: The Opinionated Config
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "40GB"
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 2000m
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
persistence:
enabled: true
size: 10Gi
adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
grafana.ini:
server:
root_url: "https://grafana.yourdomain.com"
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 10Gi
The Alerts That Matter
This is the part most guides skip. Default Prometheus alerts are either too noisy or miss the real issues. Here are the rules I've refined over hundreds of production incidents.
Node-Level Alerts
groups:
- name: node.rules
rules:
- alert: NodeHighCPU
expr: |
(1 - avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} CPU above 85% for 10 minutes"
- alert: NodeMemoryPressure
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} has less than 10% memory available"
- alert: NodeDiskPressure
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} disk is 85%+ full"
Pod-Level Alerts
- name: pod.rules
rules:
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="false"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 15 minutes"
- alert: PodOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"
Deployment Alerts
- name: deployment.rules
rules:
- alert: DeploymentReplicasMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 15m
labels:
severity: critical
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has unavailable replicas"
- alert: DeploymentRolloutStuck
expr: |
kube_deployment_status_observed_generation
!= kube_deployment_metadata_generation
for: 30m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout appears stuck"
The 4 Grafana Dashboards I Always Import
Don't build dashboards from scratch. Start with these IDs from grafana.com/dashboards:
| Dashboard | ID | Purpose | |-----------|----|---------| | Kubernetes Cluster Overview | 7249 | Node health, capacity, pod count | | Kubernetes Pod Resources | 6781 | Per-pod CPU/memory usage and limits | | Node Exporter Full | 1860 | Detailed node metrics (disk I/O, network) | | Kubernetes Deployment Stats | 8588 | Replica status, rollout history |
Import them via Grafana UI: Dashboards → Import → Enter ID.
Recording Rules: Speed Up Expensive Queries
Raw Prometheus queries over long time ranges are slow. Recording rules pre-compute expensive aggregations.
groups:
- name: recording.rules
rules:
- record: job:container_cpu_usage_seconds_total:rate5m
expr: |
sum by(namespace, pod, container) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
- record: job:container_memory_working_set_bytes:avg
expr: |
avg by(namespace, pod) (
container_memory_working_set_bytes{container!=""}
)
These make your Grafana dashboards snappy even when querying months of data.
SLO Monitoring: The Part Most Setups Miss
Alert thresholds on raw metrics are reactive. SLO-based alerting is proactive — it fires when you're on track to breach a service level objective, not after you've already breached it.
Here's a simple availability SLO for an HTTP service:
groups:
- name: slo.rules
rules:
# Track error rate as a ratio
- record: slo:http_error_rate:ratio_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Burn rate alert: 14x the hourly error budget consumed in 1 hour
# This means you'll exhaust a 30-day SLO in 2 days
- alert: SLOHighBurnRate
expr: |
slo:http_error_rate:ratio_rate5m > (14 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "High error burn rate — SLO at risk"
description: "Error rate {{ $value | humanizePercentage }} is consuming error budget 14x faster than sustainable"
The 0.001 in the expression assumes a 99.9% availability SLO (0.1% allowed errors). Adjust this to match your actual SLO. This pattern, called "burn rate alerting," was popularized by Google's SRE Workbook and dramatically reduces both alert fatigue and missed incidents.
Alertmanager Routing
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-general'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack-general'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: "${PAGERDUTY_KEY}"
description: '{{ template "pagerduty.default.description" . }}'
- name: 'slack-general'
slack_configs:
- api_url: "${SLACK_WEBHOOK_URL}"
channel: '#alerts'
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
Critical alerts → PagerDuty (wakes someone up). Warnings → Slack (visible, not urgent).
Managing Prometheus Cardinality
The most common reason Prometheus clusters run out of memory: cardinality explosion. Every unique combination of label values is a separate time series. One label with 1,000 unique values multiplied across 10 metrics = 10,000 series. If that label contains user IDs or request URLs, you could be generating millions of series per hour.
Signs of cardinality problems:
- Prometheus memory usage growing unbounded
- Queries returning "too many samples" errors
- Ingestion lag increasing over time
Prevention in scrape configs:
# Drop high-cardinality labels before ingestion
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['app:8080']
metric_relabel_configs:
# Drop user_id label — too many unique values
- action: labeldrop
regex: user_id
# Drop any metric with "debug" in the name
- source_labels: [__name__]
regex: '.*debug.*'
action: drop
Detection after the fact:
# Find the top 10 metrics by time series count
topk(10, count by (__name__) ({__name__=~".+"}))
Run this query to identify which metrics are consuming the most series, then decide whether to drop labels or entire metrics from those sources.
The Metrics I Check First During an Incident
When something breaks, this is my 5-minute triage sequence:
kube_pod_container_status_restarts_total— anything spiking in the last 30m?container_memory_working_set_bytesvs limits — anyone near OOM?node_cpu_seconds_total— saturated nodes will slow everythingkube_deployment_status_available_replicas— is the desired count matching?container_network_receive_bytes_total— unexpected traffic spikes
Five queries, five minutes, you've eliminated 80% of common failure modes.
Common Mistakes
Too many alerts, too few actions. If your on-call gets 50 alerts per shift and silences half of them, fix the alerts — not the human. Every alert should be either actionable or removed.
No runbooks linked. Every alert annotation should have a runbook_url field. "NodeHighCPU" means nothing at 3am without a link to what to actually do.
Prometheus scraping intervals too frequent. Default 15s is fine. Don't set 5s unless you have a specific need — it multiplies storage costs.
Forgetting persistent storage. Prometheus data in an emptyDir is lost on pod restart. Always use PVCs.
Not setting resource limits on Prometheus itself. A runaway Prometheus instance consuming all node memory is a monitoring system that takes down the thing it's supposed to monitor. Always set memory limits and configure --storage.tsdb.max-block-duration to control memory pressure.
Alerting on percentages without minimum volume floors. An alert that fires when error rate > 5% will page you when 1 request fails out of 20. Add a minimum volume check:
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) > 0.05)
and
(sum(rate(http_requests_total[5m])) > 10)
The and clause requires at least 10 RPS before the error rate check fires. No more 3am pages for a single failed health check.
Good monitoring isn't about having metrics — it's about having the right alerts that fire at the right time, with enough context to act fast. Start with these patterns and tune from there.