Production Kubernetes Monitoring: Prometheus + Grafana Setup That Actually Works

Most Kubernetes monitoring tutorials show you how to install Prometheus and call it done. But raw metrics without the right alert rules and dashboards are just noise. After managing clusters at FAANG scale, here's the monitoring stack I actually deploy in production.

The Stack

Prometheus — metrics scraping and storage
Grafana — visualization and alerting UI
Alertmanager — alert routing (PagerDuty/Slack)
kube-state-metrics — Kubernetes object state metrics
node-exporter — node-level hardware/OS metrics

Everything deployed via Helm for reproducibility.

Installation

# Add the kube-prometheus-stack chart (bundles everything)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install with custom values
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values values.yaml \
  --wait

values.yaml: The Opinionated Config

prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: "40GB"
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 2000m
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

grafana:
  persistence:
    enabled: true
    size: 10Gi
  adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
  grafana.ini:
    server:
      root_url: "https://grafana.yourdomain.com"

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 10Gi

The Alerts That Matter

This is the part most guides skip. Default Prometheus alerts are either too noisy or miss the real issues. Here are the rules I've refined over hundreds of production incidents.

Node-Level Alerts

groups:
  - name: node.rules
    rules:
      - alert: NodeHighCPU
        expr: |
          (1 - avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} CPU above 85% for 10 minutes"

      - alert: NodeMemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} has less than 10% memory available"

      - alert: NodeDiskPressure
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} disk is 85%+ full"

Pod-Level Alerts

  - name: pod.rules
    rules:
      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="false"} == 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 15 minutes"

      - alert: PodOOMKilled
        expr: |
          kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"

Deployment Alerts

  - name: deployment.rules
    rules:
      - alert: DeploymentReplicasMismatch
        expr: |
          kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has unavailable replicas"

      - alert: DeploymentRolloutStuck
        expr: |
          kube_deployment_status_observed_generation
            != kube_deployment_metadata_generation
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout appears stuck"

The 4 Grafana Dashboards I Always Import

Don't build dashboards from scratch. Start with these IDs from grafana.com/dashboards:

| Dashboard | ID | Purpose | |-----------|----|---------| | Kubernetes Cluster Overview | 7249 | Node health, capacity, pod count | | Kubernetes Pod Resources | 6781 | Per-pod CPU/memory usage and limits | | Node Exporter Full | 1860 | Detailed node metrics (disk I/O, network) | | Kubernetes Deployment Stats | 8588 | Replica status, rollout history |

Import them via Grafana UI: Dashboards → Import → Enter ID.

Recording Rules: Speed Up Expensive Queries

Raw Prometheus queries over long time ranges are slow. Recording rules pre-compute expensive aggregations.

groups:
  - name: recording.rules
    rules:
      - record: job:container_cpu_usage_seconds_total:rate5m
        expr: |
          sum by(namespace, pod, container) (
            rate(container_cpu_usage_seconds_total{container!=""}[5m])
          )

      - record: job:container_memory_working_set_bytes:avg
        expr: |
          avg by(namespace, pod) (
            container_memory_working_set_bytes{container!=""}
          )

These make your Grafana dashboards snappy even when querying months of data.

SLO Monitoring: The Part Most Setups Miss

Alert thresholds on raw metrics are reactive. SLO-based alerting is proactive — it fires when you're on track to breach a service level objective, not after you've already breached it.

Here's a simple availability SLO for an HTTP service:

groups:
  - name: slo.rules
    rules:
      # Track error rate as a ratio
      - record: slo:http_error_rate:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # Burn rate alert: 14x the hourly error budget consumed in 1 hour
      # This means you'll exhaust a 30-day SLO in 2 days
      - alert: SLOHighBurnRate
        expr: |
          slo:http_error_rate:ratio_rate5m > (14 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error burn rate — SLO at risk"
          description: "Error rate {{ $value | humanizePercentage }} is consuming error budget 14x faster than sustainable"

The 0.001 in the expression assumes a 99.9% availability SLO (0.1% allowed errors). Adjust this to match your actual SLO. This pattern, called "burn rate alerting," was popularized by Google's SRE Workbook and dramatically reduces both alert fatigue and missed incidents.

Alertmanager Routing

route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-general'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack-general'

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_KEY}"
        description: '{{ template "pagerduty.default.description" . }}'

  - name: 'slack-general'
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: '#alerts'
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

Critical alerts → PagerDuty (wakes someone up). Warnings → Slack (visible, not urgent).

Managing Prometheus Cardinality

The most common reason Prometheus clusters run out of memory: cardinality explosion. Every unique combination of label values is a separate time series. One label with 1,000 unique values multiplied across 10 metrics = 10,000 series. If that label contains user IDs or request URLs, you could be generating millions of series per hour.

Signs of cardinality problems:

Prometheus memory usage growing unbounded
Queries returning "too many samples" errors
Ingestion lag increasing over time

Prevention in scrape configs:

# Drop high-cardinality labels before ingestion
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['app:8080']
    metric_relabel_configs:
      # Drop user_id label — too many unique values
      - action: labeldrop
        regex: user_id

      # Drop any metric with "debug" in the name
      - source_labels: [__name__]
        regex: '.*debug.*'
        action: drop

Detection after the fact:

# Find the top 10 metrics by time series count
topk(10, count by (__name__) ({__name__=~".+"}))

Run this query to identify which metrics are consuming the most series, then decide whether to drop labels or entire metrics from those sources.

The Metrics I Check First During an Incident

When something breaks, this is my 5-minute triage sequence:

kube_pod_container_status_restarts_total — anything spiking in the last 30m?
container_memory_working_set_bytes vs limits — anyone near OOM?
node_cpu_seconds_total — saturated nodes will slow everything
kube_deployment_status_available_replicas — is the desired count matching?
container_network_receive_bytes_total — unexpected traffic spikes

Five queries, five minutes, you've eliminated 80% of common failure modes.

Common Mistakes

Too many alerts, too few actions. If your on-call gets 50 alerts per shift and silences half of them, fix the alerts — not the human. Every alert should be either actionable or removed.

No runbooks linked. Every alert annotation should have a runbook_url field. "NodeHighCPU" means nothing at 3am without a link to what to actually do.

Prometheus scraping intervals too frequent. Default 15s is fine. Don't set 5s unless you have a specific need — it multiplies storage costs.

Forgetting persistent storage. Prometheus data in an emptyDir is lost on pod restart. Always use PVCs.

Not setting resource limits on Prometheus itself. A runaway Prometheus instance consuming all node memory is a monitoring system that takes down the thing it's supposed to monitor. Always set memory limits and configure --storage.tsdb.max-block-duration to control memory pressure.

Alerting on percentages without minimum volume floors. An alert that fires when error rate > 5% will page you when 1 request fails out of 20. Add a minimum volume check:

- alert: HighErrorRate
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[5m])) > 0.05)
    and
    (sum(rate(http_requests_total[5m])) > 10)

The and clause requires at least 10 RPS before the error rate check fires. No more 3am pages for a single failed health check.

Good monitoring isn't about having metrics — it's about having the right alerts that fire at the right time, with enough context to act fast. Start with these patterns and tune from there.