devops

Production Monitoring with Prometheus + Grafana: Complete Setup Guide (2026)

Set up production-grade monitoring with Prometheus and Grafana. Step-by-step guide covering Node Exporter, AlertManager, dashboards, and alerting rules for your infrastructure.

June 25, 2026·9 min read·
#prometheus#grafana#monitoring#observability#alertmanager#node-exporter

Introduction

Your application is live. Users are hitting your endpoints. Traffic is growing. Then it happens — 3 AM, your phone buzzes. A customer reports the site is down. You scramble to check: is it the database? The API server? Did you run out of disk space?

This is the nightmare scenario that Prometheus + Grafana prevents.

In this guide, I will walk you through setting up a production-ready monitoring stack. We will cover everything from scraping metrics to building dashboards to configuring alerts that actually wake you up when something breaks — not when it is already broken.

By the end, you will have:

  • Prometheus collecting metrics from your servers
  • Grafana dashboards showing real-time system health
  • AlertManager sending notifications to Slack, email, or PagerDuty
  • Node Exporter exposing Linux system metrics

Let us build this.


Prerequisites

You need:

  • A Linux server (Ubuntu 22.04 or 24.04 recommended)
  • At least 2 GB RAM and 10 GB disk for the monitoring server
  • SSH access with sudo privileges
  • Docker and Docker Compose installed (we will use containers for simplicity)

If you do not have Docker installed:

sudo apt update && sudo apt install -y docker.io docker-compose-v2
sudo usermod -aG docker $USER
newgrp docker

Architecture Overview

Here is what we are building:

┌─────────────┐     scrape     ┌──────────────┐
│  Node       │◄───────────────│  Prometheus   │
│  Exporter   │   every 15s    │  (metrics DB) │
│  :9100      │                │  :9090        │
└─────────────┘                └──────┬───────┘
                                      │ query
┌─────────────┐                ┌──────▼───────┐
│  App Server │                │   Grafana    │
│  metrics    │                │   :3000      │
│  :8080      │                └──────────────┘
└─────────────┘
                                      │ alerts
                               ┌──────▼───────┐
                               │ AlertManager │
                               │  :9093       │
                               └──────┬───────┘
                                      │ notify
                               ┌──────▼───────┐
                               │ Slack/Email   │
                               │ PagerDuty     │
                               └──────────────┘

Prometheus pulls metrics from targets (Node Exporter, your app, databases). Grafana queries Prometheus for visualization. AlertManager handles alert routing and silencing.


Step 1: Project Structure

Create a directory for our monitoring stack:

mkdir -p ~/monitoring-stack
cd ~/monitoring-stack

Create the directory layout:

monitoring-stack/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alerts/
│       └── node_alerts.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
│           └── dashboard.yml
└── alertmanager/
    └── alertmanager.yml

Step 2: Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production'

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          environment: 'production'
          role: 'application-server'

  - job_name: 'node_exporter_monitoring'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          environment: 'production'
          role: 'monitoring-server'

Key decisions here:

  • 15-second scrape interval — balances freshness with storage cost
  • external_labels — helps identify metrics when you have multiple Prometheus instances
  • Separate jobs per role — makes filtering in Grafana much easier

Now create the alerting rules in prometheus/alerts/node_alerts.yml:

groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage has been above 80% for 10 minutes (current: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% (current: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Less than 10% disk space remaining on / (current: {{ $value }}%)"

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been unreachable for 2 minutes."

These four alerts cover the most common production issues: CPU, memory, disk, and instance availability. The for clause prevents flapping — an alert must persist for the specified duration before firing.


Step 3: Grafana Provisioning

Grafana supports declarative provisioning — configure datasources and dashboards as code. No more manual UI setup after redeploy.

Create grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"

Create grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards

Step 4: AlertManager Configuration

Create alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: 'slack-critical'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warning'

receivers:
  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        title: '🔴 {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Instance:* {{ .Labels.instance }}
          {{ end }}

  - name: 'slack-warning'
    slack_configs:
      - channel: '#alerts-warning'
        title: '🟡 {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}

This routes critical alerts to one Slack channel and warnings to another. The group_interval of 5 minutes prevents alert spam — multiple alerts arriving within that window are batched into a single notification.


Step 5: Docker Compose — Putting It All Together

Create docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts:/etc/prometheus/alerts
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    restart: unless-stopped
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=change-me-now
      - GF_SERVER_ROOT_URL=https://monitoring.yourdomain.com
      - GF_AUTH_ANONYMOUS_ENABLED=false
    ports:
      - "3000:3000"
    restart: unless-stopped
    networks:
      - monitoring
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    restart: unless-stopped
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

Important details:

  • prometheus_data volume — persists metrics across container restarts
  • 30-day retention — adjust based on your disk capacity
  • --web.enable-lifecycle — allows hot-reloading Prometheus config without restart
  • Node Exporter mounts are read-only for security

Step 6: Launch and Verify

# Start everything
docker compose up -d

# Check all containers are running
docker compose ps

# Verify Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

Expected output:

{ "job": "prometheus", "health": "up" }
{ "job": "node_exporter", "health": "up" }
{ "job": "node_exporter_monitoring", "health": "up" }

Now visit:

  • Prometheus: http://your-server:9090 — try querying node_memory_MemAvailable_bytes
  • Grafana: http://your-server:3000 — login with admin / change-me-now
  • AlertManager: http://your-server:9093 — view alert status

Step 7: Import a Pre-Built Dashboard

Grafana has a massive community dashboard library. The most popular Node Exporter dashboard is ID 1860 ("Node Exporter Full").

  1. In Grafana, go to Dashboards → Import
  2. Enter 1860 in the "Import via grafana.com" field
  3. Select your Prometheus datasource
  4. Click Import

You will immediately see CPU, memory, disk, network, and dozens of other panels populated with real data.


Step 8: Add Application Metrics

So far we are monitoring the server itself. Let us add application-level metrics.

For a Node.js app, install prom-client:

npm install prom-client

Add this to your Express app:

const promClient = require('prom-client');

const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || req.path, status: res.statusCode });
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Then add a new scrape job in prometheus.yml:

  - job_name: 'nodejs_app'
    static_configs:
      - targets: ['your-app-server:3000']
        labels:
          app: 'my-api'

Step 9: Nginx Reverse Proxy for Grafana (Optional but Recommended)

Running Grafana on port 3000 without TLS is not ideal. Let us put Nginx in front:

server {
    listen 443 ssl;
    server_name monitoring.yourdomain.com;

    ssl_certificate     /etc/letsencrypt/live/monitoring.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/monitoring.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Generate TLS certificate:

sudo certbot --nginx -d monitoring.yourdomain.com

Monitoring Best Practices

Here are a few principles I have learned from running monitoring in production:

  1. Alert on symptoms, not causes. "Site is returning 500s" is better than "CPU is high" — the former tells you there is a user impact.
  2. Avoid alert fatigue. If an alert fires every day and nobody acts on it, delete it or silence it. Noisy alerts teach your team to ignore all alerts.
  3. Dashboard hierarchy. Create three levels: high-level (exec summary), service-level (per team), and drill-down (debugging). Nobody needs to see 50 panels at once.
  4. Retention is a tradeoff. 30 days is enough for most teams. If you need long-term trending, consider Thanos or VictoriaMetrics for Prometheus long-term storage.
  5. Test your alerts. Intentionally trigger each alert once to verify the notification pipeline works end to end.

Troubleshooting Common Issues

"Targets show DOWN in Prometheus" Check network connectivity between Prometheus and the target. Verify the target is actually exporting metrics on the expected port: curl http://target-host:9100/metrics

"Grafana says 'Datasource not found'" Verify the datasource URL matches the service name in docker-compose: http://prometheus:9090

"Alerts fire but no Slack notification" Check the Slack webhook URL. Test it manually: curl -X POST -H 'Content-type: application/json' --data '{"text":"test"}' YOUR_WEBHOOK_URL


Conclusion

You now have a production-grade monitoring stack. Here is what is running:

ComponentPortPurpose
Prometheus9090Metrics collection and storage
Node Exporter9100Linux system metrics
Grafana3000Dashboards and visualization
AlertManager9093Alert routing and silencing

From here, you might want to:

  • Add cAdvisor for Docker container metrics
  • Add Blackbox Exporter for HTTP endpoint probing
  • Add PostgreSQL Exporter or Redis Exporter for database monitoring
  • Set up Loki for log aggregation alongside metrics
  • Configure Prometheus remote write to Grafana Cloud or another long-term store

Monitoring is not a one-time setup — it is an ongoing practice. Revisit your dashboards monthly. Ask: "What broke last month? Did we have a dashboard for it? Did we get alerted?" That cycle of continuous improvement is what separates monitoring from observability.


What monitoring challenges are you facing? Drop a comment below — I read every one.

#prometheus#grafana#monitoring#observability#alertmanager#node-exporter
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →