Introduction
Your application is live. Users are hitting your endpoints. Traffic is growing. Then it happens — 3 AM, your phone buzzes. A customer reports the site is down. You scramble to check: is it the database? The API server? Did you run out of disk space?
This is the nightmare scenario that Prometheus + Grafana prevents.
In this guide, I will walk you through setting up a production-ready monitoring stack. We will cover everything from scraping metrics to building dashboards to configuring alerts that actually wake you up when something breaks — not when it is already broken.
By the end, you will have:
- Prometheus collecting metrics from your servers
- Grafana dashboards showing real-time system health
- AlertManager sending notifications to Slack, email, or PagerDuty
- Node Exporter exposing Linux system metrics
Let us build this.
Prerequisites
You need:
- A Linux server (Ubuntu 22.04 or 24.04 recommended)
- At least 2 GB RAM and 10 GB disk for the monitoring server
- SSH access with sudo privileges
- Docker and Docker Compose installed (we will use containers for simplicity)
If you do not have Docker installed:
sudo apt update && sudo apt install -y docker.io docker-compose-v2
sudo usermod -aG docker $USER
newgrp docker
Architecture Overview
Here is what we are building:
┌─────────────┐ scrape ┌──────────────┐
│ Node │◄───────────────│ Prometheus │
│ Exporter │ every 15s │ (metrics DB) │
│ :9100 │ │ :9090 │
└─────────────┘ └──────┬───────┘
│ query
┌─────────────┐ ┌──────▼───────┐
│ App Server │ │ Grafana │
│ metrics │ │ :3000 │
│ :8080 │ └──────────────┘
└─────────────┘
│ alerts
┌──────▼───────┐
│ AlertManager │
│ :9093 │
└──────┬───────┘
│ notify
┌──────▼───────┐
│ Slack/Email │
│ PagerDuty │
└──────────────┘
Prometheus pulls metrics from targets (Node Exporter, your app, databases). Grafana queries Prometheus for visualization. AlertManager handles alert routing and silencing.
Step 1: Project Structure
Create a directory for our monitoring stack:
mkdir -p ~/monitoring-stack
cd ~/monitoring-stack
Create the directory layout:
monitoring-stack/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── alerts/
│ └── node_alerts.yml
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
│ └── dashboard.yml
└── alertmanager/
└── alertmanager.yml
Step 2: Prometheus Configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'production'
rule_files:
- 'alerts/*.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
environment: 'production'
role: 'application-server'
- job_name: 'node_exporter_monitoring'
static_configs:
- targets: ['localhost:9100']
labels:
environment: 'production'
role: 'monitoring-server'
Key decisions here:
- 15-second scrape interval — balances freshness with storage cost
external_labels— helps identify metrics when you have multiple Prometheus instances- Separate jobs per role — makes filtering in Grafana much easier
Now create the alerting rules in prometheus/alerts/node_alerts.yml:
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage has been above 80% for 10 minutes (current: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% (current: {{ $value }}%)"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Less than 10% disk space remaining on / (current: {{ $value }}%)"
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been unreachable for 2 minutes."
These four alerts cover the most common production issues: CPU, memory, disk, and instance availability. The for clause prevents flapping — an alert must persist for the specified duration before firing.
Step 3: Grafana Provisioning
Grafana supports declarative provisioning — configure datasources and dashboards as code. No more manual UI setup after redeploy.
Create grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
Create grafana/provisioning/dashboards/dashboard.yml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
Step 4: AlertManager Configuration
Create alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'slack-critical'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'slack-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warning'
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
title: '🔴 {{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
{{ end }}
- name: 'slack-warning'
slack_configs:
- channel: '#alerts-warning'
title: '🟡 {{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ end }}
This routes critical alerts to one Slack channel and warnings to another. The group_interval of 5 minutes prevents alert spam — multiple alerts arriving within that window are batched into a single notification.
Step 5: Docker Compose — Putting It All Together
Create docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.8.2
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:11.1.0
container_name: grafana
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=change-me-now
- GF_SERVER_ROOT_URL=https://monitoring.yourdomain.com
- GF_AUTH_ANONYMOUS_ENABLED=false
ports:
- "3000:3000"
restart: unless-stopped
networks:
- monitoring
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
Important details:
prometheus_datavolume — persists metrics across container restarts- 30-day retention — adjust based on your disk capacity
--web.enable-lifecycle— allows hot-reloading Prometheus config without restart- Node Exporter mounts are read-only for security
Step 6: Launch and Verify
# Start everything
docker compose up -d
# Check all containers are running
docker compose ps
# Verify Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
Expected output:
{ "job": "prometheus", "health": "up" }
{ "job": "node_exporter", "health": "up" }
{ "job": "node_exporter_monitoring", "health": "up" }
Now visit:
- Prometheus:
http://your-server:9090— try queryingnode_memory_MemAvailable_bytes - Grafana:
http://your-server:3000— login withadmin/change-me-now - AlertManager:
http://your-server:9093— view alert status
Step 7: Import a Pre-Built Dashboard
Grafana has a massive community dashboard library. The most popular Node Exporter dashboard is ID 1860 ("Node Exporter Full").
- In Grafana, go to Dashboards → Import
- Enter
1860in the "Import via grafana.com" field - Select your Prometheus datasource
- Click Import
You will immediately see CPU, memory, disk, network, and dozens of other panels populated with real data.
Step 8: Add Application Metrics
So far we are monitoring the server itself. Let us add application-level metrics.
For a Node.js app, install prom-client:
npm install prom-client
Add this to your Express app:
const promClient = require('prom-client');
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || req.path, status: res.statusCode });
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Then add a new scrape job in prometheus.yml:
- job_name: 'nodejs_app'
static_configs:
- targets: ['your-app-server:3000']
labels:
app: 'my-api'
Step 9: Nginx Reverse Proxy for Grafana (Optional but Recommended)
Running Grafana on port 3000 without TLS is not ideal. Let us put Nginx in front:
server {
listen 443 ssl;
server_name monitoring.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/monitoring.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/monitoring.yourdomain.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Generate TLS certificate:
sudo certbot --nginx -d monitoring.yourdomain.com
Monitoring Best Practices
Here are a few principles I have learned from running monitoring in production:
- Alert on symptoms, not causes. "Site is returning 500s" is better than "CPU is high" — the former tells you there is a user impact.
- Avoid alert fatigue. If an alert fires every day and nobody acts on it, delete it or silence it. Noisy alerts teach your team to ignore all alerts.
- Dashboard hierarchy. Create three levels: high-level (exec summary), service-level (per team), and drill-down (debugging). Nobody needs to see 50 panels at once.
- Retention is a tradeoff. 30 days is enough for most teams. If you need long-term trending, consider Thanos or VictoriaMetrics for Prometheus long-term storage.
- Test your alerts. Intentionally trigger each alert once to verify the notification pipeline works end to end.
Troubleshooting Common Issues
"Targets show DOWN in Prometheus"
Check network connectivity between Prometheus and the target. Verify the target is actually exporting metrics on the expected port: curl http://target-host:9100/metrics
"Grafana says 'Datasource not found'"
Verify the datasource URL matches the service name in docker-compose: http://prometheus:9090
"Alerts fire but no Slack notification"
Check the Slack webhook URL. Test it manually: curl -X POST -H 'Content-type: application/json' --data '{"text":"test"}' YOUR_WEBHOOK_URL
Conclusion
You now have a production-grade monitoring stack. Here is what is running:
| Component | Port | Purpose |
|---|---|---|
| Prometheus | 9090 | Metrics collection and storage |
| Node Exporter | 9100 | Linux system metrics |
| Grafana | 3000 | Dashboards and visualization |
| AlertManager | 9093 | Alert routing and silencing |
From here, you might want to:
- Add cAdvisor for Docker container metrics
- Add Blackbox Exporter for HTTP endpoint probing
- Add PostgreSQL Exporter or Redis Exporter for database monitoring
- Set up Loki for log aggregation alongside metrics
- Configure Prometheus remote write to Grafana Cloud or another long-term store
Monitoring is not a one-time setup — it is an ongoing practice. Revisit your dashboards monthly. Ask: "What broke last month? Did we have a dashboard for it? Did we get alerted?" That cycle of continuous improvement is what separates monitoring from observability.
What monitoring challenges are you facing? Drop a comment below — I read every one.