devops

PostgreSQL Performance Tuning: Fix Slow Queries in Production Without Downtime

The exact commands to diagnose and fix slow PostgreSQL queries. Real examples from production databases handling millions of rows — no downtime required.

April 9, 2026·8 min read·
#postgresql#database#performance#devops#backend

Your app is slow. The logs say it's the database. But which query? And how do you fix it without a 3am maintenance window?

This is the exact process I use to diagnose and fix PostgreSQL performance problems in production.

Step 1: Find What's Actually Slow

Before optimizing anything, you need data.

Enable pg_stat_statements

-- Check if it's enabled
SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';

-- Enable it (requires superuser, no restart needed in PG 14+)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Add to postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000

Find Your Top 10 Slowest Queries

SELECT 
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round(total_exec_time::numeric, 2) AS total_ms,
    calls,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct_total,
    substring(query, 1, 100) AS query_snippet
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

This shows you queries sorted by total time spent — not just slowest individual run, but slowest total impact on your database.

Find Queries with Worst Cache Hit Ratio

SELECT 
    schemaname,
    tablename,
    heap_blks_read,
    heap_blks_hit,
    round(heap_blks_hit::numeric / nullif(heap_blks_hit + heap_blks_read, 0) * 100, 2) AS cache_hit_pct
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 20;

You want cache hit > 99%. Below 95% means you're hitting disk constantly.

Step 2: EXPLAIN ANALYZE — Read It Like a Map

Once you have your slow query, understand why it's slow:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.id, u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > NOW() - INTERVAL '30 days'
GROUP BY u.id, u.email
ORDER BY order_count DESC
LIMIT 50;

Reading EXPLAIN Output

Limit  (cost=45231.88..45231.90 rows=50) (actual time=2341.221..2341.234 rows=50)
  ->  Sort  (cost=45231.88..45481.88 rows=100000) (actual time=2341.220..2341.225 rows=50)
        ->  HashAggregate  (cost=35000.00..37500.00 rows=100000) (actual time=2298.445..2330.112 rows=84231)
              ->  Hash Left Join  (cost=2500.00..30000.00 rows=1000000) (actual time=45.221..1987.332 rows=1000000)
                    ->  Seq Scan on users  (cost=0.00..15000.00) (actual time=0.021..342.551 rows=500000)
                          Filter: (created_at > (now() - '30 days'::interval))
                          Rows Removed by Filter: 450000
                    ->  Hash  (actual time=44.221..44.221 rows=1000000)
                          ->  Seq Scan on orders  (cost=0.00..20000.00) (actual time=0.015..35.221 rows=1000000)

Red flags to look for:

| Warning Sign | What It Means | |-------------|---------------| | Seq Scan on large table | Missing index — reads every row | | Rows Removed by Filter: 450000 | Index would eliminate these | | actual time >> cost estimate | Statistics are stale, run ANALYZE | | Hash Join with huge hash | Consider if join order can be improved |

Step 3: Fix the Most Common Problems

Problem 1: Missing Index on Filter Column

-- Slow: full table scan
SELECT * FROM orders WHERE user_id = 12345;

-- Fix: create index
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- CONCURRENTLY means no table lock — safe for production

Index on timestamp for time-range queries:

-- Slow: scanning all users
SELECT * FROM users WHERE created_at > NOW() - INTERVAL '30 days';

-- Fix:
CREATE INDEX CONCURRENTLY idx_users_created_at ON users(created_at);

Problem 2: Index Exists but Not Being Used

Postgres might ignore your index if:

-- Check: why isn't index being used?
EXPLAIN SELECT * FROM users WHERE lower(email) = 'user@example.com';
-- Output: Seq Scan (because index is on email, not lower(email))

-- Fix: function-based index
CREATE INDEX CONCURRENTLY idx_users_email_lower ON users(lower(email));

-- Now this will use the index:
SELECT * FROM users WHERE lower(email) = 'user@example.com';

Index not used due to type mismatch:

-- Column is INTEGER but you're comparing to text
SELECT * FROM users WHERE id = '12345';  -- '12345' is text, triggers implicit cast → no index

-- Fix: match types
SELECT * FROM users WHERE id = 12345;  -- integer literal

Problem 3: N+1 Query Problem

Your ORM fires 1 query to get users, then 1 query per user to get their orders = N+1 queries.

-- Bad pattern: called 10,000 times for 10,000 users
SELECT * FROM orders WHERE user_id = ?;

-- Fix: single query with JOIN or IN clause
SELECT u.id, u.email, o.id as order_id, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.id = ANY(ARRAY[1,2,3,...,10000]);

Problem 4: Missing Composite Index

Sometimes you need an index on multiple columns:

-- Query filters on two columns
SELECT * FROM orders 
WHERE user_id = 12345 AND status = 'pending';

-- Single-column index on user_id helps
-- but a composite index is better
CREATE INDEX CONCURRENTLY idx_orders_user_status 
ON orders(user_id, status);

-- Column order matters: put the high-cardinality column first
-- user_id (millions of unique values) before status (few unique values)

Problem 5: VACUUM Not Keeping Up (Table Bloat)

Heavy UPDATE/DELETE workloads leave dead rows. PostgreSQL autovacuum should handle this, but sometimes it can't keep up.

-- Check table bloat
SELECT 
    schemaname,
    tablename,
    n_dead_tup,
    n_live_tup,
    round(100 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0)::numeric, 2) AS dead_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

If dead_pct > 20% or autovacuum hasn't run recently:

-- Manual vacuum (non-blocking)
VACUUM ANALYZE orders;

-- If very bloated (need full rebuild, requires lock)
VACUUM FULL orders;  -- Warning: table lock! Schedule maintenance window.

-- Better alternative to VACUUM FULL (no lock):
-- Use pg_repack extension

Tune autovacuum for high-write tables:

ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,  -- vacuum when 1% of rows are dead (default: 20%)
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_vacuum_cost_delay = 2        -- milliseconds (faster vacuuming)
);

Step 4: postgresql.conf Tuning

The defaults are conservative. For a dedicated database server:

# Memory (for 16GB RAM server)
shared_buffers = 4GB              # 25% of RAM
effective_cache_size = 12GB       # 75% of RAM (planner hint)
work_mem = 64MB                   # Per sort/hash operation (watch for OOM with many connections)
maintenance_work_mem = 1GB        # For VACUUM, CREATE INDEX

# Checkpoints (reduce I/O spikes)
checkpoint_completion_target = 0.9
wal_buffers = 64MB
max_wal_size = 4GB

# Connection settings
max_connections = 100             # Use connection pooler (PgBouncer) instead of raising this

Critical: Don't raise max_connections to solve connection issues. Use PgBouncer. Each Postgres connection consumes ~5-10MB RAM. 500 connections = 5GB RAM just for connection overhead.

Step 5: Connection Pooling (PgBouncer)

Most applications don't need persistent connections. PgBouncer sits between your app and Postgres:

App (100 threads) → PgBouncer (20 server connections) → PostgreSQL
# pgbouncer.ini
[databases]
myapp = host=localhost port=5432 dbname=myapp

[pgbouncer]
pool_mode = transaction          # Best performance for stateless apps
max_client_conn = 1000          # How many app connections
default_pool_size = 20          # How many actual PG connections
min_pool_size = 5

transaction pool mode: connection returned to pool after each transaction. Works for most apps. Incompatible with SET commands and advisory locks.

Monitoring Queries to Run Weekly

-- Tables without primary keys (missing indexes, can cause slow seq scans)
SELECT tablename 
FROM pg_tables 
WHERE schemaname = 'public' 
AND tablename NOT IN (
    SELECT table_name FROM information_schema.table_constraints 
    WHERE constraint_type = 'PRIMARY KEY'
);

-- Unused indexes (wasting write performance)
SELECT 
    schemaname || '.' || tablename AS table,
    indexname,
    idx_scan AS scans,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan < 10
ORDER BY pg_relation_size(indexrelid) DESC;

-- Long-running queries (> 5 minutes)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND (now() - query_start) > interval '5 minutes'
ORDER BY duration DESC;

The 15-Minute Production Fix Checklist

When someone pages you at 2am because the database is slow:

  1. Check active queries: SELECT pid, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state = 'active';
  2. Kill long-running queries if needed: SELECT pg_terminate_backend(pid);
  3. Check for lock contention: SELECT * FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE NOT granted;
  4. Find top slow queries: Run pg_stat_statements query above
  5. Check indexes exist: EXPLAIN ANALYZE on the worst query
  6. Create missing index: CREATE INDEX CONCURRENTLY ...
  7. Run ANALYZE if statistics stale: ANALYZE <tablename>;
  8. Monitor: Watch pg_stat_activity while app traffic recovers

Most production database incidents are solved by one of: missing index, stale statistics, or lock contention. This checklist covers all three.

FAQ

Q: How do I add an index without downtime? A: Always use CREATE INDEX CONCURRENTLY in production. It takes longer to build but doesn't lock the table. A regular CREATE INDEX will block all reads and writes until complete.

Q: My query was fast, now it's slow after adding more data. Why? A: Postgres query planner uses statistics (row counts, value distributions) to pick execution plans. Run ANALYZE tablename to refresh statistics. If that doesn't help, the planner may be choosing wrong — try SET enable_seqscan = off temporarily to force index use and confirm the index helps.

Q: Should I use UUID or integer for primary keys? A: Integer (BIGINT) primary keys are faster for joins and indexing. UUID v4 is random — inserting random UUIDs into a B-tree index causes page splits, inflating the index. If you need UUIDs (for distributed systems), use UUIDv7 which is time-ordered, or use gen_random_uuid() with awareness that write performance degrades at large scale.

Q: What's the max table size PostgreSQL can handle? A: Theoretically 32TB per table. Practically, tables over 100GB start needing careful partition management. Postgres table partitioning (declarative partitioning in PG 10+) solves this — partition by date or ID range and queries that filter on the partition key skip irrelevant partitions entirely.

#postgresql#database#performance#devops#backend
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →