Linux Performance Tuning: 10 Commands Every DevOps Engineer Must Know

When a production server slows down at 2 AM, you don't have time to Google. You need a mental toolkit that lets you pinpoint the problem in under 5 minutes. After responding to hundreds of production incidents, these are the 10 Linux commands I reach for first — and what each output actually tells you.

Why Performance Tuning Matters More Than Ever

Modern stacks are complex. A single "slow page" complaint could be CPU contention, disk I/O saturation, memory pressure, network latency, or a runaway process. Blindly restarting services wastes time and masks the real cause. Good diagnostics tell you the why before you touch anything.

1. `top` / `htop` — Your First Look

Everyone knows top, but few use it effectively.

top -c -d 1

What to look at:

%us — User CPU. High = your app is the problem
%sy — System CPU. High = kernel overhead (check I/O)
%wa — I/O wait. Above 5-10% = disk bottleneck
load average — If consistently above CPU core count, you're overloaded

Install htop for a better view:

htop --sort-key=PERCENT_CPU

Press F5 for tree view to see parent-child process relationships. This instantly shows you if a single worker is spawning hundreds of children.

2. `vmstat` — CPU, Memory, and I/O Snapshot

vmstat 1 10

Runs 10 samples, 1 second apart. Key columns:

| Column | Meaning | Alert Level | |--------|---------|-------------| | r | Processes waiting for CPU | > CPU cores | | b | Processes in uninterruptible sleep (I/O) | > 5 | | swpd | Swap usage in KB | Any non-zero is bad | | si/so | Swap in/out per second | > 0 means RAM pressure | | wa | % time waiting for I/O | > 10% is a problem |

If r is consistently above your CPU count, you need more CPU or need to find what's consuming it.

3. `iostat` — Disk I/O Deep Dive

iostat -x 1 5

The -x flag gives extended stats. Focus on:

%util — How busy the device is. Above 80% = saturation approaching
await — Average time for I/O requests in milliseconds. SSD should be <1ms, HDD <10ms
r/s, w/s — Read/write operations per second

# Check a specific disk
iostat -x sda 1 5

Pro tip: If %util is 100% but await is low, you have sequential I/O. If await is high, you have random I/O or a slow disk.

4. `ss` / `netstat` — Network Connection Analysis

# Count connections by state
ss -s

# Show all established connections with process names
ss -tulpn

# Count connections to a specific port
ss -an | grep :8080 | awk '{print $2}' | sort | uniq -c | sort -rn

High TIME_WAIT count means your app is closing connections faster than the OS can clean them. Fix: tune net.ipv4.tcp_tw_reuse or use connection pooling.

High CLOSE_WAIT means your app isn't closing sockets properly — that's a code bug.

5. `lsof` — Find What's Using Resources

# Who has the most open files
lsof | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

# Check open files for a specific process
lsof -p <PID>

# Find what process has a port
lsof -i :8080

If a process is leaking file descriptors, lsof -p <PID> | wc -l will grow over time. Compare across restarts.

6. `strace` — See Exactly What a Process Is Doing

This is the nuclear option for diagnosing stuck processes.

# Attach to a running process
strace -p <PID> -e trace=network,file

# Time each system call
strace -T -p <PID> 2>&1 | head -50

Look for:

epoll_wait loops — normal for event-driven servers
read hanging — blocked on file or network
Repeated open/close on the same file — inefficient file handling
futex waits — thread contention/lock contention

Warning: strace adds overhead (10-100x slower). Use briefly on one process at a time.

7. `perf top` — CPU Profiler Without Code Changes

sudo perf top -p <PID>

Shows which functions are consuming CPU in real time. No recompilation needed. This is how I identified a regex in a logging library was consuming 40% of CPU on a high-traffic service.

# Record 30 seconds then analyze
sudo perf record -g -p <PID> -- sleep 30
sudo perf report

If you see a lot of time in __libc_malloc or malloc_consolidate, you have allocation pressure — consider an arena allocator or reducing allocations in hot paths.

8. `free` + `/proc/meminfo` — Memory Pressure Diagnosis

free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Cached|SwapTotal|SwapFree|Dirty"

What matters:

MemAvailable — not MemFree. Available includes reclaimable cache
Dirty — Pages waiting to be written to disk. High value with slow disk = write I/O bottleneck
Slab — Kernel memory for data structures. High value means kernel is caching heavily (usually fine)

If MemAvailable is near zero and swap is active, you need more RAM or need to find and kill memory hogs.

# Find memory hogs
ps aux --sort=-%mem | head -15

9. `dmesg` — Kernel Messages That No App Log Shows

dmesg -T | tail -50
dmesg -T | grep -E "OOM|killed|error|fail"

The OOM killer silently murders processes when RAM runs out. Your app "crashes" with no error in its own logs — because it was killed externally. dmesg is the only place this is recorded:

[Tue Apr  1 03:22:17 2025] Out of memory: Kill process 12345 (java) score 892 or sacrifice child
[Tue Apr  1 03:22:17 2025] Killed process 12345 (java) total-vm:8192000kB, anon-rss:7850000kB

Also watch for:

EXT4-fs error — filesystem corruption
TCP: out of memory — network socket exhaustion
nf_conntrack: table full — connection tracking overflow (common in NAT-heavy setups)

10. `sar` — Historical Performance Data

Unlike the above tools that show current state, sar shows historical data — crucial for post-mortems.

# CPU usage for yesterday
sar -u -f /var/log/sa/sa$(date -d yesterday +%d)

# Memory history today
sar -r 1 10

# I/O history
sar -d -p 1 10

Enable sysstat to collect data automatically:

apt install sysstat
systemctl enable sysstat
systemctl start sysstat

Data is collected every 10 minutes by default. You can replay exactly what happened during an incident window — hours after the fact.

The 5-Minute Incident Triage Checklist

When a server is slow:

top → Is CPU pegged? Who? (%wa high = I/O problem)
vmstat 1 5 → Swap in use? Many processes sleeping?
iostat -x 1 3 → Disk saturated?
ss -s → Connection count exploding?
free -h → Out of RAM?
dmesg | tail → Anything killed by OOM?

In 90% of cases, the answer is visible in these 6 commands within 2 minutes.

Turning Diagnostics Into Income

These aren't just survival skills — they're billable skills. Rate cards for Linux performance consulting range from $150–$400/hr. If you've resolved real incidents using these tools, you have the portfolio to charge for it.

The pattern that works: document your incidents (anonymized), publish the diagnosis and fix, and let the SEO bring clients to you. That's the core loop behind devtocash.

What's Next

In the next post, I'll cover kernel parameter tuning — the sysctl values that give you 20-40% more throughput with zero hardware changes. Subscribe to get notified when it drops.

Linux Performance Tuning: 10 Commands Every DevOps Engineer Must Know

Why Performance Tuning Matters More Than Ever

1. `top` / `htop` — Your First Look

2. `vmstat` — CPU, Memory, and I/O Snapshot

3. `iostat` — Disk I/O Deep Dive

4. `ss` / `netstat` — Network Connection Analysis

5. `lsof` — Find What's Using Resources

6. `strace` — See Exactly What a Process Is Doing

7. `perf top` — CPU Profiler Without Code Changes

8. `free` + `/proc/meminfo` — Memory Pressure Diagnosis

9. `dmesg` — Kernel Messages That No App Log Shows

10. `sar` — Historical Performance Data

The 5-Minute Incident Triage Checklist

Turning Diagnostics Into Income

What's Next

Related Articles

CI/CD Pipeline With GitHub Actions: The Complete Production Setup

10 Kubernetes Mistakes That Cost Companies Millions (And How to Fix Them)

PostgreSQL Performance Tuning: Fix Slow Queries in Production Without Downtime

Why Performance Tuning Matters More Than Ever

1. top / htop — Your First Look

2. vmstat — CPU, Memory, and I/O Snapshot

3. iostat — Disk I/O Deep Dive

4. ss / netstat — Network Connection Analysis

5. lsof — Find What's Using Resources

6. strace — See Exactly What a Process Is Doing

7. perf top — CPU Profiler Without Code Changes

8. free + /proc/meminfo — Memory Pressure Diagnosis

9. dmesg — Kernel Messages That No App Log Shows

10. sar — Historical Performance Data

The 5-Minute Incident Triage Checklist

Turning Diagnostics Into Income

What's Next

Related Articles

CI/CD Pipeline With GitHub Actions: The Complete Production Setup

10 Kubernetes Mistakes That Cost Companies Millions (And How to Fix Them)

PostgreSQL Performance Tuning: Fix Slow Queries in Production Without Downtime

1. `top` / `htop` — Your First Look

2. `vmstat` — CPU, Memory, and I/O Snapshot

3. `iostat` — Disk I/O Deep Dive

4. `ss` / `netstat` — Network Connection Analysis

5. `lsof` — Find What's Using Resources

6. `strace` — See Exactly What a Process Is Doing

7. `perf top` — CPU Profiler Without Code Changes

8. `free` + `/proc/meminfo` — Memory Pressure Diagnosis

9. `dmesg` — Kernel Messages That No App Log Shows

10. `sar` — Historical Performance Data