When a production server slows down at 2 AM, you don't have time to Google. You need a mental toolkit that lets you pinpoint the problem in under 5 minutes. After responding to hundreds of production incidents, these are the 10 Linux commands I reach for first — and what each output actually tells you.
Why Performance Tuning Matters More Than Ever
Modern stacks are complex. A single "slow page" complaint could be CPU contention, disk I/O saturation, memory pressure, network latency, or a runaway process. Blindly restarting services wastes time and masks the real cause. Good diagnostics tell you the why before you touch anything.
1. top / htop — Your First Look
Everyone knows top, but few use it effectively.
top -c -d 1
What to look at:
%us— User CPU. High = your app is the problem%sy— System CPU. High = kernel overhead (check I/O)%wa— I/O wait. Above 5-10% = disk bottleneckload average— If consistently above CPU core count, you're overloaded
Install htop for a better view:
htop --sort-key=PERCENT_CPU
Press F5 for tree view to see parent-child process relationships. This instantly shows you if a single worker is spawning hundreds of children.
2. vmstat — CPU, Memory, and I/O Snapshot
vmstat 1 10
Runs 10 samples, 1 second apart. Key columns:
| Column | Meaning | Alert Level |
|--------|---------|-------------|
| r | Processes waiting for CPU | > CPU cores |
| b | Processes in uninterruptible sleep (I/O) | > 5 |
| swpd | Swap usage in KB | Any non-zero is bad |
| si/so | Swap in/out per second | > 0 means RAM pressure |
| wa | % time waiting for I/O | > 10% is a problem |
If r is consistently above your CPU count, you need more CPU or need to find what's consuming it.
3. iostat — Disk I/O Deep Dive
iostat -x 1 5
The -x flag gives extended stats. Focus on:
%util— How busy the device is. Above 80% = saturation approachingawait— Average time for I/O requests in milliseconds. SSD should be <1ms, HDD <10msr/s,w/s— Read/write operations per second
# Check a specific disk
iostat -x sda 1 5
Pro tip: If %util is 100% but await is low, you have sequential I/O. If await is high, you have random I/O or a slow disk.
4. ss / netstat — Network Connection Analysis
# Count connections by state
ss -s
# Show all established connections with process names
ss -tulpn
# Count connections to a specific port
ss -an | grep :8080 | awk '{print $2}' | sort | uniq -c | sort -rn
High TIME_WAIT count means your app is closing connections faster than the OS can clean them. Fix: tune net.ipv4.tcp_tw_reuse or use connection pooling.
High CLOSE_WAIT means your app isn't closing sockets properly — that's a code bug.
5. lsof — Find What's Using Resources
# Who has the most open files
lsof | awk '{print $1}' | sort | uniq -c | sort -rn | head -20
# Check open files for a specific process
lsof -p <PID>
# Find what process has a port
lsof -i :8080
If a process is leaking file descriptors, lsof -p <PID> | wc -l will grow over time. Compare across restarts.
6. strace — See Exactly What a Process Is Doing
This is the nuclear option for diagnosing stuck processes.
# Attach to a running process
strace -p <PID> -e trace=network,file
# Time each system call
strace -T -p <PID> 2>&1 | head -50
Look for:
epoll_waitloops — normal for event-driven serversreadhanging — blocked on file or network- Repeated
open/closeon the same file — inefficient file handling futexwaits — thread contention/lock contention
Warning: strace adds overhead (10-100x slower). Use briefly on one process at a time.
7. perf top — CPU Profiler Without Code Changes
sudo perf top -p <PID>
Shows which functions are consuming CPU in real time. No recompilation needed. This is how I identified a regex in a logging library was consuming 40% of CPU on a high-traffic service.
# Record 30 seconds then analyze
sudo perf record -g -p <PID> -- sleep 30
sudo perf report
If you see a lot of time in __libc_malloc or malloc_consolidate, you have allocation pressure — consider an arena allocator or reducing allocations in hot paths.
8. free + /proc/meminfo — Memory Pressure Diagnosis
free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Cached|SwapTotal|SwapFree|Dirty"
What matters:
MemAvailable— notMemFree. Available includes reclaimable cacheDirty— Pages waiting to be written to disk. High value with slow disk = write I/O bottleneckSlab— Kernel memory for data structures. High value means kernel is caching heavily (usually fine)
If MemAvailable is near zero and swap is active, you need more RAM or need to find and kill memory hogs.
# Find memory hogs
ps aux --sort=-%mem | head -15
9. dmesg — Kernel Messages That No App Log Shows
dmesg -T | tail -50
dmesg -T | grep -E "OOM|killed|error|fail"
The OOM killer silently murders processes when RAM runs out. Your app "crashes" with no error in its own logs — because it was killed externally. dmesg is the only place this is recorded:
[Tue Apr 1 03:22:17 2025] Out of memory: Kill process 12345 (java) score 892 or sacrifice child
[Tue Apr 1 03:22:17 2025] Killed process 12345 (java) total-vm:8192000kB, anon-rss:7850000kB
Also watch for:
EXT4-fs error— filesystem corruptionTCP: out of memory— network socket exhaustionnf_conntrack: table full— connection tracking overflow (common in NAT-heavy setups)
10. sar — Historical Performance Data
Unlike the above tools that show current state, sar shows historical data — crucial for post-mortems.
# CPU usage for yesterday
sar -u -f /var/log/sa/sa$(date -d yesterday +%d)
# Memory history today
sar -r 1 10
# I/O history
sar -d -p 1 10
Enable sysstat to collect data automatically:
apt install sysstat
systemctl enable sysstat
systemctl start sysstat
Data is collected every 10 minutes by default. You can replay exactly what happened during an incident window — hours after the fact.
The 5-Minute Incident Triage Checklist
When a server is slow:
top→ Is CPU pegged? Who? (%wahigh = I/O problem)vmstat 1 5→ Swap in use? Many processes sleeping?iostat -x 1 3→ Disk saturated?ss -s→ Connection count exploding?free -h→ Out of RAM?dmesg | tail→ Anything killed by OOM?
In 90% of cases, the answer is visible in these 6 commands within 2 minutes.
Turning Diagnostics Into Income
These aren't just survival skills — they're billable skills. Rate cards for Linux performance consulting range from $150–$400/hr. If you've resolved real incidents using these tools, you have the portfolio to charge for it.
The pattern that works: document your incidents (anonymized), publish the diagnosis and fix, and let the SEO bring clients to you. That's the core loop behind devtocash.
What's Next
In the next post, I'll cover kernel parameter tuning — the sysctl values that give you 20-40% more throughput with zero hardware changes. Subscribe to get notified when it drops.