I've been managing Kubernetes clusters at scale for 8 years. I've seen $50K/month waste from a single misconfigured HPA, clusters that brought down production because of missing resource limits, and deployments that rolled back silently for weeks.
These aren't edge cases. They happen at well-funded startups, mid-size tech companies, and even FAANG teams. Here are the 10 mistakes that hurt the most — and exactly how to fix them.
1. No Resource Requests or Limits
This is the silent killer. Without resource requests, the Kubernetes scheduler has no idea where to place your pods. Without limits, a single buggy service can starve everything else on the node.
# BAD — no limits
containers:
- name: api
image: myapp:latest
# GOOD — explicit requests and limits
containers:
- name: api
image: myapp:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Real cost: At 100 pods with no limits, one memory-leaking service can OOM the node and trigger cascading restarts. Fix this first.
2. Misconfigured HPA — Scaling on the Wrong Metric
Horizontal Pod Autoscaler scaling on CPU sounds right, but CPU is a lagging indicator for most web services. If your service is I/O bound (waiting on DB, Redis, external APIs), CPU stays low while latency spikes.
# Better: scale on custom metrics or RPS
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Use Prometheus Adapter to expose custom metrics. CPU-only HPA is almost always wrong for production workloads.
3. Storing Secrets in ConfigMaps (or Worse, Environment Variables in YAML)
This is committed to git at thousands of companies right now:
# BAD — don't do this
env:
- name: DB_PASSWORD
value: "supersecretpassword123"
Use Kubernetes Secrets at minimum. Better: use External Secrets Operator with HashiCorp Vault or AWS Secrets Manager:
kubectl create secret generic db-creds \
--from-literal=password=supersecretpassword123
# Reference in deployment
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-creds
key: password
4. No Pod Disruption Budgets (PDB)
When you run kubectl drain for node maintenance, without a PDB, Kubernetes can terminate ALL replicas of a deployment simultaneously. Site goes down. You discover this during your next cluster upgrade.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # always keep at least 2 pods running
selector:
matchLabels:
app: api
Set PDBs for every stateless service with minReplicas: 2. This is a 10-minute fix with catastrophic upside.
5. Using latest Image Tags in Production
The latest tag is non-deterministic. Your pods will drift. A rollback becomes impossible when you can't know what image version each pod is running.
# BAD
image: myapp:latest
# GOOD — use immutable tags (git SHA or semver)
image: myapp:sha-abc1234
image: myapp:v2.4.1
Enforce this with an admission webhook or OPA Gatekeeper policy that rejects latest in production namespaces.
6. Ignoring Namespace Resource Quotas
Without namespace quotas, a single team's buggy deployment can consume all cluster resources, starving every other team.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-payments
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
count/pods: "50"
Set quotas per namespace. Cost accountability becomes trivial when you track it at namespace level.
7. Not Setting Liveness vs Readiness Probes Correctly
Most teams set liveness probes that are too aggressive. The probe fails during a brief slowdown, Kubernetes restarts the pod, which causes more load, which causes more failures. Death spiral.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # give the app time to start
periodSeconds: 10
failureThreshold: 3 # 3 failures before restart
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2 # remove from LB faster
Rule: liveness probe checks if the process is alive (restart if dead). Readiness probe checks if it can handle traffic (remove from load balancer if not ready). They are different things.
8. Single Replica Stateful Services in Production
StatefulSets running a single replica are a single point of failure. Node goes down, your database pod is unschedulable for minutes while Kubernetes reschedules it.
For PostgreSQL: use a HA operator (CloudNativePG, Zalando Postgres Operator) with at least 1 primary + 1 replica. For Redis: use Redis Sentinel or Redis Cluster. For anything stateful: plan for the primary going away.
9. Cluster-Admin Bindings Everywhere
Teams hand out cluster-admin because it's easy. Then someone runs a script that accidentally deletes production namespaces. Use RBAC properly:
# Give a team access only to their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-payments-edit
namespace: team-payments
subjects:
- kind: Group
name: team-payments
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: edit
apiGroup: rbac.authorization.k8s.io
Audit all cluster-admin bindings today: kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin")'
10. Not Testing Chaos — No Failover Drills
You don't know if your redundancy works until you test it. Most teams discover their PDB was misconfigured, their readiness probe was wrong, or their multi-AZ setup didn't actually spread pods — during a real outage.
Schedule monthly chaos drills:
- Kill the primary node in a nodegroup
- Delete a random pod from each critical deployment
- Simulate DNS failure for external dependencies
- Fill a node's disk to 95%
Use Chaos Mesh or Litmus Chaos to automate this. The cost of a drill is 30 minutes. The cost of an avoidable outage is measured in revenue.
Key Takeaways
- Resource requests/limits are mandatory — no exceptions
- Use immutable image tags; ban
:latestin production namespaces - PodDisruptionBudgets take 10 minutes to set and save hours of downtime
- RBAC: principle of least privilege, always
- Test your failover before it fails you
Conclusion
Every one of these mistakes has a "but we'll fix it later" story behind it. Later becomes a 3am incident. Fix them in the order I listed — resource limits, HPA metrics, and PDBs alone will reduce your on-call burden by 40%.
Want a full checklist? Drop a comment and I'll email it to you.
Published: 2026-04-11 | Category: DevOps | Read time: 8 min