devops

10 Kubernetes Mistakes That Cost Companies Millions (And How to Fix Them)

Real Kubernetes misconfigurations I've seen destroy uptime and budgets at scale. Avoid these 10 mistakes and save your team thousands per month.

April 11, 2026·6 min read·
#kubernetes#k8s#infrastructure#devops#sre#cost-optimization

I've been managing Kubernetes clusters at scale for 8 years. I've seen $50K/month waste from a single misconfigured HPA, clusters that brought down production because of missing resource limits, and deployments that rolled back silently for weeks.

These aren't edge cases. They happen at well-funded startups, mid-size tech companies, and even FAANG teams. Here are the 10 mistakes that hurt the most — and exactly how to fix them.

1. No Resource Requests or Limits

This is the silent killer. Without resource requests, the Kubernetes scheduler has no idea where to place your pods. Without limits, a single buggy service can starve everything else on the node.

# BAD — no limits
containers:
  - name: api
    image: myapp:latest

# GOOD — explicit requests and limits
containers:
  - name: api
    image: myapp:latest
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Real cost: At 100 pods with no limits, one memory-leaking service can OOM the node and trigger cascading restarts. Fix this first.


2. Misconfigured HPA — Scaling on the Wrong Metric

Horizontal Pod Autoscaler scaling on CPU sounds right, but CPU is a lagging indicator for most web services. If your service is I/O bound (waiting on DB, Redis, external APIs), CPU stays low while latency spikes.

# Better: scale on custom metrics or RPS
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Use Prometheus Adapter to expose custom metrics. CPU-only HPA is almost always wrong for production workloads.


3. Storing Secrets in ConfigMaps (or Worse, Environment Variables in YAML)

This is committed to git at thousands of companies right now:

# BAD — don't do this
env:
  - name: DB_PASSWORD
    value: "supersecretpassword123"

Use Kubernetes Secrets at minimum. Better: use External Secrets Operator with HashiCorp Vault or AWS Secrets Manager:

kubectl create secret generic db-creds \
  --from-literal=password=supersecretpassword123

# Reference in deployment
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-creds
        key: password

4. No Pod Disruption Budgets (PDB)

When you run kubectl drain for node maintenance, without a PDB, Kubernetes can terminate ALL replicas of a deployment simultaneously. Site goes down. You discover this during your next cluster upgrade.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2    # always keep at least 2 pods running
  selector:
    matchLabels:
      app: api

Set PDBs for every stateless service with minReplicas: 2. This is a 10-minute fix with catastrophic upside.


5. Using latest Image Tags in Production

The latest tag is non-deterministic. Your pods will drift. A rollback becomes impossible when you can't know what image version each pod is running.

# BAD
image: myapp:latest

# GOOD — use immutable tags (git SHA or semver)
image: myapp:sha-abc1234
image: myapp:v2.4.1

Enforce this with an admission webhook or OPA Gatekeeper policy that rejects latest in production namespaces.


6. Ignoring Namespace Resource Quotas

Without namespace quotas, a single team's buggy deployment can consume all cluster resources, starving every other team.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    count/pods: "50"

Set quotas per namespace. Cost accountability becomes trivial when you track it at namespace level.


7. Not Setting Liveness vs Readiness Probes Correctly

Most teams set liveness probes that are too aggressive. The probe fails during a brief slowdown, Kubernetes restarts the pod, which causes more load, which causes more failures. Death spiral.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # give the app time to start
  periodSeconds: 10
  failureThreshold: 3       # 3 failures before restart

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2       # remove from LB faster

Rule: liveness probe checks if the process is alive (restart if dead). Readiness probe checks if it can handle traffic (remove from load balancer if not ready). They are different things.


8. Single Replica Stateful Services in Production

StatefulSets running a single replica are a single point of failure. Node goes down, your database pod is unschedulable for minutes while Kubernetes reschedules it.

For PostgreSQL: use a HA operator (CloudNativePG, Zalando Postgres Operator) with at least 1 primary + 1 replica. For Redis: use Redis Sentinel or Redis Cluster. For anything stateful: plan for the primary going away.


9. Cluster-Admin Bindings Everywhere

Teams hand out cluster-admin because it's easy. Then someone runs a script that accidentally deletes production namespaces. Use RBAC properly:

# Give a team access only to their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-payments-edit
  namespace: team-payments
subjects:
- kind: Group
  name: team-payments
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: edit
  apiGroup: rbac.authorization.k8s.io

Audit all cluster-admin bindings today: kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin")'


10. Not Testing Chaos — No Failover Drills

You don't know if your redundancy works until you test it. Most teams discover their PDB was misconfigured, their readiness probe was wrong, or their multi-AZ setup didn't actually spread pods — during a real outage.

Schedule monthly chaos drills:

  1. Kill the primary node in a nodegroup
  2. Delete a random pod from each critical deployment
  3. Simulate DNS failure for external dependencies
  4. Fill a node's disk to 95%

Use Chaos Mesh or Litmus Chaos to automate this. The cost of a drill is 30 minutes. The cost of an avoidable outage is measured in revenue.


Key Takeaways

  • Resource requests/limits are mandatory — no exceptions
  • Use immutable image tags; ban :latest in production namespaces
  • PodDisruptionBudgets take 10 minutes to set and save hours of downtime
  • RBAC: principle of least privilege, always
  • Test your failover before it fails you

Conclusion

Every one of these mistakes has a "but we'll fix it later" story behind it. Later becomes a 3am incident. Fix them in the order I listed — resource limits, HPA metrics, and PDBs alone will reduce your on-call burden by 40%.

Want a full checklist? Drop a comment and I'll email it to you.


Published: 2026-04-11 | Category: DevOps | Read time: 8 min

#kubernetes#k8s#infrastructure#devops#sre#cost-optimization
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →