10 Kubernetes Mistakes That Cost Companies Millions (And How to Fix Them)

I've been managing Kubernetes clusters at scale for 8 years. I've seen $50K/month waste from a single misconfigured HPA, clusters that brought down production because of missing resource limits, and deployments that rolled back silently for weeks.

These aren't edge cases. They happen at well-funded startups, mid-size tech companies, and even FAANG teams. Here are the 10 mistakes that hurt the most — and exactly how to fix them.

1. No Resource Requests or Limits

This is the silent killer. Without resource requests, the Kubernetes scheduler has no idea where to place your pods. Without limits, a single buggy service can starve everything else on the node.

# BAD — no limits
containers:
  - name: api
    image: myapp:latest

# GOOD — explicit requests and limits
containers:
  - name: api
    image: myapp:latest
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Real cost: At 100 pods with no limits, one memory-leaking service can OOM the node and trigger cascading restarts. Fix this first.

2. Misconfigured HPA — Scaling on the Wrong Metric

Horizontal Pod Autoscaler scaling on CPU sounds right, but CPU is a lagging indicator for most web services. If your service is I/O bound (waiting on DB, Redis, external APIs), CPU stays low while latency spikes.

# Better: scale on custom metrics or RPS
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Use Prometheus Adapter to expose custom metrics. CPU-only HPA is almost always wrong for production workloads.

3. Storing Secrets in ConfigMaps (or Worse, Environment Variables in YAML)

This is committed to git at thousands of companies right now:

# BAD — don't do this
env:
  - name: DB_PASSWORD
    value: "supersecretpassword123"

Use Kubernetes Secrets at minimum. Better: use External Secrets Operator with HashiCorp Vault or AWS Secrets Manager:

kubectl create secret generic db-creds \
  --from-literal=password=supersecretpassword123

# Reference in deployment
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-creds
        key: password

4. No Pod Disruption Budgets (PDB)

When you run kubectl drain for node maintenance, without a PDB, Kubernetes can terminate ALL replicas of a deployment simultaneously. Site goes down. You discover this during your next cluster upgrade.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2    # always keep at least 2 pods running
  selector:
    matchLabels:
      app: api

Set PDBs for every stateless service with minReplicas: 2. This is a 10-minute fix with catastrophic upside.

5. Using `latest` Image Tags in Production

The latest tag is non-deterministic. Your pods will drift. A rollback becomes impossible when you can't know what image version each pod is running.

# BAD
image: myapp:latest

# GOOD — use immutable tags (git SHA or semver)
image: myapp:sha-abc1234
image: myapp:v2.4.1

Enforce this with an admission webhook or OPA Gatekeeper policy that rejects latest in production namespaces.

6. Ignoring Namespace Resource Quotas

Without namespace quotas, a single team's buggy deployment can consume all cluster resources, starving every other team.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    count/pods: "50"

Set quotas per namespace. Cost accountability becomes trivial when you track it at namespace level.

7. Not Setting Liveness vs Readiness Probes Correctly

Most teams set liveness probes that are too aggressive. The probe fails during a brief slowdown, Kubernetes restarts the pod, which causes more load, which causes more failures. Death spiral.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # give the app time to start
  periodSeconds: 10
  failureThreshold: 3       # 3 failures before restart

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2       # remove from LB faster

Rule: liveness probe checks if the process is alive (restart if dead). Readiness probe checks if it can handle traffic (remove from load balancer if not ready). They are different things.

8. Single Replica Stateful Services in Production

StatefulSets running a single replica are a single point of failure. Node goes down, your database pod is unschedulable for minutes while Kubernetes reschedules it.

For PostgreSQL: use a HA operator (CloudNativePG, Zalando Postgres Operator) with at least 1 primary + 1 replica. For Redis: use Redis Sentinel or Redis Cluster. For anything stateful: plan for the primary going away.

9. Cluster-Admin Bindings Everywhere

Teams hand out cluster-admin because it's easy. Then someone runs a script that accidentally deletes production namespaces. Use RBAC properly:

# Give a team access only to their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-payments-edit
  namespace: team-payments
subjects:
- kind: Group
  name: team-payments
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: edit
  apiGroup: rbac.authorization.k8s.io

Audit all cluster-admin bindings today: kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin")'

10. Not Testing Chaos — No Failover Drills

You don't know if your redundancy works until you test it. Most teams discover their PDB was misconfigured, their readiness probe was wrong, or their multi-AZ setup didn't actually spread pods — during a real outage.

Schedule monthly chaos drills:

Kill the primary node in a nodegroup
Delete a random pod from each critical deployment
Simulate DNS failure for external dependencies
Fill a node's disk to 95%

Use Chaos Mesh or Litmus Chaos to automate this. The cost of a drill is 30 minutes. The cost of an avoidable outage is measured in revenue.

Key Takeaways

Resource requests/limits are mandatory — no exceptions
Use immutable image tags; ban :latest in production namespaces
PodDisruptionBudgets take 10 minutes to set and save hours of downtime
RBAC: principle of least privilege, always
Test your failover before it fails you

Conclusion

Every one of these mistakes has a "but we'll fix it later" story behind it. Later becomes a 3am incident. Fix them in the order I listed — resource limits, HPA metrics, and PDBs alone will reduce your on-call burden by 40%.

Want a full checklist? Drop a comment and I'll email it to you.

Published: 2026-04-11 | Category: DevOps | Read time: 8 min

10 Kubernetes Mistakes That Cost Companies Millions (And How to Fix Them)

1. No Resource Requests or Limits

2. Misconfigured HPA — Scaling on the Wrong Metric

3. Storing Secrets in ConfigMaps (or Worse, Environment Variables in YAML)

4. No Pod Disruption Budgets (PDB)

5. Using `latest` Image Tags in Production

6. Ignoring Namespace Resource Quotas

7. Not Setting Liveness vs Readiness Probes Correctly

8. Single Replica Stateful Services in Production

9. Cluster-Admin Bindings Everywhere

10. Not Testing Chaos — No Failover Drills

Key Takeaways

Conclusion

Related Articles

CI/CD Pipeline With GitHub Actions: The Complete Production Setup

PostgreSQL Performance Tuning: Fix Slow Queries in Production Without Downtime

AWS Lambda vs Containers: Real Cost Comparison Nobody Shows You

1. No Resource Requests or Limits

2. Misconfigured HPA — Scaling on the Wrong Metric

3. Storing Secrets in ConfigMaps (or Worse, Environment Variables in YAML)

4. No Pod Disruption Budgets (PDB)

5. Using latest Image Tags in Production

6. Ignoring Namespace Resource Quotas

7. Not Setting Liveness vs Readiness Probes Correctly

8. Single Replica Stateful Services in Production

9. Cluster-Admin Bindings Everywhere

10. Not Testing Chaos — No Failover Drills

Key Takeaways

Conclusion

Related Articles

CI/CD Pipeline With GitHub Actions: The Complete Production Setup

PostgreSQL Performance Tuning: Fix Slow Queries in Production Without Downtime

AWS Lambda vs Containers: Real Cost Comparison Nobody Shows You

5. Using `latest` Image Tags in Production