Introduction
Your Kubernetes cluster hums along until a traffic spike hits. Pods crash under load, latency skyrockets, and users see 502 errors. Kubernetes autoscaling solves this. But with three different mechanisms--Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and KEDA--knowing which to use and when isn't always obvious.
In this guide, you'll learn how each autoscaler works, see real configuration examples you can copy and deploy today, and walk away with a decision framework for your own workloads.
Horizontal Pod Autoscaler (HPA)
HPA adjusts the number of pod replicas based on observed metrics like CPU or memory. It runs a simple control loop every 15 seconds:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
Prerequisites
First, install Metrics Server:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl top nodes
Then set resource requests on your pods--HPA needs these to calculate utilization percentages.
Basic HPA Configuration
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 2
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Apply and watch:
kubectl apply -f deployment.yaml
kubectl apply -f hpa.yaml
kubectl get hpa -w
This configuration maintains CPU utilization around 70% and memory around 80%, scaling between 2 and 10 replicas.
Custom Metrics HPA
CPU and memory work for basic workloads. Real applications need application-level metrics. Scale on anything Prometheus measures:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
This requires the Prometheus Adapter. The pattern is powerful--scale on queue depth, request latency, or any custom metric.
HPA Behavior Tuning
Prevent flapping with stabilization windows:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 30
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Scale-down waits 5 minutes before acting. Scale-up is aggressive--picks the larger of 100% growth or +4 pods.
Vertical Pod Autoscaler (VPA)
HPA adds more pods. VPA makes existing pods bigger. Use VPA when your workload can't easily replicate (stateful apps, databases) or memory-bound pods that crash instead of slowing down.
VPA is a separate component. Install it:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
VPA Update Modes
| Mode | Behavior |
|---|---|
| Off | Only recommends, never applies changes |
| Initial | Sets resources only on pod creation |
| Recreate | Evicts and recreates pods with new resources |
| Auto | Currently same as Recreate |
Start with Off mode to see what VPA recommends before letting it make changes:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: myapp
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "2Gi"
Check recommendations after a few days:
kubectl describe vpa myapp-vpa
You'll see lowerBound, target, and upperBound values. Once you're confident in the recommendations, switch to Auto mode.
Important: VPA in Recreate/Auto mode will evict pods to apply changes. Ensure your application handles graceful shutdowns and you have proper PodDisruptionBudgets in place.
KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) scales pods based on external events--message queues, database changes, cron schedules. Its killer feature? It can scale to zero pods when there's no work.
Install KEDA:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
RabbitMQ Queue Scaling
Scale workers based on queue depth:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: rabbitmq-scaler
spec:
scaleTargetRef:
name: worker-deployment
minReplicaCount: 0
maxReplicaCount: 30
triggers:
- type: rabbitmq
metadata:
queueName: processing-queue
queueLength: "50"
host: amqp://user:password@rabbitmq:5672
When 50+ messages pile up, KEDA spins up workers. At zero messages, it scales to zero--no idle pods wasting resources.
Cron-Based Scaling
Schedule scaling for predictable patterns:
triggers:
- type: cron
metadata:
timezone: Asia/Jakarta
start: "30 8 * * 1-5"
end: "0 18 * * 1-5"
desiredReplicas: "10"
Scale to 10 replicas during business hours, back to minimum outside. KEDA supports 50+ scalers including Kafka, PostgreSQL, Prometheus, and cloud-specific triggers like AWS SQS and GCP Pub/Sub.
Decision Framework: Which Autoscaler When?
| Scenario | Use | Why |
|---|---|---|
| Stateless web API, CPU/memory spikes | HPA | Built-in, simple, works with standard metrics |
| Stateful app, memory-bound | VPA | Can't easily replicate; resize instead |
| Queue workers, event-driven processing | KEDA | Scale to zero, event triggers, 50+ integrations |
| Database (PostgreSQL, MySQL) | VPA | Stateful, resizing safer than replicating |
| Mixed: API + background workers | HPA + KEDA | HPA for the API tier, KEDA for event-driven workers |
| Batch/cron jobs on Kubernetes | KEDA | Cron triggers with scale-to-zero between runs |
| Microservices with unpredictable traffic | HPA + custom metrics | Scale on request rate or latency, not just CPU |
Pro tip: You can combine them. Use HPA for horizontal scaling and VPA in Off mode for resource recommendations. Or use KEDA triggers to feed into HPA for custom-metric-driven horizontal scaling.
Conclusion
Start with HPA for stateless services--it's built in, battle-tested, and covers 80% of use cases with zero additional components. Add VPA in recommendation mode to discover over- or under-provisioned pods. Reach for KEDA when you need event-driven scaling or scale-to-zero for cost efficiency.
The key takeaway: no autoscaler replaces proper load testing. Know your application's limits before you let Kubernetes decide them. Test your HPA configurations under real traffic, validate VPA recommendations in staging, and ensure KEDA scalers respond within your latency budget.
Once configured correctly, Kubernetes autoscaling transforms your cluster from a static deployment into a self-regulating system that adapts to demand automatically. Your users won't notice the difference--but your cloud bill and on-call stress levels will.