Kubernetes Pod Autoscaling: HPA, VPA, and KEDA Explained

Introduction

Your Kubernetes cluster hums along until a traffic spike hits. Pods crash under load, latency skyrockets, and users see 502 errors. Kubernetes autoscaling solves this. But with three different mechanisms--Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and KEDA--knowing which to use and when isn't always obvious.

In this guide, you'll learn how each autoscaler works, see real configuration examples you can copy and deploy today, and walk away with a decision framework for your own workloads.

Horizontal Pod Autoscaler (HPA)

HPA adjusts the number of pod replicas based on observed metrics like CPU or memory. It runs a simple control loop every 15 seconds:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

Prerequisites

First, install Metrics Server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl top nodes

Then set resource requests on your pods--HPA needs these to calculate utilization percentages.

Basic HPA Configuration

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: app
        image: myapp:latest
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Apply and watch:

kubectl apply -f deployment.yaml
kubectl apply -f hpa.yaml
kubectl get hpa -w

This configuration maintains CPU utilization around 70% and memory around 80%, scaling between 2 and 10 replicas.

Custom Metrics HPA

CPU and memory work for basic workloads. Real applications need application-level metrics. Scale on anything Prometheus measures:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

This requires the Prometheus Adapter. The pattern is powerful--scale on queue depth, request latency, or any custom metric.

HPA Behavior Tuning

Prevent flapping with stabilization windows:

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 30
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

Scale-down waits 5 minutes before acting. Scale-up is aggressive--picks the larger of 100% growth or +4 pods.

Vertical Pod Autoscaler (VPA)

HPA adds more pods. VPA makes existing pods bigger. Use VPA when your workload can't easily replicate (stateful apps, databases) or memory-bound pods that crash instead of slowing down.

VPA is a separate component. Install it:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

VPA Update Modes

Mode	Behavior
Off	Only recommends, never applies changes
Initial	Sets resources only on pod creation
Recreate	Evicts and recreates pods with new resources
Auto	Currently same as Recreate

Start with Off mode to see what VPA recommends before letting it make changes:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "2"
        memory: "2Gi"

Check recommendations after a few days:

kubectl describe vpa myapp-vpa

You'll see lowerBound, target, and upperBound values. Once you're confident in the recommendations, switch to Auto mode.

Important: VPA in Recreate/Auto mode will evict pods to apply changes. Ensure your application handles graceful shutdowns and you have proper PodDisruptionBudgets in place.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) scales pods based on external events--message queues, database changes, cron schedules. Its killer feature? It can scale to zero pods when there's no work.

Install KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

RabbitMQ Queue Scaling

Scale workers based on queue depth:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: rabbitmq-scaler
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
  - type: rabbitmq
    metadata:
      queueName: processing-queue
      queueLength: "50"
      host: amqp://user:password@rabbitmq:5672

When 50+ messages pile up, KEDA spins up workers. At zero messages, it scales to zero--no idle pods wasting resources.

Cron-Based Scaling

Schedule scaling for predictable patterns:

  triggers:
  - type: cron
    metadata:
      timezone: Asia/Jakarta
      start: "30 8 * * 1-5"
      end: "0 18 * * 1-5"
      desiredReplicas: "10"

Scale to 10 replicas during business hours, back to minimum outside. KEDA supports 50+ scalers including Kafka, PostgreSQL, Prometheus, and cloud-specific triggers like AWS SQS and GCP Pub/Sub.

Decision Framework: Which Autoscaler When?

Scenario	Use	Why
Stateless web API, CPU/memory spikes	HPA	Built-in, simple, works with standard metrics
Stateful app, memory-bound	VPA	Can't easily replicate; resize instead
Queue workers, event-driven processing	KEDA	Scale to zero, event triggers, 50+ integrations
Database (PostgreSQL, MySQL)	VPA	Stateful, resizing safer than replicating
Mixed: API + background workers	HPA + KEDA	HPA for the API tier, KEDA for event-driven workers
Batch/cron jobs on Kubernetes	KEDA	Cron triggers with scale-to-zero between runs
Microservices with unpredictable traffic	HPA + custom metrics	Scale on request rate or latency, not just CPU

Pro tip: You can combine them. Use HPA for horizontal scaling and VPA in Off mode for resource recommendations. Or use KEDA triggers to feed into HPA for custom-metric-driven horizontal scaling.

Conclusion

Start with HPA for stateless services--it's built in, battle-tested, and covers 80% of use cases with zero additional components. Add VPA in recommendation mode to discover over- or under-provisioned pods. Reach for KEDA when you need event-driven scaling or scale-to-zero for cost efficiency.

The key takeaway: no autoscaler replaces proper load testing. Know your application's limits before you let Kubernetes decide them. Test your HPA configurations under real traffic, validate VPA recommendations in staging, and ensure KEDA scalers respond within your latency budget.

Once configured correctly, Kubernetes autoscaling transforms your cluster from a static deployment into a self-regulating system that adapts to demand automatically. Your users won't notice the difference--but your cloud bill and on-call stress levels will.