Zero-Downtime Deployments with GitHub Actions and Kubernetes

Zero-downtime deployments shouldn't be a luxury for large teams. With GitHub Actions and Kubernetes, you can have a production-grade deployment pipeline in an afternoon. Here's the exact workflow I use — battle-tested across dozens of production services.

The Goal

Every merge to main should:

Build a Docker image with a deterministic tag
Run tests in parallel (fail fast)
Push to container registry
Deploy to Kubernetes with zero downtime
Verify the deployment succeeded
Auto-rollback if health checks fail

All within 5–8 minutes.

The Workflow File

# .github/workflows/deploy.yml
name: Build and Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run tests
        run: npm test -- --coverage --passWithNoTests

      - name: Lint
        run: npm run lint

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-,format=short
            type=ref,event=branch
            type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}

      - name: Build and push
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Set up kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.28.0'

      - name: Configure kubeconfig
        run: |
          mkdir -p $HOME/.kube
          echo "${{ secrets.KUBECONFIG }}" | base64 -d > $HOME/.kube/config
          chmod 600 $HOME/.kube/config

      - name: Deploy to Kubernetes
        run: |
          IMAGE_TAG="sha-$(echo ${{ github.sha }} | cut -c1-7)"
          
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:$IMAGE_TAG \
            --namespace=production
          
          kubectl annotate deployment/app \
            kubernetes.io/change-cause="Deploy $IMAGE_TAG from commit ${{ github.sha }}" \
            --namespace=production \
            --overwrite

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/app \
            --namespace=production \
            --timeout=300s

      - name: Smoke test
        run: |
          # Hit the health endpoint after deploy
          sleep 10
          ENDPOINT="${{ secrets.APP_HEALTH_URL }}"
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT")
          if [ "$STATUS" != "200" ]; then
            echo "Health check failed with status $STATUS — rolling back"
            kubectl rollout undo deployment/app --namespace=production
            exit 1
          fi
          echo "Deployment healthy — status $STATUS"

The Kubernetes Deployment Config

Your Kubernetes deployment must be configured correctly for zero-downtime to work:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0     # never take pods down before new ones are ready
      maxSurge: 1           # one extra pod during transition
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: app
          image: ghcr.io/your-org/your-app:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]
      terminationGracePeriodSeconds: 30

Why These Settings Matter

maxUnavailable: 0 — Kubernetes will not kill old pods until new ones pass readiness checks. This guarantees zero downtime.

readinessProbe — The new pod only receives traffic once /health/ready returns 200. Your app should return 503 until it's fully initialized (DB connections established, caches warmed).

preStop sleep 5 — Kubernetes removes the pod from the service endpoints before sending SIGTERM, but there's a race condition. This 5-second sleep lets all in-flight requests drain before the process exits.

terminationGracePeriodSeconds: 30 — Gives your app 30 seconds to finish handling requests after SIGTERM before a hard kill.

Manual Rollback

If automated rollback fails or you need to roll back manually:

# See rollout history (shows the change-cause annotations)
kubectl rollout history deployment/app --namespace=production

# Roll back to previous version
kubectl rollout undo deployment/app --namespace=production

# Roll back to a specific revision
kubectl rollout undo deployment/app --namespace=production --to-revision=5

# Watch the rollback progress
kubectl rollout status deployment/app --namespace=production

With this setup, rollback to previous version takes under 60 seconds.

GitHub Environment Protection Rules

In GitHub repo settings, create an environment called production with:

Required reviewers — require 1 approval for production deploys (optional but recommended)
Deployment branches — only allow main
Wait timer — 0 minutes (don't slow down routine deploys)

The environment: production line in the workflow triggers these checks.

Secrets to Configure

In GitHub → Settings → Secrets:

| Secret | Value | |--------|-------| | KUBECONFIG | Base64-encoded kubeconfig with deploy permissions | | APP_HEALTH_URL | Full URL to your health check endpoint |

Generate the kubeconfig secret:

cat ~/.kube/config | base64 | pbcopy

Build Time Benchmarks

With this setup and GitHub Actions cache for Docker layers:

Test job: ~2 min
Build + push: ~3 min (cold), ~90s (warm cache)
Deploy + verify: ~2 min
Total: ~5–7 minutes from merge to production

That's fast enough that you can deploy multiple times per day without friction.

Common Pitfalls

Readiness probe too aggressive. If initialDelaySeconds is too short and your app takes 20 seconds to start, Kubernetes kills and restarts it in a loop. Give your app breathing room.

Not draining connections. Containers need to handle SIGTERM gracefully — stop accepting new connections, finish existing ones, then exit. Most frameworks support this natively (Node.js: server.close(), Go: gracefulShutdown).

Image tag latest in production. Always deploy specific SHA-based tags. latest is non-deterministic and kills your ability to roll back to a known version.

Smoke test too fast. The sleep 10 after deploy gives load balancers time to notice the new endpoints. Remove it and you might hit old pods during the smoke test.

The core insight: zero-downtime deployment is a property of correct configuration, not clever code. Get the readiness probes and rolling update strategy right, and Kubernetes handles the rest.