Kubernetes LLM Inference: Deploy and Scale Open-Source LLMs in 2026

Introduction

Running LLM inference at scale has become one of the most demanding infrastructure challenges of 2026. The default answer for most teams is to call an API — OpenAI, Anthropic, or managed model endpoints. But as inference volumes grow, that API bill becomes the single largest line item in the cloud budget. A single Llama 70B model running 24/7 on an A100 costs roughly $4,745 per month on bare cloud GPU compute — but the equivalent per-token pricing from a managed API can easily reach $15,000 to $20,000 per month at production throughput.

This article is a hands-on guide to running open-source LLM inference on Kubernetes. You will learn GPU scheduling strategies, deploy vLLM with real YAML, configure autoscaling that actually works for inference workloads, solve the cold start problem, and cut your GPU bill by up to 70% using spot instances, model quantization, and GPU sharing.

By the end, you will have production-ready manifests to deploy Llama 4, Mistral, or DeepSeek on your own K8s cluster — with privacy, cost control, and the flexibility to tune models for agentic workflows rather than generic endpoints.

Why Run LLMs on Kubernetes Instead of Managed APIs?

Three reasons drive teams toward self-hosted LLM inference in 2026: privacy, cost, and control.

Privacy. Legal and compliance teams increasingly refuse to let customer data leave the VPC. Healthcare, fintech, and enterprise contracts routinely prohibit sending PII or proprietary data to third-party model endpoints. Running inference on your own Kubernetes cluster keeps data within your network boundary — a non-negotiable requirement for regulated industries.

Cost. Per-token pricing looks cheap at first but becomes punishing at scale. A production workload generating 500,000 tokens per hour — roughly 500 concurrent user sessions — can cost $8,000 to $15,000 per month on managed endpoints. The same workload on a dedicated A100 or H100 instance costs $2,000 to $4,500 per month. At 10x scale, the gap widens to hundreds of thousands of dollars annually. Most teams we have spoken with overspend 50-70% on GPU inference before migrating to self-hosted Kubernetes.

Control. Managed APIs serve general-purpose models with fixed inference parameters. If you are building agentic workflows — chains of reasoning steps with tool calls, retries, and function-calling loops — you need control over temperature, max tokens, stop sequences, and batch processing strategies. Self-hosting lets you tune the serving stack for your specific workload, not the lowest common denominator.

GPU Scheduling on Kubernetes: Beyond Atomic Device Allocation

Kubernetes has treated GPUs as atomic, whole-device resources since the device plugin API was introduced. Requesting nvidia.com/gpu: 1 means one entire GPU is assigned to your pod, even if you only use 15% of its memory and compute. That model was fine for training jobs. For inference, it is a colossal waste.

NVIDIA GPU Operator: The Foundation

Every K8s cluster running LLM inference needs the NVIDIA GPU Operator. It installs GPU drivers, the Kubernetes device plugin, the Data Center GPU Manager (DCGM) for metrics export, and GPU Feature Discovery for node labeling — all via a single Helm chart.

# clusterpolicy.yaml — NVIDIA GPU Operator via ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: true
    version: "570"
  toolkit:
    enabled: true
  devicePlugin:
    enabled: true
  dcgmExporter:
    enabled: true
    config:
      name: dcgm-metrics
  gfd:
    enabled: true

Apply this once per cluster. The operator handles driver upgrades, node reboots, and CUDA toolkit compatibility — saving you from the maintenance burden that used to make GPU clusters brittle.

GPU Sharing: Three Methods Compared

Once the operator is running, the question becomes: how do you share one physical GPU across multiple inference pods? Three methods exist, each with distinct tradeoffs.

MIG (Multi-Instance GPU). Hardware-level partitioning at the GPU fabric level. An H100 can be carved into seven independent 1g.10gb instances, each with its own isolated memory, cache, and compute pipeline. Memory is physically isolated — no pod can overflow into another's memory. The downside: changing the MIG layout requires draining all workloads and resetting the GPU. This is a one-time configuration, not a dynamic allocation. MIG is available only on A100, H100, and B200 GPUs.

Time-Slicing. Software-level multiplexing where pods take turns on the GPU. The device plugin exposes "virtual GPUs" as separate resources, but under the hood they share the same physical device. Time-slicing works on any NVIDIA GPU, requires no GPU reset to change, and is ideal for bursty inference workloads where no single request saturates the GPU. The critical limitation: no memory isolation. If one pod allocates too much GPU memory, it can OOM-kill other pods on the same device.

MPS (Multi-Process Service). Uses NVIDIA's MPS control daemon to share GPU memory across processes. Provides partial memory isolation — all processes share the same address space, but the control daemon can enforce limits. A single misbehaving process can still affect others, making MPS the least isolated option.

Here is the comparison at a glance:

Method	Memory Isolation	GPU Models	Layout Change	Best For
MIG	Hardware-isolated	A100/H100/B200 only	Requires GPU reset	Multi-tenant, strict isolation
Time-Slicing	None	All NVIDIA GPUs	Dynamic (no reset)	Bursty inference, dev/test
MPS	Partial	All NVIDIA GPUs	Requires process restart	Single-owner GPU sharing

DRA (Dynamic Resource Allocation) is the K8s-native future direction. It allows pods to request GPU resources with more granularity than the device plugin model — think "256 MiB of GPU memory" rather than "1 GPU." DRA is in alpha in Kubernetes 1.32 and should reach beta by late 2026. Watch this space.

Model Serving Frameworks: vLLM, TGI, and Ray Serve

Three frameworks dominate self-hosted LLM inference on Kubernetes in 2026. Here is how they compare.

vLLM (Recommended)

vLLM is the de facto standard for production LLM serving on Kubernetes. Its killer feature is PagedAttention — a KV cache management algorithm that borrows from operating system virtual memory. Instead of allocating one contiguous block of GPU memory for the key-value cache, PagedAttention splits it into small fixed-size blocks allocated on demand. Fragmentation drops to nearly zero, and vLLM can serve far more concurrent requests per GPU than frameworks using naive contiguous allocation.

The vLLM Production Stack — an official Helm chart, Kubernetes operator, KEDA-based autoscaling, and multi-model routing — reached version 0.1.11 in May 2026 with 22 releases and 2,400+ GitHub stars. It is the closest thing to a turnkey LLM inference platform on Kubernetes today.

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-inference
  template:
    metadata:
      labels:
        app: llama-inference
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-70B-Instruct"
            - "--tensor-parallel-size"
            - "4"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--enable-prefix-caching"
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 4
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300
            periodSeconds: 30

Key flags explained:

--tensor-parallel-size 4: Splits the model across 4 GPUs using tensor parallelism. For a 70B model at FP16 (~140 GB), 4x A100-40GB or 2x H100-80GB is the minimum.
--gpu-memory-utilization 0.92: Leaves 8% GPU memory headroom. Going higher risks OOM during request spikes.
--enable-prefix-caching: Reuses KV cache blocks when prompts share a common prefix — critical for agentic workflows where the system prompt is repeated across tool-call loops.
--max-model-len 8192: Caps context length. Longer contexts consume more KV cache memory. Set this to the smallest value your workload actually needs.

TGI (Text Generation Inference)

Hugging Face's TGI is vLLM's primary competitor. It offers solid throughput, native Hugging Face Hub integration, and watermarked output detection. TGI's strength is its Hugging Face ecosystem integration — if your team already standardizes on HF models and libraries, TGI reduces friction. Its GPU memory efficiency is comparable to vLLM for most workloads, though it trails slightly on maximum concurrent request throughput under the PagedAttention benchmark.

Ray Serve

Ray Serve is the right choice when inference is one component of a larger distributed pipeline that includes pre-processing, retrieval (RAG), ranking, and post-processing. Ray's actor model lets you co-locate these steps on the same GPU node, avoiding network hops between inference and pre-processing. For standalone inference serving, vLLM or TGI is simpler and more performant.

Auto-Scaling for LLM Inference: Why CPU-Based HPA Is Wrong

The standard Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory utilization. For LLM inference, this is fundamentally wrong. A GPU processing inference requests is always at high utilization — that is the desired state. The metric that matters is request queue depth: how many requests are waiting because all current replicas are busy.

vLLM exposes a Prometheus metric vllm:num_requests_waiting that tracks exactly this. KEDA can map it to a scaling trigger:

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-inference-autoscaler
spec:
  scaleTargetRef:
    name: llama-inference
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: vllm_queue_depth
        query: |
          sum(vllm:num_requests_waiting{model_name="llama-70b"})
        threshold: "5"
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Percent
              value: 50
              periodSeconds: 60

The threshold: "5" triggers a scale-up when five or more requests are queued. The scaleDown stabilizationWindowSeconds: 300 prevents replica thrashing — if a cold start takes 120 seconds, you do not want the HPA scaling down before the new replica is warm.

Node-Level Autoscaling with Karpenter

GPU nodes are expensive. You do not want idle GPU instances running when no inference workloads are scheduled. Karpenter provisions GPU nodes on demand and terminates them when empty:

# karpenter-nodepool-gpu.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p4d.24xlarge    # 8x A100-40GB
            - p5.48xlarge     # 8x H100-80GB
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-node-class
  limits:
    nvidia.com/gpu: 16
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m

This NodePool provisions GPU instances as spot first, falling back to on-demand. The consolidateAfter: 5m setting gives Karpenter five minutes before it replaces an underutilized node — preventing churn when inference pods are briefly idle between request bursts.

The Cold Start Problem: Why Your Inference Pod Takes Five Minutes to Start

LLM inference pods have the worst cold start in all of Kubernetes infrastructure. Three phases contribute:

Container image pull (30-90 seconds). The vLLM image is large — usually 4-8 GB. Caching it on nodes via a DaemonSet pre-puller or using a container registry with local node caching helps.
Model weight download and deserialization (60-180 seconds). A 70B model in FP16 weighs roughly 140 GB. If pulled from Hugging Face Hub on every pod start, this dominates cold start time. The solution: pre-load models onto a ReadWriteMany PersistentVolume (EFS, Filestore, or CephFS) and mount it into the pod.
GPU warmup (10-30 seconds). After weights are loaded into GPU memory, vLLM runs a warmup forward pass to initialize CUDA graphs. This is fast but adds latency before the readiness probe passes.

# PVC for model caching
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 500Gi

---
# Pod volume mount
volumes:
  - name: models
    persistentVolumeClaim:
      claimName: model-cache
volumeMounts:
  - name: models
    mountPath: /models

Configure vLLM to load from the shared volume:

--model /models/Llama-3.1-70B-Instruct --download-dir /models

Readiness probe tuning is critical. If your probe fires before the model is loaded, Kubernetes will restart the pod — resetting the entire cold start process. Set initialDelaySeconds generously:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120   # 2-minute delay for model load
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 300   # 5 minutes — Liveness kills the pod!
  periodSeconds: 30

Never scale to zero. Keep a warm pool of at least one replica. A scale-from-zero cold start can take five minutes — long enough for users to assume your service is down. If you are using Karpenter spot instances, ensure at least one on-demand node is always available to host the warm replica.

Cost Optimization: Cutting Your GPU Bill by 70%

GPU inference on Kubernetes can be dramatically cheaper than managed APIs — but only if you apply these four strategies.

Spot and Preemptible GPUs

Inference is stateless. If a spot instance is reclaimed mid-request, the client retries. There is no model checkpoint to lose. Spot GPUs save 60-70% versus on-demand pricing. Use Karpenter with a spot-first NodePool (shown above) and a preStop hook to drain in-flight requests gracefully:

lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - |
          sleep 90
          # Let in-flight requests complete before termination

AWS sends a two-minute termination notice for spot instances. The 90-second sleep gives in-flight requests time to complete and new requests time to route away.

Model Quantization

AWQ 4-bit quantization reduces GPU memory requirements by roughly 75% with under 1% quality loss on standard benchmarks. A 70B model that requires 4x A100-40GB at FP16 fits on a single A100-80GB at AWQ 4-bit. GPTQ 4-bit achieves similar memory savings with a 1-3% quality tradeoff. Both are supported natively in vLLM:

--model TheBloke/Llama-3.1-70B-Instruct-AWQ --quantization awq

GPU Bin Packing

Without GPU sharing, ten services each using 15% of a GPU still consume ten whole GPUs — a 90% waste rate. Time-slicing or MIG partitions let you pack multiple small inference workloads onto one physical GPU:

# Time-slicing config in NVIDIA Device Plugin
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

With replicas: 4, each physical GPU is exposed as four "virtual GPUs," allowing up to four inference pods to share one device — provided their memory requirements fit within the total GPU memory.

The GPU Waste Tax

If you are not using MIG or time-slicing, measure your cluster's GPU utilization with DCGM metrics. Many teams discover their GPUs average 20-35% utilization in production — a direct result of the "one GPU per pod" model. Moving to GPU sharing can cut your GPU count in half without reducing throughput.

Monitoring LLM Serving: Metrics That Matter

Your inference stack needs monitoring on two levels: the model serving layer and the GPU hardware layer.

vLLM Prometheus Metrics

vLLM exposes rich Prometheus metrics at :8000/metrics. The essential ones:

Metric	What It Tells You
`vllm:time_per_output_token_seconds`	P50/P95/P99 token generation latency
`vllm:num_requests_waiting`	Queue depth — your primary autoscaling signal
`vllm:num_requests_running`	Active requests being processed
`vllm:gpu_cache_usage_perc`	KV cache utilization — approaching 100% means you need more GPUs or shorter context
`vllm:request_success_total`	Successful completions
`vllm:prompt_tokens_total`	Input token throughput
`vllm:generation_tokens_total`	Output token throughput

NVIDIA DCGM Metrics

The GPU Operator's DCGM Exporter provides hardware-level metrics:

DCGM_FI_DEV_GPU_UTIL: GPU compute utilization
DCGM_FI_DEV_FB_USED: GPU memory used (framebuffer)
DCGM_FI_DEV_GPU_TEMP: GPU temperature — rising temps indicate thermal throttling
DCGM_FI_DEV_POWER_USAGE: Power draw in watts

Grafana Dashboard Essentials

Build a Grafana dashboard with these four panels for every inference deployment:

Token throughput (input + output tokens/sec): Your north-star metric. If throughput plateaus while queue depth rises, scale out.
P95 time per output token: Latency as experienced by users. Spikes indicate KV cache pressure or GPU memory pressure.
GPU memory utilization: Track both framebuffer usage (DCGM) and KV cache usage (vLLM). Memory usage that climbs over time without releasing signals a memory leak.
Request queue depth: Overlaid with replica count to validate that autoscaling is working. Queue depth rising while replicas are flat means your KEDA trigger may be misconfigured.

Conclusion

Running LLM inference on Kubernetes in 2026 is the financially responsible choice for any team serving more than a few hundred thousand tokens per day. The tooling — vLLM with PagedAttention, NVIDIA GPU Operator for driver lifecycle management, KEDA for inference-aware autoscaling, and Karpenter for spot-first GPU node provisioning — has matured to the point where a single engineer can deploy and operate a production inference stack.

The key decisions are:

Pick vLLM as your serving framework unless you have specific Hugging Face ecosystem requirements (then TGI) or need inference as part of a distributed pipeline (then Ray Serve).
Use time-slicing for dev, staging, and bursty workloads. Use MIG for multi-tenant production where memory isolation matters.
Scale on queue depth, not CPU. Standard HPA will waste GPUs and miss traffic spikes.
Never scale to zero. Cold starts kill user experience. Keep a warm pool.
Spot GPUs plus quantization can reduce your inference bill by 60-70% with negligible quality loss.

The gap between "calling an API" and "running your own inference" has narrowed to a single Kubernetes deploy. If your team already operates Kubernetes for application workloads, adding an inference deployment is a natural extension of your existing infrastructure skills — not a separate discipline.

For securing the Kubernetes clusters that host your inference workloads, see our Kubernetes security best practices guide. And for what happens when production inference goes down at 3 AM, our incident management and blameless postmortem guide will help your on-call team respond effectively.

Cost Optimization: Stop Burning Money on Idle GPUs

Most teams overspend on GPU inference by 50-70%. Here are five tactics to cut costs without degrading throughput.

Spot and Preemptible GPUs

Inference is stateless — no training checkpoint to lose. Spot GPUs cost 60-70% less than on-demand. Add a preStop hook to drain requests before termination:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 90"]

With a 90-second drain window, in-flight requests complete before the pod terminates.

Model Quantization

Quantization	Memory Reduction	Quality Impact
AWQ 4-bit	75%	Under 1% accuracy drop
GPTQ 4-bit	75%	1-3% accuracy drop
FP8	50%	Minimal (H100/B200 only)

A Llama 4 70B model at FP16 needs roughly 140GB GPU memory (2x H100). AWQ 4-bit brings it to roughly 35GB — fitting on a single H100. That is a 95% cost reduction.

Bin Packing with GPU Sharing

Without GPU sharing, 10 services each using 15% of a GPU still consume 10 GPUs. With MIG, those same services share 2 GPUs:

Without sharing: 10 GPUs × $3.06/hr (A100 spot) = $22,000/month
With MIG: 2 GPUs × $2.48/hr (H100 spot) = $3,570/month — 84% savings

Right-Size Your GPU

A fine-tuned 7B model for internal tooling runs fine on an L4 ($0.40/hr spot). Do not pay for an H100 if you do not need one.

Monitoring LLM Inference: Metrics That Matter

Standard infrastructure metrics mislead you for LLM workloads. Monitor these instead.

vLLM Prometheus Metrics

Metric	Signal
`vllm_num_requests_waiting`	Queue depth — primary autoscaling trigger
`vllm_time_to_first_token_seconds`	P50/P95/P99 TTFT
`vllm_gpu_cache_usage_perc`	KV cache utilization — OOM early warning
`vllm_request_success_total`	Throughput counter

NVIDIA DCGM Metrics

Metric	Alert When
`DCGM_FI_DEV_GPU_UTIL`	Sustained above 95%
`DCGM_FI_DEV_FB_USED` / total	Above 90% — OOM risk
`DCGM_FI_DEV_GPU_TEMP`	Above 85°C — throttling

Grafana Dashboard Essentials

Build panels for: request throughput (rate(vllm_request_success_total[5m])), P99 TTFT heatmap, queue depth, GPU memory pressure, and KV cache usage trend.

Alert on three conditions: queue depth above threshold for 3+ minutes, GPU memory above 92%, and P99 TTFT exceeding SLO. These three cover the failure modes that affect users.

For a deeper dive into SLO-based alerting and defining error budgets, read our error budgets SRE guide.

Kubernetes LLM Inference: Deploy and Scale Open-Source LLMs in 2026

Introduction

Why Run LLMs on Kubernetes Instead of Managed APIs?

GPU Scheduling on Kubernetes: Beyond Atomic Device Allocation

NVIDIA GPU Operator: The Foundation

GPU Sharing: Three Methods Compared

Model Serving Frameworks: vLLM, TGI, and Ray Serve

vLLM (Recommended)

TGI (Text Generation Inference)

Ray Serve

Auto-Scaling for LLM Inference: Why CPU-Based HPA Is Wrong

Node-Level Autoscaling with Karpenter

The Cold Start Problem: Why Your Inference Pod Takes Five Minutes to Start

Cost Optimization: Cutting Your GPU Bill by 70%

Spot and Preemptible GPUs

Model Quantization

GPU Bin Packing

The GPU Waste Tax

Monitoring LLM Serving: Metrics That Matter

vLLM Prometheus Metrics

NVIDIA DCGM Metrics

Grafana Dashboard Essentials

Conclusion

Cost Optimization: Stop Burning Money on Idle GPUs

Spot and Preemptible GPUs

Model Quantization

Bin Packing with GPU Sharing

Right-Size Your GPU

Monitoring LLM Inference: Metrics That Matter

vLLM Prometheus Metrics

NVIDIA DCGM Metrics

Grafana Dashboard Essentials

Related Articles

FinOps Best Practices: Cloud Financial Management for Engineers

Dagger CI/CD: The Future of Pipeline Development (Hands-On Tutorial)

SLO, SLI, SLA Explained: A Practical Guide with Real Examples