Introduction
Running LLM inference at scale has become one of the most demanding infrastructure challenges of 2026. The default answer for most teams is to call an API — OpenAI, Anthropic, or managed model endpoints. But as inference volumes grow, that API bill becomes the single largest line item in the cloud budget. A single Llama 70B model running 24/7 on an A100 costs roughly $4,745 per month on bare cloud GPU compute — but the equivalent per-token pricing from a managed API can easily reach $15,000 to $20,000 per month at production throughput.
This article is a hands-on guide to running open-source LLM inference on Kubernetes. You will learn GPU scheduling strategies, deploy vLLM with real YAML, configure autoscaling that actually works for inference workloads, solve the cold start problem, and cut your GPU bill by up to 70% using spot instances, model quantization, and GPU sharing.
By the end, you will have production-ready manifests to deploy Llama 4, Mistral, or DeepSeek on your own K8s cluster — with privacy, cost control, and the flexibility to tune models for agentic workflows rather than generic endpoints.
Why Run LLMs on Kubernetes Instead of Managed APIs?
Three reasons drive teams toward self-hosted LLM inference in 2026: privacy, cost, and control.
Privacy. Legal and compliance teams increasingly refuse to let customer data leave the VPC. Healthcare, fintech, and enterprise contracts routinely prohibit sending PII or proprietary data to third-party model endpoints. Running inference on your own Kubernetes cluster keeps data within your network boundary — a non-negotiable requirement for regulated industries.
Cost. Per-token pricing looks cheap at first but becomes punishing at scale. A production workload generating 500,000 tokens per hour — roughly 500 concurrent user sessions — can cost $8,000 to $15,000 per month on managed endpoints. The same workload on a dedicated A100 or H100 instance costs $2,000 to $4,500 per month. At 10x scale, the gap widens to hundreds of thousands of dollars annually. Most teams we have spoken with overspend 50-70% on GPU inference before migrating to self-hosted Kubernetes.
Control. Managed APIs serve general-purpose models with fixed inference parameters. If you are building agentic workflows — chains of reasoning steps with tool calls, retries, and function-calling loops — you need control over temperature, max tokens, stop sequences, and batch processing strategies. Self-hosting lets you tune the serving stack for your specific workload, not the lowest common denominator.
GPU Scheduling on Kubernetes: Beyond Atomic Device Allocation
Kubernetes has treated GPUs as atomic, whole-device resources since the device plugin API was introduced. Requesting nvidia.com/gpu: 1 means one entire GPU is assigned to your pod, even if you only use 15% of its memory and compute. That model was fine for training jobs. For inference, it is a colossal waste.
NVIDIA GPU Operator: The Foundation
Every K8s cluster running LLM inference needs the NVIDIA GPU Operator. It installs GPU drivers, the Kubernetes device plugin, the Data Center GPU Manager (DCGM) for metrics export, and GPU Feature Discovery for node labeling — all via a single Helm chart.
# clusterpolicy.yaml — NVIDIA GPU Operator via ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
version: "570"
toolkit:
enabled: true
devicePlugin:
enabled: true
dcgmExporter:
enabled: true
config:
name: dcgm-metrics
gfd:
enabled: true
Apply this once per cluster. The operator handles driver upgrades, node reboots, and CUDA toolkit compatibility — saving you from the maintenance burden that used to make GPU clusters brittle.
GPU Sharing: Three Methods Compared
Once the operator is running, the question becomes: how do you share one physical GPU across multiple inference pods? Three methods exist, each with distinct tradeoffs.
MIG (Multi-Instance GPU). Hardware-level partitioning at the GPU fabric level. An H100 can be carved into seven independent 1g.10gb instances, each with its own isolated memory, cache, and compute pipeline. Memory is physically isolated — no pod can overflow into another's memory. The downside: changing the MIG layout requires draining all workloads and resetting the GPU. This is a one-time configuration, not a dynamic allocation. MIG is available only on A100, H100, and B200 GPUs.
Time-Slicing. Software-level multiplexing where pods take turns on the GPU. The device plugin exposes "virtual GPUs" as separate resources, but under the hood they share the same physical device. Time-slicing works on any NVIDIA GPU, requires no GPU reset to change, and is ideal for bursty inference workloads where no single request saturates the GPU. The critical limitation: no memory isolation. If one pod allocates too much GPU memory, it can OOM-kill other pods on the same device.
MPS (Multi-Process Service). Uses NVIDIA's MPS control daemon to share GPU memory across processes. Provides partial memory isolation — all processes share the same address space, but the control daemon can enforce limits. A single misbehaving process can still affect others, making MPS the least isolated option.
Here is the comparison at a glance:
| Method | Memory Isolation | GPU Models | Layout Change | Best For |
|---|---|---|---|---|
| MIG | Hardware-isolated | A100/H100/B200 only | Requires GPU reset | Multi-tenant, strict isolation |
| Time-Slicing | None | All NVIDIA GPUs | Dynamic (no reset) | Bursty inference, dev/test |
| MPS | Partial | All NVIDIA GPUs | Requires process restart | Single-owner GPU sharing |
DRA (Dynamic Resource Allocation) is the K8s-native future direction. It allows pods to request GPU resources with more granularity than the device plugin model — think "256 MiB of GPU memory" rather than "1 GPU." DRA is in alpha in Kubernetes 1.32 and should reach beta by late 2026. Watch this space.
Model Serving Frameworks: vLLM, TGI, and Ray Serve
Three frameworks dominate self-hosted LLM inference on Kubernetes in 2026. Here is how they compare.
vLLM (Recommended)
vLLM is the de facto standard for production LLM serving on Kubernetes. Its killer feature is PagedAttention — a KV cache management algorithm that borrows from operating system virtual memory. Instead of allocating one contiguous block of GPU memory for the key-value cache, PagedAttention splits it into small fixed-size blocks allocated on demand. Fragmentation drops to nearly zero, and vLLM can serve far more concurrent requests per GPU than frameworks using naive contiguous allocation.
The vLLM Production Stack — an official Helm chart, Kubernetes operator, KEDA-based autoscaling, and multi-model routing — reached version 0.1.11 in May 2026 with 22 releases and 2,400+ GitHub stars. It is the closest thing to a turnkey LLM inference platform on Kubernetes today.
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-inference
spec:
replicas: 2
selector:
matchLabels:
app: llama-inference
template:
metadata:
labels:
app: llama-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.92"
- "--enable-prefix-caching"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 4
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
Key flags explained:
--tensor-parallel-size 4: Splits the model across 4 GPUs using tensor parallelism. For a 70B model at FP16 (~140 GB), 4x A100-40GB or 2x H100-80GB is the minimum.--gpu-memory-utilization 0.92: Leaves 8% GPU memory headroom. Going higher risks OOM during request spikes.--enable-prefix-caching: Reuses KV cache blocks when prompts share a common prefix — critical for agentic workflows where the system prompt is repeated across tool-call loops.--max-model-len 8192: Caps context length. Longer contexts consume more KV cache memory. Set this to the smallest value your workload actually needs.
TGI (Text Generation Inference)
Hugging Face's TGI is vLLM's primary competitor. It offers solid throughput, native Hugging Face Hub integration, and watermarked output detection. TGI's strength is its Hugging Face ecosystem integration — if your team already standardizes on HF models and libraries, TGI reduces friction. Its GPU memory efficiency is comparable to vLLM for most workloads, though it trails slightly on maximum concurrent request throughput under the PagedAttention benchmark.
Ray Serve
Ray Serve is the right choice when inference is one component of a larger distributed pipeline that includes pre-processing, retrieval (RAG), ranking, and post-processing. Ray's actor model lets you co-locate these steps on the same GPU node, avoiding network hops between inference and pre-processing. For standalone inference serving, vLLM or TGI is simpler and more performant.
Auto-Scaling for LLM Inference: Why CPU-Based HPA Is Wrong
The standard Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory utilization. For LLM inference, this is fundamentally wrong. A GPU processing inference requests is always at high utilization — that is the desired state. The metric that matters is request queue depth: how many requests are waiting because all current replicas are busy.
vLLM exposes a Prometheus metric vllm:num_requests_waiting that tracks exactly this. KEDA can map it to a scaling trigger:
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-inference-autoscaler
spec:
scaleTargetRef:
name: llama-inference
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: vllm_queue_depth
query: |
sum(vllm:num_requests_waiting{model_name="llama-70b"})
threshold: "5"
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
The threshold: "5" triggers a scale-up when five or more requests are queued. The scaleDown stabilizationWindowSeconds: 300 prevents replica thrashing — if a cold start takes 120 seconds, you do not want the HPA scaling down before the new replica is warm.
Node-Level Autoscaling with Karpenter
GPU nodes are expensive. You do not want idle GPU instances running when no inference workloads are scheduled. Karpenter provisions GPU nodes on demand and terminates them when empty:
# karpenter-nodepool-gpu.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-spot
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100-40GB
- p5.48xlarge # 8x H100-80GB
taints:
- key: nvidia.com/gpu
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-node-class
limits:
nvidia.com/gpu: 16
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
This NodePool provisions GPU instances as spot first, falling back to on-demand. The consolidateAfter: 5m setting gives Karpenter five minutes before it replaces an underutilized node — preventing churn when inference pods are briefly idle between request bursts.
The Cold Start Problem: Why Your Inference Pod Takes Five Minutes to Start
LLM inference pods have the worst cold start in all of Kubernetes infrastructure. Three phases contribute:
-
Container image pull (30-90 seconds). The vLLM image is large — usually 4-8 GB. Caching it on nodes via a DaemonSet pre-puller or using a container registry with local node caching helps.
-
Model weight download and deserialization (60-180 seconds). A 70B model in FP16 weighs roughly 140 GB. If pulled from Hugging Face Hub on every pod start, this dominates cold start time. The solution: pre-load models onto a ReadWriteMany PersistentVolume (EFS, Filestore, or CephFS) and mount it into the pod.
-
GPU warmup (10-30 seconds). After weights are loaded into GPU memory, vLLM runs a warmup forward pass to initialize CUDA graphs. This is fast but adds latency before the readiness probe passes.
# PVC for model caching
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 500Gi
---
# Pod volume mount
volumes:
- name: models
persistentVolumeClaim:
claimName: model-cache
volumeMounts:
- name: models
mountPath: /models
Configure vLLM to load from the shared volume:
--model /models/Llama-3.1-70B-Instruct --download-dir /models
Readiness probe tuning is critical. If your probe fires before the model is loaded, Kubernetes will restart the pod — resetting the entire cold start process. Set initialDelaySeconds generously:
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # 2-minute delay for model load
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 5 minutes — Liveness kills the pod!
periodSeconds: 30
Never scale to zero. Keep a warm pool of at least one replica. A scale-from-zero cold start can take five minutes — long enough for users to assume your service is down. If you are using Karpenter spot instances, ensure at least one on-demand node is always available to host the warm replica.
Cost Optimization: Cutting Your GPU Bill by 70%
GPU inference on Kubernetes can be dramatically cheaper than managed APIs — but only if you apply these four strategies.
Spot and Preemptible GPUs
Inference is stateless. If a spot instance is reclaimed mid-request, the client retries. There is no model checkpoint to lose. Spot GPUs save 60-70% versus on-demand pricing. Use Karpenter with a spot-first NodePool (shown above) and a preStop hook to drain in-flight requests gracefully:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
sleep 90
# Let in-flight requests complete before termination
AWS sends a two-minute termination notice for spot instances. The 90-second sleep gives in-flight requests time to complete and new requests time to route away.
Model Quantization
AWQ 4-bit quantization reduces GPU memory requirements by roughly 75% with under 1% quality loss on standard benchmarks. A 70B model that requires 4x A100-40GB at FP16 fits on a single A100-80GB at AWQ 4-bit. GPTQ 4-bit achieves similar memory savings with a 1-3% quality tradeoff. Both are supported natively in vLLM:
--model TheBloke/Llama-3.1-70B-Instruct-AWQ --quantization awq
GPU Bin Packing
Without GPU sharing, ten services each using 15% of a GPU still consume ten whole GPUs — a 90% waste rate. Time-slicing or MIG partitions let you pack multiple small inference workloads onto one physical GPU:
# Time-slicing config in NVIDIA Device Plugin
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4
With replicas: 4, each physical GPU is exposed as four "virtual GPUs," allowing up to four inference pods to share one device — provided their memory requirements fit within the total GPU memory.
The GPU Waste Tax
If you are not using MIG or time-slicing, measure your cluster's GPU utilization with DCGM metrics. Many teams discover their GPUs average 20-35% utilization in production — a direct result of the "one GPU per pod" model. Moving to GPU sharing can cut your GPU count in half without reducing throughput.
Monitoring LLM Serving: Metrics That Matter
Your inference stack needs monitoring on two levels: the model serving layer and the GPU hardware layer.
vLLM Prometheus Metrics
vLLM exposes rich Prometheus metrics at :8000/metrics. The essential ones:
| Metric | What It Tells You |
|---|---|
vllm:time_per_output_token_seconds | P50/P95/P99 token generation latency |
vllm:num_requests_waiting | Queue depth — your primary autoscaling signal |
vllm:num_requests_running | Active requests being processed |
vllm:gpu_cache_usage_perc | KV cache utilization — approaching 100% means you need more GPUs or shorter context |
vllm:request_success_total | Successful completions |
vllm:prompt_tokens_total | Input token throughput |
vllm:generation_tokens_total | Output token throughput |
NVIDIA DCGM Metrics
The GPU Operator's DCGM Exporter provides hardware-level metrics:
DCGM_FI_DEV_GPU_UTIL: GPU compute utilizationDCGM_FI_DEV_FB_USED: GPU memory used (framebuffer)DCGM_FI_DEV_GPU_TEMP: GPU temperature — rising temps indicate thermal throttlingDCGM_FI_DEV_POWER_USAGE: Power draw in watts
Grafana Dashboard Essentials
Build a Grafana dashboard with these four panels for every inference deployment:
- Token throughput (input + output tokens/sec): Your north-star metric. If throughput plateaus while queue depth rises, scale out.
- P95 time per output token: Latency as experienced by users. Spikes indicate KV cache pressure or GPU memory pressure.
- GPU memory utilization: Track both framebuffer usage (DCGM) and KV cache usage (vLLM). Memory usage that climbs over time without releasing signals a memory leak.
- Request queue depth: Overlaid with replica count to validate that autoscaling is working. Queue depth rising while replicas are flat means your KEDA trigger may be misconfigured.
Conclusion
Running LLM inference on Kubernetes in 2026 is the financially responsible choice for any team serving more than a few hundred thousand tokens per day. The tooling — vLLM with PagedAttention, NVIDIA GPU Operator for driver lifecycle management, KEDA for inference-aware autoscaling, and Karpenter for spot-first GPU node provisioning — has matured to the point where a single engineer can deploy and operate a production inference stack.
The key decisions are:
- Pick vLLM as your serving framework unless you have specific Hugging Face ecosystem requirements (then TGI) or need inference as part of a distributed pipeline (then Ray Serve).
- Use time-slicing for dev, staging, and bursty workloads. Use MIG for multi-tenant production where memory isolation matters.
- Scale on queue depth, not CPU. Standard HPA will waste GPUs and miss traffic spikes.
- Never scale to zero. Cold starts kill user experience. Keep a warm pool.
- Spot GPUs plus quantization can reduce your inference bill by 60-70% with negligible quality loss.
The gap between "calling an API" and "running your own inference" has narrowed to a single Kubernetes deploy. If your team already operates Kubernetes for application workloads, adding an inference deployment is a natural extension of your existing infrastructure skills — not a separate discipline.
For securing the Kubernetes clusters that host your inference workloads, see our Kubernetes security best practices guide. And for what happens when production inference goes down at 3 AM, our incident management and blameless postmortem guide will help your on-call team respond effectively.
Cost Optimization: Stop Burning Money on Idle GPUs
Most teams overspend on GPU inference by 50-70%. Here are five tactics to cut costs without degrading throughput.
Spot and Preemptible GPUs
Inference is stateless — no training checkpoint to lose. Spot GPUs cost 60-70% less than on-demand. Add a preStop hook to drain requests before termination:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 90"]
With a 90-second drain window, in-flight requests complete before the pod terminates.
Model Quantization
| Quantization | Memory Reduction | Quality Impact |
|---|---|---|
| AWQ 4-bit | 75% | Under 1% accuracy drop |
| GPTQ 4-bit | 75% | 1-3% accuracy drop |
| FP8 | 50% | Minimal (H100/B200 only) |
A Llama 4 70B model at FP16 needs roughly 140GB GPU memory (2x H100). AWQ 4-bit brings it to roughly 35GB — fitting on a single H100. That is a 95% cost reduction.
Bin Packing with GPU Sharing
Without GPU sharing, 10 services each using 15% of a GPU still consume 10 GPUs. With MIG, those same services share 2 GPUs:
- Without sharing: 10 GPUs × $3.06/hr (A100 spot) = $22,000/month
- With MIG: 2 GPUs × $2.48/hr (H100 spot) = $3,570/month — 84% savings
Right-Size Your GPU
A fine-tuned 7B model for internal tooling runs fine on an L4 ($0.40/hr spot). Do not pay for an H100 if you do not need one.
Monitoring LLM Inference: Metrics That Matter
Standard infrastructure metrics mislead you for LLM workloads. Monitor these instead.
vLLM Prometheus Metrics
| Metric | Signal |
|---|---|
vllm_num_requests_waiting | Queue depth — primary autoscaling trigger |
vllm_time_to_first_token_seconds | P50/P95/P99 TTFT |
vllm_gpu_cache_usage_perc | KV cache utilization — OOM early warning |
vllm_request_success_total | Throughput counter |
NVIDIA DCGM Metrics
| Metric | Alert When |
|---|---|
DCGM_FI_DEV_GPU_UTIL | Sustained above 95% |
DCGM_FI_DEV_FB_USED / total | Above 90% — OOM risk |
DCGM_FI_DEV_GPU_TEMP | Above 85°C — throttling |
Grafana Dashboard Essentials
Build panels for: request throughput (rate(vllm_request_success_total[5m])), P99 TTFT heatmap, queue depth, GPU memory pressure, and KV cache usage trend.
Alert on three conditions: queue depth above threshold for 3+ minutes, GPU memory above 92%, and P99 TTFT exceeding SLO. These three cover the failure modes that affect users.
For a deeper dive into SLO-based alerting and defining error budgets, read our error budgets SRE guide.