Running large language model inference in production means juggling latency SLAs, GPU memory constraints, and cloud costs. Off-the-shelf API services charge a premium per token, and they lock you into their model availability and data handling policies. A self-hosted solution built on vLLM and Kubernetes gives you direct control over throughput, cost, and privacy — but the path from a local test script to a production-grade autoscaling cluster has sharp edges. This guide walks through the exact architecture, configuration, and operational knobs needed to serve open-weight LLMs with sub-200ms latency while keeping GPU utilization above 70 percent.
vLLM (Virtual Large Language Model) is an inference engine developed at UC Berkeley that achieves up to 24x higher throughput than Hugging Face Transformers on the same hardware. The critical innovation is PagedAttention, a memory management technique that handles the key-value cache in fixed-size blocks instead of contiguous chunks. This eliminates the fragmentation that forces other frameworks to waste up to 60 percent of GPU memory.
For a self-hosted setup, vLLM offers three concrete advantages:
A common mistake is deploying vLLM without tuning the max-model-len and max-num-seqs parameters. If you set max-model-len to the model's full context window (say 32k tokens) but your average request is only 1024 tokens, you reserve GPU memory for sequences that never arrive. On a single A100 (80GB), reducing max-model-len from 32k to 8k for Llama 3 70B increases the serving capacity from 4 concurrent requests to 16.
Vanilla Kubernetes does not understand GPUs. You need the NVIDIA GPU Operator installed on the cluster to expose GPU resources as schedulable quantities. Without it, pods requesting nvidia.com/gpu: 1 will hang indefinitely because the kubelet cannot detect the device.
Create at least two node pools — one for GPU workloads and one for CPU-only services (like the API gateway or Prometheus). For the GPU pool, use a single instance type per pool to avoid unpredictable scheduling behavior. On AWS, p4d.24xlarge (8x A100) gives the best price-to-performance for serving, but g5.48xlarge (4x A10G) can handle smaller models cost-effectively. On GCP, L4 GPU instances (4x L4) offer a solid middle ground at roughly $1.20 per GPU-hour versus $4.50 for A100.
Add a taint like gpu-node=true:NoSchedule to the GPU nodes so that only inference pods with the matching toleration land on them. This prevents a misconfigured logging agent from consuming GPU memory:
tolerations:
- key: "gpu-node"
operator: "Equal"
value: "true"
effect: "NoSchedule"
The vLLM Docker image (vllm/vllm-openai:latest) bundles the OpenAI API server. You launch it with environment variables and command-line flags — not a configuration file. The critical parameters are:
Here is a representative Deployment YAML snippet for Llama 3 8B on a single A10G:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "meta-llama/Meta-Llama-3-8B-Instruct",
"--trust-remote-code",
"--max-model-len", "4096",
"--max-num-seqs", "256",
"--gpu-memory-utilization", "0.90"]
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
CPU-based autoscaling metrics (like CPU utilization) are useless for LLM inference — the GPU is the bottleneck, not the CPU. Instead, scale on the number of waiting requests in vLLM's internal scheduler queue. vLLM exposes Prometheus metrics at port 8000/metrics, including vllm:num_requests_waiting.
Install the Prometheus Adapter (https://github.com/kubernetes-sigs/prometheus-adapter) and configure it to expose the waiting-requests metric. In the adapter configuration, define a series that targets the vLLM namespace and pod label:
rules:
- seriesQuery: 'vllm:num_requests_waiting'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "(.*)"
as: "waiting_requests"
metricsQuery: 'avg(<<.Series>>) by (<<.GroupBy>>)'
Then create a HorizontalPodAutoscaler that scales the deployment when the average waiting requests across all pods exceeds 10:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3-8b
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: waiting_requests
target:
type: AverageValue
averageValue: 10
Test this setup with a load generator. If the target value is too low, the HPA scales up prematurely, creating many small pods that fragment the request load. A value between 5 and 15 works for most scenarios — adjust based on your desired response time.
Spot instances on AWS and preemptible VMs on GCP reduce GPU costs by 60 to 80 percent. The catch: they can be reclaimed with two minutes of warning. For inference workloads, implement a two-step strategy:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15 && kill 1"]
The 15-second sleep gives vLLM time to drain active requests before the SIGTERM kills the process. Combine this with a readiness probe that returns 200 only when the model is fully loaded. Spot instances also benefit from node anti-affinity rules — spread replicas across different availability zones so a single zone reclaim event does not take down all pods.
Standard Kubernetes monitoring (CPU, memory, network) tells you nothing about whether the GPU is actually working. You need NVIDIA's DCGM Exporter (https://github.com/NVIDIA/dcgm-exporter) deployed as a DaemonSet on GPU nodes. It exposes metrics like:
On the vLLM side, the metric vllm:avg_generation_throughput_toks_per_sec tells you the actual token output rate. For Llama 3 70B on a single A100 with continuous batching, expect roughly 40-60 tokens per second per request at batch size 8. If you see single-digit throughput, check whether the model is swapping KV cache to CPU memory — indicated by vLLM metrics showing high "cpu_cache_usage".
Rolling out a new model version requires more care than a typical container update. When the new deployment pod starts, it loads the model weights into GPU memory — which takes 30 seconds for a 7B model and up to 3 minutes for a 70B model on NVMe storage. During this time, the pod is not ready to serve traffic, and the HPA may interpret the lack of requests as a signal to scale down.
Solve this with a two-phase startup:
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
For frequent model updates, use a tag-based image version (e.g., vllm/vllm-openai:llama3.1-8b-v2) rather than the latest tag. This gives the HPA a deterministic rollout trigger: when the deployment image changes, the old pods are terminated with the preStop hook, and new pods start with the updated weights.
Start by deploying a single-replica vLLM pod on one GPU node and run a stress test with the Hey load generator (hey -z 60s -c 20 http://your-service/v1/chat/completions). Measure the 95th percentile latency. If it exceeds 500ms, reduce max-num-seqs by 32 until it falls into range. Once latency is stable, enable the HPA and repeat the test with 50 concurrent users. Watch the HPA scale from 1 to 3 replicas within 90 seconds. That moment — when the system self-adjusts to demand without manual intervention — is the signal that your self-hosted inference server is ready for production traffic.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse