How to Build a Self-Hosted LLM Inference Server with vLLM and Kubernetes Autoscaling

May 14·8 min read·AI-assisted · human-reviewed

Running large language model inference in production means juggling latency SLAs, GPU memory constraints, and cloud costs. Off-the-shelf API services charge a premium per token, and they lock you into their model availability and data handling policies. A self-hosted solution built on vLLM and Kubernetes gives you direct control over throughput, cost, and privacy — but the path from a local test script to a production-grade autoscaling cluster has sharp edges. This guide walks through the exact architecture, configuration, and operational knobs needed to serve open-weight LLMs with sub-200ms latency while keeping GPU utilization above 70 percent.

Why vLLM Is the Right Engine for Self-Hosted Inference

vLLM (Virtual Large Language Model) is an inference engine developed at UC Berkeley that achieves up to 24x higher throughput than Hugging Face Transformers on the same hardware. The critical innovation is PagedAttention, a memory management technique that handles the key-value cache in fixed-size blocks instead of contiguous chunks. This eliminates the fragmentation that forces other frameworks to waste up to 60 percent of GPU memory.

For a self-hosted setup, vLLM offers three concrete advantages:

Continuous batching — automatically merges incoming requests into a dynamic batch, so a single GPU serves multiple concurrent users without waiting for a full batch to accumulate. Latency per request stays flat until the batch size saturates the GPU compute units.
Quantization support — loads models in FP16, INT8, or GPTQ-compressed formats. Running a 13B parameter model in INT8 drops memory from 26GB to roughly 13GB, enabling deployment on a single A10G (24GB).
OpenAI-compatible API — exposes a /v1/chat/completions endpoint that mirrors the OpenAI schema, so existing applications and LangChain integrations require zero code changes.

A common mistake is deploying vLLM without tuning the max-model-len and max-num-seqs parameters. If you set max-model-len to the model's full context window (say 32k tokens) but your average request is only 1024 tokens, you reserve GPU memory for sequences that never arrive. On a single A100 (80GB), reducing max-model-len from 32k to 8k for Llama 3 70B increases the serving capacity from 4 concurrent requests to 16.

Kubernetes Cluster Design for GPU Workloads

Vanilla Kubernetes does not understand GPUs. You need the NVIDIA GPU Operator installed on the cluster to expose GPU resources as schedulable quantities. Without it, pods requesting nvidia.com/gpu: 1 will hang indefinitely because the kubelet cannot detect the device.

Node pools and instance types

Create at least two node pools — one for GPU workloads and one for CPU-only services (like the API gateway or Prometheus). For the GPU pool, use a single instance type per pool to avoid unpredictable scheduling behavior. On AWS, p4d.24xlarge (8x A100) gives the best price-to-performance for serving, but g5.48xlarge (4x A10G) can handle smaller models cost-effectively. On GCP, L4 GPU instances (4x L4) offer a solid middle ground at roughly $1.20 per GPU-hour versus $4.50 for A100.

Node tolerations and pod affinity

Add a taint like gpu-node=true:NoSchedule to the GPU nodes so that only inference pods with the matching toleration land on them. This prevents a misconfigured logging agent from consuming GPU memory:

  tolerations:
    - key: "gpu-node"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"

Configuring vLLM for Production Throughput

The vLLM Docker image (vllm/vllm-openai:latest) bundles the OpenAI API server. You launch it with environment variables and command-line flags — not a configuration file. The critical parameters are:

--model: path to the Hugging Face model identifier or a local directory mounted via PVC.
--tensor-parallel-size: number of GPUs to shard the model across. For a 70B model on 4x A100, set this to 4.
--max-num-seqs: maximum number of sequences the scheduler processes concurrently. Start at 256 and monitor GPU memory. If you see OOM errors, reduce by 50.
--max-model-len: truncate the context window to match your use case. A RAG pipeline that retrieves 3 documents at 512 tokens each never needs more than 4096.
--gpu-memory-utilization: fraction of GPU memory vLLM can reserve. Default 0.90 leaves headroom for CUDA kernels and framework overhead.

Here is a representative Deployment YAML snippet for Llama 3 8B on a single A10G:

  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: vllm-llama3-8b
  spec:
    replicas: 1
    template:
      spec:
        containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args: ["--model", "meta-llama/Meta-Llama-3-8B-Instruct",
                 "--trust-remote-code",
                 "--max-model-len", "4096",
                 "--max-num-seqs", "256",
                 "--gpu-memory-utilization", "0.90"]
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
          - containerPort: 8000

Horizontal Pod Autoscaling Based on Request Queue Depth

CPU-based autoscaling metrics (like CPU utilization) are useless for LLM inference — the GPU is the bottleneck, not the CPU. Instead, scale on the number of waiting requests in vLLM's internal scheduler queue. vLLM exposes Prometheus metrics at port 8000/metrics, including vllm:num_requests_waiting.

Setting up a custom metrics adapter

Install the Prometheus Adapter (https://github.com/kubernetes-sigs/prometheus-adapter) and configure it to expose the waiting-requests metric. In the adapter configuration, define a series that targets the vLLM namespace and pod label:

  rules:
  - seriesQuery: 'vllm:num_requests_waiting'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "(.*)"
      as: "waiting_requests"
    metricsQuery: 'avg(<<.Series>>) by (<<.GroupBy>>)'

Then create a HorizontalPodAutoscaler that scales the deployment when the average waiting requests across all pods exceeds 10:

  apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  metadata:
    name: vllm-hpa
  spec:
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: vllm-llama3-8b
    minReplicas: 1
    maxReplicas: 8
    metrics:
    - type: Pods
      pods:
        metric:
          name: waiting_requests
        target:
          type: AverageValue
          averageValue: 10

Test this setup with a load generator. If the target value is too low, the HPA scales up prematurely, creating many small pods that fragment the request load. A value between 5 and 15 works for most scenarios — adjust based on your desired response time.

Cost Optimization with Spot Instances and Graceful Eviction

Spot instances on AWS and preemptible VMs on GCP reduce GPU costs by 60 to 80 percent. The catch: they can be reclaimed with two minutes of warning. For inference workloads, implement a two-step strategy:

PodDisruptionBudget: Set minAvailable to 1 so that at least one replica stays alive during a spot eviction wave. This prevents a total service outage.
Graceful shutdown hook: Configure a prestop lifecycle hook in the vLLM container that waits for the scheduler to finish processing current requests:

  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "sleep 15 && kill 1"]

The 15-second sleep gives vLLM time to drain active requests before the SIGTERM kills the process. Combine this with a readiness probe that returns 200 only when the model is fully loaded. Spot instances also benefit from node anti-affinity rules — spread replicas across different availability zones so a single zone reclaim event does not take down all pods.

Monitoring GPU Utilization and Token Throughput

Standard Kubernetes monitoring (CPU, memory, network) tells you nothing about whether the GPU is actually working. You need NVIDIA's DCGM Exporter (https://github.com/NVIDIA/dcgm-exporter) deployed as a DaemonSet on GPU nodes. It exposes metrics like:

DCGM_FI_DEV_GPU_UTIL — the fraction of time the GPU was executing kernels (not memory bandwidth). Aim for 70-85 percent at peak load. Below 50 percent means your batch size is too small.
DCGM_FI_DEV_FB_FREE — free framebuffer memory in MiB. If this drops below 500MiB, you risk OOM errors during large-batch inference.
DCGM_FI_DEV_POWER_USAGE — power draw in watts. A sustained reading near the TDP limit indicates the GPU is thermally throttling; you may need to reduce max-num-seqs.

On the vLLM side, the metric vllm:avg_generation_throughput_toks_per_sec tells you the actual token output rate. For Llama 3 70B on a single A100 with continuous batching, expect roughly 40-60 tokens per second per request at batch size 8. If you see single-digit throughput, check whether the model is swapping KV cache to CPU memory — indicated by vLLM metrics showing high "cpu_cache_usage".

Handling Model Updates and Cache Warming

Rolling out a new model version requires more care than a typical container update. When the new deployment pod starts, it loads the model weights into GPU memory — which takes 30 seconds for a 7B model and up to 3 minutes for a 70B model on NVMe storage. During this time, the pod is not ready to serve traffic, and the HPA may interpret the lack of requests as a signal to scale down.

Solve this with a two-phase startup:

Init container: run a lightweight script that downloads the model weights into a shared emptyDir volume. This ensures the model is local before the vLLM container starts.
Readiness probe with initial delay: set initialDelaySeconds to 120 for large models so Kubernetes waits before checking readiness:

  readinessProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 120
    periodSeconds: 10

For frequent model updates, use a tag-based image version (e.g., vllm/vllm-openai:llama3.1-8b-v2) rather than the latest tag. This gives the HPA a deterministic rollout trigger: when the deployment image changes, the old pods are terminated with the preStop hook, and new pods start with the updated weights.

Next Steps to Production Readiness

Start by deploying a single-replica vLLM pod on one GPU node and run a stress test with the Hey load generator (hey -z 60s -c 20 http://your-service/v1/chat/completions). Measure the 95th percentile latency. If it exceeds 500ms, reduce max-num-seqs by 32 until it falls into range. Once latency is stable, enable the HPA and repeat the test with 50 concurrent users. Watch the HPA scale from 1 to 3 replicas within 90 seconds. That moment — when the system self-adjusts to demand without manual intervention — is the signal that your self-hosted inference server is ready for production traffic.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.