Chapter 18: Model Serving

Traffic doubles at 2 PM. The autoscaler adds 2 inference replicas. Each takes 3 minutes to load model weights into GPU memory. For 3 minutes, the existing replicas handle double the load at 95% KV cache utilization. Latency degrades from 200ms to 3 seconds. Users notice.

This is the cold-start problem — and it’s why inference workloads can’t use the same scaling model as web services.

The decisions — what metric to scale on, how to handle cold start, when inference deserves its own CRD — are universal to any platform serving models. The reference implementation’s LatticeModel CRD is one approach; the scaling and lifecycle problems exist regardless of the specific CRD design. The decisions:

What metric do you scale on? (CPU utilization is wrong for inference. KV cache usage, queue depth, and time-to-first-token are better — but each has trade-offs.)
How do you handle cold start? (Long stabilization windows prevent flapping but hold expensive GPUs idle. Short windows risk repeated cold starts.)
When does a workload type deserve its own CRD? (Inference has different lifecycle, different scaling, different resource types — meeting all four criteria from Chapter 11.)

18.1 How Inference Differs from Services

Cold start. A web service starts in seconds. An inference endpoint starts in minutes — loading model weights from storage into GPU memory. A 70B parameter model in FP16 needs ~140GB just for weights — more than any single GPU’s memory (an A100 has 80GB, an H100 has 80GB). Models this size require tensor parallelism: the weights are sharded across multiple GPUs, and each GPU holds a slice. The CRD declares the total GPU count per replica (e.g., nvidia.com/gpu: 4 for 4-way tensor parallelism); the inference engine (vLLM, SGLang) handles the sharding automatically. Loading time depends on storage speed, model size, and shard count. This means: scaling up is slow, scaling down is expensive (you lose the loaded model across all GPUs), and the first request after a cold start waits.

Request heterogeneity. Web service requests are roughly uniform in cost. Inference requests vary by orders of magnitude — a 10-token prompt costs 100x less than a 1000-token prompt. Load balancing by request count produces unbalanced GPU utilization. A “balanced” distribution may send 5 long prompts to one GPU and 500 short prompts to another.

GPU memory as the bottleneck. Services are bounded by CPU and memory. Inference is bounded by GPU memory — the model weights must fit in GPU memory, plus the KV cache for active requests. The number of concurrent requests a GPU can handle depends on how much memory the KV cache consumes, not CPU load.

Scaling metrics. Services scale on CPU utilization or request rate. Inference should scale on KV cache usage (how full is the GPU memory?) or time-to-first-token (how long does the user wait?). Scaling on CPU utilization for an inference endpoint is wrong — GPU utilization can be high while the queue is empty, or low while the queue is full.

18.2 The LatticeModel CRD

apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
  name: qwen-8b
  namespace: serving
spec:
  modelSource:
    uri: "hf://Qwen/Qwen3-8B"
    cacheSize: "50Gi"
  defaults:
    entryWorkload:
      containers:
        main:
          image: registry.example.com/vllm@sha256:abc123
          variables:
            MODEL_NAME: "Qwen/Qwen3-8B"
          resources:
            requests:
              cpu: "4"
              memory: 16Gi
              nvidia.com/gpu: "1"
  roles:
    prefill:
      replicas: 2
    decode:
      replicas: 4
      autoscaling:
        max: 16
        metrics:
          - metric: gpu_kv_cache_usage
            target: 0.8
        behavior:
          scaleDown:
            stabilizationWindow: "10m"
  routing:
    inferenceEngine: vLLM
    model: "Qwen/Qwen3-8B"

Model source. The modelSource specifies where to download model weights. The pipeline generates an init container that downloads from HuggingFace (hf://), S3 (s3://), or GCS (gs://) into a cache volume before the serving container starts. The cache volume can be a PVC (persistent across restarts) or an emptyDir (re-downloaded on restart).

Roles. LatticeModel supports multiple roles — prefill and decode in this example. Each role gets its own Deployment with its own replica count, resource requests, and autoscaling configuration. This enables disaggregated inference: prefill workers handle the compute-intensive prompt processing (computing the KV cache for the full input sequence), decode workers handle the autoregressive token generation (reading the KV cache and producing output tokens one at a time). The CRD’s roles map uses the standard terminology — prefill and decode — not a custom abstraction.

Autoscaling. The decode role scales on gpu_kv_cache_usage — a custom metric from the inference engine (vLLM exposes this). The stabilizationWindow: 10m prevents rapid scale-down — scaling down means evicting a loaded model, which takes minutes to reload if demand returns. This is 10x longer than a typical service’s stabilization window because the cost of scaling back up is 10x higher.

Routing. The routing block configures inference-specific request routing. The inference engine (vLLM, SGLang) determines the API protocol. The platform compiles this into routing resources that direct inference requests to the appropriate role.

18.3 Scaling Under Load: A Walkthrough

Walk through what happens when an inference endpoint receives a traffic spike.

Steady state. The qwen-8b model is running with 2 prefill replicas and 4 decode replicas. KV cache usage on the decode replicas averages 0.5 (50%). Latency is stable at 200ms P99.

t=0. Traffic doubles. The marketing team launched a campaign. Request rate goes from 100 RPS to 200 RPS.

t=30s. KV cache usage on decode replicas rises to 0.85 (85%). The KEDA ScaledObject triggers: gpu_kv_cache_usage > 0.8 (the configured target). KEDA wants to scale decode from 4 to 6 replicas.

t=1m. Two new decode pods are scheduled. Each needs 1 GPU. The cluster has available GPUs. The pods start — but the model isn’t loaded yet. The vLLM container begins downloading model weights from the cache volume (PVC with ReadWriteMany or node-local storage such as hostPath, so the weights may already be available without a fresh download; a ReadWriteOnce PVC cannot be shared across replicas on different nodes; emptyDir means a fresh download from HuggingFace).

t=3m. Model loading completes. The new replicas register with the routing layer and begin accepting requests. KV cache usage across all 6 replicas drops to 0.55. Latency returns to 200ms P99.

t=60m. Traffic returns to 100 RPS. KV cache usage drops to 0.3. The scale-down stabilization window (10 minutes) starts counting. KEDA waits to confirm the demand decrease is sustained — not a brief lull.

t=70m. Stabilization window expires. KEDA scales decode from 6 to 4. Two replicas are terminated. Their loaded models are evicted from GPU memory.

t=75m. Traffic spikes again unexpectedly. KV cache usage rises. KEDA scales up again. The two new replicas must reload the model — another 2-3 minutes of cold start.

The lesson. The 10-minute stabilization window prevented premature scale-down during the initial demand period (t=0 to t=60m). But it couldn’t prevent the scale-up/scale-down/scale-up cycle at t=60-75m. Long stabilization windows smooth out noise but can’t predict demand. If this pattern is frequent, the operator should increase minReplicas to hold more warm capacity — accepting higher cost for lower cold-start risk.

This is the fundamental trade-off of inference scaling: GPU capacity is expensive to hold idle and expensive to warm up. The platform can’t solve this trade-off — it can only make the parameters (minReplicas, maxReplicas, stabilization windows, scaling metrics) visible and configurable in the spec.

18.4 What the Shared Compiler Handles

Same as LatticeJob (Chapter 16, Section 16.4): image verification, Cedar authorization, secret resolution, environment compilation, mesh member generation. The shared WorkloadCompiler produces the pod template, ConfigMaps, Secrets, ExternalSecrets for each role.

What LatticeModel adds:

Multiple Deployments (one per role) instead of a single Deployment or VCJob.
Model download init container from modelSource.
Role-specific autoscaling with GPU metrics instead of KEDA’s standard CPU/memory triggers.
Routing configuration for the inference engine.
Volcano scheduling for multi-GPU roles (roles that need gang placement).
Worker pods within a role (an entry pod receives requests and distributes pipeline stages across worker pods for pipeline parallelism).

18.5 Scaling Configuration

Scaling inference is different from scaling services in three ways.

Scale-up stabilization (3-5 minutes). Longer than services because you need to confirm demand is sustained before committing to a cold start. Transient spikes shouldn’t trigger model loading.

Scale-down stabilization (10-15 minutes). Much longer than services. Scaling down means evicting a model from GPU memory. If demand returns, the model must be reloaded — minutes of cold start. The platform is conservative about removing capacity.

Scaling to zero. Useful for dev/staging or low-traffic models. The first request after scale-down waits minutes for model loading. Not appropriate for production endpoints with latency SLOs.

The trade-off. Long stabilization windows mean the platform holds GPU capacity longer than necessary during demand drops. Short windows risk rapid scale-up/scale-down cycles that waste minutes of model loading time. The right values depend on the model size (larger models = longer loading = longer stabilization), the traffic pattern (bursty vs. steady), and the cost tolerance (inference-optimized GPUs like L4 or A10G at ~$2/hour — roughly half the cost of H100-class training hardware — are not free to hold idle).

18.6 When to Build a New CRD

Chapter 11 asked: when does a workload type need its own CRD vs. an escape hatch?

LatticeModel exists because inference meets all the criteria:

Different lifecycle. Models have load/unload semantics (cold start, cache management).
Different scaling. GPU-specific metrics, long stabilization windows.
Different output resources. Multiple Deployments per CRD, routing configuration, model download init containers.
Multiple teams need it. Any organization doing ML inference has this pattern.

The shared WorkloadCompiler handles the common security and networking infrastructure. The LatticeModel-specific compiler handles the inference-specific resources. Adding a fourth workload type (DaemonSets for agents? StatefulSets for databases?) means building another wrapper on the same foundation.

18.7 What Goes Wrong in Practice

Scenario: the scale-up/scale-down/scale-up cycle. An inference endpoint runs 4 decode replicas. Traffic drops at 10 PM. KV cache usage falls to 0.2. The 10-minute stabilization window starts. At 10:10 PM, KEDA scales down from 4 to 2 replicas. Two models are evicted from GPU memory.

At 10:15 PM, a batch of scheduled reports hits the endpoint. Traffic spikes. KV cache usage on the 2 remaining replicas jumps to 0.9. KEDA scales up from 2 to 4. Two new replicas start loading models — 3 minutes of cold start. For 3 minutes, the 2 existing replicas handle the report traffic at near-capacity. Some reports time out.

At 10:18 PM, the new replicas are ready. Traffic is served. At 10:30 PM, the reports are done. Traffic drops again. The cycle repeats.

The cost: Two model loads at 3 minutes each, with degraded latency during each. The stabilization window (10 minutes) was long enough to prevent scale-down during the main traffic period but too short to survive a 5-minute gap followed by a burst.

The lesson: Stabilization windows are a guess about traffic patterns. They prevent the common case (transient dips) but can’t prevent every case (gaps followed by bursts). If this pattern is regular (nightly reports at 10:15 PM), the operator should increase minReplicas to hold capacity through the gap — accepting higher cost for no cold start. The platform can surface the pattern: “service qwen-8b scaled down and up 3 times this week between 10-11 PM. Consider increasing minReplicas to avoid cold starts.”

18.8 Part VI Complete

The advanced workload types:

Chapter 16: Batch and gang scheduling — atomic placement through Volcano, LatticeJob CRD.
Chapter 17: GPU infrastructure — failure modes, DCGM monitoring, health classification.
Chapter 18: Model serving — disaggregated inference, KV cache scaling, cold-start management.

In each case, the shared WorkloadCompiler handled security and networking; the type-specific wrapper handled lifecycle. Part VII covers operations: observability (Chapter 19), disaster recovery (Chapter 20), and testing (Chapter 21).

Exercises

18.1. [M10] A model serving endpoint scales on gpu_kv_cache_usage. The metric says 0.95 (95% full). The autoscaler adds a replica. The new replica takes 3 minutes to load the model. During those 3 minutes, the existing replicas are at 95% — near capacity. New requests queue. What happens to latency during the scale-up? How should the platform report this to the developer?

18.2. [H30] The modelSource downloads model weights into a cache volume. Design the caching strategy for a model deployed with 4 replicas across 4 nodes. Options: emptyDir (each pod downloads independently — 4 downloads), hostPath (shared per node — 1 download per node), PVC with ReadWriteMany (shared across nodes — 1 download total). What are the trade-offs in download time, storage cost, node affinity constraints, and fault tolerance?

18.3. [R] LatticeModel supports roles (prefill, decode). In practice, many teams start with a single role (a combined prefill+decode deployment) and disaggregate later as they optimize. Should the CRD require roles, or should it support a “simple mode” with a single deployment? What’s the cost of requiring roles for every model (complexity for simple cases) vs. allowing single-deployment mode (two code paths in the compiler)?

18.4. [M10] An inference endpoint is serving production traffic. The scale-down stabilization is 10 minutes. Traffic drops to zero at 2 AM and returns at 8 AM. The platform holds 4 GPU replicas for 6 hours at ~$2/hour per inference-optimized GPU (L4 or A10G — smaller and cheaper than the H100s used for training) = $48 wasted. Should the platform scale to zero overnight? What is the cold-start latency when traffic returns? Is $48 cheaper or more expensive than 3 minutes of degraded latency at 8 AM?

18.5. [H30] Disaggregated inference separates prefill workers from decode workers. But it introduces a new failure mode: the decode workers are healthy but all prefill workers are overloaded. Decode workers accept requests (they’re not at capacity) but can’t get KV caches from prefill workers (they’re all full). Design the backpressure mechanism: how does the routing layer signal to clients that it can’t handle more requests? Should the platform’s health check reflect prefill load, not just decode health?