Workloads

LatticeModel

Model serving resource backed by Volcano. Defines named roles (e.g. prefill, decode) with disaggregated entry and worker pod templates for independent scaling. Uses the same WorkloadSpec composition model as LatticeService and LatticeJob.

group: lattice.dev version: v1alpha1 scope: namespaced

Examples

Model serving workloads with disaggregated entry/worker pod templates, GPU resources, shared model caches, and ingress.

Disaggregated LLM Inference (Prefill + Decode)

llama-70b.yaml
apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
  name: llama-70b
  namespace: inference
spec:
  recoveryPolicy: Restart
  restartGracePeriodSeconds: 30
  roles:
    prefill:                          # entry-only role (no workers)
      replicas: 1
      entryWorkload:
        containers:
          main:
            image: ghcr.io/myorg/vllm-prefill:v0.4
            variables:
              MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
              TENSOR_PARALLEL_SIZE: "4"
            volumes:
              /models:
                source: ${resources.model-cache}
            resources:
              requests:
                cpu: 8000m
                memory: 64Gi
              limits:
                cpu: 16000m
                memory: 128Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 4
              memory: 80Gi
              model: A100
          model-cache:
            type: volume
            id: llm-model-cache
    decode:                            # entry + worker pods
      replicas: 2
      workerReplicas: 4
      entryWorkload:                   # coordinator pods
        containers:
          main:
            image: ghcr.io/myorg/vllm-decode:v0.4
            variables:
              MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
              ROLE: entry
            volumes:
              /models:
                source: ${resources.model-cache}
            resources:
              requests:
                cpu: 4000m
                memory: 32Gi
              limits:
                cpu: 8000m
                memory: 64Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 2
              memory: 80Gi
              model: A100
          model-cache:
            type: volume
            id: llm-model-cache
      workerWorkload:                  # worker pods (different resources)
        containers:
          main:
            image: ghcr.io/myorg/vllm-decode:v0.4
            variables:
              MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
              ROLE: worker
            volumes:
              /models:
                source: ${resources.model-cache}
            resources:
              requests:
                cpu: 2000m
                memory: 16Gi
              limits:
                cpu: 4000m
                memory: 32Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 1
              memory: 40Gi
              model: A100
          model-cache:
            type: volume
            id: llm-model-cache
  ingress:
    routes:
      api:
        hosts: [llama.inference.example.com]
        tls:
          issuerRef:
            name: letsencrypt-prod
        port: http

Single-Role Embedding Service

embedding-service.yaml
apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
  name: embedding-service
  namespace: inference
spec:
  roles:
    server:
      replicas: 2
      entryWorkload:
        containers:
          main:
            image: ghcr.io/myorg/tei:latest
            variables:
              MODEL_ID: BAAI/bge-large-en-v1.5
            resources:
              requests:
                cpu: 4000m
                memory: 16Gi
              limits:
                cpu: 8000m
                memory: 32Gi
        service:
          ports:
            http:
              port: 8080
        resources:
          gpu:
            type: gpu
            params:
              count: 1
              model: L4

Spec

LatticeModel spec fields
Field Type Description
roles map<string, ModelRoleSpec> Named role definitions (e.g. prefill, decode, server). Each role maps to a Volcano ModelServing role.
schedulerName string Volcano scheduler name. Default: "volcano".
recoveryPolicy string? Recovery policy for the serving group (e.g. "Restart").
restartGracePeriodSeconds u32? Grace period in seconds before restarting a failed role.
ingress IngressSpec? Optional Gateway API ingress configuration for exposing the model externally.

ModelRoleSpec

A single role within a LatticeModel serving workload. Each role has separate entry and worker pod templates, enabling independent resource allocation for disaggregated inference patterns. If workerWorkload is omitted, the role runs entry pods only.

ModelRoleSpec fields
Field Type Description
entryWorkload WorkloadSpec Workload specification for entry (coordinator) pods.
replicas u32 Number of entry pod replicas. Default: 1.
workerWorkload WorkloadSpec? Workload specification for worker pods. If omitted, the role has no workers.
workerReplicas u32? Number of worker pod replicas. Only used when workerWorkload is set.
Entry Runtime Extensions
sidecars map<string, ContainerSpec> Additional sidecar containers for entry pods.
sysctls map<string, string> Kernel parameter overrides for entry pods.
hostNetwork bool? Use the host network namespace for entry pods.
shareProcessNamespace bool? Share a single process namespace between entry pod containers.
imagePullSecrets []string Resource names referencing type: secret resources for private image registries.
Worker Runtime Extensions (optional)
workerSidecars map<string, ContainerSpec>? Sidecar containers for worker pods. Falls back to entry sidecars if omitted.
workerSysctls map<string, string>? Kernel parameter overrides for worker pods. Falls back to entry sysctls if omitted.
workerHostNetwork bool? Host network for worker pods. Falls back to entry setting if omitted.
workerImagePullSecrets []string? Image pull secrets for worker pods. Falls back to entry secrets if omitted.

Worker runtime extensions are optional. When omitted, worker pods inherit the corresponding entry runtime settings. This lets you share a common runtime configuration while only overriding specific fields for workers when needed.

Lifecycle

1 Pending — Model is created. Lattice compiles mesh members, secrets, volumes, and TracingPolicies.
2 Loading — Infrastructure is ready. Volcano ModelServing is created and model artifacts are being loaded.
3 Serving — Model is loaded and serving inference requests.
4 Failed — Model has encountered an error. Check conditions for details.

Status

LatticeModel status fields
Field Type Description
phase ModelServingPhase Current phase: Pending, Loading, Serving, or Failed.
message string? Human-readable status message.
observedGeneration i64? Last observed metadata.generation.
conditions []ModelCondition Detailed conditions mirrored from the underlying ModelServing resource.

ModelCondition

ModelCondition fields
Field Type Description
type string Condition type: Available, Progressing, Failed, or UpdateInProgress.
status string True, False, or Unknown.
reason string? Machine-readable reason for the condition.
message string? Human-readable message with details.
lastTransitionTime string? Timestamp of the last condition transition.