Workloads

LatticeModel

Model serving resource backed by Volcano. Defines named roles (e.g. prefill, decode) with disaggregated entry and worker pod templates for independent scaling. Uses the same WorkloadSpec composition model as LatticeService and LatticeJob.

group: lattice.dev version: v1alpha1 scope: namespaced

Examples

Model serving workloads with disaggregated entry/worker pod templates, GPU resources, shared model caches, and ingress.

Disaggregated LLM Inference (Prefill + Decode)

llama-70b.yaml

apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
  name: llama-70b
  namespace: inference
spec:
  recoveryPolicy: Restart
  restartGracePeriodSeconds: 30
  roles:
    prefill:                          # entry-only role (no workers)
      replicas: 1
      entryWorkload:
        containers:
          main:
            image: ghcr.io/myorg/vllm-prefill:v0.4
            variables:
              MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
              TENSOR_PARALLEL_SIZE: "4"
            volumes:
              /models:
                source: ${resources.model-cache}
            resources:
              requests:
                cpu: 8000m
                memory: 64Gi
              limits:
                cpu: 16000m
                memory: 128Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 4
              memory: 80Gi
              model: A100
          model-cache:
            type: volume
            id: llm-model-cache
    decode:                            # entry + worker pods
      replicas: 2
      workerReplicas: 4
      entryWorkload:                   # coordinator pods
        containers:
          main:
            image: ghcr.io/myorg/vllm-decode:v0.4
            variables:
              MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
              ROLE: entry
            volumes:
              /models:
                source: ${resources.model-cache}
            resources:
              requests:
                cpu: 4000m
                memory: 32Gi
              limits:
                cpu: 8000m
                memory: 64Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 2
              memory: 80Gi
              model: A100
          model-cache:
            type: volume
            id: llm-model-cache
      workerWorkload:                  # worker pods (different resources)
        containers:
          main:
            image: ghcr.io/myorg/vllm-decode:v0.4
            variables:
              MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
              ROLE: worker
            volumes:
              /models:
                source: ${resources.model-cache}
            resources:
              requests:
                cpu: 2000m
                memory: 16Gi
              limits:
                cpu: 4000m
                memory: 32Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 1
              memory: 40Gi
              model: A100
          model-cache:
            type: volume
            id: llm-model-cache
  ingress:
    routes:
      api:
        hosts: [llama.inference.example.com]
        tls:
          issuerRef:
            name: letsencrypt-prod
        port: http

Single-Role Embedding Service

embedding-service.yaml

apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
  name: embedding-service
  namespace: inference
spec:
  roles:
    server:
      replicas: 2
      entryWorkload:
        containers:
          main:
            image: ghcr.io/myorg/tei:latest
            variables:
              MODEL_ID: BAAI/bge-large-en-v1.5
            resources:
              requests:
                cpu: 4000m
                memory: 16Gi
              limits:
                cpu: 8000m
                memory: 32Gi
        service:
          ports:
            http:
              port: 8080
        resources:
          gpu:
            type: gpu
            params:
              count: 1
              model: L4

Spec

LatticeModel spec fields
Field	Type	Description
`roles`	map<string, ModelRoleSpec>	Named role definitions (e.g. `prefill`, `decode`, `server`). Each role maps to a Volcano ModelServing role.
`schedulerName`	string	Volcano scheduler name. Default: `"volcano"`.
`recoveryPolicy`	string?	Recovery policy for the serving group (e.g. `"Restart"`).
`restartGracePeriodSeconds`	u32?	Grace period in seconds before restarting a failed role.
`ingress`	IngressSpec?	Optional Gateway API ingress configuration for exposing the model externally.

ModelRoleSpec

A single role within a LatticeModel serving workload. Each role has separate entry and worker pod templates, enabling independent resource allocation for disaggregated inference patterns. If workerWorkload is omitted, the role runs entry pods only.

ModelRoleSpec fields
Field	Type	Description
`entryWorkload`	WorkloadSpec	Workload specification for entry (coordinator) pods.
`replicas`	u32	Number of entry pod replicas. Default: `1`.
`workerWorkload`	WorkloadSpec?	Workload specification for worker pods. If omitted, the role has no workers.
`workerReplicas`	u32?	Number of worker pod replicas. Only used when `workerWorkload` is set.
Entry Runtime Extensions
`sidecars`	map<string, ContainerSpec>	Additional sidecar containers for entry pods.
`sysctls`	map<string, string>	Kernel parameter overrides for entry pods.
`hostNetwork`	bool?	Use the host network namespace for entry pods.
`shareProcessNamespace`	bool?	Share a single process namespace between entry pod containers.
`imagePullSecrets`	[]string	Resource names referencing `type: secret` resources for private image registries.
Worker Runtime Extensions (optional)
`workerSidecars`	map<string, ContainerSpec>?	Sidecar containers for worker pods. Falls back to entry sidecars if omitted.
`workerSysctls`	map<string, string>?	Kernel parameter overrides for worker pods. Falls back to entry sysctls if omitted.
`workerHostNetwork`	bool?	Host network for worker pods. Falls back to entry setting if omitted.
`workerImagePullSecrets`	[]string?	Image pull secrets for worker pods. Falls back to entry secrets if omitted.

Worker runtime extensions are optional. When omitted, worker pods inherit the corresponding entry runtime settings. This lets you share a common runtime configuration while only overriding specific fields for workers when needed.

Lifecycle

1 Pending — Model is created. Lattice compiles mesh members, secrets, volumes, and TracingPolicies.

2 Loading — Infrastructure is ready. Volcano ModelServing is created and model artifacts are being loaded.

3 Serving — Model is loaded and serving inference requests.

4 Failed — Model has encountered an error. Check conditions for details.

Status

LatticeModel status fields
Field	Type	Description
`phase`	ModelServingPhase	Current phase: `Pending`, `Loading`, `Serving`, or `Failed`.
`message`	string?	Human-readable status message.
`observedGeneration`	i64?	Last observed `metadata.generation`.
`conditions`	[]ModelCondition	Detailed conditions mirrored from the underlying ModelServing resource.

ModelCondition

ModelCondition fields
Field	Type	Description
`type`	string	Condition type: `Available`, `Progressing`, `Failed`, or `UpdateInProgress`.
`status`	string	`True`, `False`, or `Unknown`.
`reason`	string?	Machine-readable reason for the condition.
`message`	string?	Human-readable message with details.
`lastTransitionTime`	string?	Timestamp of the last condition transition.