LatticeModel
Model serving resource backed by Volcano. Defines named roles (e.g. prefill, decode) with disaggregated entry and worker pod templates for independent scaling. Uses the same WorkloadSpec composition model as LatticeService and LatticeJob.
Examples
Model serving workloads with disaggregated entry/worker pod templates, GPU resources, shared model caches, and ingress.
Disaggregated LLM Inference (Prefill + Decode)
apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
name: llama-70b
namespace: inference
spec:
recoveryPolicy: Restart
restartGracePeriodSeconds: 30
roles:
prefill: # entry-only role (no workers)
replicas: 1
entryWorkload:
containers:
main:
image: ghcr.io/myorg/vllm-prefill:v0.4
variables:
MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
TENSOR_PARALLEL_SIZE: "4"
volumes:
/models:
source: ${resources.model-cache}
resources:
requests:
cpu: 8000m
memory: 64Gi
limits:
cpu: 16000m
memory: 128Gi
resources:
gpu:
type: gpu
params:
count: 4
memory: 80Gi
model: A100
model-cache:
type: volume
id: llm-model-cache
decode: # entry + worker pods
replicas: 2
workerReplicas: 4
entryWorkload: # coordinator pods
containers:
main:
image: ghcr.io/myorg/vllm-decode:v0.4
variables:
MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
ROLE: entry
volumes:
/models:
source: ${resources.model-cache}
resources:
requests:
cpu: 4000m
memory: 32Gi
limits:
cpu: 8000m
memory: 64Gi
resources:
gpu:
type: gpu
params:
count: 2
memory: 80Gi
model: A100
model-cache:
type: volume
id: llm-model-cache
workerWorkload: # worker pods (different resources)
containers:
main:
image: ghcr.io/myorg/vllm-decode:v0.4
variables:
MODEL_NAME: meta-llama/Llama-2-70b-chat-hf
ROLE: worker
volumes:
/models:
source: ${resources.model-cache}
resources:
requests:
cpu: 2000m
memory: 16Gi
limits:
cpu: 4000m
memory: 32Gi
resources:
gpu:
type: gpu
params:
count: 1
memory: 40Gi
model: A100
model-cache:
type: volume
id: llm-model-cache
ingress:
routes:
api:
hosts: [llama.inference.example.com]
tls:
issuerRef:
name: letsencrypt-prod
port: http Single-Role Embedding Service
apiVersion: lattice.dev/v1alpha1
kind: LatticeModel
metadata:
name: embedding-service
namespace: inference
spec:
roles:
server:
replicas: 2
entryWorkload:
containers:
main:
image: ghcr.io/myorg/tei:latest
variables:
MODEL_ID: BAAI/bge-large-en-v1.5
resources:
requests:
cpu: 4000m
memory: 16Gi
limits:
cpu: 8000m
memory: 32Gi
service:
ports:
http:
port: 8080
resources:
gpu:
type: gpu
params:
count: 1
model: L4 Spec
| Field | Type | Description |
|---|---|---|
roles | map<string, ModelRoleSpec> | Named role definitions (e.g. prefill, decode, server). Each role maps to a Volcano ModelServing role. |
schedulerName | string | Volcano scheduler name. Default: "volcano". |
recoveryPolicy | string? | Recovery policy for the serving group (e.g. "Restart"). |
restartGracePeriodSeconds | u32? | Grace period in seconds before restarting a failed role. |
ingress | IngressSpec? | Optional Gateway API ingress configuration for exposing the model externally. |
ModelRoleSpec
A single role within a LatticeModel serving workload. Each role has separate entry and worker pod templates, enabling independent resource allocation for disaggregated inference patterns. If workerWorkload is omitted, the role runs entry pods only.
| Field | Type | Description |
|---|---|---|
entryWorkload | WorkloadSpec | Workload specification for entry (coordinator) pods. |
replicas | u32 | Number of entry pod replicas. Default: 1. |
workerWorkload | WorkloadSpec? | Workload specification for worker pods. If omitted, the role has no workers. |
workerReplicas | u32? | Number of worker pod replicas. Only used when workerWorkload is set. |
| Entry Runtime Extensions | ||
sidecars | map<string, ContainerSpec> | Additional sidecar containers for entry pods. |
sysctls | map<string, string> | Kernel parameter overrides for entry pods. |
hostNetwork | bool? | Use the host network namespace for entry pods. |
shareProcessNamespace | bool? | Share a single process namespace between entry pod containers. |
imagePullSecrets | []string | Resource names referencing type: secret resources for private image registries. |
| Worker Runtime Extensions (optional) | ||
workerSidecars | map<string, ContainerSpec>? | Sidecar containers for worker pods. Falls back to entry sidecars if omitted. |
workerSysctls | map<string, string>? | Kernel parameter overrides for worker pods. Falls back to entry sysctls if omitted. |
workerHostNetwork | bool? | Host network for worker pods. Falls back to entry setting if omitted. |
workerImagePullSecrets | []string? | Image pull secrets for worker pods. Falls back to entry secrets if omitted. |
Worker runtime extensions are optional. When omitted, worker pods inherit the corresponding entry runtime settings. This lets you share a common runtime configuration while only overriding specific fields for workers when needed.
Lifecycle
Pending — Model is created. Lattice compiles mesh members, secrets, volumes, and TracingPolicies. Loading — Infrastructure is ready. Volcano ModelServing is created and model artifacts are being loaded. Serving — Model is loaded and serving inference requests. Failed — Model has encountered an error. Check conditions for details. Status
| Field | Type | Description |
|---|---|---|
phase | ModelServingPhase | Current phase: Pending, Loading, Serving, or Failed. |
message | string? | Human-readable status message. |
observedGeneration | i64? | Last observed metadata.generation. |
conditions | []ModelCondition | Detailed conditions mirrored from the underlying ModelServing resource. |
ModelCondition
| Field | Type | Description |
|---|---|---|
type | string | Condition type: Available, Progressing, Failed, or UpdateInProgress. |
status | string | True, False, or Unknown. |
reason | string? | Machine-readable reason for the condition. |
message | string? | Human-readable message with details. |
lastTransitionTime | string? | Timestamp of the last condition transition. |