Workloads
LatticeJob
Batch workload resource backed by Volcano VCJob. Defines named tasks with independent pod templates, replica counts, and restart policies. Supports gang scheduling, fair-share queuing, and the same WorkloadSpec composition model as LatticeService.
group: lattice.dev version: v1alpha1 scope: namespaced
Examples
Batch jobs with named tasks, demonstrating master/worker patterns, secret references, restart policies, and GPU gang scheduling.
Data Pipeline with Master/Worker Tasks
data-pipeline.yaml
apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
name: data-pipeline
namespace: batch
spec:
queue: default
tasks:
master:
replicas: 1
workload:
containers:
main:
image: ghcr.io/myorg/pipeline-master:latest
command: ["/bin/sh", "-c", "echo 'master ready'; sleep 30"]
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
resources:
ghcr-creds:
type: secret
id: registry-creds
params:
provider: vault-ci
refreshInterval: 1h
imagePullSecrets:
- ghcr-creds
restartPolicy: Never
worker:
replicas: 2
workload:
containers:
main:
image: ghcr.io/myorg/pipeline-worker:latest
command: ["/bin/sh", "-c", "echo 'worker running'; sleep 30"]
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
resources:
ghcr-creds:
type: secret
id: registry-creds
params:
provider: vault-ci
refreshInterval: 1h
imagePullSecrets:
- ghcr-creds
restartPolicy: OnFailure GPU Training Job with Gang Scheduling
llm-finetune.yaml
apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
name: llm-finetune
namespace: ml
spec:
queue: gpu-queue
priorityClassName: high-priority
minAvailable: 3 # gang-schedule: all 3 pods must be placed
maxRetry: 2
tasks:
trainer:
replicas: 3
workload:
containers:
main:
image: ghcr.io/myorg/llm-trainer:v2.1
variables:
NCCL_DEBUG: INFO
MODEL_URI: s3://models/llama-7b
volumes:
/data:
source: ${resources.training-data}
readOnly: true
/checkpoints:
source: ${resources.checkpoints}
resources:
requests:
cpu: 8000m
memory: 64Gi
limits:
cpu: 16000m
memory: 128Gi
resources:
gpu:
type: gpu
params:
count: 4
memory: 80Gi
model: A100
training-data:
type: volume
id: ml-training-data
checkpoints:
type: volume
id: ml-checkpoints
params:
size: 500Gi
storageClass: fast-nvme
restartPolicy: OnFailure Spec
| Field | Type | Description |
|---|---|---|
tasks | map<string, JobTaskSpec> | Named task definitions. Each task maps to a Volcano VCJob task with its own pod template. |
schedulerName | string | Volcano scheduler name. Default: "volcano". |
queue | string? | Volcano queue name for fair-share scheduling. |
priorityClassName | string? | Priority class for Volcano fair-share scheduling. |
minAvailable | u32? | Minimum pods that must be schedulable before any are placed (gang scheduling). |
maxRetry | u32? | Maximum retry count for failed tasks. |
JobTaskSpec
A single task within a LatticeJob. Each task has its own pod template, replica count, and restart policy. Tasks share the same WorkloadSpec composition model as LatticeService.
| Field | Type | Description |
|---|---|---|
workload | WorkloadSpec | Score-compatible workload specification: containers, resources, and service ports. |
replicas | u32 | Number of pod replicas for this task. Default: 1. |
restartPolicy | RestartPolicy? | Pod restart policy. Default: Never. |
sidecars | map<string, ContainerSpec> | Additional sidecar containers with capability control. |
sysctls | map<string, string> | Kernel parameter overrides for the pod. |
hostNetwork | bool? | Use the host network namespace. |
shareProcessNamespace | bool? | Share a single process namespace between containers. |
imagePullSecrets | []string | Resource names referencing type: secret resources for private image registries. |
RestartPolicy
Never Never restart on failure (default). Suitable for one-shot tasks.
OnFailure Restart the pod if it exits with a non-zero code. Suitable for worker tasks that should retry.
Always Always restart regardless of exit code.
Lifecycle
LatticeJob follows a linear state machine. Once the job reaches Running, the spec is treated as immutable — changes to the spec are rejected with a warning.
1
Pending — Job is created. Lattice compiles mesh members, secrets, volumes, and TracingPolicies. 2
Running — Infrastructure is ready. Volcano VCJob is created and pods are gang-scheduled. 3
Succeeded or Failed — Terminal state. Service graph edges are cleaned up. Status
| Field | Type | Description |
|---|---|---|
phase | JobPhase | Current phase: Pending, Running, Succeeded, or Failed. |
message | string? | Human-readable status message. |
observedGeneration | i64? | Last observed metadata.generation. Used to detect spec mutations after the job is running. |