Workloads

LatticeJob

Batch workload resource backed by Volcano VCJob. Defines named tasks with independent pod templates, replica counts, and restart policies. Supports gang scheduling, fair-share queuing, and the same WorkloadSpec composition model as LatticeService.

group: lattice.dev version: v1alpha1 scope: namespaced

Examples

Batch jobs with named tasks, demonstrating master/worker patterns, secret references, restart policies, and GPU gang scheduling.

Data Pipeline with Master/Worker Tasks

data-pipeline.yaml

apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
  name: data-pipeline
  namespace: batch
spec:
  queue: default
  tasks:
    master:
      replicas: 1
      workload:
        containers:
          main:
            image: ghcr.io/myorg/pipeline-master:latest
            command: ["/bin/sh", "-c", "echo 'master ready'; sleep 30"]
            resources:
              requests:
                cpu: 100m
                memory: 64Mi
              limits:
                cpu: 200m
                memory: 128Mi
        resources:
          ghcr-creds:
            type: secret
            id: registry-creds
            params:
              provider: vault-ci
              refreshInterval: 1h
      imagePullSecrets:
        - ghcr-creds
      restartPolicy: Never
    worker:
      replicas: 2
      workload:
        containers:
          main:
            image: ghcr.io/myorg/pipeline-worker:latest
            command: ["/bin/sh", "-c", "echo 'worker running'; sleep 30"]
            resources:
              requests:
                cpu: 100m
                memory: 64Mi
              limits:
                cpu: 200m
                memory: 128Mi
        resources:
          ghcr-creds:
            type: secret
            id: registry-creds
            params:
              provider: vault-ci
              refreshInterval: 1h
      imagePullSecrets:
        - ghcr-creds
      restartPolicy: OnFailure

GPU Training Job with Gang Scheduling

llm-finetune.yaml

apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
  name: llm-finetune
  namespace: ml
spec:
  queue: gpu-queue
  priorityClassName: high-priority
  minAvailable: 3           # gang-schedule: all 3 pods must be placed
  maxRetry: 2
  tasks:
    trainer:
      replicas: 3
      workload:
        containers:
          main:
            image: ghcr.io/myorg/llm-trainer:v2.1
            variables:
              NCCL_DEBUG: INFO
              MODEL_URI: s3://models/llama-7b
            volumes:
              /data:
                source: ${resources.training-data}
                readOnly: true
              /checkpoints:
                source: ${resources.checkpoints}
            resources:
              requests:
                cpu: 8000m
                memory: 64Gi
              limits:
                cpu: 16000m
                memory: 128Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 4
              memory: 80Gi
              model: A100
          training-data:
            type: volume
            id: ml-training-data
          checkpoints:
            type: volume
            id: ml-checkpoints
            params:
              size: 500Gi
              storageClass: fast-nvme
      restartPolicy: OnFailure

Spec

LatticeJob spec fields
Field	Type	Description
`tasks`	map<string, JobTaskSpec>	Named task definitions. Each task maps to a Volcano VCJob task with its own pod template.
`schedulerName`	string	Volcano scheduler name. Default: `"volcano"`.
`queue`	string?	Volcano queue name for fair-share scheduling.
`priorityClassName`	string?	Priority class for Volcano fair-share scheduling.
`minAvailable`	u32?	Minimum pods that must be schedulable before any are placed (gang scheduling).
`maxRetry`	u32?	Maximum retry count for failed tasks.

JobTaskSpec

A single task within a LatticeJob. Each task has its own pod template, replica count, and restart policy. Tasks share the same WorkloadSpec composition model as LatticeService.

JobTaskSpec fields
Field	Type	Description
`workload`	WorkloadSpec	Score-compatible workload specification: containers, resources, and service ports.
`replicas`	u32	Number of pod replicas for this task. Default: `1`.
`restartPolicy`	RestartPolicy?	Pod restart policy. Default: `Never`.
`sidecars`	map<string, ContainerSpec>	Additional sidecar containers with capability control.
`sysctls`	map<string, string>	Kernel parameter overrides for the pod.
`hostNetwork`	bool?	Use the host network namespace.
`shareProcessNamespace`	bool?	Share a single process namespace between containers.
`imagePullSecrets`	[]string	Resource names referencing `type: secret` resources for private image registries.

RestartPolicy

Never Never restart on failure (default). Suitable for one-shot tasks.

OnFailure Restart the pod if it exits with a non-zero code. Suitable for worker tasks that should retry.

Always Always restart regardless of exit code.

Lifecycle

LatticeJob follows a linear state machine. Once the job reaches Running, the spec is treated as immutable — changes to the spec are rejected with a warning.

1 Pending — Job is created. Lattice compiles mesh members, secrets, volumes, and TracingPolicies.

2 Running — Infrastructure is ready. Volcano VCJob is created and pods are gang-scheduled.

3 Succeeded or Failed — Terminal state. Service graph edges are cleaned up.

Status

LatticeJob status fields
Field	Type	Description
`phase`	JobPhase	Current phase: `Pending`, `Running`, `Succeeded`, or `Failed`.
`message`	string?	Human-readable status message.
`observedGeneration`	i64?	Last observed `metadata.generation`. Used to detect spec mutations after the job is running.