Workloads

LatticeJob

Batch workload resource backed by Volcano VCJob. Defines named tasks with independent pod templates, replica counts, and restart policies. Supports gang scheduling, fair-share queuing, and the same WorkloadSpec composition model as LatticeService.

group: lattice.dev version: v1alpha1 scope: namespaced

Examples

Batch jobs with named tasks, demonstrating master/worker patterns, secret references, restart policies, and GPU gang scheduling.

Data Pipeline with Master/Worker Tasks

data-pipeline.yaml
apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
  name: data-pipeline
  namespace: batch
spec:
  queue: default
  tasks:
    master:
      replicas: 1
      workload:
        containers:
          main:
            image: ghcr.io/myorg/pipeline-master:latest
            command: ["/bin/sh", "-c", "echo 'master ready'; sleep 30"]
            resources:
              requests:
                cpu: 100m
                memory: 64Mi
              limits:
                cpu: 200m
                memory: 128Mi
        resources:
          ghcr-creds:
            type: secret
            id: registry-creds
            params:
              provider: vault-ci
              refreshInterval: 1h
      imagePullSecrets:
        - ghcr-creds
      restartPolicy: Never
    worker:
      replicas: 2
      workload:
        containers:
          main:
            image: ghcr.io/myorg/pipeline-worker:latest
            command: ["/bin/sh", "-c", "echo 'worker running'; sleep 30"]
            resources:
              requests:
                cpu: 100m
                memory: 64Mi
              limits:
                cpu: 200m
                memory: 128Mi
        resources:
          ghcr-creds:
            type: secret
            id: registry-creds
            params:
              provider: vault-ci
              refreshInterval: 1h
      imagePullSecrets:
        - ghcr-creds
      restartPolicy: OnFailure

GPU Training Job with Gang Scheduling

llm-finetune.yaml
apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
  name: llm-finetune
  namespace: ml
spec:
  queue: gpu-queue
  priorityClassName: high-priority
  minAvailable: 3           # gang-schedule: all 3 pods must be placed
  maxRetry: 2
  tasks:
    trainer:
      replicas: 3
      workload:
        containers:
          main:
            image: ghcr.io/myorg/llm-trainer:v2.1
            variables:
              NCCL_DEBUG: INFO
              MODEL_URI: s3://models/llama-7b
            volumes:
              /data:
                source: ${resources.training-data}
                readOnly: true
              /checkpoints:
                source: ${resources.checkpoints}
            resources:
              requests:
                cpu: 8000m
                memory: 64Gi
              limits:
                cpu: 16000m
                memory: 128Gi
        resources:
          gpu:
            type: gpu
            params:
              count: 4
              memory: 80Gi
              model: A100
          training-data:
            type: volume
            id: ml-training-data
          checkpoints:
            type: volume
            id: ml-checkpoints
            params:
              size: 500Gi
              storageClass: fast-nvme
      restartPolicy: OnFailure

Spec

LatticeJob spec fields
Field Type Description
tasks map<string, JobTaskSpec> Named task definitions. Each task maps to a Volcano VCJob task with its own pod template.
schedulerName string Volcano scheduler name. Default: "volcano".
queue string? Volcano queue name for fair-share scheduling.
priorityClassName string? Priority class for Volcano fair-share scheduling.
minAvailable u32? Minimum pods that must be schedulable before any are placed (gang scheduling).
maxRetry u32? Maximum retry count for failed tasks.

JobTaskSpec

A single task within a LatticeJob. Each task has its own pod template, replica count, and restart policy. Tasks share the same WorkloadSpec composition model as LatticeService.

JobTaskSpec fields
Field Type Description
workload WorkloadSpec Score-compatible workload specification: containers, resources, and service ports.
replicas u32 Number of pod replicas for this task. Default: 1.
restartPolicy RestartPolicy? Pod restart policy. Default: Never.
sidecars map<string, ContainerSpec> Additional sidecar containers with capability control.
sysctls map<string, string> Kernel parameter overrides for the pod.
hostNetwork bool? Use the host network namespace.
shareProcessNamespace bool? Share a single process namespace between containers.
imagePullSecrets []string Resource names referencing type: secret resources for private image registries.

RestartPolicy

Never Never restart on failure (default). Suitable for one-shot tasks.
OnFailure Restart the pod if it exits with a non-zero code. Suitable for worker tasks that should retry.
Always Always restart regardless of exit code.

Lifecycle

LatticeJob follows a linear state machine. Once the job reaches Running, the spec is treated as immutable — changes to the spec are rejected with a warning.

1 Pending — Job is created. Lattice compiles mesh members, secrets, volumes, and TracingPolicies.
2 Running — Infrastructure is ready. Volcano VCJob is created and pods are gang-scheduled.
3 Succeeded or Failed — Terminal state. Service graph edges are cleaned up.

Status

LatticeJob status fields
Field Type Description
phase JobPhase Current phase: Pending, Running, Succeeded, or Failed.
message string? Human-readable status message.
observedGeneration i64? Last observed metadata.generation. Used to detect spec mutations after the job is running.