Skip to content

Gang Scheduling on Kubernetes

Partial placement of a distributed training job is worse than no placement. Seven GPUs burning ~$25/hour while the eighth can’t be scheduled isn’t a scheduling delay — it’s money on fire.

The default kube-scheduler processes pods one at a time. Submit a distributed training job that needs 8 GPUs, and the scheduler tries to place each worker independently. If it places 7 but the 8th has no available GPU, you have 7 pods sitting at a synchronization barrier, doing nothing, costing everything.

I’ve watched this happen. Everyone has. You’re staring at a Grafana dashboard, watching GPU utilization flatline on seven pods while the eighth bounces between Pending and Unschedulable. The cost ticker keeps climbing. Someone opens a Slack thread: “should we kill the job?” Someone else says “give it five more minutes.” Those five minutes cost $29 in idle GPU time.

The starvation variant is nastier. Job A takes 4 GPUs. Job B needs 4 but only gets 2 placed. Those 2 pods are stuck at a barrier while higher-priority jobs keep arriving. Job B’s remaining pods may never get placed. You’re paying for idle GPUs indefinitely, and the scheduler has no mechanism to detect or correct this.

The kube-scheduler has no concept of pod groups. It can’t express “place all 8 or place none.” Once it binds a pod to a node, that decision is final — no rollback.

This design is correct for stateless services. A Deployment with 3 replicas tolerates partial placement. Batch workloads don’t. They need all participants or they need to wait.

There’s a second constraint: fragmentation. A cluster with 8 free GPUs scattered one-per-node can’t place a job that needs 2 GPUs per node. The pattern is common in large GPU clusters — hundreds of GPUs free in aggregate, but few nodes with all slots available because naive scheduling scatters single-GPU jobs everywhere. You have the GPUs. You just can’t use them.

I chose Volcano for gang scheduling. CNCF incubating project, production-proven at Huawei, Tencent, and AWS (EMR on EKS).

Volcano runs alongside the default scheduler — schedulerName: volcano on batch workloads, kube-scheduler for everything else. The central abstraction is the PodGroup with minMember: N: the scheduler won’t bind any pod in the group unless it can place at least N simultaneously.

flowchart LR
  subgraph "kube-scheduler (eager bind)"
    K1[Pod 1] --> N1[Node A]
    K2[Pod 2] --> N2[Node B]
    K3[Pod 3] --> N3[Node C]
    K4["Pod 4 !!"] -. "no GPU" .-> N4["???"]
  end
  subgraph "Volcano (two-phase commit)"
    V1[Pod 1] --> T["Tentative\nplacement"]
    V2[Pod 2] --> T
    V3[Pod 3] --> T
    V4[Pod 4] --> T
    T -- "all fit?" --> C{Commit all}
  end

The key insight is the two-phase commit. Instead of binding pods as it finds nodes, Volcano tentatively places all pods in memory first. Once minMember is satisfied, it commits the batch. If it can’t, it drops the tentative placements — no pods bound to nodes, no compute resources consumed. All or nothing. That’s the whole point.

What about the alternatives?

  • Kueue handles admission control and quotas. Recent versions support gang scheduling via JobSet integration — a reasonable choice if you’re already invested in Kueue’s quota model.
  • YuniKorn supports gang semantics but replaces the default scheduler entirely. I wanted gang for batch while keeping kube-scheduler for services.
  • The scheduler-plugins CoScheduling plugin adds basic PodGroup support to kube-scheduler. Covers the placement constraint but lacks Volcano’s queue system, dominant resource fairness, and cross-queue reclaim.

I’d use Kueue alongside Volcano if I needed quota management. They solve different problems. Pretending otherwise wastes time.

For fragmentation, Lattice combines Volcano with bin-packing node scoring. Gang scheduling gives atomicity. Packing gives density. Together they dramatically increase the number of nodes with full GPU availability — turning a cluster that looks 95% utilized on paper but can’t place a multi-GPU job into one that can.

In Lattice, you don’t interact with Volcano directly. You write a LatticeJob CRD with task groups and replica counts. The platform compiles it into the correct Volcano VCJob, PodGroup, and queue assignment — while still evaluating the same authorization gates, secret resolution, and network policy compilation as any other workload.

The GPU bill doesn’t care about your architecture’s elegance. It cares whether 7 pods sat idle for an hour because the scheduler couldn’t think about groups. Volcano makes the scheduler think about groups. That’s worth the operational complexity.

  • The default kube-scheduler can’t do all-or-nothing. It binds pods independently. Partial placement on GPU workloads is money on fire — literally, measurably, ~$25/hour on fire.
  • Volcano’s two-phase commit solves gang scheduling. Tentative placement in memory, then commit or drop. No resources wasted on partial placement.
  • Fragmentation is a separate problem from atomicity. Having enough aggregate GPUs doesn’t mean you can place a job. Gang scheduling plus bin-packing solves both.