Gang Scheduling

Gang scheduling ensures that all pods in a workload are scheduled together — either all pods start at once or none start at all. This is important for distributed workloads such as multi-GPU training jobs where partial scheduling would waste resources and stall progress. This page uses Kueue — a Kubernetes-native job queueing controller — to provide the all-or-nothing semantics; it installs as a workload inside the cluster and isn't tied to anything Breqwatr-specific.

Prerequisites

Before proceeding, make sure you have a running Kubernetes cluster. See one of the following guides depending on your setup:

Install Kueue

Install Kueue using Helm:

helm install kueue oci://registry.k8s.io/kueue/charts/kueue \
  --version=0.17.2 \
  --namespace kueue-system \
  --create-namespace \
  --wait --timeout 300s

Verify Kueue is running:

kubectl get pods -n kueue-system

Expected output:

NAME                                        READY   STATUS    RESTARTS   AGE
kueue-controller-manager-7c557f677c-t5vq9   1/1     Running   0          33s

Configure Queues and Resource Flavors

Create a ClusterQueue, ResourceFlavor, and LocalQueue. The ClusterQueue defines the cluster-wide resource quotas, the ResourceFlavor describes the available hardware, and the LocalQueue is the namespace-scoped entry point that workloads reference:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: gpu-cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: default-flavor
      resources:
      - name: cpu
        nominalQuota: "10"
      - name: memory
        nominalQuota: 10Gi
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: default-flavor
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: gpu-local-queue
  namespace: default
spec:
  clusterQueue: gpu-cluster-queue

Verify all three resources were created:

kubectl get clusterqueue

Expected output:

NAME                COHORT   PENDING WORKLOADS
gpu-cluster-queue            0

kubectl get localqueue

Expected output:

NAME              CLUSTERQUEUE        PENDING WORKLOADS   ADMITTED WORKLOADS
gpu-local-queue   gpu-cluster-queue   0                   0

kubectl get resourceflavor

Expected output:

NAME             AGE
default-flavor   2m58s

Submit a Gang-Scheduled Job

Submit a test job with parallelism: 2 and completions: 2. Kueue will admit both pods together or hold them both until capacity is available. The job references the LocalQueue via the kueue.x-k8s.io/queue-name label:

apiVersion: batch/v1
kind: Job
metadata:
  name: gang-schedule-test
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: gpu-local-queue
spec:
  completions: 2
  parallelism: 2
  template:
    spec:
      containers:
      - name: worker
        image: busybox
        command: ["sh", "-c", "echo Gang scheduled worker && sleep 30"]
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
      restartPolicy: Never

Verify both pods were scheduled and completed together:

kubectl get pods

Expected output:

NAME                       READY   STATUS      RESTARTS   AGE
gang-schedule-test-8sb7w   0/1     Completed   0          97s
gang-schedule-test-fhrxt   0/1     Completed   0          97s

kubectl get jobs

Expected output:

NAME                 STATUS     COMPLETIONS   DURATION   AGE
gang-schedule-test   Complete   2/2           38s        2m2s

Inspect the job events to see Kueue's scheduling flow:

kubectl describe job gang-schedule-test

Expected output:

Name:             gang-schedule-test
Namespace:        default
Selector:         batch.kubernetes.io/controller-uid=2780b857-a84a-4d71-958f-bd78979c604e
Labels:           kueue.x-k8s.io/queue-name=gpu-local-queue
Annotations:      <none>
Parallelism:      2
Completions:      2
Completion Mode:  NonIndexed
Suspend:          false
Backoff Limit:    6
Start Time:       Tue, 05 May 2026 16:14:09 -0400
Completed At:     Tue, 05 May 2026 16:14:47 -0400
Duration:         38s
Pods Statuses:    0 Active (0 Ready) / 2 Succeeded / 0 Failed
Pod Template:
  Labels:       batch.kubernetes.io/controller-uid=2780b857-a84a-4d71-958f-bd78979c604e
                batch.kubernetes.io/job-name=gang-schedule-test
                controller-uid=2780b857-a84a-4d71-958f-bd78979c604e
                job-name=gang-schedule-test
                kueue.x-k8s.io/cluster-queue-name=gpu-cluster-queue
                kueue.x-k8s.io/local-queue-name=gpu-local-queue
                kueue.x-k8s.io/podset=main
  Annotations:  kueue.x-k8s.io/workload: job-gang-schedule-test-33e93
  Containers:
   worker:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      echo Gang scheduled worker && sleep 30
    Requests:
      cpu:         100m
      memory:      128Mi
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Events:
  Type    Reason            Age                From                        Message
  ----    ------            ----               ----                        -------
  Normal  Suspended         2m5s               job-controller              Job suspended
  Normal  CreatedWorkload   2m5s               batch/job-kueue-controller  Created Workload: default/job-gang-schedule-test-33e93
  Normal  Started           2m5s               batch/job-kueue-controller  Admitted by clusterQueue gpu-cluster-queue
  Normal  SuccessfulCreate  2m5s               job-controller              Created pod: gang-schedule-test-fhrxt
  Normal  SuccessfulCreate  2m5s               job-controller              Created pod: gang-schedule-test-8sb7w
  Normal  Resumed           2m5s               job-controller              Job resumed
  Normal  Completed         87s                job-controller              Job completed
  Normal  FinishedWorkload  87s (x3 over 87s)  batch/job-kueue-controller  Workload 'default/job-gang-schedule-test-33e93' is declared finished

The events confirm the gang scheduling behaviour: the job was first Suspended by Kueue, then admitted as a whole to gpu-cluster-queue, both pods were created simultaneously, and the job completed successfully.