Kubeflow Training Operator
The Kubeflow Training Operator provides Kubernetes custom resources for running distributed machine learning training jobs, including PyTorch, TensorFlow, MPI, and XGBoost workloads.
Prerequisites
Before proceeding, make sure you have a running Kubernetes cluster. See one of the following guides depending on your setup:
Install the Training Operator
Install the Kubeflow Training Operator standalone using Kustomize:
kubectl apply --server-side -k \
"github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0"
Expected output:
namespace/kubeflow serverside-applied
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/paddlejobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org serverside-applied
serviceaccount/training-operator serverside-applied
clusterrole.rbac.authorization.k8s.io/training-operator serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/training-operator serverside-applied
secret/training-operator-webhook-cert serverside-applied
service/training-operator serverside-applied
deployment.apps/training-operator serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/validator.training-operator.kubeflow.org serverside-applied
Watch the operator pod start up:
Expected output once ready:
Verify the Installation
Check the operator pod is running:
Expected output:
Check the Kubeflow CRDs were installed:
Expected output:
mpijobs.kubeflow.org 2026-05-08T20:16:21Z
mxjobs.kubeflow.org 2026-05-08T20:16:21Z
paddlejobs.kubeflow.org 2026-05-08T20:16:22Z
pytorchjobs.kubeflow.org 2026-05-08T20:16:22Z
tfjobs.kubeflow.org 2026-05-08T20:16:22Z
xgboostjobs.kubeflow.org 2026-05-08T20:16:23Z
Check the validating webhook was registered:
Expected output:
Run a PyTorchJob
Create a PyTorchJob with one master and one worker. Both processes initialise a distributed process group, print their rank, and synchronise at a barrier:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-test
namespace: default
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
- python3
- -c
- |
import torch
import torch.distributed as dist
import os
dist.init_process_group(backend='gloo')
rank = dist.get_rank()
world_size = dist.get_world_size()
print(f'Hello from rank {rank} of {world_size}')
dist.barrier()
print(f'Rank {rank} completed successfully')
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
- python3
- -c
- |
import torch
import torch.distributed as dist
import os
dist.init_process_group(backend='gloo')
rank = dist.get_rank()
world_size = dist.get_world_size()
print(f'Hello from rank {rank} of {world_size}')
dist.barrier()
print(f'Rank {rank} completed successfully')
Check the pod statuses:
Expected output:
NAME READY STATUS RESTARTS AGE
pytorch-test-master-0 0/1 Completed 0 7m57s
pytorch-test-worker-0 0/1 Completed 0 7m57s
Verify the job reached the Succeeded state:
Expected output: