Horizontal Pod Autoscaling
Horizontal Pod Autoscaling (HPA) automatically scales the number of pod replicas in a deployment based on observed metrics. This guide demonstrates GPU-based HPA using NVIDIA DCGM GPU utilization metrics exposed through Prometheus and the Prometheus Adapter.
Prerequisites
Before proceeding, make sure you have a running GPU cluster. See Creating a Kubernetes cluster with GPUs for instructions.
Install the Nvidia GPU Operator
Install the GPU operator with the DCGM exporter's ServiceMonitor enabled so that Prometheus can scrape GPU metrics:
helm install gpu-operator nvidia/gpu-operator \
--version v26.3.1 \
-n nvidia-gpu-operator \
--create-namespace \
--set driver.enabled=false \
--set dcgmExporter.serviceMonitor.enabled=true
Install Prometheus
Add the Prometheus community Helm repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Install the kube-prometheus-stack, configuring it to pick up all ServiceMonitor and PodMonitor resources regardless of label selectors:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
Verify GPU Metrics are Being Scraped
Port-forward the Prometheus service to your local machine:
Open http://localhost:9090 in a browser and query DCGM_FI_DEV_GPU_UTIL to confirm GPU metrics are being collected.
Deploy a GPU Workload
Deploy a test workload that requests a GPU:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-hpa-test
spec:
replicas: 1
selector:
matchLabels:
app: gpu-hpa-test
template:
metadata:
labels:
app: gpu-hpa-test
spec:
containers:
- name: cuda-test
image: nvcr.io/nvidia/cuda:13.0.1-base-ubi9
command: ["/bin/sh", "-c"]
args:
- nvidia-smi && sleep 3600
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Install the Prometheus Adapter
The Prometheus Adapter exposes Prometheus metrics through the Kubernetes custom metrics API, which HPA uses to make scaling decisions. Create an adapter values file that maps the DCGM_FI_DEV_GPU_UTIL metric to a custom metric named gpu_utilization:
cat > /tmp/adapter-values.yaml << 'EOF'
prometheus:
url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
port: 9090
rules:
custom:
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_pod!=""}'
resources:
overrides:
exported_namespace:
resource: namespace
exported_pod:
resource: pod
name:
matches: "^(.*)$"
as: "gpu_utilization"
metricsQuery: 'avg(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
EOF
Install the adapter:
helm install prometheus-adapter \
prometheus-community/prometheus-adapter \
--namespace monitoring \
-f /tmp/adapter-values.yaml
Verify the Custom Metric
Confirm the gpu_utilization metric is available through the custom metrics API:
kubectl get --raw \
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/gpu_utilization" | \
python3 -m json.tool
Expected output:
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {},
"items": [
{
"describedObject": {
"kind": "Pod",
"namespace": "default",
"name": "gpu-hpa-test-7c6b78785c-4k8mb",
"apiVersion": "/v1"
},
"metricName": "gpu_utilization",
"timestamp": "2026-05-07T21:07:31Z",
"value": "0",
"selector": null
}
]
}
Create the HPA
Create an HPA that targets an average GPU utilization of 30% and allows scaling between 1 and 2 replicas:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-hpa-test
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gpu-hpa-test
minReplicas: 1
maxReplicas: 2
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: 30
Test Scale-Up
Stress the GPU by running a memory copy loop inside the running pod:
kubectl exec -it gpu-hpa-test-7c6b78785c-gq8rf -- python3 -c "
import ctypes, time
cudart = ctypes.CDLL('/usr/local/cuda-13.0/targets/x86_64-linux/lib/libcudart.so.13')
size = 1024*1024*1024
ptr = ctypes.c_void_p()
src = ctypes.c_void_p()
cudart.cudaMalloc(ctypes.byref(ptr), size)
cudart.cudaMalloc(ctypes.byref(src), size)
print('Stressing GPU...')
while True:
cudart.cudaMemcpy(ptr, src, size, ctypes.c_int(3))
cudart.cudaDeviceSynchronize()
"
Watch the HPA in another terminal. Once GPU utilization exceeds the 30% target, the HPA will scale the deployment to 2 replicas:
Expected output once scaled up:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
gpu-hpa-test Deployment/gpu-hpa-test 100/30 1 2 2 14m
With both replicas running, the utilization averages across the two pods and drops:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
gpu-hpa-test Deployment/gpu-hpa-test 50/30 1 2 2 24m
Verify both pods are running:
Expected output:
NAME READY STATUS RESTARTS AGE
gpu-hpa-test-7c6b78785c-gq8rf 1/1 Running 0 26m
gpu-hpa-test-7c6b78785c-xlrw7 1/1 Running 0 10m
Expected output:
Test Scale-Down
Stop the stress script with Ctrl+C. Once GPU utilization drops below the target, the HPA will scale the deployment back to 1 replica after a cooldown period:
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
gpu-hpa-test Deployment/gpu-hpa-test 0/30 1 2 2 29m
gpu-hpa-test Deployment/gpu-hpa-test 0/30 1 2 2 30m
gpu-hpa-test Deployment/gpu-hpa-test 0/30 1 2 1 30m
Expected output:
Expected output: