Skip to content

Horizontal Pod Autoscaling

Horizontal Pod Autoscaling (HPA) automatically scales the number of pod replicas in a deployment based on observed metrics. This guide demonstrates GPU-based HPA using NVIDIA DCGM GPU utilization metrics exposed through Prometheus and the Prometheus Adapter.

Prerequisites

Before proceeding, make sure you have a running GPU cluster. See Creating a Kubernetes cluster with GPUs for instructions.

Install the Nvidia GPU Operator

Install the GPU operator with the DCGM exporter's ServiceMonitor enabled so that Prometheus can scrape GPU metrics:

helm install gpu-operator nvidia/gpu-operator \
  --version v26.3.1 \
  -n nvidia-gpu-operator \
  --create-namespace \
  --set driver.enabled=false \
  --set dcgmExporter.serviceMonitor.enabled=true

Install Prometheus

Add the Prometheus community Helm repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install the kube-prometheus-stack, configuring it to pick up all ServiceMonitor and PodMonitor resources regardless of label selectors:

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Verify GPU Metrics are Being Scraped

Port-forward the Prometheus service to your local machine:

kubectl port-forward -n monitoring \
  svc/prometheus-kube-prometheus-prometheus 9090:9090 &

Open http://localhost:9090 in a browser and query DCGM_FI_DEV_GPU_UTIL to confirm GPU metrics are being collected.

Deploy a GPU Workload

Deploy a test workload that requests a GPU:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-hpa-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-hpa-test
  template:
    metadata:
      labels:
        app: gpu-hpa-test
    spec:
      containers:
      - name: cuda-test
        image: nvcr.io/nvidia/cuda:13.0.1-base-ubi9
        command: ["/bin/sh", "-c"]
        args:
          - nvidia-smi && sleep 3600
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Install the Prometheus Adapter

The Prometheus Adapter exposes Prometheus metrics through the Kubernetes custom metrics API, which HPA uses to make scaling decisions. Create an adapter values file that maps the DCGM_FI_DEV_GPU_UTIL metric to a custom metric named gpu_utilization:

cat > /tmp/adapter-values.yaml << 'EOF'
prometheus:
  url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
  port: 9090

rules:
  custom:
  - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_pod!=""}'
    resources:
      overrides:
        exported_namespace:
          resource: namespace
        exported_pod:
          resource: pod
    name:
      matches: "^(.*)$"
      as: "gpu_utilization"
    metricsQuery: 'avg(DCGM_FI_DEV_GPU_UTIL{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
EOF

Install the adapter:

helm install prometheus-adapter \
  prometheus-community/prometheus-adapter \
  --namespace monitoring \
  -f /tmp/adapter-values.yaml

Verify the Custom Metric

Confirm the gpu_utilization metric is available through the custom metrics API:

kubectl get --raw \
  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/gpu_utilization" | \
  python3 -m json.tool

Expected output:

{
    "kind": "MetricValueList",
    "apiVersion": "custom.metrics.k8s.io/v1beta1",
    "metadata": {},
    "items": [
        {
            "describedObject": {
                "kind": "Pod",
                "namespace": "default",
                "name": "gpu-hpa-test-7c6b78785c-4k8mb",
                "apiVersion": "/v1"
            },
            "metricName": "gpu_utilization",
            "timestamp": "2026-05-07T21:07:31Z",
            "value": "0",
            "selector": null
        }
    ]
}

Create the HPA

Create an HPA that targets an average GPU utilization of 30% and allows scaling between 1 and 2 replicas:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa-test
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gpu-hpa-test
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: 30

Test Scale-Up

Stress the GPU by running a memory copy loop inside the running pod:

kubectl exec -it gpu-hpa-test-7c6b78785c-gq8rf -- python3 -c "
import ctypes, time
cudart = ctypes.CDLL('/usr/local/cuda-13.0/targets/x86_64-linux/lib/libcudart.so.13')

size = 1024*1024*1024
ptr = ctypes.c_void_p()
src = ctypes.c_void_p()
cudart.cudaMalloc(ctypes.byref(ptr), size)
cudart.cudaMalloc(ctypes.byref(src), size)

print('Stressing GPU...')
while True:
    cudart.cudaMemcpy(ptr, src, size, ctypes.c_int(3))
    cudart.cudaDeviceSynchronize()
"

Watch the HPA in another terminal. Once GPU utilization exceeds the 30% target, the HPA will scale the deployment to 2 replicas:

watch kubectl get hpa gpu-hpa-test

Expected output once scaled up:

NAME           REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa-test   Deployment/gpu-hpa-test   100/30    1         2         2          14m

With both replicas running, the utilization averages across the two pods and drops:

NAME           REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa-test   Deployment/gpu-hpa-test   50/30     1         2         2          24m

Verify both pods are running:

kubectl get pods

Expected output:

NAME                            READY   STATUS    RESTARTS   AGE
gpu-hpa-test-7c6b78785c-gq8rf   1/1     Running   0          26m
gpu-hpa-test-7c6b78785c-xlrw7   1/1     Running   0          10m
kubectl get deploy

Expected output:

NAME           READY   UP-TO-DATE   AVAILABLE   AGE
gpu-hpa-test   2/2     2            2           26m

Test Scale-Down

Stop the stress script with Ctrl+C. Once GPU utilization drops below the target, the HPA will scale the deployment back to 1 replica after a cooldown period:

kubectl get hpa -w

Expected output:

NAME           REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa-test   Deployment/gpu-hpa-test   0/30      1         2         2          29m
gpu-hpa-test   Deployment/gpu-hpa-test   0/30      1         2         2          30m
gpu-hpa-test   Deployment/gpu-hpa-test   0/30      1         2         1          30m
kubectl get deploy

Expected output:

NAME           READY   UP-TO-DATE   AVAILABLE   AGE
gpu-hpa-test   1/1     1            1           33m
kubectl get pods

Expected output:

NAME                            READY   STATUS    RESTARTS   AGE
gpu-hpa-test-7c6b78785c-gq8rf   1/1     Running   0          33m