Skip to content

Accelerator Metrics

The Nvidia DCGM exporter exposes GPU telemetry — utilization, memory usage, temperature, power, and more — as Prometheus metrics. This guide covers how to collect and query those metrics from a GPU cluster.

Prerequisites

Before proceeding, make sure you have a running GPU cluster. See Creating a Kubernetes cluster with GPUs for instructions.

Install the Nvidia GPU Operator

Install the GPU operator with the DCGM exporter's ServiceMonitor enabled so that Prometheus can discover and scrape the GPU metrics endpoint:

Add the Nvidia Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --version v26.3.1 \
  -n nvidia-gpu-operator \
  --create-namespace \
  --set driver.enabled=false \
  --set dcgmExporter.serviceMonitor.enabled=true

Install Prometheus

Add the Prometheus community Helm repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install the kube-prometheus-stack, configuring it to pick up all ServiceMonitor and PodMonitor resources regardless of label selectors:

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Expected output:

NAME: prometheus
LAST DEPLOYED: Thu May  7 16:17:22 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace monitoring get pods -l "release=prometheus"

Get Grafana 'admin' user password by running:

  kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Access Grafana local instance:

  export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=prometheus" -oname)
  kubectl --namespace monitoring port-forward $POD_NAME 3000

Access Prometheus

Port-forward the Prometheus service to your local machine:

kubectl port-forward -n monitoring \
  svc/prometheus-kube-prometheus-prometheus 9090:9090 &

List Available DCGM Metrics

Query the Prometheus label values API to list all DCGM metrics being scraped:

curl -s "http://localhost:9090/api/v1/label/__name__/values" | \
  python3 -m json.tool | grep DCGM

Expected output:

        "DCGM_FI_DEV_DEC_UTIL",
        "DCGM_FI_DEV_ENC_UTIL",
        "DCGM_FI_DEV_FB_FREE",
        "DCGM_FI_DEV_FB_RESERVED",
        "DCGM_FI_DEV_FB_USED",
        "DCGM_FI_DEV_GPU_TEMP",
        "DCGM_FI_DEV_GPU_UTIL",
        "DCGM_FI_DEV_MEMORY_TEMP",
        "DCGM_FI_DEV_MEM_CLOCK",
        "DCGM_FI_DEV_MEM_COPY_UTIL",
        "DCGM_FI_DEV_PCIE_REPLAY_COUNTER",
        "DCGM_FI_DEV_POWER_USAGE",
        "DCGM_FI_DEV_SM_CLOCK",
        "DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION",
        "DCGM_FI_DEV_VGPU_LICENSE_STATUS",
        "DCGM_FI_PROF_DRAM_ACTIVE",
        "DCGM_FI_PROF_GR_ENGINE_ACTIVE",
        "DCGM_FI_PROF_PCIE_RX_BYTES",
        "DCGM_FI_PROF_PCIE_TX_BYTES",
        "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE",

Query GPU Metrics

GPU Utilization

Query the current GPU utilization percentage for all pods:

curl -s "http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL" | \
  python3 -m json.tool

Expected output:

{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "595.71.05",
                    "Hostname": "kube-h1qtv-default-worker-vfkbs-886wk-9tnxl",
                    "UUID": "GPU-3024d656-66c9-2eeb-5a1b-a996141fd176",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia0",
                    "endpoint": "gpu-metrics",
                    "exported_container": "cuda-test",
                    "exported_namespace": "default",
                    "exported_pod": "gpu-hpa-test-7c6b78785c-gq8rf",
                    "gpu": "0",
                    "instance": "10.100.247.197:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "Tesla T4",
                    "namespace": "nvidia-gpu-operator",
                    "pci_bus_id": "00000000:00:05.0",
                    "pod": "nvidia-dcgm-exporter-b989l",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1778215068.715,
                    "0"
                ]
            }
        ]
    }
}

GPU Memory Usage

Query the amount of framebuffer memory currently in use (in MiB):

curl -s "http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_FB_USED" | \
  python3 -m json.tool

Expected output:

{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "595.71.05",
                    "Hostname": "kube-h1qtv-default-worker-vfkbs-886wk-9tnxl",
                    "UUID": "GPU-3024d656-66c9-2eeb-5a1b-a996141fd176",
                    "__name__": "DCGM_FI_DEV_FB_USED",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia0",
                    "endpoint": "gpu-metrics",
                    "exported_container": "cuda-test",
                    "exported_namespace": "default",
                    "exported_pod": "gpu-hpa-test-7c6b78785c-gq8rf",
                    "gpu": "0",
                    "instance": "10.100.247.197:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "Tesla T4",
                    "namespace": "nvidia-gpu-operator",
                    "pci_bus_id": "00000000:00:05.0",
                    "pod": "nvidia-dcgm-exporter-b989l",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1778215105.532,
                    "0"
                ]
            }
        ]
    }
}

The same query pattern applies to any other metric in the list above. Replace DCGM_FI_DEV_FB_USED with the metric name of interest, for example DCGM_FI_DEV_GPU_TEMP for GPU temperature or DCGM_FI_DEV_POWER_USAGE for power consumption.

AI Workload Metrics

In addition to hardware-level GPU metrics, application-level metrics can be collected directly from AI inference workloads. vLLM is a high-throughput LLM inference server that natively exposes metrics in Prometheus exposition format, covering request throughput, queue depth, token generation rate, and latency.

Deploy vLLM

Deploy vLLM with a small model (facebook/opt-125m) and expose it on port 8000:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - facebook/opt-125m
          - --port
          - "8000"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-inference
  namespace: default
  labels:
    app: vllm
spec:
  selector:
    app: vllm
  ports:
  - name: http
    port: 8000
    targetPort: 8000

Watch the logs to confirm the server is ready:

kubectl logs -f deployment/vllm | grep -i "ready\|error\|serving"

Browse the Raw Metrics Endpoint

Port-forward the vLLM service and fetch the raw metrics to see everything it exposes:

kubectl port-forward svc/vllm-inference 8000:8000 &
curl http://localhost:8000/metrics

Create a ServiceMonitor

Create a ServiceMonitor to tell Prometheus to scrape the vLLM metrics endpoint every 15 seconds:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm
  namespace: default
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
  - port: http
    interval: 15s
    path: /metrics

Query vLLM Metrics

Port-forward Prometheus if it is not already running:

kubectl port-forward -n monitoring \
  svc/prometheus-kube-prometheus-prometheus 9090:9090 &

Query the number of requests currently running on the inference engine:

curl -s "http://localhost:9090/api/v1/query?query=vllm:num_requests_running" | \
  python3 -m json.tool

Expected output:

{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "__name__": "vllm:num_requests_running",
                    "container": "vllm",
                    "endpoint": "http",
                    "engine": "0",
                    "instance": "10.100.247.242:8000",
                    "job": "vllm-inference",
                    "model_name": "facebook/opt-125m",
                    "namespace": "default",
                    "pod": "vllm-586567bcb4-kskfh",
                    "service": "vllm-inference"
                },
                "value": [
                    1778261062.766,
                    "0"
                ]
            }
        ]
    }
}

The same pattern applies to other vLLM metrics exposed at the /metrics endpoint, such as queue depth, token throughput, and time-to-first-token latency.