Accelerator Metrics
The Nvidia DCGM exporter exposes GPU telemetry — utilization, memory usage, temperature, power, and more — as Prometheus metrics. This guide covers how to collect and query those metrics from a GPU cluster.
Prerequisites
Before proceeding, make sure you have a running GPU cluster. See Creating a Kubernetes cluster with GPUs for instructions.
Install the Nvidia GPU Operator
Install the GPU operator with the DCGM exporter's ServiceMonitor enabled so that Prometheus can discover and scrape the GPU metrics endpoint:
Add the Nvidia Helm repository:
helm install gpu-operator nvidia/gpu-operator \
--version v26.3.1 \
-n nvidia-gpu-operator \
--create-namespace \
--set driver.enabled=false \
--set dcgmExporter.serviceMonitor.enabled=true
Install Prometheus
Add the Prometheus community Helm repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Install the kube-prometheus-stack, configuring it to pick up all ServiceMonitor and PodMonitor resources regardless of label selectors:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
Expected output:
NAME: prometheus
LAST DEPLOYED: Thu May 7 16:17:22 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace monitoring get pods -l "release=prometheus"
Get Grafana 'admin' user password by running:
kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
Access Grafana local instance:
export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=prometheus" -oname)
kubectl --namespace monitoring port-forward $POD_NAME 3000
Access Prometheus
Port-forward the Prometheus service to your local machine:
List Available DCGM Metrics
Query the Prometheus label values API to list all DCGM metrics being scraped:
Expected output:
"DCGM_FI_DEV_DEC_UTIL",
"DCGM_FI_DEV_ENC_UTIL",
"DCGM_FI_DEV_FB_FREE",
"DCGM_FI_DEV_FB_RESERVED",
"DCGM_FI_DEV_FB_USED",
"DCGM_FI_DEV_GPU_TEMP",
"DCGM_FI_DEV_GPU_UTIL",
"DCGM_FI_DEV_MEMORY_TEMP",
"DCGM_FI_DEV_MEM_CLOCK",
"DCGM_FI_DEV_MEM_COPY_UTIL",
"DCGM_FI_DEV_PCIE_REPLAY_COUNTER",
"DCGM_FI_DEV_POWER_USAGE",
"DCGM_FI_DEV_SM_CLOCK",
"DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION",
"DCGM_FI_DEV_VGPU_LICENSE_STATUS",
"DCGM_FI_PROF_DRAM_ACTIVE",
"DCGM_FI_PROF_GR_ENGINE_ACTIVE",
"DCGM_FI_PROF_PCIE_RX_BYTES",
"DCGM_FI_PROF_PCIE_TX_BYTES",
"DCGM_FI_PROF_PIPE_TENSOR_ACTIVE",
Query GPU Metrics
GPU Utilization
Query the current GPU utilization percentage for all pods:
Expected output:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "595.71.05",
"Hostname": "kube-h1qtv-default-worker-vfkbs-886wk-9tnxl",
"UUID": "GPU-3024d656-66c9-2eeb-5a1b-a996141fd176",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "nvidia-dcgm-exporter",
"device": "nvidia0",
"endpoint": "gpu-metrics",
"exported_container": "cuda-test",
"exported_namespace": "default",
"exported_pod": "gpu-hpa-test-7c6b78785c-gq8rf",
"gpu": "0",
"instance": "10.100.247.197:9400",
"job": "nvidia-dcgm-exporter",
"modelName": "Tesla T4",
"namespace": "nvidia-gpu-operator",
"pci_bus_id": "00000000:00:05.0",
"pod": "nvidia-dcgm-exporter-b989l",
"service": "nvidia-dcgm-exporter"
},
"value": [
1778215068.715,
"0"
]
}
]
}
}
GPU Memory Usage
Query the amount of framebuffer memory currently in use (in MiB):
Expected output:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "595.71.05",
"Hostname": "kube-h1qtv-default-worker-vfkbs-886wk-9tnxl",
"UUID": "GPU-3024d656-66c9-2eeb-5a1b-a996141fd176",
"__name__": "DCGM_FI_DEV_FB_USED",
"container": "nvidia-dcgm-exporter",
"device": "nvidia0",
"endpoint": "gpu-metrics",
"exported_container": "cuda-test",
"exported_namespace": "default",
"exported_pod": "gpu-hpa-test-7c6b78785c-gq8rf",
"gpu": "0",
"instance": "10.100.247.197:9400",
"job": "nvidia-dcgm-exporter",
"modelName": "Tesla T4",
"namespace": "nvidia-gpu-operator",
"pci_bus_id": "00000000:00:05.0",
"pod": "nvidia-dcgm-exporter-b989l",
"service": "nvidia-dcgm-exporter"
},
"value": [
1778215105.532,
"0"
]
}
]
}
}
The same query pattern applies to any other metric in the list above. Replace DCGM_FI_DEV_FB_USED with the metric name of interest, for example DCGM_FI_DEV_GPU_TEMP for GPU temperature or DCGM_FI_DEV_POWER_USAGE for power consumption.
AI Workload Metrics
In addition to hardware-level GPU metrics, application-level metrics can be collected directly from AI inference workloads. vLLM is a high-throughput LLM inference server that natively exposes metrics in Prometheus exposition format, covering request throughput, queue depth, token generation rate, and latency.
Deploy vLLM
Deploy vLLM with a small model (facebook/opt-125m) and expose it on port 8000:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- facebook/opt-125m
- --port
- "8000"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-inference
namespace: default
labels:
app: vllm
spec:
selector:
app: vllm
ports:
- name: http
port: 8000
targetPort: 8000
Watch the logs to confirm the server is ready:
Browse the Raw Metrics Endpoint
Port-forward the vLLM service and fetch the raw metrics to see everything it exposes:
Create a ServiceMonitor
Create a ServiceMonitor to tell Prometheus to scrape the vLLM metrics endpoint every 15 seconds:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm
namespace: default
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: http
interval: 15s
path: /metrics
Query vLLM Metrics
Port-forward Prometheus if it is not already running:
Query the number of requests currently running on the inference engine:
curl -s "http://localhost:9090/api/v1/query?query=vllm:num_requests_running" | \
python3 -m json.tool
Expected output:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "vllm:num_requests_running",
"container": "vllm",
"endpoint": "http",
"engine": "0",
"instance": "10.100.247.242:8000",
"job": "vllm-inference",
"model_name": "facebook/opt-125m",
"namespace": "default",
"pod": "vllm-586567bcb4-kskfh",
"service": "vllm-inference"
},
"value": [
1778261062.766,
"0"
]
}
]
}
}
The same pattern applies to other vLLM metrics exposed at the /metrics endpoint, such as queue depth, token throughput, and time-to-first-token latency.