Ultimate Monitoring Setup: Prometheus, Grafana, Node Exporter and NVIDIA GPU Utilization

Andrii Shatokhin
DevOps.dev
Published in
4 min readSep 22, 2024
Nvidia GPU Metrics Dashboard

In this step-by-step guide, we will create a full monitoring solution for both system metrics and NVIDIA GPU metrics using Prometheus, Grafana, and Node Exporter. This tutorial is designed to be as simple as copying and pasting commands, with no additional setup required. By the end, you’ll have a working monitoring stack visualizing GPU and system data.

Step 1: Create the Monitoring Directory

First, create a directory to organize all your monitoring-related files.

  1. Open your terminal and run the following command to create the directory:
mkdir -p ~/Docker/monitoring
cd ~/Docker/monitoring

This Docker/monitoring directory will contain the necessary configuration files and the Docker Compose file for the monitoring stack.

Step 2: Create the Docker Compose File

This file defines your entire stack, including Prometheus, Grafana, Node Exporter, and NVIDIA GPU Exporter. It will make sure everything runs in Docker containers. This will allow you to collect system metrics, GPU metrics, and visualize them in Grafana.

  1. Create the docker-compose.yml file inside the ~/Docker/monitoring directory:
nano docker-compose.yml

2. Copy and paste the following content into the docker-compose.yml file:

version: '3.8'

services:
prometheus:
image: prom/prometheus
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
networks:
- monitoring

grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
networks:
- monitoring

node_exporter:
image: prom/node-exporter:latest
container_name: node_exporter
ports:
- "9100:9100"
networks:
- monitoring

nvidia_smi_exporter:
image: docker.io/utkuozdemir/nvidia_gpu_exporter:1.2.1
container_name: nvidia_smi_exporter
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
ports:
- "9835:9835"
volumes:
- /usr/bin/nvidia-smi:/usr/bin/nvidia-smi
- /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
- /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
restart: unless-stopped
networks:
- monitoring

networks:
monitoring:
driver: bridge

volumes:
prometheus-data:
grafana-storage:
  • Prometheus: Scrapes metrics and exposes them (port 9090).
  • Grafana: Visualizes data from Prometheus, accessible (port 3000).
  • Node Exporter: Collects system metrics like CPU and memory usage (port 9100).
  • NVIDIA GPU Exporter: for collecting GPU metrics (port 9835).

Save and close the file (CTRL + O, ENTER, CTRL + X).

Step 3: Configure Prometheus to Scrape GPU Metrics

Now we need to configure Prometheus to scrape both system and GPU metrics.

  1. In the same directory (~/Docker/monitoring), create the Prometheus configuration file:
nano prometheus.yml

2. Paste the following configuration into the file:

global:
scrape_interval: 10s

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']

- job_name: 'node_exporter'
static_configs:
- targets: ['node_exporter:9100']

- job_name: 'nvidia_smi_exporter'
static_configs:
- targets: ['nvidia_smi_exporter:9835']

This configuration tells Prometheus to scrape metrics from itself, Node Exporter (for system metrics), and NVIDIA GPU Exporter (for GPU metrics).

Step 4: Start the Monitoring Stack

With everything in place, we can now use Docker Compose to start the monitoring stack.

  1. In the ~/Docker/monitoring directory, run the following command to start all services:
docker compose up -d

2. Verify that all containers are running:

docker ps

You should see Prometheus, Grafana, and Node Exporter running.

Step 5: Access Prometheus and Grafana

Access Prometheus

Open your browser and go to:

http://<server-ip>:9090

This is the Prometheus interface, where you can query metrics. To check the running processes use the following endpoint:

http://<server-ip>:9090/targets
Prometheus http://<server-ip>:9090/targets endpoint

Access Grafana

Open Grafana in your browser:

http://<server-ip>:3000

Log in with the default credentials (admin/admin), then create or skip a new password form.

Step 6: Add Prometheus as a Data Source in Grafana

  1. In Grafana, navigate to Connections > Data sources > Add data source.
  2. Select Prometheus.
  3. Set the URL to http://prometheus:9090 and click Save & Test.
  4. Grafana should successfully connect to Prometheus.

Step 7: Import Dashboards in Grafana

Import System Metrics Dashboard (Node Exporter)

  1. Go to Dashboards > New > Import.
  2. In the Find and import dashboards for common applications field, enter 1860 (this is the Node Exporter Full Dashboard ID) and click Load.
  3. Select Prometheus as the data source and click Import.

Import NVIDIA GPU Metrics Dashboard

  1. Go to Dashboards > New > Import.
  2. In the Find and import dashboards for common applications field, enter 14574 (this is the custom Nvidia GPU Metrics Dashboard ID) and click Load.
  3. Select Prometheus as the data source and click Import.

Now you should have two dashboards:

  • Node Exporter Full Dashboard: Shows CPU, memory, and disk metrics.
  • Nvidia GPU Metrics Dashboard: Shows GPU utilization, temperature, and other GPU metrics.

Step 8: [Optional] Clean-Up

When you are done with monitoring and need to remove all the monitoring stack we used above, you can do so by stopping the containers and removing the volumes:

docker compose down --volumes

This command stops and removes all containers and associated volumes.

Conclusion

By following this guide, you’ve set up a full monitoring solution that tracks both system and GPU metrics using Prometheus, Grafana, and Node Exporter. All metrics are collected and visualized in Grafana, providing insights into your system’s performance and GPU utilization.

This setup is designed for easy deployment with minimal configuration. Just copy and paste the provided commands, and you’ll have your monitoring stack up and running in no time!

References

Nvidia GPU Exporter for Prometheus Using Nvidia-SMI Binary:

Christian Lempa — Docker Compose Boilerplate for Nvidia SMI Exporter:

Node Exporter Full Dashboard for Grafana:

  • Grafana Dashboard ID: 1860 — Node Exporter Full
  • A comprehensive Grafana dashboard for monitoring system metrics collected by Node Exporter, such as CPU, memory, disk, and network usage.

Nvidia GPU Metrics Dashboard:

  • Grafana Dashboard ID: 14574 — Nvidia GPU Metrics
  • This dashboard provides detailed NVIDIA GPU metrics based on the data exposed by the NVIDIA GPU Exporter, allowing you to monitor GPU usage, memory, power, and more.

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in DevOps.dev

Devops.dev is a community of DevOps enthusiasts sharing insight, stories, and the latest development in the field.

Written by Andrii Shatokhin

Data Scientist with a strong foundation in AI and data analytics. Follow my journey and explore trends in the AI field

No responses yet

What are your thoughts?