Ultimate Monitoring Setup: Prometheus, Grafana, Node Exporter and NVIDIA GPU Utilization

Published in

DevOps.dev

4 min readSep 22, 2024

In this step-by-step guide, we will create a full monitoring solution for both system metrics and NVIDIA GPU metrics using Prometheus, Grafana, and Node Exporter. This tutorial is designed to be as simple as copying and pasting commands, with no additional setup required. By the end, you’ll have a working monitoring stack visualizing GPU and system data.

Step 1: Create the Monitoring Directory

First, create a directory to organize all your monitoring-related files.

Open your terminal and run the following command to create the directory:

mkdir -p ~/Docker/monitoring
cd ~/Docker/monitoring

This Docker/monitoring directory will contain the necessary configuration files and the Docker Compose file for the monitoring stack.

Step 2: Create the Docker Compose File

This file defines your entire stack, including Prometheus, Grafana, Node Exporter, and NVIDIA GPU Exporter. It will make sure everything runs in Docker containers. This will allow you to collect system metrics, GPU metrics, and visualize them in Grafana.

Create the docker-compose.yml file inside the ~/Docker/monitoring directory:

nano docker-compose.yml

2. Copy and paste the following content into the docker-compose.yml file:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    networks:
      - monitoring

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    ports:
      - "9100:9100"
    networks:
      - monitoring

  nvidia_smi_exporter:
    image: docker.io/utkuozdemir/nvidia_gpu_exporter:1.2.1
    container_name: nvidia_smi_exporter
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
    ports:
      - "9835:9835"
    volumes:
      - /usr/bin/nvidia-smi:/usr/bin/nvidia-smi
      - /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
      - /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
    restart: unless-stopped
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus-data:
  grafana-storage:

Prometheus: Scrapes metrics and exposes them (port 9090).
Grafana: Visualizes data from Prometheus, accessible (port 3000).
Node Exporter: Collects system metrics like CPU and memory usage (port 9100).
NVIDIA GPU Exporter: for collecting GPU metrics (port 9835).

Save and close the file (CTRL + O, ENTER, CTRL + X).

Step 3: Configure Prometheus to Scrape GPU Metrics

Now we need to configure Prometheus to scrape both system and GPU metrics.

In the same directory (~/Docker/monitoring), create the Prometheus configuration file:

nano prometheus.yml

2. Paste the following configuration into the file:

global:
  scrape_interval: 10s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100']

  - job_name: 'nvidia_smi_exporter'
    static_configs:
      - targets: ['nvidia_smi_exporter:9835']

This configuration tells Prometheus to scrape metrics from itself, Node Exporter (for system metrics), and NVIDIA GPU Exporter (for GPU metrics).

Step 4: Start the Monitoring Stack

With everything in place, we can now use Docker Compose to start the monitoring stack.

In the ~/Docker/monitoring directory, run the following command to start all services:

docker compose up -d

2. Verify that all containers are running:

docker ps

You should see Prometheus, Grafana, and Node Exporter running.

Step 5: Access Prometheus and Grafana

Access Prometheus

Open your browser and go to:

http://<server-ip>:9090

This is the Prometheus interface, where you can query metrics. To check the running processes use the following endpoint:

http://<server-ip>:9090/targets

Prometheus http://<server-ip>:9090/targets endpoint

Access Grafana

Open Grafana in your browser:

http://<server-ip>:3000

Step 6: Add Prometheus as a Data Source in Grafana

In Grafana, navigate to Connections > Data sources > Add data source.
Select Prometheus.
Set the URL to http://prometheus:9090 and click Save & Test.
Grafana should successfully connect to Prometheus.

Step 7: Import Dashboards in Grafana

Import System Metrics Dashboard (Node Exporter)

Go to Dashboards > New > Import.
In the Find and import dashboards for common applications field, enter 1860 (this is the Node Exporter Full Dashboard ID) and click Load.
Select Prometheus as the data source and click Import.

Import NVIDIA GPU Metrics Dashboard

Go to Dashboards > New > Import.
In the Find and import dashboards for common applications field, enter 14574 (this is the custom Nvidia GPU Metrics Dashboard ID) and click Load.
Select Prometheus as the data source and click Import.

Now you should have two dashboards:

Node Exporter Full Dashboard: Shows CPU, memory, and disk metrics.
Nvidia GPU Metrics Dashboard: Shows GPU utilization, temperature, and other GPU metrics.

Step 8: [Optional] Clean-Up

When you are done with monitoring and need to remove all the monitoring stack we used above, you can do so by stopping the containers and removing the volumes:

docker compose down --volumes

This command stops and removes all containers and associated volumes.

Conclusion

By following this guide, you’ve set up a full monitoring solution that tracks both system and GPU metrics using Prometheus, Grafana, and Node Exporter. All metrics are collected and visualized in Grafana, providing insights into your system’s performance and GPU utilization.

This setup is designed for easy deployment with minimal configuration. Just copy and paste the provided commands, and you’ll have your monitoring stack up and running in no time!

References

Nvidia GPU Exporter for Prometheus Using Nvidia-SMI Binary:

GitHub Repository: https://github.com/utkuozdemir/nvidia_gpu_exporter
This repository contains the NVIDIA GPU Exporter tool, which leverages the nvidia-smi binary to expose GPU metrics for Prometheus.

Christian Lempa — Docker Compose Boilerplate for Nvidia SMI Exporter:

GitHub Compose YAML: https://github.com/ChristianLempa/boilerplates/blob/08ce2b48300776e32390c102b3cd58f69b5ab354/docker-compose/nvidiasmi/compose.yaml
This boilerplate provides a Docker Compose setup for the NVIDIA SMI Exporter.

Node Exporter Full Dashboard for Grafana:

Grafana Dashboard ID: 1860 — Node Exporter Full
A comprehensive Grafana dashboard for monitoring system metrics collected by Node Exporter, such as CPU, memory, disk, and network usage.

Nvidia GPU Metrics Dashboard:

Grafana Dashboard ID: 14574 — Nvidia GPU Metrics
This dashboard provides detailed NVIDIA GPU metrics based on the data exposed by the NVIDIA GPU Exporter, allowing you to monitor GPU usage, memory, power, and more.

Ultimate Monitoring Setup: Prometheus, Grafana, Node Exporter and NVIDIA GPU Utilization

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in DevOps.dev

Written by Andrii Shatokhin

No responses yet