zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs and NPUs. It provides real-time insights into the performance and health of your hardware.

It is a mix between nvidia-smi and nvtop.

It transparently supports all the platforms ZML supports. That is NVIDIA, AMD, Google TPU and AWS Trainium devices. It will be extended to support more platforms in the future as ZML continues to expand its hardware support.

Getting started

You can download zml-smi from the official mirror.

$ curl -LO 'https://mirror.zml.ai/zml-smi/zml-smi-v0.2.tar.zst'
$ tar -xf zml-smi-v0.2.tar.zst
$ ./zml-smi/zml-smi

Listing devices

$ zml-smi

Monitoring devices

The --top flag provides real-time monitoring of device performance, including utilization, temperature, and memory usage.

$ zml-smi --top

Completely sandboxed

zml-smi doesn’t require any software on the target machine besides the device driver and the GLIBC (mostly due to the fact that some shared objects from vendors are loaded).

Metrics

Host

zml-smi displays host-level metrics such as CPU model and utilization, memory usage, and temperature.

Available metrics

Hostname, Kernel, CPU Model, CPU Core Count, Memory Used / Total, Uptime, Load Average (1m / 5m / 15m), Device Count

Processes

zml-smi also provides insights into the processes utilizing the devices, including their resource usage and command lines. This is available for all platforms.

Available metrics

PID, Device Index, Device Utilization, Device Memory, Process Command Line

NVIDIA

Metrics are given through the NVML library, which ships with the driver. As such, it is expected to be on the system.

Available metrics

GPU Utilization, Temperature, Power Draw, Encoder Utilization, Decoder Utilization, VRAM Used, VRAM Total, Memory Bus Width, Temperature, Fan Speed, Power Draw, Power Limit, Graphics Clock, SM Clock, Memory Clock, Max Graphics Clock, Max Memory Clock, PCIe Link Generation, PCIe Link Width, PCIe TX Throughput, PCIe RX Throughput

AMD

Metrics are provided through the AMD SMI library. zml-smi ships with it in its sandbox.

In order to support the latest AMD GPUs, zml-smi at build time downloads the amdgpu.ids file from both Mesa and ROCm (7.2.1 at the time of this article) and merges them together. This allows zml-smi to recognize and report on the latest AMD GPU models, even if they are not yet included in the official ROCm release. This is the case for Ryzen AI Max+ 395 (Strix Halo) for instance.

Sandboxing that file turned somewhat tricky. Because libdrm-amdgpu expects to find it in /opt/amdgpu/share/libdrm/amdgpu.ids, we had to get a bit creative. We didn’t want to install anything outside the binary sandbox. Nor did we want to patch that string inside libdrm.

So we created a shared object named zmlxrocm.so that is added to the DT_NEEDED section of libdrm_amdgpu.so.1. Then, fopen64 is renamed to zmlxrocm_fopen64, which is then provided by zmlxrocm.so. Since we now sit between libdrm and fopen64, we can intercept the call to fopen64, compare the path against /opt/amdgpu/share/libdrm/amdgpu.ids and redirect it to the sandboxed copy of the file.

Available metrics

GPU Utilization, Memory Usage, Temperature, Power Draw, VRAM Used, VRAM Total, Temperature, Fan Speed, Power Draw, Power Limit, Graphics Clock, SoC Clock, Memory Clock, Max Graphics Clock, Max Memory Clock, PCIe Bandwidth, PCIe Link Generation, PCIe Link Width

TPU

Metrics are provided via the local gRPC endpoint exposed by the TPU runtime. Those are the same metrics exposed to the tpu-info tool from Google.

Available metrics

TensorCore Duty Cycle, HBM Used, HBM Total

AWS Trainium

Metrics are provided through a private API found in libnrt.so, which zml-smi embeds in its sandbox. Those are the same metrics provided by the neuron-top utility.

Available metrics

Core Utilization, HBM Used, HBM Total, Tensor Memory, Constant Memory, Model Code, Shared Scratchpad, Nonshared Scratchpad, Runtime Memory, Driver Memory, DMA Rings, Collectives, Notifications