Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop (systalyze.com)
The standard GPU utilization metric reported by nvidia-smi, nvtop, Weights & Biases, Amazon CloudWatch, Google Cloud Monitoring, and Azure Monitor is highly misleading. It reports the fraction of time that any kernel is running on the GPU, which means a GPU can report 100% utilization even if only a small portion of its compute capacity is actually being used. In practice, we've seen workloads with ~1–10% real compute throughput while dashboards show 100%.
This becomes a problem when teams rely on that metric for capacity planning or optimization decisions, it can make underutilized systems look saturated.
We're releasing an open-source (Apache 2.0) tool, Utilyze, to measure GPU utilization differently. It samples hardware performance counters and reports compute and memory throughput relative to the hardware's theoretical limits. It also estimates an attainable utilization ceiling for a given workload.
GitHub link: https://github.com/systalyze/utilyze
We'd love to hear your thoughts!
At the moment (v0.1.3) it is more helpful for compute visualization but keeping track of memory usage/processes/temperature/fan speed/etc. prevent this from becoming a full-on drop-in replacement for `nvidia-smi` for me.
Source: spent a couple of years developing an energy and performance profiler for cpus and gpus with various government labs.
But if you really care about this, you should actually profile your application. nsight systems makes this pretty simple to do. Dunno how many actually care about having a TUI.
On nsys, agreed it's great, but we wanted something that could run continuously instead of an offline analysis tool. We think there's room for both to be useful.
Just testing for now.
Any removal instructions or function for utilyze beyond the manual removal of utilyze & utlz binaries from ~/.local/bin & /usr/local/bin & PATH cleanup for ~/.profile, in particular CAP_SYS_ADMIN capability and reversal for any other changes made?
I don't fully get the 100% utilisation vs. 1-10% real compute. Given you rely on telemetry from users to add new models, are you trying to predict how fast a model should be on vLLM, compared to how it runs in practice? What if users tweak some hyperparameters?
What you described is the goal of Attainable SOL, but using GPU utilization as the metric rather than throughput. We're answering "for a given model and workload, have you optimized this well enough?", where "optimized" includes hyperparameter tuning. So if someone hasn't tuned batch size, parallelism, or other knobs well for their workload, the gap between their current utilization and the Attainable SOL is what tells them there's still room to improve.
We're motivated by the fact that reaching 100% Compute SOL is impossible -- no model can run at the hardware's theoretical maximum -- but we want to provide a realistic target for optimization. And we've noticed that different model architectures have different realistic ceilings. For example, MoE models run at much worse utilization due to their sparsity. We don't expect you to retrain an MoE model in order to get a higher utilization, and no hyperparameter tuning can bring you close to 100%, so the maximum attainable SOL should be lower for that model.
It's useful as a rough heuristic, but tends to overestimate utilization. We've also noticed that power-derived metrics have a lag time behind true utilization, the controller that regulates it has a delayed response time. This especially becomes important for spiky workloads like real-time inference.
Any tool (like nvtop) that only queries NVIDIA's NVML library does not have access to the detailed metrics that we draw upon, and therefore has to use proxies for efficiency.
Will further test it.
Does anyone know of a good tool for "load balancing" usage across local GPUs?
Why: I have two RTX3090s (24GB). I've been using nvidia-smi to check usage of my RTX3090. Mostly I'm running llama.cpp with unsloth/Qwen3.6-27B-GGUF:Q4_K_M and getting some pretty decent results for a self hosted LLMs (orchestrated via opencode). I'm surprised at how well it is working for a local model. nvidia-smi is great for determining total VRAM usage and nvtop gives a little more insight.
But, I also am doing some experiments with some other non-LLM models (video generation, etc), and want to find a way to timeslice across these GPUs, for example, when my coding is paused.
This "Utilyze" tool appears it would get me better insight into usage of one. Can it be scripted to better utilize my GPUs across a diverse load?
Any suggestions on whether there are existing projects out there? I thought about vibe coding, but wonder if there is existing art.
Nvidia’s toolsets and APIs are under-documented, and the commercial-grade hardware itself is super unreliable.
Developers and operators just bear with the whole situation because there is no alternative. To the point that they are ready to jump to things like TPUs or other custom silicon.
Say what you will about Intel, but their documentation and the commercial-grade hardware were top-notch. I wish they find their footing and this time stay humble.
One thing I'd love to see in tools like this is a "memory pressure"view that shows not just current VRAM usage but how close you are to the OOM cliff for the workload you're running. Running quantized LLMs on consumer GPUs (e.g. Q4_K_M Gemma 4 E4B on an 8GB card), you can be at 95% utilization and totally fine, or at 80% and one context spike away from a crash. nvtop and nvidia-smi give you the number but not the headroom.
Whether that's feasible without instrumenting the workload specifically is another question. But it's the metric I actually care about when I'm picking quantization levels.