General

May 13, 2025

10 min

Read

GPU utilization is lying to you

Enterprise GPU utilization numbers almost always look healthier than they are. Not because anyone is lying exactly, but because most monitoring tools measure the wrong thing. Here is what to actually look at, and what the real number usually turns out to be.

Alex Hatfield

CEO, Juno Innovations

VFX
ORION
GPU ORCHESTRATION
VFX
ORION
GPU ORCHESTRATION
VFX
ORION
GPU ORCHESTRATION

Table of contents

Share

The number on your GPU dashboard is almost certainly wrong.

Not broken. Not misconfigured. Just measuring the wrong thing, in a way that makes your fleet look a lot busier than it actually is. Vendors cite GPU utilization figures anywhere from 10% to 90% for the same class of workload. Neither is technically lying. They just aren't measuring the same thing.

We built Orion to fix GPU scheduling, so we have skin in this game. Take that into account. But we've spent years looking at what customer GPU clusters actually do versus what their monitoring thinks they're doing, and the gap is usually significant enough to matter.

Here is what we've learned.

The definition problem

GPU utilization is not a single metric. Depending on what you're asking, it's at least four different ones.

SM utilization is what most dashboards show. It's the percentage of time at least one shader multiprocessor was active during a sampling interval. A GPU that ran one tiny kernel for a single millisecond in a one-second window reports 0.1% utilization. One running the same kernel the whole second reports 100%. Both readings are accurate. Neither tells you whether the GPU is earning its keep.

Memory bandwidth utilization is different. You can have high SM utilization and near-zero memory bandwidth utilization if your kernels are compute-bound but memory-starved. For ML inference workloads, this tends to be the actual bottleneck, not the number anyone is watching.

Tensor Core utilization matters especially if you're running V100s or A100s. Those cards have Tensor Cores built specifically for matrix operations. If your workloads aren't using them, you're leaving most of the useful capacity on the table. A100 Tensor Core utilization at most shops we've seen: embarrassingly low.

Fleet utilization is the one that actually tells you whether you're getting value from your hardware spend. It's the percentage of time your GPUs are doing anything at all, aggregated across the whole cluster over time. It's also the hardest number to find in most monitoring setups, which is not a coincidence.

When someone tells you their GPU utilization is 80%, ask which of these they're measuring. The conversation usually gets awkward.

Why the number looks fine on your dashboard

Time averaging is the main culprit. A GPU that ran flat out for an hour and sat idle for three hours reports 25% utilization over the four-hour block. That's accurate math. But if your monitoring rolls this into a daily average, and your workloads run in business-hours bursts, the number you look at each morning tells you almost nothing about actual capacity utilization.

The per-process versus per-device mismatch makes it worse. Your monitoring might report GPU utilization at the process level. A shared GPU running three idle processes and one active one looks fine at the device level. The fleet view would show three quarters of it sitting idle.

Then there's the provisioned/allocated/actually-used gap. You provision a GPU instance. A workload gets scheduled to it. The workload is waiting on a CPU-bound preprocessing step. The GPU is allocated, SM utilization is zero, and most monitoring systems count it as in use. It isn't.

What the real number usually looks like

SM utilization during active jobs tends to look reasonable: 60-80% is typical when workloads are actually running. The problem is how many hours of the day those workloads are running.

Batch jobs run in windows. Researchers submit jobs, go to a meeting, come back, wait on results. Development and testing are bursty. Interactive workloads track human working patterns, which means they're idle whenever the human is.

When you fold all that idle time into fleet utilization, the number drops fast. Goldman Sachs and Cast AI have both put enterprise on-premises GPU utilization at 10-15% on average. Our own observations across customers before they deploy Orion land in that same range.

Some people push back on this with higher figures. Usually they're measuring peak SM utilization during active jobs, not fleet-wide time-averaged utilization. Both numbers are real. They just mean different things.

The 10-15% figure is the uncomfortable one. It means you're paying for a cluster that does useful work for roughly one hour out of every eight. The other seven, those GPUs are waiting.

The workstation problem specifically

Interactive GPU workstations are where utilization gets particularly bad, because the usage pattern doesn't match batch compute at all.

A render farm can be optimized for throughput. You queue jobs, schedule around demand, run the cluster at high occupancy. The GPU is either working or waiting for the next job, and you can tune that.

An interactive workstation has a human in the loop. Usage tracks the human: active during a complex render preview, idle while they read notes or take a call, active again when they adjust parameters and kick off another preview. The GPU is provisioned to the artist's peak demand but runs at their average demand. Those two numbers are not close.

Traditional 1:1 provisioning means you're paying for peak capacity around the clock. If an artist works eight hours and the GPU is actively processing for three of those, you're at about 37% utilization in the best case. Off hours, weekends, PTO: full cost, zero output.

This is the specific problem Orion solved at R3D Studios. Moving from 1:1 provisioning to shared fleet orchestration got them to 2:1 user density: 10 artists across 5 GPU instances. Per-artist compute cost dropped roughly 40% on AWS EC2. The artists didn't notice, because the scheduling layer handled contention.

"Orion shifted our focus from finding stability to using the stability to iterate." That's Donald Strubler, Head of Technology at R3D Studios. The stability was always there in the hardware. It just wasn't being used.

How to actually measure this

Pull fleet-level time-averaged utilization. Not per-job. Not per-device during active jobs. The denominator is total available GPU-hours across your cluster over 30 days. The numerator is GPU-hours where SM utilization was above some meaningful threshold, say 10%. That's the number that tells you whether your hardware spend is working.

Then pull scheduled versus available hours separately. What percentage of the day are GPUs allocated to any workload at all? A GPU can be allocated and completely idle. These are different things.

Break it out by workload type. Aggregate numbers hide patterns. Training runs, inference, interactive workstations, batch rendering: each has a different utilization profile and different optimization options. If you average them together, you'll miss what's actually going wrong.

Finally, look at idle time by hour of day. Most enterprise clusters have deep valleys overnight and on weekends. If your business runs 9-5 in one timezone, your GPU cluster is probably sitting idle for more than half the calendar hours you're paying for it.

One more thing: don't trust a single day's snapshot. GPU workloads are bursty and project-dependent. A 30-day window with daily breakdowns is useful. A snapshot isn't.

What good utilization actually looks like

There's no universal target. It depends on what you're running.

For batch training where you want maximum throughput, 70-80% sustained SM utilization during scheduled windows, covering most of the 24-hour day, is genuinely good. HPC clusters running genomics or financial modeling can hit this. It takes real scheduling discipline to get there and stay there.

For interactive workstations, 2:1 or 3:1 user density with good scheduling is a realistic improvement from 1:1. Beyond that you start trading latency for cost in ways most creative workloads won't accept.

For inference serving, GPU utilization is almost the wrong metric. Latency and throughput per dollar are what matter. Thirty percent SM utilization on an efficient inference cluster is fine if latency targets are met.

The number to benchmark against isn't some industry-wide "good" threshold. It's your current number, honestly measured, compared to what you'd get with better scheduling.

The capacity story nobody tells

GPU hardware lead times are long right now. Cloud GPU availability tightens during peak demand. If your organization is planning AI workloads that need more GPU capacity, the first question worth asking is whether you're actually using what you have.

A cluster running at 12% fleet utilization has a lot of headroom before it needs new hardware. The same cluster with better scheduling might support twice the workloads at the same cost. That's not a cost story, it's a capability story: you can do more without buying more, and you don't have to wait six months for new hardware to arrive.

We find this gets lost in the per-user cost math when customers evaluate Orion. The 40% lower compute spend at R3D is the number that's easy to cite. The harder-to-quantify thing is that they can now run workloads they'd have said no to before, because they thought they were out of capacity when they weren't.

Our number, for what it's worth

Orion customers typically see 2-4x more workload capacity from existing infrastructure. That's a wide range deliberately. The actual figure depends on your starting utilization, your workload mix, and what "capacity" means in your context.

The R3D data point is concrete: 1:1 provisioning to 2:1 user density, no performance degradation, roughly 40% lower compute costs on AWS EC2. That's one workload type at one scale. Other workloads will have different numbers.

If you want to know what your number might look like, the starting point is an honest measurement of your current fleet utilization. Not the dashboard number. The real one.

Everything else is math from there.

See how Orion handles your workload mix. Book a demo

Alex Hatfield is the CEO and co-founder of Juno Innovations. Juno builds Orion, a customer-hosted unified compute plane that orchestrates Kubernetes, VMs, and bare metal across cloud, on-premises, and air-gapped environments.

Looking for more? Dive into our other articles, updates, and strategies