Ask HN: How is GPU power draw measured at scale?

5 points | by anax32 1 day ago

1 comments

lemonademan 1 day ago

I personally believe once you get beyond a handful of GPUs, people probably end up using both levels of telemetry because they answer different questions. NVML is nice for per-request attribution and understanding model behavior, but I believe PDU/BMC measurements are better suited for actual power draw since they capture everything (CPUs, networking, PSU losses, fans, etc.).
For instance, people running 32+ GPU setups probably correlate timestamps rather than trying to preserve strict per-request attribution at the rack level. This will enable these individuals to have rack/PDU power sampled every second.
Either way, I haven't seen many people publish how they instrument this in practice so take what I wrote with a gran of salt. I simple wanted to share a little bit of what I understand and I hope it helps.

[-]
- anax32 23 hours ago
  
  Yes, thank you. That's exactly where I am, and trying to gather some knowledge.
  The power draw from the wall is especially important, because a spike across multiple devices at the same time can cause issues which are really difficult to debug.