Sitemap

Proving Why 1.25–1.60 L min⁻¹ kW⁻¹ Is a Good Design Rule but Wasteful Without Variable-Speed Control

7 min readMay 7, 2025

--

1. OCP Guidelines and the rationale behind them

Large-frame mechanical designers select pipe diameters, quick-disconnects, and pump heads for the worst-case rack TDP — e.g., 120 kW for an NVIDIA GB200 NVL72 rack or 160 kW for coming Rubin racks. The Open Compute Project’s liquid-cooling guidance and vendor reference designs all embed the same sizing constant:

  • 1.5 L min⁻¹ kW⁻¹ guarantees a ≤10 °C coolant rise when every Blackwell GPU in the rack is pinned at its 1 kW TDP and fans are bypassed. This keeps silicon Tjunction comfortably below 85 °C, satisfying NVIDIA spec. (GB200 NVL72 | NVIDIA)
  • Pipe friction and quick-disconnect impedance limit practical ΔP; most RPU specs top out at 40 psi. Staying near 1.5 L min⁻¹ kW⁻¹ balances flow and pressure head across dozens of cold plates. (Stulz: CyberCool CMU | Advanced Coolant Distribution Unit by STULZ)

2. Real racks almost never run at design heat

Modern reference designs size the coolant flow for the worst-case rack TDP — for example, 120 kW for an NVIDIA GB200 NVL72 rack that OCP just accepted into its contribution library (NVIDIA Developer). The OAI liquid-cooling guideline therefore calls for 1.25–2.0 L min-¹ kW-¹, with 1.5 L min-¹ kW-¹ as the typical target to hold ΔT ≈ 10 °C (Open Compute Project). The companion RPU spec insists a rack CDU must be able to deliver 150 L min-¹ at ≤ 40 psi to a 100 kW load (i.e., the same 1.5 L min-¹ kW-¹ ratio).

Yet production telemetry shows that board power regularly falls far below those design watts, even while utilization.gpu sits at 100 %:

Because pump power scales with the cube of flow (Affinity Law) (The Engineering ToolBox), running the fixed 1.5 L min-¹ kW-¹ design flow through these frequent low-heat valleys wastes well over 60 % of pump kWh. Variable-speed CDUs such as the Stulz CyberCool CMU already advertise that their VFD pumps “eliminate bypass under low load,” confirming that throttling flow saves energy instead of burning it across a bypass valve (stulz.com).

In short, the OCP rule is a sound mechanical safety net. Still, the workload dynamics (training stalls, MIG slices, DVFS caps) make a compelling case for adding workload-aware, variable-speed control so flow tracks actual heat rather than the theoretical peak. Doing so reclaims pump energy for more servers or faster time-to-train without touching the pipe sizing dictated by OCP.

3. Variable-Speed Control (pre-cool + slew-limited feed-forward)

The above diagram illustrates the architecture of Federator.ai Smart Liquid Cooling solution. A Federator.ai Edge Agent resides on a GPU server polls metrics from DCGM frequently to detect any potential critical conditions and trigger alerts to the centralized Fedeartor.ai Liquid Cooling module, which then forwards the alerts to the Rack Flow Manager of a CDU controller for any fail-safe adjustments in real time. The Federator.ai Smart Liquid Cooling module also polls GPU workload related metrics from Prometheus in the Kubernetes cluster and the CDU metrics such as current flow speed, coolant supply/return temperatures etc. Based on these metrics, the Federator.ai Smart Cooling module issue flow rate recommendations at appropriate time to the Rack Flow Manager.

3.1 Element-Level Updates

3.2 Edge-Agent Architecture

Edge Agent polls DCGM , computes HeatIndex, and triggers alert notifications to the Flow-Manager. Local fail-safe issues instant alerts and power caps.

3.3 Temperature-Aware HeatIndex

Let be board power (kW), the average core temperature, the ceiling, the temperature when GPU is idle.

Where:

Rack HeatIndex is the (optionally power-weighted) mean of all hosts.

3.4 Feed-Forward Pre-Cooling

Variables

Explanation

The cube root in the formula,

suggests a scaling relationship between heat dissipation and rotational speed, often derived from fluid dynamics principles. This is commonly used in cooling systems where airflow or coolant flow is related to the cube of the RPM.

Example Calculation

Given:

Then,

Progressive RPM Algorithm

3.5 Operational Impact

  • Thermal-coupled control reacts when GPUs near the ceiling, even if kW is flat.
  • Slew-limited RPM removes ±4 kW pump oscillations, extending motor life.
  • Edge reaction 1 s when GPU temp spike: allows immediate flow rate adjustment before GPU gets overheated
  • Energy ROI: 25–35 % pump-fan kWh savings vs. fixed-flow, even after pre-cooling overhead.
  • Mechanical envelope: 1.5 L min-1 kW-1 manifold sizing remains the safe-harbor worst case

4. Energy-saving proof points — real rack, field unit, and whole-hall model

A CoreWeave laboratory A/B run on a liquid-cooled NVIDIA GB200 NVL72 rack (≈ 120 kW IT) compared fixed “safe-harbor” flow with a variable-speed loop while an NCCL all-reduce job pulsed the GPUs. During each 30 second burst the controller reduced coolant flow by 28 %, and the rack’s Grafana trace shows pump + fan demand dropping by ≈ 5.9 kW — exactly what the cubic pump-affinity law ( P∝Q3 ) predicts for that flow cut. The GPUs stayed below 85 °C, so no thermal derate occurred.

A field pilot with the STULZ CyberCool CMU row-level CDU confirms the same physics at scale. STULZ’s public datasheet highlights:

“Variable-speed pumps ensure enhanced energy efficiency, especially under low loads, eliminating the need for liquid bypass.”

Using the same affinity law, throttling flow to 70 % of nominal at 50 % load yields 30–40 % pump-kWh savings — the range STULZ quotes in its application note.

Finally, a whole-hall model in the ProphetStor + Supermicro white paper Beyond Static Cooling takes an 80 kW H100 rack that was measured at 16–18 % pump-fan savings under adaptive flow and scales the duty cycle. If the pumps are allowed to run at 1.0 L min⁻¹ kW⁻¹ during light-load windows (≈ 40 % of the day) instead of the OCP safe-harbor 1.5 L min⁻¹ kW⁻¹, the model shows a 25–30 % reduction in CDU-motor energy with no ΔT breach. On a 5 MW AI block that free kWh corresponds to roughly 0.6–1 MW of electrical headroom — enough for about 600 additional GB200 GPUs without new utility service.

Together these three lines of evidence — lab, production hardware, and calibrated model — demonstrate that smart, workload-aware flow control delivers 15–40 % cooling-motor energy savings with zero risk to thermal margins.

5. Answer to the mechanical team

Keep the 1.5 L min⁻¹ kW⁻¹ manifold sizing — it’s your safe harbor for worst-case TDP and balances line losses. Add federated, workload-aware variable-speed control, so pumps back off automatically when real heat is below design. The flow variation never exceeds what quick-disconnect CV allows, and you regain >5 % of facility power for more compute.

#AI #GPU #ESG #DataCenter #LiquidCooling #AIDC

References

1. Open Compute Project, “OAI System Liquid-Cooling Guidelines, Rev 1.0,” Mar. 2023. https://www.opencompute.org/documents/oai-system-liquid-cooling-guidelines-in-ocp-template-mar-3-2023-update-pdf.

2. OCP Cooling Environments Project, “Reservoir and Pumping Unit Specification v1.0,” Open Compute Project. https://www.opencompute.org/documents/ocp-reservoir-and-pumping-unit-specification-v1-0-pdf.

3. Vertiv Group Corp., “Deploying Liquid Cooling in Data Centers: Installing and Managing CDUs,” Mar. 2024. https://www.vertiv.com/en-us/about/news-and-insights/articles/blog-posts/deploying-liquid-cooling-in-data-centers-installing-and-managing-coolant-distribution-units-cdus/.

4. NVIDIA Developer Forums, “MIG Performance,” Nov. 2024. https://forums.developer.nvidia.com/t/mig-performance/314963.

5. NVIDIA Developer Forums, “GPU Utilization vs Power Draw,” Apr. 2021. https://forums.developer.nvidia.com/t/some-questions-on-gpu-utilization/176318.

6. Pumps & Systems, “Drives for Efficiency and Energy Savings,” Dec. 2011. https://www.pumpsandsystems.com/drives-efficiency-and-energy-savings.

7. EngineeringToolBox, “Affinity Laws for Pumps,” 2023. https://www.engineeringtoolbox.com/affinity-laws-d_408.html.

8. Stulz GmbH, “CyberCool CMU — Coolant Distribution Unit,” 2024. https://www.stulz.com/en-de/products/detail/cybercool-cmu/.

9. NVIDIA Corp., “GB200 NVL72,” Apr. 2025. https://www.nvidia.com/en-us/data-center/gb200-nvl72/.

10. CoreWeave, “Unleashing the Power of the NVIDIA GB200 NVL72,” Jan. 2025. [Online]. Available: https://www.coreweave.com/blog/unleashing-the-power-of-the-nvidia-gb200-nvl72

11. STULZ GmbH, “CyberCool CMU | Advanced Coolant Distribution Unit,” Datasheet CMU_Flyer_2411_EN_01, Nov. 2024. Available: https://www.stulz.com/products/cybercool-cmu

12. ProphetStor & Supermicro, “Beyond Static Cooling — The Value of Smart Liquid Cooling in High-Utilization GPU Data Centers,” White paper, May 2025. PDF available from the authors on request.

These independent sources span OCP standards, vendor reference designs, pump-energy fundamentals, and live workload evidence — proving the baseline flow rule is sensible, **but only variable-speed, workload-aware control prevents chronic over-pumping and unlocks head-room for more GPU racks.

--

--

ProphetStor
ProphetStor

Written by ProphetStor

Pioneering Excellence in IT/Cloud Efficiency and GPU Management Through Resilient and Advanced Optimization.

Responses (1)