RTX PRO 6000 (Blackwell): Phase-Separated Energy Profiling & Backend Compatibility Finding
This update reports phase-separated prefill and decode measurements on an NVIDIA RTX PRO 6000 Blackwell GPU. The data is presented as a supplementary backend-architecture case study rather than a standard leaderboard benchmark, because the measurement protocol and interpretation differ from the main end-to-end energy results.
Section 1: Experimental Overview
| Item | Value |
|---|---|
| GPU | RTX PRO 6000 Blackwell Server Edition, 96GB |
| Software stack | bitsandbytes 0.49.2, PyTorch 2.11.0, CUDA 12.8 |
| Power sampling rate | 100 Hz (vs. 10 Hz in the main benchmark) |
| Models | Qwen2-1.5B, Qwen2.5-3B, Yi-1.5-6B |
| Configuration coverage | 15 measured rows across 5 model/batch settings and 3 precisions; 11 non-FP16 quantized or anchor configurations are used for interpretation. |
| Measurement protocol | Phase-separated prefill energy per call and decode energy per 32 generated tokens. |
Section 2: Phase-Separated Data Table
Ratios are computed relative to the FP16 baseline for the same model and batch size. Values above 1.0 indicate higher energy or lower speed than FP16.
| Model | Batch | Precision | Prefill Energy / Call (J) | Decode Energy / 32 tok (J) | Decode tok/s | Prefill Energy Ratio | Decode Energy Ratio | Decode Speed Ratio |
|---|---|---|---|---|---|---|---|---|
| Qwen2-1.5B | 1 | FP16 | 0.924 | 42.737 | 105.26 | 1.00x | 1.00x | 1.00x |
| Qwen2-1.5B | 1 | INT8 | 4.736 | 135.724 | 15.79 | 5.12x | 3.18x | 0.15x |
| Qwen2-1.5B | 1 | NF4 | 3.569 | 71.281 | 41.10 | 3.86x | 1.67x | 0.39x |
| Qwen2-1.5B | 8 | FP16 | 1.326 | 48.156 | 789.97 | 1.00x | 1.00x | 1.00x |
| Qwen2-1.5B | 8 | INT8 | 5.829 | 154.507 | 112.77 | 4.39x | 3.21x | 0.14x |
| Qwen2-1.5B | 8 | NF4 | 4.511 | 133.460 | 244.35 | 3.40x | 2.77x | 0.31x |
| Qwen2.5-3B | 1 | FP16 | 1.575 | 73.191 | 80.74 | 1.00x | 1.00x | 1.00x |
| Qwen2.5-3B | 1 | INT8 | 6.620 | 186.035 | 12.30 | 4.20x | 2.54x | 0.15x |
| Qwen2.5-3B | 1 | NF4 | 6.412 | 108.622 | 31.71 | 4.07x | 1.48x | 0.39x |
| Qwen2.5-3B | 8 | FP16 | 2.423 | 81.280 | 636.22 | 1.00x | 1.00x | 1.00x |
| Qwen2.5-3B | 8 | INT8 | 8.364 | 212.776 | 86.55 | 3.45x | 2.62x | 0.14x |
| Qwen2.5-3B | 8 | NF4 | 7.886 | 234.826 | 188.79 | 3.25x | 2.89x | 0.30x |
| Yi-1.5-6B | 8 | FP16 | 3.645 | 116.080 | 617.32 | 1.00x | 1.00x | 1.00x |
| Yi-1.5-6B | 8 | INT8 | 9.124 | 218.745 | 99.99 | 2.50x | 1.88x | 0.16x |
| Yi-1.5-6B | 8 | NF4 | 12.126 | 365.331 | 222.80 | 3.33x | 3.15x | 0.36x |
Section 3A: Finding 1 — Observed Phase-Separated Pattern
Across the measured quantized configurations, the observed quantization energy overhead is consistently larger in the prefill phase than in the decode phase. Prefill energy ratios range from approximately 2.5x to 5.1x, while decode energy ratios range from approximately 1.5x to 3.2x. This observed pattern suggests that dequantization overhead may be amplified during compute-intensive parallel token processing.
This is reported as an observed pattern under this backend and hardware stack, not as a demonstrated universal prefill/decode mechanism.
Section 3B: Finding 2 — Backend-Architecture Interaction
On this Blackwell GPU with bitsandbytes 0.49.2, no energy crossover was observed. All tested models, including Yi-1.5-6B at batch size 8, showed substantial quantized-path energy penalties relative to FP16. INT8 decode speed was only about 14% to 16% of FP16 across the measured settings, consistent with architecture-specific quantized-path inefficiency under the evaluated backend.
This reinforces that quantization energy behavior is implementation-specific and should not be interpreted solely from nominal bit width.
Diagnostic Power Evidence: Low INT8 TDP Utilization
Average power ranges provide the key diagnostic context for the backend-architecture interpretation. INT8 runs were not only slow; they also drew only a small fraction of the 600W TDP, pointing to ineffective GPU utilization rather than simply higher algorithmic work.
| Precision | Prefill Power (W) | % of 600W TDP | Decode Power (W) | % of 600W TDP |
|---|---|---|---|---|
| FP16 | 92-247 | 15-41% | 141-280 | 23-47% |
| NF4 | 114-326 | 19-54% | 92-318 | 15-53% |
| INT8 | 67-109 | 11-18% | 67-85 | 11-14% |
This low-power / low-throughput signature is consistent with quantized kernels failing to utilize the Blackwell GPU effectively under the evaluated bitsandbytes backend.
Section 4: Relationship to the Main Leaderboard
These measurements are not directly comparable to the main leaderboard for three reasons: (1) the phase-separated measurement protocol differs from the end-to-end energy protocol; (2) bitsandbytes quantized execution paths show architecture-specific inefficiency on this Blackwell GPU; and (3) generation length is 32 tokens in the decode phase, compared with 256/512-token settings in the main benchmark. This data is therefore presented as a supplementary case study rather than a standard benchmark entry.
In particular, the RTX PRO 6000 values should not be read as a normal leaderboard claim that quantization is inherently 2-3x more energy-intensive on Blackwell. They document the measured behavior of the evaluated bitsandbytes backend under this specific hardware-software stack. The 15 measured phase-separated rows in this update are not counted in the homepage's main benchmark configuration total.