LLM Energy Benchmark ← Back to Homepage
Supplementary Case Study Not a main leaderboard entry Published: 2026-06-03

RTX PRO 6000 (Blackwell): Phase-Separated Energy Profiling & Backend Compatibility Finding

This update reports phase-separated prefill and decode measurements on an NVIDIA RTX PRO 6000 Blackwell GPU. The data is presented as a supplementary backend-architecture case study rather than a standard leaderboard benchmark, because the measurement protocol and interpretation differ from the main end-to-end energy results.

Section 1: Experimental Overview

ItemValue
GPURTX PRO 6000 Blackwell Server Edition, 96GB
Software stackbitsandbytes 0.49.2, PyTorch 2.11.0, CUDA 12.8
Power sampling rate100 Hz (vs. 10 Hz in the main benchmark)
ModelsQwen2-1.5B, Qwen2.5-3B, Yi-1.5-6B
Configuration coverage15 measured rows across 5 model/batch settings and 3 precisions; 11 non-FP16 quantized or anchor configurations are used for interpretation.
Measurement protocolPhase-separated prefill energy per call and decode energy per 32 generated tokens.

Section 2: Phase-Separated Data Table

Ratios are computed relative to the FP16 baseline for the same model and batch size. Values above 1.0 indicate higher energy or lower speed than FP16.

Model Batch Precision Prefill Energy / Call (J) Decode Energy / 32 tok (J) Decode tok/s Prefill Energy Ratio Decode Energy Ratio Decode Speed Ratio
Qwen2-1.5B1FP160.92442.737105.261.00x1.00x1.00x
Qwen2-1.5B1INT84.736135.72415.795.12x3.18x0.15x
Qwen2-1.5B1NF43.56971.28141.103.86x1.67x0.39x
Qwen2-1.5B8FP161.32648.156789.971.00x1.00x1.00x
Qwen2-1.5B8INT85.829154.507112.774.39x3.21x0.14x
Qwen2-1.5B8NF44.511133.460244.353.40x2.77x0.31x
Qwen2.5-3B1FP161.57573.19180.741.00x1.00x1.00x
Qwen2.5-3B1INT86.620186.03512.304.20x2.54x0.15x
Qwen2.5-3B1NF46.412108.62231.714.07x1.48x0.39x
Qwen2.5-3B8FP162.42381.280636.221.00x1.00x1.00x
Qwen2.5-3B8INT88.364212.77686.553.45x2.62x0.14x
Qwen2.5-3B8NF47.886234.826188.793.25x2.89x0.30x
Yi-1.5-6B8FP163.645116.080617.321.00x1.00x1.00x
Yi-1.5-6B8INT89.124218.74599.992.50x1.88x0.16x
Yi-1.5-6B8NF412.126365.331222.803.33x3.15x0.36x

Section 3A: Finding 1 — Observed Phase-Separated Pattern

Across the measured quantized configurations, the observed quantization energy overhead is consistently larger in the prefill phase than in the decode phase. Prefill energy ratios range from approximately 2.5x to 5.1x, while decode energy ratios range from approximately 1.5x to 3.2x. This observed pattern suggests that dequantization overhead may be amplified during compute-intensive parallel token processing.

This is reported as an observed pattern under this backend and hardware stack, not as a demonstrated universal prefill/decode mechanism.

Section 3B: Finding 2 — Backend-Architecture Interaction

On this Blackwell GPU with bitsandbytes 0.49.2, no energy crossover was observed. All tested models, including Yi-1.5-6B at batch size 8, showed substantial quantized-path energy penalties relative to FP16. INT8 decode speed was only about 14% to 16% of FP16 across the measured settings, consistent with architecture-specific quantized-path inefficiency under the evaluated backend.

This reinforces that quantization energy behavior is implementation-specific and should not be interpreted solely from nominal bit width.

Diagnostic Power Evidence: Low INT8 TDP Utilization

Average power ranges provide the key diagnostic context for the backend-architecture interpretation. INT8 runs were not only slow; they also drew only a small fraction of the 600W TDP, pointing to ineffective GPU utilization rather than simply higher algorithmic work.

Precision Prefill Power (W) % of 600W TDP Decode Power (W) % of 600W TDP
FP1692-24715-41%141-28023-47%
NF4114-32619-54%92-31815-53%
INT867-10911-18%67-8511-14%

This low-power / low-throughput signature is consistent with quantized kernels failing to utilize the Blackwell GPU effectively under the evaluated bitsandbytes backend.

Section 4: Relationship to the Main Leaderboard

These measurements are not directly comparable to the main leaderboard for three reasons: (1) the phase-separated measurement protocol differs from the end-to-end energy protocol; (2) bitsandbytes quantized execution paths show architecture-specific inefficiency on this Blackwell GPU; and (3) generation length is 32 tokens in the decode phase, compared with 256/512-token settings in the main benchmark. This data is therefore presented as a supplementary case study rather than a standard benchmark entry.

In particular, the RTX PRO 6000 values should not be read as a normal leaderboard claim that quantization is inherently 2-3x more energy-intensive on Blackwell. They document the measured behavior of the evaluated bitsandbytes backend under this specific hardware-software stack. The 15 measured phase-separated rows in this update are not counted in the homepage's main benchmark configuration total.

Related