Supplementary Case Study Not a main leaderboard entry Published: 2026-06-03

RTX PRO 6000 (Blackwell): Phase-Separated Energy Profiling & Backend Compatibility Finding

This update reports phase-separated prefill and decode measurements on an NVIDIA RTX PRO 6000 Blackwell GPU. The data is presented as a supplementary backend-architecture case study rather than a standard leaderboard benchmark, because the measurement protocol and interpretation differ from the main end-to-end energy results.

Download raw archive Return to Leaderboard

Section 1: Experimental Overview

Item	Value
GPU	RTX PRO 6000 Blackwell Server Edition, 96GB
Software stack	bitsandbytes 0.49.2, PyTorch 2.11.0, CUDA 12.8
Power sampling rate	100 Hz (vs. 10 Hz in the main benchmark)
Models	Qwen2-1.5B, Qwen2.5-3B, Yi-1.5-6B
Configuration coverage	15 measured rows across 5 model/batch settings and 3 precisions; 11 non-FP16 quantized or anchor configurations are used for interpretation.
Measurement protocol	Phase-separated prefill energy per call and decode energy per 32 generated tokens.

Section 2: Phase-Separated Data Table

Ratios are computed relative to the FP16 baseline for the same model and batch size. Values above 1.0 indicate higher energy or lower speed than FP16.

Model	Batch	Precision	Prefill Energy / Call (J)	Decode Energy / 32 tok (J)	Decode tok/s	Prefill Energy Ratio	Decode Energy Ratio	Decode Speed Ratio
Qwen2-1.5B	1	FP16	0.924	42.737	105.26	1.00x	1.00x	1.00x
Qwen2-1.5B	1	INT8	4.736	135.724	15.79	5.12x	3.18x	0.15x
Qwen2-1.5B	1	NF4	3.569	71.281	41.10	3.86x	1.67x	0.39x
Qwen2-1.5B	8	FP16	1.326	48.156	789.97	1.00x	1.00x	1.00x
Qwen2-1.5B	8	INT8	5.829	154.507	112.77	4.39x	3.21x	0.14x
Qwen2-1.5B	8	NF4	4.511	133.460	244.35	3.40x	2.77x	0.31x
Qwen2.5-3B	1	FP16	1.575	73.191	80.74	1.00x	1.00x	1.00x
Qwen2.5-3B	1	INT8	6.620	186.035	12.30	4.20x	2.54x	0.15x
Qwen2.5-3B	1	NF4	6.412	108.622	31.71	4.07x	1.48x	0.39x
Qwen2.5-3B	8	FP16	2.423	81.280	636.22	1.00x	1.00x	1.00x
Qwen2.5-3B	8	INT8	8.364	212.776	86.55	3.45x	2.62x	0.14x
Qwen2.5-3B	8	NF4	7.886	234.826	188.79	3.25x	2.89x	0.30x
Yi-1.5-6B	8	FP16	3.645	116.080	617.32	1.00x	1.00x	1.00x
Yi-1.5-6B	8	INT8	9.124	218.745	99.99	2.50x	1.88x	0.16x
Yi-1.5-6B	8	NF4	12.126	365.331	222.80	3.33x	3.15x	0.36x

Section 3A: Finding 1 — Observed Phase-Separated Pattern

Across the measured quantized configurations, the observed quantization energy overhead is consistently larger in the prefill phase than in the decode phase. Prefill energy ratios range from approximately 2.5x to 5.1x, while decode energy ratios range from approximately 1.5x to 3.2x. This observed pattern suggests that dequantization overhead may be amplified during compute-intensive parallel token processing.

This is reported as an observed pattern under this backend and hardware stack, not as a demonstrated universal prefill/decode mechanism.

Section 3B: Finding 2 — Backend-Architecture Interaction

On this Blackwell GPU with bitsandbytes 0.49.2, no energy crossover was observed. All tested models, including Yi-1.5-6B at batch size 8, showed substantial quantized-path energy penalties relative to FP16. INT8 decode speed was only about 14% to 16% of FP16 across the measured settings, consistent with architecture-specific quantized-path inefficiency under the evaluated backend.

This reinforces that quantization energy behavior is implementation-specific and should not be interpreted solely from nominal bit width.

Diagnostic Power Evidence: Low INT8 TDP Utilization

Average power ranges provide the key diagnostic context for the backend-architecture interpretation. INT8 runs were not only slow; they also drew only a small fraction of the 600W TDP, pointing to ineffective GPU utilization rather than simply higher algorithmic work.

Precision	Prefill Power (W)	% of 600W TDP	Decode Power (W)	% of 600W TDP
FP16	92-247	15-41%	141-280	23-47%
NF4	114-326	19-54%	92-318	15-53%
INT8	67-109	11-18%	67-85	11-14%

This low-power / low-throughput signature is consistent with quantized kernels failing to utilize the Blackwell GPU effectively under the evaluated bitsandbytes backend.

Section 4: Relationship to the Main Leaderboard

These measurements are not directly comparable to the main leaderboard for three reasons: (1) the phase-separated measurement protocol differs from the end-to-end energy protocol; (2) bitsandbytes quantized execution paths show architecture-specific inefficiency on this Blackwell GPU; and (3) generation length is 32 tokens in the decode phase, compared with 256/512-token settings in the main benchmark. This data is therefore presented as a supplementary case study rather than a standard benchmark entry.

In particular, the RTX PRO 6000 values should not be read as a normal leaderboard claim that quantization is inherently 2-3x more energy-intensive on Blackwell. They document the measured behavior of the evaluated bitsandbytes backend under this specific hardware-software stack. The 15 measured phase-separated rows in this update are not counted in the homepage's main benchmark configuration total.

Read the paper RTX PRO 6000 FP16 E2E Qwen2.5-3B on Tesla T4 Leaderboard & data