RTX PRO 6000 Blackwell: FP16 End-to-End Baseline for Qwen2.5-3B
This update adds a standard FP16 end-to-end reference measurement for Qwen2.5-3B on the NVIDIA RTX PRO 6000 Blackwell Server Edition. Unlike the earlier phase-separated backend-compatibility case study, these measurements use the main leaderboard unit of Joules per 1,000 generated tokens.
Experimental Setup
| Item | Value |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| CUDA capability | 12.0 |
| Software stack | PyTorch 2.11.0+cu128, CUDA runtime 12.8 |
| Model path used in run | /root/autodl-tmp/models/Qwen2.5-3B |
| Precision | FP16 |
| Batch size | 1 |
| Prompt length | 128 tokens |
| Generated lengths | 256 and 512 tokens |
| Power measurement | NVML available during run |
FP16 End-to-End Results
| Model | GPU | Precision | Batch | Generated Tokens | Energy (J/1k tok) | Throughput (tok/s) | Average Power (W) | Runs |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-3B | RTX PRO 6000 Blackwell | FP16 | 1 | 256 | 2,660.83 ± 124.87 | 83.05 ± 0.38 | 221.58 ± 10.12 | 10 |
| Qwen2.5-3B | RTX PRO 6000 Blackwell | FP16 | 1 | 512 | 2,709.67 ± 38.22 | 83.17 ± 0.26 | 225.61 ± 2.90 | 10 |
The two generation lengths produce similar throughput and energy per 1,000 tokens, providing a stable FP16 anchor point for RTX PRO 6000 Blackwell under this end-to-end protocol.
Reported ± values are one standard deviation across 10 runs. The 256-token energy result has a CV of approximately 4.7%, above the <3% threshold used for the main benchmark; the 512-token energy result has a CV of approximately 1.4%.
Interpretation and Relationship to the Blackwell Phase Study
This FP16 end-to-end result should be treated separately from the June 3 phase-separated measurements. The present update uses the standard leaderboard unit, J/1k generated tokens, and measures FP16 inference only. The June 3 update instead examined prefill/decode energy and quantized-path backend behavior under bitsandbytes 0.49.2.
Together, the two updates support a conservative interpretation: the RTX PRO 6000 Blackwell stack produced stable FP16 end-to-end inference, while the earlier quantized phase-separated results should be interpreted as backend-architecture interaction evidence rather than normal quantized leaderboard performance.