Open Data Reproducible NVML 10Hz Sampling Last updated: Apr 18, 2026

How Much Energy Does
LLM Quantization Actually Cost?

Real GPU power measurements for FP16, INT8, NF4 inference across 4 NVIDIA architectures (RTX 5090, RTX 4090D, T4, A800). Includes perplexity accuracy data and benchmark coverage from 1.1B to 14B parameters. All raw data, scripts, and methodology are open source.

LATEST UPDATE ยท 2026-04-18

Qwen2.5-3B on Tesla T4: Supplemental Results Added

Added Table 8 and Figure 5 summary for FP16 vs NF4 across batch sizes 1/2/4. New measurements show NF4 increases energy by +7.4% to +39.9% on T4 for this 3B model.

View Update Page
4
GPU Architectures
Blackwell / Ada / Turing / Ampere
7
Models Tested
1.1B to 14B parameters
360+
Configurations
FP16 / INT8 / INT8 Pure / NF4 / NF4 DQ
n=10
Runs per Config
CV < 3%

GPU Power Consumption Comparison: Quantization Energy Efficiency Leaderboard

Energy per 1,000 generated tokens (Joules). Lower is better. Compare INT8 vs NF4 vs FP16 energy efficiency across NVIDIA GPUs.

Four GPU architectures covering 1.1B to 14B parameters, from edge (T4, 70W) to datacenter (A800, 300W).

RTX 5090 NVIDIA GeForce RTX 5090 (Blackwell, 32GB)
Batch size = 1, seq_len = 512
Model Params Precision Energy (J/1k tok) Throughput (tok/s) Power (W) vs FP16
TinyLlama-1.1B 1.1B FP16 1,659 94.87 157.5 baseline
TinyLlama-1.1B 1.1B NF4 2,098 55.79 117.0 +26.5%
Qwen2-1.5B 1.5B FP16 2,411 71.45 172.3 baseline
Qwen2-1.5B 1.5B NF4 3,120 41.57 129.8 +29.4%
Qwen2.5-3B 3B FP16 3,383 54.77 185.6 baseline
Qwen2.5-3B 3B NF4 3,780 31.85 120.5 +11.7%
Qwen2-7B 7B FP16 5,509 70.47 388.3 baseline
Qwen2-7B 7B NF4 4,878 41.40 201.9 -11.4%
NF4 saves energy only for 7B+ models on RTX 5090. For smaller models, NF4 increases energy by 12-29%.
RTX 4090D NVIDIA GeForce RTX 4090D (Ada Lovelace, 24GB)
5 quant configs, batch size = 1, n=10, Feb 25 2026
Model Params Precision Energy (J/1k tok) Throughput (tok/s) Power (W) vs FP16 CV
TinyLlama-1.1B 1.1B FP16 1,802 50.55 91.1 baseline 0.43%
INT8 Default 4,335 16.01 69.4 +140.6% 0.27%
INT8 Pure 3,121 23.21 72.4 +73.2% 0.94%
NF4 2,423 31.27 75.8 +34.4% 0.80%
NF4 DQ 2,857 25.85 73.8 +58.6% 0.52%
Qwen2.5-3B 3B FP16 3,515 31.83 111.9 baseline 0.64%
INT8 Default 9,207 7.94 73.1 +161.9% 0.99%
INT8 Pure 5,310 14.64 77.7 +51.1% 0.54%
NF4 4,411 18.37 81.0 +25.5% 0.92%
NF4 DQ 5,109 15.60 79.7 +45.3% 0.74%
Yi-1.5-6B 6B FP16 5,208 35.70 185.9 baseline 0.41%
INT8 Default 9,390 8.55 80.2 +80.3% 2.72%
INT8 Pure 5,791 15.90 92.0 +11.2% 0.88%
NF4 4,932 21.20 104.5 -5.3% 1.19%
NF4 DQ 5,652 17.24 97.4 +8.5% 2.88%
Key insight: On RTX 4090D, all quantization methods increase energy for models ≤3B. NF4 begins saving energy at 6B parameters (Yi-1.5-6B: -5.3%). INT8 Default shows the largest overhead (+80% to +162%) due to mixed-precision decomposition. FP16 remains the most energy-efficient option when GPU memory allows.
T4 NVIDIA Tesla T4 (Turing, 16GB)
Batch size = 1, seq_len = 512
Model Params Precision Energy (J/1k tok) Throughput (tok/s) Power (W) vs FP16
TinyLlama-1.1B 1.1B FP16 4,251 12.34 52.5 baseline
TinyLlama-1.1B 1.1B NF4 4,445 10.87 48.3 +4.6%
Qwen2-1.5B 1.5B FP16 5,732 9.56 54.8 baseline
Qwen2-1.5B 1.5B NF4 5,744 8.92 51.2 +0.2%
Qwen2.5-3B 3B FP16 11,268 5.23 58.9 baseline
Qwen2.5-3B 3B NF4 11,112 5.01 55.7 -1.4%
Qwen2-7B 7B FP16 21,723 2.87 62.3 baseline
Qwen2-7B 7B NF4 18,735 3.12 58.5 -13.8%
On T4, NF4 crossover starts around the 3B regime and becomes clearer for larger models (for example, 7B: -13.8%). INT8 rows are not shown in this cross-model table because model-level support is inconsistent on T4; see the 2026-04-18 supplemental page for failed Qwen2.5-3B INT8 runs.
A800-SXM4-80GB NVIDIA A800 80GB (Ampere, Large Models)
3 models ร— 5 configs, batch size = 1, n=10, Feb 26 2026
Model Params Precision Energy (J/1k tok) Throughput (tok/s) Power (W) PPL vs FP16 CV
Mistral-7B-v0.1 7B FP16 4,402 37.68 169 4.668 baseline 1.73%
INT8 Default 10,162 10.24 104 4.679 +130.8% 0.64%
INT8 Pure 6,528 18.54 121 4.729 +48.3% 0.89%
NF4 4,222 27.15 116 4.781 -4.1% 1.60%
NF4 DQ 4,797 23.05 112 4.781 +9.0% 1.13%
Yi-1.5-9B 9B FP16 5,445 31.16 171 5.732 baseline 0.50%
INT8 Default 11,826 7.41 88 4.943 +117.2% 0.48%
INT8 Pure 8,504 14.08 120 5.103 +56.2% 1.51%
NF4 5,363 21.84 118 5.091 -1.5% 0.91%
NF4 DQ 6,146 17.57 109 5.092 +12.9% 0.51%
Qwen2.5-14B 14B FP16 7,360 29.36 219 4.593 baseline 0.78%
INT8 Default 15,266 7.47 114 4.656 +107.5% 1.03%
INT8 Pure 9,807 13.90 137 4.793 +33.2% 0.58%
NF4 7,545 20.56 156 4.898 +2.5% 1.12%
NF4 DQ 8,756 16.89 149 4.898 +19.0% 0.50%
Large Model Findings: NF4 achieves near-FP16 energy for 7B-9B models with minimal PPL degradation. INT8 Default shows 2-2.3ร— energy overhead. For 14B models, NF4 energy is comparable to FP16.

Quantization Accuracy Trade-offs: Perplexity Analysis

Perplexity (PPL) on WikiText-2 test set. Lower is better. Energy efficiency vs accuracy comparison for INT8, NF4, and FP16 quantization.

RTX 4090D WikiText-2 test split (full, max_len=512)
3 models ร— 5 configs, Feb 25 2026
Model Precision Perplexity PPL vs FP16 Energy vs FP16 Assessment
TinyLlama-1.1B FP16 14.995 โ€” โ€” Baseline
INT8 Default 15.168 +1.15% +140.6% Negligible
INT8 Pure 15.365 +2.47% +73.2% Minor
NF4 15.876 +5.87% +34.4% Moderate
NF4 DQ 15.873 +5.85% +58.6% Moderate
Qwen2.5-3B FP16 13.775 โ€” โ€” Baseline
INT8 Default 14.048 +1.98% +161.9% Negligible
INT8 Pure 15.880 +15.28% +51.1% Significant
NF4 15.035 +9.14% +25.5% Moderate
NF4 DQ 15.048 +9.24% +45.3% Moderate
Yi-1.5-6B FP16 9.423 โ€” โ€” Baseline
INT8 Default 9.485 +0.66% +80.3% Negligible
INT8 Pure 11.566 +22.74% +11.2% Significant
NF4 9.965 +5.75% -5.3% Moderate
NF4 DQ 9.973 +5.84% +8.5% Moderate
Consistent findings across 3 models: INT8 Default preserves accuracy well (+0.66% to +1.98% PPL) but at significant energy cost. INT8 Pure causes severe degradation (+2.5% to +22.7% PPL). NF4 causes moderate PPL increase (+5.7% to +9.1%). Larger models (6B) are more robust to quantization than smaller ones (1.1B-3B).

Key Findings: Observed Energy-Accuracy Trade-offs

1

INT8 Default: Higher Energy Use vs FP16 in Tested Setup

Default LLM.int8() increases energy by +80% to +162% vs FP16 on RTX 4090D due to mixed-precision decomposition overhead. Throughput drops 68-75%. Accuracy is preserved (+0.7% to +2% PPL), while the energy overhead is substantial.

2

NF4 Crossover at ~3-6B Params (Architecture-Dependent)

Across measured platforms, NF4 crossover appears in the ~3.2B to ~5.2B range rather than a single fixed point. For smaller models (roughly 1.1B-3B), NF4 often increases energy, while mid/large models can begin to benefit depending on GPU architecture and memory subsystem behavior.

3

INT8 Pure: Lower Energy with Accuracy Degradation

Disabling mixed-precision decomposition (threshold=0.0) causes +2.5% to +22.7% PPL degradation across 3 models. Larger models suffer more (Qwen2.5-3B: +15.3%, Yi-1.5-6B: +22.7%). Observed energy savings are accompanied by measurable accuracy degradation.

๐Ÿ“Š

Interactive Model Comparison Tool

Compare energy efficiency and accuracy trade-offs across different models and quantization methods. Select any 2-4 models from our benchmark data and view side-by-side benchmark analysis of their performance characteristics.

Launch Comparison Tool

Methodology

Full reproducibility details. All scripts available in the repository.

Hardware Platforms

RTX 5090 Blackwell, 32GB GDDR7, 575W TDP
RTX 4090D Ada Lovelace, 24GB GDDR6X, 425W TDP
T4 Turing, 16GB GDDR6, 70W TDP
A800 Ampere, 80GB HBM2e, 300W TDP

Measurement Protocol

  • -Power sampling: NVML at 10 Hz
  • -Repetitions: n=10 per configuration
  • -Coefficient of variation: < 3%
  • -Warmup: 3 runs discarded before measurement
  • -Perplexity: WikiText-2 test split, cross-entropy loss
  • -Software: PyTorch 2.4+, bitsandbytes, transformers

Community Discussions & Documentation Contributions

โš™๏ธ Discussion With MLCommons / MLPerf Inference
โ— Active Discussion

Adding Energy Consumption Metrics to MLPerf Inference Benchmark

Active collaboration with MLCommons Power WG on energy metrics schema design, including 6 rounds of technical discussion and schema evolution from 7 fields to 16 fields. This work focuses on improving how inference energy evidence is represented for reproducible benchmark reporting.

360+
Configurations Tested
4
GPU Architectures
10 Hz
NVML Sampling
6
Tech Rounds
  • - Ongoing schema co-design discussion with MLCommons Power WG (active discussion state)
  • - Reporting scope expanded from 7 to 16 schema fields during iterative technical review
  • - Focus: reproducible energy evidence alongside throughput/latency benchmark outputs
โ†— View Discussion #2558
๐Ÿค— Merged Into Hugging Face Optimum Docs
โ— PR Accepted & Merged

Energy Efficiency in Practice - Official Quantization Documentation

Selected benchmarking findings were integrated into Hugging Face Optimum documentation as practitioner-facing guidance for quantization-related energy considerations.

39
Models Benchmarked
4
GPU Platforms
3
Quant Methods
2
HW Generations
  • - Large models (>5B): NF4 achieves near-FP16 energy with major memory savings
  • - Small models (<3B): NF4 can increase energy by 25-56%
  • - Batch size optimization reduces per-token energy by up to 96%
โ†— Read in HF Docs

Research-to-practice path: open benchmark findings under standards discussion and reflected in deployment documentation.

Cite This Benchmark

@misc{zhang2026llmenergy,
  author    = {Zhang, Hongping},
  title     = {LLM Energy Benchmark: Real GPU Power Measurements for Quantized Inference},
  year      = {2026},
  doi       = {10.5281/zenodo.18900289},
  url       = {https://doi.org/10.5281/zenodo.18900289},
  note      = {NVML power monitoring, 4 GPU architectures, 360+ configurations, includes perplexity data}
}