Open Data Reproducible NVML 10Hz Sampling Last updated: Feb 2026

How Much Energy Does
LLM Quantization Actually Cost?

Real GPU power measurements for INT8, NF4, and FP16 inference across 3 NVIDIA architectures. Includes perplexity (accuracy) data. All raw data, scripts, and methodology are open source.

3
GPU Architectures
Blackwell / Ada / Turing
8
Models Tested
1.1B to 7B parameters
25+
Configurations
FP16 / INT8 / NF4
n=10
Runs per Config
CV < 3%

Energy Leaderboard

Energy per 1,000 generated tokens (Joules). Lower is better. Sorted by energy efficiency.

RTX 5090 NVIDIA GeForce RTX 5090 (Blackwell, 32GB)
Batch size = 1, seq_len = 512
Model Params Precision Energy (J/1k tok) Throughput (tok/s) Power (W) vs FP16
TinyLlama-1.1B 1.1B FP16 1,659 94.87 157.5 baseline
TinyLlama-1.1B 1.1B NF4 2,098 55.79 117.0 +26.5%
Qwen2-1.5B 1.5B FP16 2,411 71.45 172.3 baseline
Qwen2-1.5B 1.5B NF4 3,120 41.57 129.8 +29.4%
Qwen2.5-3B 3B FP16 3,383 54.77 185.6 baseline
Qwen2.5-3B 3B NF4 3,780 31.85 120.5 +11.7%
Qwen2-7B 7B FP16 5,509 70.47 388.3 baseline
Qwen2-7B 7B NF4 4,878 41.40 201.9 -11.4%
NF4 saves energy only for 7B+ models on RTX 5090. For smaller models, NF4 increases energy by 12-29%.
RTX 4090D NVIDIA GeForce RTX 4090D (Ada Lovelace, 24GB)
INT8 quantization, batch size = 1
Model Params Precision Energy (J/1k tok) INT8 Config vs FP16
Yi-1.5-6B 6B FP16 4,716 - baseline
Yi-1.5-6B 6B INT8 6,258 threshold=6.0 (default) +32.7%
Mistral-7B 7B FP16 5,661 - baseline
Mistral-7B 7B INT8 7,401 threshold=6.0 (default) +30.7%
Phi-3-mini 3.8B FP16 3,003 - baseline
Phi-3-mini 3.8B INT8 3,940 threshold=6.0 (default) +31.2%
Qwen2.5-7B 7B FP16 5,217 - baseline
Qwen2.5-7B 7B INT8 6,127 threshold=6.0 (default) +17.4%
Default INT8 (bitsandbytes llm_int8_threshold=6.0) consistently increases energy by 17-33% due to mixed-precision decomposition overhead. This is the cost of preserving accuracy.
T4 NVIDIA Tesla T4 (Turing, 16GB)
Batch size = 1, seq_len = 512
Model Params Precision Energy (J/1k tok) Throughput (tok/s) Power (W) vs FP16
TinyLlama-1.1B 1.1B FP16 4,251 12.34 52.5 baseline
TinyLlama-1.1B 1.1B NF4 4,445 10.87 48.3 +4.6%
Qwen2-1.5B 1.5B FP16 5,732 9.56 54.8 baseline
Qwen2-1.5B 1.5B NF4 5,744 8.92 51.2 +0.2%
Qwen2.5-3B 3B FP16 11,268 5.23 58.9 baseline
Qwen2.5-3B 3B NF4 11,112 5.01 55.7 -1.4%
Qwen2-7B 7B FP16 21,723 2.87 62.3 baseline
Qwen2-7B 7B NF4 18,735 3.12 58.5 -13.8%
On T4, NF4 energy savings begin at ~3B parameters. Effect is more pronounced for 7B models (-13.8%).

Accuracy Assessment

Perplexity (PPL) on WikiText-2 test set. Lower is better. Measures accuracy cost of quantization.

RTX 4090D Yi-1.5-6B on WikiText-2 (50 samples, max_len=512)
Feb 2026
Configuration Precision INT8 Threshold Perplexity PPL vs FP16 Energy vs FP16 Memory (GB) Assessment
FP16 (baseline) FP16 - 11.16 - - 12.12 Baseline
INT8 Default INT8 6.0 11.20 +0.33% +32.7% 6.70 Negligible loss
INT8 Pure INT8 0.0 14.00 +25.38% -3.1% 6.70 Significant loss
Default INT8 preserves accuracy (+0.33% PPL) but costs +32.7% more energy. Pure INT8 (threshold=0.0) saves only 3.1% energy while losing 25% accuracy. This validates the bitsandbytes default design.

Key Findings

1

INT8 Costs More Energy

Default LLM.int8() increases energy consumption by 17-33% vs FP16 on consumer GPUs due to mixed-precision decomposition overhead. This is the cost of preserving accuracy.

2

NF4 Crossover at ~5B Params

NF4 quantization saves energy only for models with ≥5B parameters. For smaller models, the dequantization compute cost exceeds the memory bandwidth savings, increasing energy by 5-29%.

3

Pure INT8 Harms Accuracy

Disabling mixed-precision decomposition (threshold=0.0) causes +25% perplexity degradation while saving only 3% energy. The trade-off is not justified.

Methodology

Full reproducibility details. All scripts available in the repository.

Hardware Platforms

RTX 5090 Blackwell, 32GB GDDR7, 575W TDP
RTX 4090D Ada Lovelace, 24GB GDDR6X, 425W TDP
T4 Turing, 16GB GDDR6, 70W TDP
A800 Ampere, 80GB HBM2e, 300W TDP

Measurement Protocol

  • -Power sampling: NVML at 10 Hz
  • -Repetitions: n=10 per configuration
  • -Coefficient of variation: < 3%
  • -Warmup: 3 runs discarded before measurement
  • -Perplexity: WikiText-2 test split, cross-entropy loss
  • -Software: PyTorch 2.4+, bitsandbytes, transformers

Community & Contributions

Cite This Benchmark

@misc{zhang2026llmenergy,
  author    = {Zhang, Hongping},
  title     = {LLM Energy Benchmark: Real GPU Power Measurements for Quantized Inference},
  year      = {2026},
  url       = {https://github.com/hongping-zh/ecocompute-ai},
  note      = {NVML power monitoring, 3 GPU architectures, includes perplexity data}
}