Real GPU power measurements for INT8, NF4, and FP16 inference across 3 NVIDIA architectures. Includes perplexity (accuracy) data. All raw data, scripts, and methodology are open source.
Energy per 1,000 generated tokens (Joules). Lower is better. Sorted by energy efficiency.
| Model | Params | Precision | Energy (J/1k tok) | Throughput (tok/s) | Power (W) | vs FP16 |
|---|---|---|---|---|---|---|
| TinyLlama-1.1B | 1.1B | FP16 | 1,659 | 94.87 | 157.5 | baseline |
| TinyLlama-1.1B | 1.1B | NF4 | 2,098 | 55.79 | 117.0 | +26.5% |
| Qwen2-1.5B | 1.5B | FP16 | 2,411 | 71.45 | 172.3 | baseline |
| Qwen2-1.5B | 1.5B | NF4 | 3,120 | 41.57 | 129.8 | +29.4% |
| Qwen2.5-3B | 3B | FP16 | 3,383 | 54.77 | 185.6 | baseline |
| Qwen2.5-3B | 3B | NF4 | 3,780 | 31.85 | 120.5 | +11.7% |
| Qwen2-7B | 7B | FP16 | 5,509 | 70.47 | 388.3 | baseline |
| Qwen2-7B | 7B | NF4 | 4,878 | 41.40 | 201.9 | -11.4% |
| Model | Params | Precision | Energy (J/1k tok) | INT8 Config | vs FP16 |
|---|---|---|---|---|---|
| Yi-1.5-6B | 6B | FP16 | 4,716 | - | baseline |
| Yi-1.5-6B | 6B | INT8 | 6,258 | threshold=6.0 (default) | +32.7% |
| Mistral-7B | 7B | FP16 | 5,661 | - | baseline |
| Mistral-7B | 7B | INT8 | 7,401 | threshold=6.0 (default) | +30.7% |
| Phi-3-mini | 3.8B | FP16 | 3,003 | - | baseline |
| Phi-3-mini | 3.8B | INT8 | 3,940 | threshold=6.0 (default) | +31.2% |
| Qwen2.5-7B | 7B | FP16 | 5,217 | - | baseline |
| Qwen2.5-7B | 7B | INT8 | 6,127 | threshold=6.0 (default) | +17.4% |
llm_int8_threshold=6.0) consistently increases energy by 17-33% due to mixed-precision decomposition overhead. This is the cost of preserving accuracy.
| Model | Params | Precision | Energy (J/1k tok) | Throughput (tok/s) | Power (W) | vs FP16 |
|---|---|---|---|---|---|---|
| TinyLlama-1.1B | 1.1B | FP16 | 4,251 | 12.34 | 52.5 | baseline |
| TinyLlama-1.1B | 1.1B | NF4 | 4,445 | 10.87 | 48.3 | +4.6% |
| Qwen2-1.5B | 1.5B | FP16 | 5,732 | 9.56 | 54.8 | baseline |
| Qwen2-1.5B | 1.5B | NF4 | 5,744 | 8.92 | 51.2 | +0.2% |
| Qwen2.5-3B | 3B | FP16 | 11,268 | 5.23 | 58.9 | baseline |
| Qwen2.5-3B | 3B | NF4 | 11,112 | 5.01 | 55.7 | -1.4% |
| Qwen2-7B | 7B | FP16 | 21,723 | 2.87 | 62.3 | baseline |
| Qwen2-7B | 7B | NF4 | 18,735 | 3.12 | 58.5 | -13.8% |
Perplexity (PPL) on WikiText-2 test set. Lower is better. Measures accuracy cost of quantization.
| Configuration | Precision | INT8 Threshold | Perplexity | PPL vs FP16 | Energy vs FP16 | Memory (GB) | Assessment |
|---|---|---|---|---|---|---|---|
| FP16 (baseline) | FP16 | - | 11.16 | - | - | 12.12 | Baseline |
| INT8 Default | INT8 | 6.0 | 11.20 | +0.33% | +32.7% | 6.70 | Negligible loss |
| INT8 Pure | INT8 | 0.0 | 14.00 | +25.38% | -3.1% | 6.70 | Significant loss |
Default LLM.int8() increases energy consumption by 17-33% vs FP16 on consumer GPUs due to mixed-precision decomposition overhead. This is the cost of preserving accuracy.
NF4 quantization saves energy only for models with ≥5B parameters. For smaller models, the dequantization compute cost exceeds the memory bandwidth savings, increasing energy by 5-29%.
Disabling mixed-precision decomposition (threshold=0.0) causes +25% perplexity degradation while saving only 3% energy. The trade-off is not justified.
Full reproducibility details. All scripts available in the repository.
| RTX 5090 | Blackwell, 32GB GDDR7, 575W TDP |
| RTX 4090D | Ada Lovelace, 24GB GDDR6X, 425W TDP |
| T4 | Turing, 16GB GDDR6, 70W TDP |
| A800 | Ampere, 80GB HBM2e, 300W TDP |
Documentation PR adding energy-accuracy trade-off analysis to official bitsandbytes docs.
Discussion with the bitsandbytes core author on the energy implications of mixed-precision decomposition.
@misc{zhang2026llmenergy,
author = {Zhang, Hongping},
title = {LLM Energy Benchmark: Real GPU Power Measurements for Quantized Inference},
year = {2026},
url = {https://github.com/hongping-zh/ecocompute-ai},
note = {NVML power monitoring, 3 GPU architectures, includes perplexity data}
}