LLM Energy Efficiency Benchmark Dataset

Hongping Zhang

doi:10.5281/zenodo.18900289

LLM Energy Efficiency Research: Why Default INT8 Increases Power Consumption by 17-33%

Comprehensive GPU power measurements reveal surprising insights about quantized LLM inference. Our research answers the critical question: why does bitsandbytes' default INT8 increase energy consumption by 17-33% (avg 31.7%) compared to FP16?

Through rigorous benchmarking across RTX 5090, RTX 4090D, and A800 GPUs, we discovered the root cause: mixed-precision decomposition (INT8↔FP16 conversion overhead), not INT8 compute itself. Disabling this pathway (threshold=0.0) recovers +79% throughput and -36% energy. Meanwhile, NF4 quantization saves energy (-8% to -35%) for models ≥6B but costs +12-29% more for smaller models.

Key Finding: Pure INT8 (without decomposition) is 3-8% more efficient than FP16. This dataset provides 93+ measurements with complete metadata (n=10 per config, CV<2%), enabling practitioners to make informed decisions about LLM deployment strategies.

🏆 Recognized by Industry Leaders: Our findings have been merged into Hugging Face Optimum official documentation and are actively discussed in MLCommons/MLPerf Inference #2558 for adding energy metrics to the industry-standard AI benchmark.

Topics: LLM energy efficiency • GPU power consumption • INT8 quantization paradox • NF4 vs FP16 comparison • bitsandbytes optimization • sustainable AI

Impact & Recognition

Research Contributions

Researcher Identity & Citation

ORCID: 0009-0000-2529-4613

Verified researcher identity

View ORCID Profile

DOI: 10.5281/zenodo.18900289

Cite this dataset in your research

View on Zenodo

Real-time Zenodo Statistics

Views

Downloads

v1.0

Version

Academic Impact Metrics

Altmetric Score

Research impact across media and policy

Citation Count

Academic citations tracked

⚙

Proposal To

MLCommons / MLPerf Inference

Active Discussion

Adding Energy Consumption Metrics to MLPerf Inference Benchmark

We proposed adding energy efficiency metrics (energy per query/token, average power, tokens/Joule) to the MLPerf Inference Benchmark — the industry-standard AI performance benchmark by MLCommons. Current benchmarks only report throughput and latency, making energy trade-offs invisible to practitioners.

270

Configurations Tested

GPU Architectures

10 Hz

NVML Sampling

200+

Measurements

Proposed 3 new metrics: Energy/query (J), Avg power (W), Efficiency ratio (tok/J)

Quantization energy trade-offs remain invisible to existing throughput/latency metrics

Invited to collaborate with MLCommons Power Working Group

View Discussion #2558

🤗

Merged Into

Hugging Face Optimum Docs

PR Accepted & Merged

Energy Efficiency in Practice — Official Quantization Documentation

Our systematic benchmarking findings were accepted into Hugging Face's official Optimum library documentation — the canonical reference for model optimization used by millions of ML practitioners worldwide.

Models Benchmarked

GPU Platforms

Quant Methods

HW Generations

Large models (≥5B): NF4 achieves near-FP16 energy with major memory savings

Small models (<3B): NF4 can increase energy by 25–56%

Batch size optimization reduces per-token energy by up to 96%

Read in HF Docs

Model Leaderboard

Sortable, filterable benchmark data across all evaluated models. Click any column header to sort.

	# ↕	Model ↕	Source ↕	Accuracy ↕	Exec Time (s) ↕	Cost ($/1k) ↕	Carbon (gCO₂) ↕	Efficiency (T/W) ↕	Score ▼

How Scoring Works

Composite score computed from weighted, normalized metrics. Adjust weights to reflect your priorities.

Score = w₁·Accuracy + w₂·(1/Cost) + w₃·(1/Carbon) + w₄·Efficiency + w₅·(1/ExecTime)

Accuracy 30%

Cost 20%

Carbon 15%

Efficiency 20%

Exec Time 15%

Performance vs. Cost Trade-offs

Explore the relationship between accuracy and operational costs. Models in the top-left quadrant offer the best trade-off.

Accuracy vs. Carbon Footprint

Higher accuracy, lower carbon = better (top-left is ideal)

Accuracy vs. Dollar Cost

Higher accuracy, lower cost = better (top-left is ideal)

Key Insights

Counterintuitive findings from real hardware measurements.

Paradox: NF4 is 22.9% LESS Efficient Than FP16 on RTX 5090 for 1.5B Models

Measured via NVML 10Hz power sampling. On the RTX 5090, small models (1.5B parameters) show an unexpected efficiency inversion — NF4 quantization actually consumes more energy per token than full FP16 inference.

FP16

415 tok/W

NF4

320 tok/W

Batch Size Is the Biggest Energy Lever

On A800 + Mistral-7B INT8, increasing batch size from 1 to 64 reduces energy per request by 95.7% (1,768 J → 76 J) while pushing GPU utilization to 91%. For throughput-oriented workloads, batch size optimization dwarfs quantization gains.

BS=1

1,768 J/req

BS=64

76 J

Calculator Templates

Industry-specific cost and carbon analysis starting points.

⚒

DeepSeek vs OpenAI Deep Cost Evaluation

Enterprise AI infrastructure selection: 8×H100 cluster, 5M tokens/day, full-spectrum cost & carbon comparison.

DeepSeek-V3GPT-4oH1005M tok/day

🌎

Enterprise Carbon Quota Trading Forecast

Large-scale GPU cluster carbon assessment: 16×A100 running 24/7, carbon tax penalty modeling & quota cost analysis.

16×A100Carbon TaxPUE 1.410M tok/day

💼

Freelancer Net Income Modeling

Freelance developer AI tool cost analysis: Gemini Flash vs GPT-4o-mini, monthly net cost & ROI evaluation.

Gemini 2.0GPT-4o-miniT4100K tok/day

Methodology

How we measure, estimate, and score AI model efficiency.

● Direct Measurement

Hardware: RTX 5090, RTX 4090D, A800 GPUs via AutoDL Cloud
Software: PyTorch 2.4.1+cu121 / 2.10.0+cu128, bitsandbytes for quantization
Power Monitoring: NVML (nvidia-smi) at 10Hz sampling rate
Each config runs 10 iterations minimum; coefficient of variation <1%
Metrics: wall-clock latency, GPU energy (Joules), tokens generated, GPU utilization

● API Estimation

Cost sourced from official published API pricing pages (Google, Anthropic, OpenAI)
Carbon estimated using average US grid intensity (0.4 kgCO₂/kWh) from IEA data
Assumed datacenter PUE of ~1.1 for major cloud providers
Inference power draw estimated from public MLPerf benchmarks
Not independently verified — marked with "~Estimated" badge

● Research Literature

Performance data from original papers (LLaMA, BERT, ResNet, Transformer)
Energy/carbon extrapolated from A100/V100 inference benchmarks in community repos
Key references: Strubell et al. 2019, Schwartz et al. 2020, He et al. 2015
Values scaled to comparable units for cross-model comparison

⚙ Scoring Formula

Each metric is min-max normalized to [0, 100] across the dataset
Cost, Carbon, and Exec Time are inverted (lower is better → higher normalized score)
Default weights: Accuracy 30%, Cost 20%, Carbon 15%, Efficiency 20%, Speed 15%
Weights are adjustable via the interactive sliders on the Leaderboard
Data verified as of Jan–Feb 2026; API pricing may have changed since

About EcoCompute

EcoCompute is an open-source project that brings transparency to the real cost — financial and environmental — of running AI models. We believe that informed infrastructure decisions start with honest, reproducible benchmarks. Our data combines direct hardware measurements with clearly-labeled estimates, so you always know what you're looking at.

Our energy efficiency research has been accepted into Hugging Face's official Optimum documentation, and we have an active proposal at MLCommons to add energy metrics to the MLPerf Inference Benchmark. The full dataset is published on the Hugging Face Hub and archived on Zenodo.

Contributions, corrections, and new benchmark submissions are welcome.

View on GitHub

Open LLM Energy Benchmark · 93+ GPU Power Measurements · Adopted by Hugging Face & MLCommons

The Cost & Carbon Benchmark
for AI Models

LLM Energy Efficiency Research: Why Default INT8 Increases Power Consumption by 17-33%

Impact & Recognition

Researcher Identity & Citation

Model Leaderboard

How Scoring Works

Performance vs. Cost Trade-offs

Accuracy vs. Carbon Footprint

Accuracy vs. Dollar Cost

Key Insights

Paradox: NF4 is 22.9% LESS Efficient Than FP16 on RTX 5090 for 1.5B Models

Batch Size Is the Biggest Energy Lever

Calculator Templates

DeepSeek vs OpenAI Deep Cost Evaluation

Enterprise Carbon Quota Trading Forecast

Freelancer Net Income Modeling

Methodology

● Direct Measurement

● API Estimation

● Research Literature

⚙ Scoring Formula

About EcoCompute

Open LLM Energy Benchmark · 93+ GPU Power Measurements · Adopted by Hugging Face & MLCommons

LLM Energy Efficiency Research: Why Default INT8 Increases Power Consumption by 17-33%

Impact & Recognition

Researcher Identity & Citation

Model Leaderboard

How Scoring Works

Performance vs. Cost Trade-offs

Accuracy vs. Carbon Footprint

Accuracy vs. Dollar Cost

Key Insights

Paradox: NF4 is 22.9% LESS Efficient Than FP16 on RTX 5090 for 1.5B Models

Batch Size Is the Biggest Energy Lever

Calculator Templates

DeepSeek vs OpenAI Deep Cost Evaluation

Enterprise Carbon Quota Trading Forecast

Freelancer Net Income Modeling

Methodology

● Direct Measurement

● API Estimation

● Research Literature

⚙ Scoring Formula

About EcoCompute

Model Comparison