Open LLM Energy Benchmark · 93+ GPU Power Measurements · Adopted by Hugging Face & MLCommons

For: LLM Deployment Engineers / Chip Vendors / AI Energy Efficiency Researchers

The Cost & Carbon Benchmark
for AI Models

EcoCompute benchmarks AI models across accuracy, cost, carbon emissions, and energy efficiency — with real hardware measurements, not just estimates.

39
Models Evaluated
4
GPU Platforms
3
Source Types
20+
Directly Measured

LLM Energy Efficiency Research: Why Default INT8 Increases Power Consumption by 17-33%

Comprehensive GPU power measurements reveal surprising insights about quantized LLM inference. Our research answers the critical question: why does bitsandbytes' default INT8 increase energy consumption by 17-33% (avg 31.7%) compared to FP16?

Through rigorous benchmarking across RTX 5090, RTX 4090D, and A800 GPUs, we discovered the root cause: mixed-precision decomposition (INT8↔FP16 conversion overhead), not INT8 compute itself. Disabling this pathway (threshold=0.0) recovers +79% throughput and -36% energy. Meanwhile, NF4 quantization saves energy (-8% to -35%) for models ≥6B but costs +12-29% more for smaller models.

Key Finding: Pure INT8 (without decomposition) is 3-8% more efficient than FP16. This dataset provides 93+ measurements with complete metadata (n=10 per config, CV<2%), enabling practitioners to make informed decisions about LLM deployment strategies.

🏆 Recognized by Industry Leaders: Our findings have been merged into Hugging Face Optimum official documentation and are actively discussed in MLCommons/MLPerf Inference #2558 for adding energy metrics to the industry-standard AI benchmark.

Topics: LLM energy efficiency GPU power consumption INT8 quantization paradox NF4 vs FP16 comparison bitsandbytes optimization sustainable AI

Impact & Recognition

Research Contributions

Researcher Identity & Citation

O
ORCID: 0009-0000-2529-4613
Verified researcher identity
View ORCID Profile
Z
DOI: 10.5281/zenodo.18900289
Cite this dataset in your research
View on Zenodo
Real-time Zenodo Statistics
Loading...
Views
Loading...
Downloads
v1.0
Version
Academic Impact Metrics
A
Altmetric Score
Research impact across media and policy
C
Citation Count
Academic citations tracked
Proposal To
MLCommons / MLPerf Inference
Active Discussion
Adding Energy Consumption Metrics to MLPerf Inference Benchmark
We proposed adding energy efficiency metrics (energy per query/token, average power, tokens/Joule) to the MLPerf Inference Benchmark — the industry-standard AI performance benchmark by MLCommons. Current benchmarks only report throughput and latency, making energy trade-offs invisible to practitioners.
270
Configurations Tested
3
GPU Architectures
10 Hz
NVML Sampling
200+
Measurements
Proposed 3 new metrics: Energy/query (J), Avg power (W), Efficiency ratio (tok/J)
Quantization energy trade-offs remain invisible to existing throughput/latency metrics
Invited to collaborate with MLCommons Power Working Group
View Discussion #2558
🤗
Merged Into
Hugging Face Optimum Docs
PR Accepted & Merged
Energy Efficiency in Practice — Official Quantization Documentation
Our systematic benchmarking findings were accepted into Hugging Face's official Optimum library documentation — the canonical reference for model optimization used by millions of ML practitioners worldwide.
39
Models Benchmarked
4
GPU Platforms
3
Quant Methods
2
HW Generations
Large models (≥5B): NF4 achieves near-FP16 energy with major memory savings
Small models (<3B): NF4 can increase energy by 25–56%
Batch size optimization reduces per-token energy by up to 96%
Read in HF Docs

Model Leaderboard

Sortable, filterable benchmark data across all evaluated models. Click any column header to sort.

# Model Source Accuracy Exec Time (s) Cost ($/1k) Carbon (gCO₂) Efficiency (T/W) Score

How Scoring Works

Composite score computed from weighted, normalized metrics. Adjust weights to reflect your priorities.

Score = w₁·Accuracy + w₂·(1/Cost) + w₃·(1/Carbon) + w₄·Efficiency + w₅·(1/ExecTime)

Performance vs. Cost Trade-offs

Explore the relationship between accuracy and operational costs. Models in the top-left quadrant offer the best trade-off.

Accuracy vs. Carbon Footprint

Higher accuracy, lower carbon = better (top-left is ideal)

Accuracy vs. Dollar Cost

Higher accuracy, lower cost = better (top-left is ideal)

Key Insights

Counterintuitive findings from real hardware measurements.

Paradox: NF4 is 22.9% LESS Efficient Than FP16 on RTX 5090 for 1.5B Models

Measured via NVML 10Hz power sampling. On the RTX 5090, small models (1.5B parameters) show an unexpected efficiency inversion — NF4 quantization actually consumes more energy per token than full FP16 inference.

FP16
415 tok/W
NF4
320 tok/W

Batch Size Is the Biggest Energy Lever

On A800 + Mistral-7B INT8, increasing batch size from 1 to 64 reduces energy per request by 95.7% (1,768 J → 76 J) while pushing GPU utilization to 91%. For throughput-oriented workloads, batch size optimization dwarfs quantization gains.

BS=1
1,768 J/req
BS=64
76 J

Calculator Templates

Industry-specific cost and carbon analysis starting points.

DeepSeek vs OpenAI Deep Cost Evaluation

Enterprise AI infrastructure selection: 8×H100 cluster, 5M tokens/day, full-spectrum cost & carbon comparison.

DeepSeek-V3GPT-4oH1005M tok/day
🌎

Enterprise Carbon Quota Trading Forecast

Large-scale GPU cluster carbon assessment: 16×A100 running 24/7, carbon tax penalty modeling & quota cost analysis.

16×A100Carbon TaxPUE 1.410M tok/day
💼

Freelancer Net Income Modeling

Freelance developer AI tool cost analysis: Gemini Flash vs GPT-4o-mini, monthly net cost & ROI evaluation.

Gemini 2.0GPT-4o-miniT4100K tok/day

Methodology

How we measure, estimate, and score AI model efficiency.

Direct Measurement

  • Hardware: RTX 5090, RTX 4090D, A800 GPUs via AutoDL Cloud
  • Software: PyTorch 2.4.1+cu121 / 2.10.0+cu128, bitsandbytes for quantization
  • Power Monitoring: NVML (nvidia-smi) at 10Hz sampling rate
  • Each config runs 10 iterations minimum; coefficient of variation <1%
  • Metrics: wall-clock latency, GPU energy (Joules), tokens generated, GPU utilization

API Estimation

  • Cost sourced from official published API pricing pages (Google, Anthropic, OpenAI)
  • Carbon estimated using average US grid intensity (0.4 kgCO₂/kWh) from IEA data
  • Assumed datacenter PUE of ~1.1 for major cloud providers
  • Inference power draw estimated from public MLPerf benchmarks
  • Not independently verified — marked with "~Estimated" badge

Research Literature

  • Performance data from original papers (LLaMA, BERT, ResNet, Transformer)
  • Energy/carbon extrapolated from A100/V100 inference benchmarks in community repos
  • Key references: Strubell et al. 2019, Schwartz et al. 2020, He et al. 2015
  • Values scaled to comparable units for cross-model comparison

⚙ Scoring Formula

  • Each metric is min-max normalized to [0, 100] across the dataset
  • Cost, Carbon, and Exec Time are inverted (lower is better → higher normalized score)
  • Default weights: Accuracy 30%, Cost 20%, Carbon 15%, Efficiency 20%, Speed 15%
  • Weights are adjustable via the interactive sliders on the Leaderboard
  • Data verified as of Jan–Feb 2026; API pricing may have changed since

About EcoCompute

EcoCompute is an open-source project that brings transparency to the real cost — financial and environmental — of running AI models. We believe that informed infrastructure decisions start with honest, reproducible benchmarks. Our data combines direct hardware measurements with clearly-labeled estimates, so you always know what you're looking at.

Our energy efficiency research has been accepted into Hugging Face's official Optimum documentation, and we have an active proposal at MLCommons to add energy metrics to the MLPerf Inference Benchmark. The full dataset is published on the Hugging Face Hub and archived on Zenodo.

Contributions, corrections, and new benchmark submissions are welcome.

View on GitHub