EcoCompute benchmarks AI models across accuracy, cost, carbon emissions, and energy efficiency — with real hardware measurements, not just estimates.
Comprehensive GPU power measurements reveal surprising insights about quantized LLM inference. Our research answers the critical question: why does bitsandbytes' default INT8 increase energy consumption by 17-33% (avg 31.7%) compared to FP16?
Through rigorous benchmarking across RTX 5090, RTX 4090D, and A800 GPUs, we discovered the root cause: mixed-precision decomposition (INT8↔FP16 conversion overhead), not INT8 compute itself. Disabling this pathway (threshold=0.0) recovers +79% throughput and -36% energy. Meanwhile, NF4 quantization saves energy (-8% to -35%) for models ≥6B but costs +12-29% more for smaller models.
Key Finding: Pure INT8 (without decomposition) is 3-8% more efficient than FP16. This dataset provides 93+ measurements with complete metadata (n=10 per config, CV<2%), enabling practitioners to make informed decisions about LLM deployment strategies.
🏆 Recognized by Industry Leaders: Our findings have been merged into Hugging Face Optimum official documentation and are actively discussed in MLCommons/MLPerf Inference #2558 for adding energy metrics to the industry-standard AI benchmark.
Sortable, filterable benchmark data across all evaluated models. Click any column header to sort.
| # | Model | Source | Accuracy | Exec Time (s) | Cost ($/1k) | Carbon (gCO₂) | Efficiency (T/W) | Score |
|---|
Composite score computed from weighted, normalized metrics. Adjust weights to reflect your priorities.
Explore the relationship between accuracy and operational costs. Models in the top-left quadrant offer the best trade-off.
Higher accuracy, lower carbon = better (top-left is ideal)
Higher accuracy, lower cost = better (top-left is ideal)
Counterintuitive findings from real hardware measurements.
Measured via NVML 10Hz power sampling. On the RTX 5090, small models (1.5B parameters) show an unexpected efficiency inversion — NF4 quantization actually consumes more energy per token than full FP16 inference.
On A800 + Mistral-7B INT8, increasing batch size from 1 to 64 reduces energy per request by 95.7% (1,768 J → 76 J) while pushing GPU utilization to 91%. For throughput-oriented workloads, batch size optimization dwarfs quantization gains.
Industry-specific cost and carbon analysis starting points.
Enterprise AI infrastructure selection: 8×H100 cluster, 5M tokens/day, full-spectrum cost & carbon comparison.
Large-scale GPU cluster carbon assessment: 16×A100 running 24/7, carbon tax penalty modeling & quota cost analysis.
Freelance developer AI tool cost analysis: Gemini Flash vs GPT-4o-mini, monthly net cost & ROI evaluation.
How we measure, estimate, and score AI model efficiency.
EcoCompute is an open-source project that brings transparency to the real cost — financial and environmental — of running AI models. We believe that informed infrastructure decisions start with honest, reproducible benchmarks. Our data combines direct hardware measurements with clearly-labeled estimates, so you always know what you're looking at.
Our energy efficiency research has been accepted into Hugging Face's official Optimum documentation, and we have an active proposal at MLCommons to add energy metrics to the MLPerf Inference Benchmark. The full dataset is published on the Hugging Face Hub and archived on Zenodo.
Contributions, corrections, and new benchmark submissions are welcome.
View on GitHub