Abstract
Weight-only quantization is widely assumed to reduce the energy cost of large language model inference by lowering memory traffic and improving computational efficiency. This paper presents an empirical study of FP16, INT8, and NF4 inference across multiple NVIDIA GPU platforms and model scales. The results show that quantization does not universally reduce energy consumption. For smaller models and certain hardware-software stacks, quantized execution can reduce active power while increasing runtime enough to raise total energy. These findings highlight an energy crossover effect: quantization becomes beneficial only when model scale, batch size, and backend implementation allow bandwidth savings to dominate overheads.
Key findings
- Weight-only quantization does not always save energy; for small models, FP16 can be more energy-efficient than INT8 or NF4.
- Energy behavior is phase-dependent: prefill and decode can show different power-throughput trade-offs.
- Backend and architecture-specific quantized-path efficiency can determine whether theoretical bandwidth savings translate into real energy savings.