AI evals are becoming the new compute bottleneck
Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Exgentic's $22,000 sweep across agent configurations found a 33× cost sp...
Title: The Evolution of AI Evaluation and Its Implications for Power Dynamics
The article provides an in-depth analysis of the challenges faced when evaluating advanced artificial intelligence (AI) systems. It highlights several critical issues that have implications for power dynamics in the field:
1. **Methodological Shift**: The shift from traditional static benchmarks to more complex statistical methods and computational resources signifies a significant change in the evaluation landscape, r...
