Aakash Gupta argues eval programs fail when teams run them only at the end, only keep passing evals, or silo them to engineers, citing Braintrust building an eval first where every model failed before improvements landed.
The three mistakes that kill eval adoption at AI teams:
Running evals only at the end. Only having evals that pass. Siloing evals to engineers.
Braintrust shipped their agent product Loop by building the eval before any model could pass it.
Every model failed.
This finding is one of many signals tracked across Artificial Intelligence. The live feed updates every few hours with new authority voices, debates, and emerging ideas.
← Back to Artificial Intelligence