Analysis
Research
The Evaluation Crisis Nobody Wants to Talk About
Public benchmarks are saturated. Private evals are unreproducible. The field is flying half-blind and pretending otherwise.
By The Memo · Wednesday, June 17, 2026 · 8 min read
There's a peculiar silence at the center of AI research right now. Every lab publishes evaluation numbers. Every paper claims state-of-the-art on something. And yet ask any researcher off the record how much they trust those numbers, and you get a long pause.
The pause is not because the numbers are fake. It's because nobody — not even the people producing them — is sure what they mean anymore.
The public benchmarks that defined the field — MMLU, GSM8K, HumanEval — are either saturated or compromised by training-data contamination. The replacements proposed in the last eighteen months are either proprietary, prohibitively expensive to run, or both. The field is in the uncomfortable position of producing rapid capability gains without a reliable measuring stick to verify them.
Subscribers only
Keep reading — it's free.
The Model Memo is a free daily newsletter. Drop your email to unlock the rest of this essay and get tomorrow's in your inbox. Always free, unsubscribe anytime.
Free daily newsletter. Unsubscribe anytime. No spam — ever.
Keep Reading