Research

The Evaluation Crisis Nobody Wants to Talk About

Public benchmarks are saturated. Private evals are unreproducible. The field is flying half-blind and pretending otherwise.

By The Memo · Wednesday, June 17, 2026 · 8 min read

The Evaluation Crisis Nobody Wants to Talk About — Photo · Carlos Muza / Unsplash

There's a peculiar silence at the center of AI research right now. Every lab publishes evaluation numbers. Every paper claims state-of-the-art on something. And yet ask any researcher off the record how much they trust those numbers, and you get a long pause.

The pause is not because the numbers are fake. It's because nobody — not even the people producing them — is sure what they mean anymore.

The public benchmarks that defined the field — MMLU, GSM8K, HumanEval — are either saturated or compromised by training-data contamination. The replacements proposed in the last eighteen months are either proprietary, prohibitively expensive to run, or both. The field is in the uncomfortable position of producing rapid capability gains without a reliable measuring stick to verify them.

Subscribers only

Keep reading — it's free.

The Model Memo is a free daily newsletter. Drop your email to unlock the rest of this essay and get tomorrow's in your inbox. Always free, unsubscribe anytime.

Free daily newsletter. Unsubscribe anytime. No spam — ever.

Keep Reading

The Token Burn Economy: Why AI Usage Limits Are Really Workflow Limits

Analysis

The Evaluation Crisis Nobody Wants to Talk About

Keep reading — it's free.

The Token Burn Economy: Why AI Usage Limits Are Really Workflow Limits

The Quiet Collapse of the LLM Moat

Inference Is the New CUDA