Milestone note Reached Jan 2025

A benchmark built to resist saturation: Humanity's Last Exam

CAIS · Scale AI — As models saturated existing tests, a 2,500-question expert exam launched on which frontier models initially scored in the single digits — a fresh yardstick for the distance to general capability.

Auto-drafted from a verified measurement, then human-checked.

More on this