News
Feb 22, 2026
Origen Secures $50 Million Strategic Investment from Bluefive Capital to Scale Abu Dhabi AI Solutions
Tech Updates
Artificial Intelligence
Americas
NewDecoded
3 min read
Image by Stanford
Researchers at the Stanford Institute for Human-Centered AI (HAI) are pioneering a new measurement science to fix the way artificial intelligence is evaluated. The initiative follows a workshop involving experts from Stanford, Cornell, and Schmidt Sciences who argue that current benchmarks are fundamentally flawed. While modern models often achieve high scores, they frequently fail at basic logic, such as claiming that 2.11 is greater than 2.9.
The group highlights a phenomenon known as jagged intelligence, where an AI might solve a differential equation but fail at rudimentary comparison. This happens because existing tests focus on task completion rather than underlying understanding. Most current benchmarks suffer from Goodhart’s Law, where the measure becomes a target for models to memorize rather than a sign of actual reasoning.
To address this, organizers including Sanmi Koyejo and Olawale Salaudeen suggest adopting psychometrics, a field of psychology used to measure hidden traits like intelligence. Instead of asking if the AI got the right answer, the new framework asks what the answer reveals about the model's latent capabilities. This shift moves the focus from pattern matching to construct validity, ensuring that a test actually measures what it claims to measure.
A key part of this effort is the development of the AI Construct Lexis, a curated knowledge base for defining AI traits. The workshop revealed that even top experts lack consensus on what reasoning actually means for a machine. By standardizing these definitions, researchers hope to avoid jingle-jangle fallacies, where unrelated behaviors are given the same name or related capabilities are dismissed due to poor labeling. Greater predictability in AI behavior is essential for safety and trust in real-world applications. If developers can accurately measure why an AI works, they can better predict when it might fail in novel situations. This technical rigor is expected to influence future policy, moving regulation away from simple hardware metrics toward proven functional safety.
This initiative signals a shift from treating AI as a black box of performance to a scientific subject requiring rigorous validation. For the industry, this means the era of marketing models based on high scores on fixed benchmarks like MMLU is nearing its end. As regulators demand proof of safety, companies will need to demonstrate functional understanding rather than statistical memorization. This movement toward AI psychometrics provides the technical foundation needed to build trust in high-stakes sectors like medicine and law.