News

Enterprise

Artificial Intelligence

Americas

Anthropic Redesigns Engineering Tests After Claude Models Match Human Candidate Performance

Anthropic has shifted to abstract logic puzzles for technical hiring after discovering its newest AI models can solve traditional realistic engineering tests.

Anthropic has shifted to abstract logic puzzles for technical hiring after discovering its newest AI models can solve traditional realistic engineering tests.

Anthropic has shifted to abstract logic puzzles for technical hiring after discovering its newest AI models can solve traditional realistic engineering tests.

NewDecoded

Published Jan 23, 2026

Jan 23, 2026

3 min read

Image by Antropic

The AI Evaluation Arms Race

Anthropic recently revealed that it has overhauled its technical hiring process after its own AI models began outperforming top-tier human performance engineers. Tristan Hume, a lead on the performance optimization team, documented how successive versions of Claude rendered traditional take-home tests nearly useless. By the time Claude Opus 4.5 arrived, it could match the output of the strongest human candidates within standard testing time limits.

The company's evaluation journey began with a realistic simulation of a hardware accelerator where candidates optimized parallel code. This format worked well for over a year, helping the firm hire dozens of engineers who built its current model clusters. However, the rapid advancement of Claude forced a shift from realistic work toward increasingly unconventional challenges to maintain a clear hiring signal.

Hume discovered that as long as a problem resembled real-world engineering, Claude could solve it using its vast training data. Even when the team designed complex data transposition tasks, the model identified clever architectural tricks that mirrored human reasoning. This prompted a move to out-of-distribution tests where AI experience offers little to no advantage over pure human adaptability.


Solving for Novelty

The current iteration of the test is modeled after Zachtronics-style logic puzzles, featuring tiny, constrained instruction sets and zero initial debugging tools. Candidates must build their own tooling and solve abstract problems that rely on raw logic rather than knowledge of existing systems. This ensures the evaluation captures the ability to navigate novel environments, a skill the company finds increasingly vital. Anthropic has now released the original version of its take-home as an open challenge on GitHub. While elite humans still hold an edge over AI when given unlimited time, the two-hour benchmark is now a dead heat. The company is inviting anyone who can beat the 1487-cycle score set by Claude Opus 4.5 to reach out to their recruiting team directly.

Decoded Take

Decoded Take

Decoded Take

The transition from realistic simulations to abstract puzzles signals a fundamental shift in how the tech industry must define human expertise. As large language models master standard workflows and common architectural patterns, the competitive advantage for human workers is shifting away from experience toward the ability to navigate novel, low-context environments. This suggests that the traditional technical interview is nearing an end, replaced by evaluations that measure how quickly a human can adapt to systems that have never existed before.

Share this article

Related Articles

Related Articles

Related Articles