Designing AI-Resistant Technical Evaluations

Published: January 21, 2026

Author: Tristan Hume, Lead on Anthropic's Performance Optimization Team

Overview

Evaluating technical candidates becomes increasingly difficult as AI capabilities advance. A take-home assessment that effectively distinguishes human skill levels today risks becoming trivial for models tomorrow, rendering it useless for evaluation purposes.

The Original Take-Home

Design Goals

Since early 2024, Anthropic's performance engineering team has used a take-home test requiring candidates to optimize code for a simulated accelerator. Over 1,000 candidates completed it, with dozens now employed at the company, including engineers who shipped every model since Claude 3 Opus.

The original assessment pursued several key objectives:

Longer time horizon: A 4-hour window (later reduced to 2 hours) better reflects real engineering work than typical 50-minute interviews
Realistic environment: Candidates work in their own editor without observation
Time for comprehension and tooling: Candidates can build debugging utilities, which mirrors actual performance engineering
Representative of real work: The problem provides authentic insight into job demands
High signal: Multiple optimization paths ensure wide scoring distribution
No specific domain knowledge: Fundamentals matter more than specialized expertise
Engagement: Fast development loops and creative opportunities motivate participation

The Simulated Machine

The assessment featured a Python simulator for a fake accelerator resembling TPUs. Candidates optimized code using hot-reloading Perfetto traces showing every instruction. The machine included:

Manually managed scratchpad memory
VLIW (multiple execution units requiring efficient instruction packing)
SIMD (vector operations on multiple elements)
Multicore parallelism

The task involved parallel tree traversal—deliberately not deep-learning flavored, since most candidates lacked ML experience. Candidates progressively exploited the machine's parallelism, starting with multicore before choosing between SIMD vectorization or VLIW instruction packing.

Early Success

The initial take-home proved effective. One candidate from the original Twitter recruitment substantially outperformed all others. This person "immediately began optimizing kernels and found a workaround for a launch-blocking compiler bug" upon starting in February 2024.

Over eighteen months, approximately 1,000 candidates completed the assessment. It successfully identified high-performing engineers, proving particularly valuable for paper-light candidates—several top performers came directly from undergrad but demonstrated sufficient skill to hire confidently.

Feedback was positive. Many candidates worked beyond the 4-hour limit from genuine engagement. Advanced submissions included full optimizing mini-compilers and unanticipated clever optimizations.

Claude Opus 4 Defeats Version 1

By May 2025, Claude 3.5 Sonnet had advanced enough that "over 50% of candidates would have been better off delegating to Claude Code entirely." Testing a pre-release Claude Opus 4 revealed it produced "more optimized solutions than almost all humans did within the 4-hour limit."

While this still enabled distinguishing the strongest candidates, it signaled the need for redesign.

Version 2 Adjustments

The problem contained far more depth than candidates could explore in 4 hours, so Tristan identified where Claude Opus 4 began struggling and made that the new starting point. Changes included:

Cleaner starter code
New machine features for additional depth
Removed multicore (which Claude had already solved)
Shortened time limit from 4 to 2 hours

Version 2 emphasized clever optimization insights over debugging and code volume. It succeeded temporarily.

Claude Opus 4.5 Defeats Version 2

When testing pre-release Claude Opus 4.5, the model worked on the problem for 2 hours, gradually improving. It solved initial bottlenecks, implemented standard micro-optimizations, and met passing thresholds in under an hour.

Importantly, Claude then "stopped, convinced it had hit an insurmountable memory bandwidth bottleneck." Most humans reach the same conclusion. However, clever tricks exploit problem structure to circumvent this bottleneck. When told the achievable cycle count, Claude "thought for a while and found the trick," then debugged, tuned, and implemented further optimizations.

By the 2-hour mark, Claude's score matched the best human performance within that time limit—and that human had made heavy use of Claude 4 with steering.

Internal testing confirmed Claude could beat humans in 2 hours while continuing improvement with extended time.

Considering Options

Anthropic faced a critical decision. Some colleagues suggested banning AI assistance, but Tristan believed people should still distinguish themselves in settings with AI—mirroring real job conditions. The concern with raising bars to "substantially outperform what Claude Code achieves alone" was that Claude works so quickly humans would spend half their time understanding problem specifications while remaining perpetually behind.

Importantly: "Nowadays performance engineers at Anthropic still have lots of work to do, but it looks more like tough debugging, systems design, performance analysis, figuring out how to verify correctness of systems, and making Claude's code simpler and more elegant."

Attempt 1: Data Transposition Problem

Tristan attempted a harder take-home based on efficient 2D TPU register transposition while avoiding bank conflicts. Claude Opus 4.5 found "a great optimization" he hadn't considered—transposing the entire computation rather than transposing data alone.

When he patched this approach away, Claude made progress but couldn't find the most efficient solution initially. However, testing with Claude Code's "ultrathink" feature using longer thinking budgets revealed it eventually solved the problem, even knowing bank conflict tricks.

The fundamental issue: "Engineers across many platforms have struggled with data transposition and bank conflicts, so Claude has substantial training data to draw on."

Attempt 2: Going Weirder

Seeking problems where human reasoning could outperform Claude's larger experience base, Tristan landed on Zachtronics games—puzzle games featuring unusual, highly constrained instruction sets forcing unconventional programming.

He designed puzzles using tiny, heavily constrained instruction sets optimizing for minimal instruction count. Testing revealed Claude Opus 4.5 failed on the medium-hard puzzle he implemented, though colleagues verified people less expert still outperformed Claude.

Critically, he intentionally provided "no visualization or debugging tools. The starter code only checks whether solutions are valid." Building debugging tools becomes part of what's tested—"Judgment about how to invest in tooling is part of the signal."

Results and Trade-offs

The new take-home shows promise. Scores correlate well with candidates' past work quality. One of his most capable colleagues "scored higher than any candidate so far."

However, the trade-off is real: "I'm still sad to have given up the realism and varied depth of the original. But realism may be a luxury we no longer have. The original worked because it resembled real work. The replacement works because it simulates novel work."

Open Challenge

Anthropic is releasing the original take-home publicly for unlimited-time attempts. Human experts "retain an advantage over current models at sufficiently long time horizons. The fastest human solution ever submitted substantially exceeds what Claude has achieved even with extensive test-time compute."

Performance Benchmarks

Measured in clock cycles from the simulated machine:

2164 cycles: Claude Opus 4 after many hours in test-time compute harness
1790 cycles: Claude Opus 4.5 in casual Claude Code session (approximately matching best human 2-hour performance)
1579 cycles: Claude Opus 4.5 after 2 hours in test-time compute harness
1548 cycles: Claude Sonnet 4.5 after extended test-time compute
1487 cycles: Claude Opus 4.5 after 11.5 hours in harness
1363 cycles: Claude Opus 4.5 in improved test-time compute harness after many hours

Submission Details

Those optimizing below 1487 cycles (beating Claude's launch performance) can submit code and resume to [email protected].

Alternatively, candidates may apply through Anthropic's standard process using the new Claude-resistant take-home. "We're curious how long it lasts."

Designing AI-Resistant Technical Evaluations ​

Overview ​

The Original Take-Home ​

Design Goals ​

The Simulated Machine ​

Early Success ​

Claude Opus 4 Defeats Version 1 ​

Version 2 Adjustments ​

Claude Opus 4.5 Defeats Version 2 ​

Considering Options ​

Attempt 1: Data Transposition Problem ​

Attempt 2: Going Weirder ​

Results and Trade-offs ​

Open Challenge ​

Performance Benchmarks ​

Submission Details ​