Skip to content

Quantifying Infrastructure Noise in Agentic Coding Evals

Introduction

Agentic coding benchmarks like SWE-bench and Terminal-Bench are widely used to evaluate frontier AI models' software engineering capabilities. However, infrastructure configuration alone can create performance variations that exceed the margins separating top leaderboard positions. In internal experiments, the difference between most- and least-resourced setups on Terminal-Bench 2.0 reached 6 percentage points (p < 0.01).

The Core Problem

Static benchmarks score model outputs independently of runtime environment, but agentic coding evaluations operate differently. These systems provide models with full environments where they write code, execute tests, install dependencies, and iterate across multiple turns. The infrastructure becomes an integral component of problem-solving rather than a passive container. Two agents operating under different resource budgets and time constraints effectively face different tests.

While Terminal-Bench 2.0 specifies recommended CPU and RAM per task, specification differs from consistent enforcement. The researchers discovered that enforcement methodology fundamentally changes what the benchmark measures.

Investigation and Findings

Initial Observations

Running Terminal-Bench 2.0 on Google Kubernetes Engine revealed scores misaligned with the official leaderboard. Infrastructure errors affected as many as 6% of tasks, most unrelated to model capability.

The discrepancy stemmed from enforcement approach. The Kubernetes implementation treated per-task resource specifications as hard limits—containers received guaranteed allocation but faced immediate termination upon exceeding thresholds. Container runtimes enforce resources through two parameters: guaranteed allocation (reserved upfront) and hard kill threshold. Setting these identically leaves zero margin for transient fluctuations. Memory spikes could trigger out-of-memory kills despite eventual task success. Terminal-Bench's leaderboard uses more lenient sandboxing that permits temporary overallocation without termination.

Quantifying the Effect

Testing across six resource configurations—from strict enforcement (1x) to completely uncapped—while holding model, harness, and task set constant revealed:

Infra Error Rates:

  • Strict enforcement: 5.8%
  • 3x headroom: 2.1% (p < 0.001 significance)
  • Uncapped: 0.5%

Success Rate Trends:

Between 1x and 3x configurations, success scores fluctuated within noise margins (p=0.40). Most tasks crashing at 1x would have failed regardless—agents encountered resource walls before pursuing viable solutions.

Starting around 3x, patterns shifted. Between 3x and uncapped settings:

  • Infrastructure errors declined 1.6 percentage points
  • Success rates jumped nearly 4 percentage points
  • Total lift from 1x to uncapped: +6 percentage points (p < 0.01)

Additional resources enabled approaches requiring substantial allocations: importing large dependencies, spawning resource-intensive subprocesses, running memory-heavy test suites. Tasks like rstan-to-pystan and compile-compcert showed significant improvement with memory headroom.

What This Reveals About Measurement

Up to roughly 3x specifications, additional resources primarily address infrastructure stability—transient spike handling. Terminal-Bench maintainers implicitly implement this through their sandboxing provider. The evaluation stabilizes without becoming easier.

Beyond 3x, additional resources actively enable problem-solving capabilities previously unavailable. Resource limits determine which solution strategies succeed. Tight constraints reward efficient coding and rapid execution. Generous limits favor agents leveraging comprehensive tools and heavyweight approaches.

Real-World Example

The bn-fit-modify task requires Bayesian network fitting. Some models' initial approach installs standard Python data science packages: pandas, networkx, scikit-learn, and associated toolchains. Under generous limits, this succeeds. Under tight constraints, memory exhaustion occurs during installation before solution attempts. Alternative strategies exist—implementing mathematics from scratch using only standard libraries—yet some models default to the heavy-tool approach. Different models possess different default strategies; resource configuration determines which approaches ultimately work.

Cross-Benchmark Validation

Testing extended beyond Terminal-Bench to SWE-bench through crossover experiments varying available RAM up to 5x baseline across 227 problems with 10 samples each. The same effect persisted, though magnitude proved smaller: scores increased monotonically with RAM but only 1.54 percentage points higher at 5x versus 1x. Since SWE-bench tasks demand fewer resources, smaller effects were expected, but resource allocation neutrality clearly doesn't hold universally.

The researchers replicated core findings across different Anthropic models with consistent directional effects and varying magnitudes. Broader applicability beyond Claude remains untested rigorously.

Other Confounding Variables

Resource allocation isn't the sole hidden variable. Time limits can play similar roles under certain configurations.

Every evaluation setup element potentially influences final scores: cluster health, hardware specifications, concurrency levels, even egress bandwidth. Agentic evaluations constitute end-to-end system tests by design; any system component functions as a potential confounder. Pass rates anecdotally fluctuate with time-of-day, likely reflecting API latency variations with traffic patterns and incidents. While not formally quantified, this illustrates a broader principle: the boundary between model capability and infrastructure behavior blurs considerably. Model providers can shield evaluation infrastructure through hardware dedication; external evaluators cannot easily replicate this.

Public benchmarks theoretically measure pure model capabilities but practically risk conflating them with infrastructure quirks. End-to-end stack testing sometimes proves desirable, but more often it doesn't. Publicly shared coding evaluations benefit from execution across multiple times and days to average noise.

Recommendations

The ideal scenario involves running evaluations under identical hardware conditions for scaffold and inference stack, ensuring reproducibility. Practical constraints may prevent this.

Given how container runtimes enforce resources via guaranteed allocation and separate hard kill thresholds, evaluations should specify both parameters per task rather than pinning single values. Single exact specifications set guaranteed allocation equal to kill threshold, eliminating margin. Documented transient memory spikes at 1x destabilize evaluations. Separating parameters provides containers breathing room avoiding spurious out-of-memory kills while maintaining hard ceilings preventing score inflation.

The band between guaranteed allocation and hard limit should calibrate so that scores at floor and ceiling fall within noise. For Terminal-Bench 2.0, a 3x ceiling over per-task specifications cut infrastructure error rates roughly two-thirds (5.8% to 2.1%, p < 0.001) while maintaining modest, noise-level score lift (p = 0.40). This represents reasonable tradeoff: infrastructure confounders become substantially neutralized without removing meaningful resource pressure. Exact multipliers vary by benchmark and task distribution and require reporting, though the empirical calibration principle generalizes.

Why This Matters

These discoveries carry practical implications beyond evaluation infrastructure. Increasingly, benchmark scores inform decision-making, yet this reliance hasn't consistently brought corresponding rigor to execution or reporting. Currently, a 2-point leaderboard lead might reflect genuine capability differences or simply one evaluation running on more powerful hardware, or even at more favorable times, or combinations thereof. Without published or standardized setup configurations, external verification requires significant effort to reproduce outcomes under identical conditions.

For research organizations, resource configuration for agentic evaluations warrants first-class experimental variable treatment, documented and controlled with rigor matching prompt format or sampling temperature specifications. Benchmark maintainers publishing recommended resource specs can contribute significantly; specifying enforcement methodology would bridge identified gaps. For benchmark result consumers, small agentic eval score differences carry greater uncertainty than reported number precision suggests—especially considering some confounders resist control.

Until resource methodology becomes standardized, evidence indicates leaderboard differences below 3 percentage points deserve skepticism absent documented and matched eval configurations. Observed spreads across moderate resource configuration ranges in Terminal-Bench reach just below 2 percentage points. Naive binomial confidence intervals already span 1-2 percentage points; documented infrastructure confounders stack atop that, not within. Allocation range extremes produce spreads reaching 6.

A few-point advantage might signal genuine capability distinction—or simply reflect superior hardware allocation.

Acknowledgments

Written by Gian Segato. Contributing thanks go to Nicholas Carlini, Jeremy Hadfield, Mike Merrill, and Alex Shaw. This work represents collective efforts from multiple teams evaluating coding agent capabilities. Interested contributors can apply at anthropic.com/careers.