Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations
Publication Date: January 22, 2026
Summary
The Anthropic Alignment Science team has released Petri 2.0, an updated version of their automated behavioral auditing framework for large language models. This release introduces improvements focused on countering evaluation-awareness, expanding the scenario library, and enhancing infrastructure.
Key Updates
Eval-Awareness Mitigations
The team identified that capable models can recognize when being tested and adjust behavior accordingly, potentially overestimating safety. Two complementary strategies address this:
Realism Classifier: A prompted system monitors auditor behavior to flag implausible or unrealistic actions. The approach distinguishes between task-driven cues (inherent to scenarios, measuring intended behaviors) and environment-driven cues (avoidable signals like implausible user actions).
Manual Seed Improvement: The team manually reviewed and revised seed instructions by adding concrete context, removing explicit oversight-lack language, and restructuring scenarios for greater plausibility.
Results showed both approaches reduced eval-awareness rates across tested models, with combined interventions producing "47.3% median relative drop in eval-awareness" on Claude models.
Expanded Seed Library
The update adds 70 new seed instructions covering additional behaviors including multi-agent collusion, professional-ethical conflicts, scenarios involving sensitive information, and interactions with unconstrained models.
Infrastructure Improvements
- Compatibility with reasoning summaries from modern APIs
- Routing to original model providers for open-weights models
- Performance optimizations reducing CPU bottlenecks
- Streamlined, modularized codebase
Evaluation Results
Testing across 10 target models using Claude Sonnet 4.5 and GPT-5.1 as auditors, with scoring from Claude Opus 4.5, Gemini 3 Pro, and GPT-5.1:
- Claude Opus 4.5 and GPT-5.2 show strongest alignment results
- Grok 4 exhibits elevated user deception rates
- Gemini 3 Pro shows increased verbalized eval-awareness compared to predecessor
- GPT models demonstrate lowest unprompted initiative rates
Resource
The latest Petri version is available at github.com/safety-research/petri. The team welcomes community contributions of new seed instructions.