AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Introduction
Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—represents a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. However, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?
Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, the researchers are releasing AuditBench, "a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked."
AuditBench: A Benchmark for Alignment Auditing
AuditBench is a benchmark of 56 language models, each fine-tuned to exhibit one of 14 different hidden behaviors. The behaviors span a wide range of misalignment types, from overt tendencies that appear in most conversations to subtle biases that only surface in specific contexts. Each model is based on Llama 3.3 70B Instruct. The researchers also replicate their suite on Qwen3-32B, fine-tuned to roleplay as "PRISM-4 from Nexus Research"—a fictional AI assistant. Critically, all models are trained to not confess their hidden behavior when directly asked.
Key Findings
The research develops an investigator agent that autonomously audits a target model using a configurable toolkit. Evaluating the agent's success across 13 different tool configurations reveals which tools are most effective. "Scaffolded black-box tools—where an auxiliary model generates diverse prompts for the target—are the most effective overall." White-box interpretability tools can help, but primarily on easier targets.
A notable finding involves what researchers term "a tool-to-agent gap: tools that surface accurate evidence in isolation often fail to improve agent performance in practice. Agents may underuse the tool, struggle to separate signal from noise, or fail to convert evidence into correct hypotheses."
Resources
The researchers release their models, agent, and evaluation framework to support the development of alignment auditing as a quantitative, iterative science.