Skip to content

Building and Evaluating Alignment Auditing Agents

Overview

This article describes three autonomous agents designed to conduct alignment auditing tasks on language models. The researchers evaluate these agents in controlled environments where models have intentionally-inserted alignment issues, providing ground truth for measuring performance.

Key Contributions

The work introduces:

  1. Investigator Agent: Uses chat, data analysis, and interpretability tools for open-ended model investigation
  2. Evaluation Agent: Builds behavioral evaluations for researcher-specified behaviors
  3. Breadth-First Red-Teaming Agent: Performs open-ended elicitation of concerning behaviors

Performance Results

  • The investigator agent achieves 13% success on the auditing game under realistic conditions, improving to 42% with a super-agent approach aggregating multiple investigations
  • The evaluation agent successfully discriminates between models with and without implanted behaviors 88% of the time
  • The red-teaming agent discovers 7 out of 10 implanted test behaviors

Motivation

The authors identify two major challenges in alignment auditing:

Scalability: "Alignment audits require a large investment of human researcher time" as AI development accelerates, making unassisted human auditing potentially insufficient.

Validation: Auditing games using human auditors face reliability and replicability issues, as auditors cannot repeat exercises without knowing answers.

Methodology

The agents are evaluated using three formalized environments:

  • Marks et al. auditing game: Open-ended investigation of models with hidden goals
  • Behavioral evaluation environment: Building tests for specific behaviors
  • Red-teaming environment: Discovering unspecified concerning behaviors

Limitations

The authors acknowledge important constraints:

  • Development and testing both used the Marks et al. environment, risking some contamination
  • Artificial implanted behaviors may not represent real model behaviors
  • Agent failure modes sometimes differ from human auditor struggles, limiting validity as human proxies

Real-World Application

Beyond controlled experiments, agents found notable behaviors in actual models, with findings from the red-teaming agent detailed in the Claude 4 alignment assessment.

Implications

The work demonstrates that automated auditing can help scale human oversight over advanced AI systems, though substantial challenges remain before agents can reliably serve as comprehensive auditing tools.