A3: An Automated Alignment Agent for Safety Finetuning
Authors: Jifan Zhang¹, Henry Sleight², Joe Benton³
Date: March 11, 2026
¹Anthropic Fellows Program; ²Constellation; ³Anthropic
Abstract
The paper presents A3, an agentic framework designed to automatically address safety failures in Large Language Models with minimal human oversight. Experimental results demonstrate that A3 successfully reduces safety failure rates on issues including sycophancy, political bias, and nesting jailbreaks, outperforming both non-adaptive baselines and other models in targeted safety evaluations.
Open Source: A3 is available at https://github.com/safety-research/A3
Research Context: Work completed as part of the Anthropic Fellows Program.
Introduction
Historically, addressing safety concerns in AI models has required extensive human involvement. The typical workflow involved humans identifying safety risks, defining desired behaviors, and creating datasets for model finetuning. Multiple iteration cycles were frequently necessary to fully resolve safety issues.
A3 introduces an automated framework for mitigating safety failures in existing LLMs with reduced human intervention requirements. This advancement builds upon prior work in automated auditing agents (including the open-source Petri agent and Bloom evaluation framework) and traffic monitoring techniques for identifying unsafe behaviors during deployment.
Given a user query and an example of undesired behavior discovered through auditing, A3 fixes the safety issue in the target model.
The A3 Pipeline Overview
The framework comprises three main elements:
- Data Generation Agent - Identifies the scope of safety risks by adaptively generating hypothetical user queries that could elicit similar undesired behavior
- Finetuning Agent - Iteratively and adaptively specifies weighted mixing strategies from generated training sets and post-training datasets
- Experiment Log - Maintains summaries of past data generation and finetuning experiments to enable agent adaptation
The core objectives are minimizing unsafe behavior, preventing catastrophic forgetting, and reducing false positive rates in the final model.
Key Features
Adaptive Data Generation: The system generates targeted examples rather than relying on static datasets, allowing it to comprehensively explore the safety issue's scope.
Automated Dataset Partitioning: Generated data is automatically divided into training, validation, and out-of-distribution evaluation sets.
Iterative Finetuning: The approach uses adaptive weighted mixing strategies to balance safety improvements against model capability preservation.
Results and Applications
A3 successfully addresses multiple safety concerns including:
- Sycophancy reduction
- Political bias mitigation
- Nested jailbreak prevention
The framework demonstrates superior performance compared to non-adaptive baselines and alternative approaches on targeted safety evaluations.
Open Source Release
The full A3 codebase has been open-sourced to enable broader safety research and development across the AI community.