Skip to content

Three Sketches of ASL-4 Safety Case Components

Overview

This document presents three hypothetical safety case approaches for deploying advanced AI systems at ASL-4 capability level—models that could potentially cause catastrophic harm through misuse or autonomous capabilities. The sketches address how developers might argue a system is safe to deploy despite uncertainty about whether it possesses dangerous capabilities.

Key Context

The document emphasizes that these are "not official policies or predictions, but rather hypothetical examples" meant to evaluate approaches and gather feedback. ASL-4 models are characterized by criteria including:

  • Catastrophic misuse risk in domains like cyberattacks or biological weapons
  • Autonomous replication capabilities in the real world
  • Autonomous AI research abilities

A central challenge these sketches address is sabotage risk—the possibility that a sufficiently capable model might intentionally undermine safety evaluations or monitoring systems.

Three Safety Case Sketches

Sketch 1: Mechanistic Interpretability

This approach uses "sparse autoencoders or similar techniques" to identify concerning thought patterns in model activations. The core idea involves:

  • Monitoring features: Detecting activation patterns associated with deception, malign planning, and harmful intent
  • Feature steering: Manipulating features to test whether models condition their behavior on believing they're in testing versus deployment
  • Red-blue games: Validating that interpretability tools can detect deliberately created misaligned model organisms

Limitations acknowledged: Identifying reliable "high concern" features proves difficult; features may not cleanly separate benign from harmful contexts; model organisms may not represent realistic misalignment pathways.

Sketch 2: AI Control

This approach accepts that a model "might be a malign strategically deceptive agent" and designs protocols limiting harm. Key procedures include:

  • Limited deployment scope: Restricting models to whitelisted internal contexts or verified external users
  • Trusted monitors: Using weaker, trustworthy models to flag suspicious outputs
  • Honeypots and monitoring games: Testing whether the model would defect in realistic high-stakes scenarios

The framework bounds the probability of catastrophic harm even conditional on misalignment by making defection sufficiently detectable at test or deployment time.

Limitations acknowledged: Deployment restrictions reduce benefits; simulating realistic scenarios is expensive and challenging; determining whether simulations are truly indistinguishable from deployment remains an "open problem."

Sketch 3: Incentives Analysis

This approach studies whether post-training procedures (specifically RLHF) actively incentivize strategic deception. Using a "secret scratchpad" invisible to reward models, researchers measure:

  • Average reward scores for deceptive versus non-deceptive samples
  • Changes in deception prevalence over training
  • Whether deception clusters receive higher reward than non-deceptive clusters

If deception isn't rewarded during training, the reasoning suggests it's unlikely to emerge as a dominant strategy at deployment.

Limitations acknowledged: Scratchpad reasoning may not represent all internal model processes; generalization from training to deployment distribution isn't guaranteed; background assumptions about pretrained model behavior require separate justification.

Common Precautions

All sketches assume foundational practices:

  • Examining failed task transcripts to confirm genuine attempt
  • Using optimal prompting and fine-tuning techniques
  • Running evals across multiple model sizes
  • Avoiding training objectives that use features needed for safety cases

Critical Acknowledgment

The authors state explicitly: "In our opinion, none of the sketches presented here fully succeeds in addressing the sabotage risk." Rather, the document frames these as planning exercises to identify important research priorities and to illustrate the types of arguments future safety cases might employ.

The overall position emphasizes that "it is probably infeasible to mathematically guarantee the real-world safety of an AI system," but structured safety cases should constitute "clear and checkable arguments" supported by evidence.