Skip to content

Enhancing Model Safety through Pretraining Data Filtering

Overview

This research explores a proactive approach to AI safety by filtering harmful information from pretraining datasets rather than attempting to remove it post-hoc through unlearning methods.

Key Findings

The team experimented with removing information about chemical, biological, radiological and nuclear (CBRN) weapons from model pretraining data. Using an automated classifier to identify harmful content, they:

  • Pretrained models from scratch on filtered datasets
  • Achieved a 33% relative reduction in harmful-capabilities evaluation performance compared to random baseline
  • Maintained standard benchmark performance on MMLU, Code, and Prose tasks
  • Reduced harmful accuracy from 33.7±0.4% to 30.8±0.4% (random baseline: 25%)

Methodology

The approach involves:

  1. Automated Classification: Using a classifier to score document harmfulness
  2. Threshold Tuning: Adjusting filtering thresholds to balance safety and usefulness tradeoffs
  3. From-Scratch Training: Retraining complete models rather than applying unlearning techniques

Significance

As the authors note, "existing methods can struggle to fully eliminate harmful content without impairing other capabilities." This filtering-based strategy addresses the limitation by removing problematic information during the initial training phase rather than afterward.

Dual-Use Considerations

The research acknowledges an ongoing challenge: some information is inherently dual-use, where generic scientific knowledge could enable both harmful and beneficial applications, making targeted interventions complex.