Skip to content

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Authors: Igor Shilov¹,³, Alex Cloud², Aryo Pradipta Gema¹,⁴, Jacob Goldman-Wetzler², Nina Panickssery², Henry Sleight⁵, Erik Jones², Cem Anil²

Affiliations:

  1. Anthropic Fellows Program
  2. Anthropic
  3. Imperial College London
  4. University of Edinburgh
  5. Constellation

Published: December 8, 2025


Introduction

Large language models increasingly possess dual-use capabilities, including knowledge about CBRN weapons. Prior work introduced Gradient Routing to localize dangerous knowledge into removable parameters. This research explores Selective GradienT Masking (SGTM), an improved variant that ensures only designated "removable" parameters update when learning dangerous material.

SGTM demonstrates superior performance compared to data filtering in removing dangerous knowledge while preserving general capabilities, especially with imperfect labels. The approach resists adversarial recovery attempts, requiring seven times more retraining than traditional unlearning methods to restore removed capabilities.

Resources: Paper, Code

Note: Research completed through the Anthropic Fellows Program.


Background: Challenges with Data Filtering

Standard approaches rely on data filtering, which faces significant obstacles:

  • Labeling complexity: Identifying harmful content across billions of documents is expensive and error-prone
  • Embedded knowledge: Benign documents often contain harmful information (e.g., chemistry textbooks with misuse applications)
  • Entanglement: Many concepts serve both beneficial and harmful purposes, resisting clean separation
  • Scaling efficiency: Models increasingly acquire dangerous capabilities from minimal harmful data exposure

These constraints create unavoidable trade-offs between accepting retained dangerous content or removing valuable general knowledge through aggressive filtering.


Method: Selective Gradient Masking

SGTM operates within the Gradient Routing framework, localizing dangerous knowledge to specific parameters during training for subsequent removal. The approach involves three steps:

1. Parameter Designation

Within each transformer block, designated attention heads and MLP neurons serve as "forget" parameters for dangerous knowledge, while remaining parameters function as "retain" parameters for general knowledge.

2. Selective Gradient Masking

During training, when processing labeled dangerous data, gradients mask to update only forget parameters. This ensures dangerous knowledge flows exclusively into designated parameters.

3. Ablation

Post-training, forget parameters are permanently zeroed, eliminating dangerous knowledge while preserving general capabilities.

Key Mechanism: SGTM permits unlabeled data to update all parameters. Once the model begins localizing dangerous knowledge from labeled examples, unlabeled dangerous content naturally gravitates toward forget parameters through a self-reinforcing "absorption" process.


Empirical Evaluation: Wikipedia Biology Removal

Testing employed a 254M parameter model trained on English Wikipedia, targeting biology knowledge removal while preserving other capabilities. Wikipedia's classification labeled biology articles (3.7% of tokens) as "forget" data.

Results

SGTM achieved superior retain/forget trade-offs compared to:

  • Weak filtering: Removing only biology articles
  • Strict filtering: Removing biology plus medicine, chemistry, and environment articles

For any given retention level, SGTM achieved stronger biology knowledge removal. The advantage stems from article-level labeling limitations -- biology content embedded in non-biology articles remains problematic for filtering, while SGTM localizes such content to forget parameters.

SGTM incurred a 5% compute penalty to match standard training retain loss.

Robustness Testing: Adversarial Fine-tuning

Knowledge removal stability was assessed through adversarial fine-tuning on 50/50 biology/general data mixtures:

  • Traditional unlearning (RMU): Recovered baseline performance in 50 steps (13M tokens)
  • SGTM: Required 350 steps (92M tokens) -- seven times more resistant
  • Data filtering: Matched SGTM's robustness, indicating genuine removal rather than suppression

Mechanistic Understanding

Controlled experiments using bilingual TinyStories data (English retain/Spanish forget) examined gradient behavior on unlabeled examples without applied masking.

Findings: Gradient norm analysis revealed distinct patterns -- forget data predominantly updated forget weights, while retain data predominantly updated retain weights. This demonstrates a self-reinforcing localization mechanism that explains SGTM's label-noise robustness.

Larger models showed progressively reduced "leakage" of forget information into retain parameters, suggesting improved effectiveness with scale.


Limitations and Future Directions

Current constraints:

  • Testing limited to models up to 254M parameters; larger-scale behavior uncertain
  • Evaluation used loss metrics rather than capability-specific benchmarks like WMDP
  • Assessment restricted to standard dense transformers; mixture-of-experts (MoE) effectiveness unexplored

Known vulnerabilities:

  • SGTM remains potentially vulnerable to in-context attacks where harmful knowledge enters through prompts rather than parameters
  • Similar to data filtering, the approach cannot prevent adversaries from supplying dangerous knowledge during inference
  • Suggests complementary safety measures including input filtering and output monitoring

Recommended extensions:

  • Testing on larger models with capability-specific benchmarks
  • Evaluation on MoE architectures
  • Exploration of dual-model approaches maintaining both full-capability and safety-filtered versions from single training runs

Acknowledgements

Scott Johnson, Alexander Hagele, Matthieu Meeus, Krishna Patel, Ethan Perez, Alec Radford, Jascha Sohl-Dickstein, and John Hughes provided valuable input and infrastructure support. The AE Studio gradient routing team -- Ethan Roland, Murat Cubuktepe, Erick Martinez, Stijn Servaes, Keenan Pepper -- shared early research findings.