Skip to content

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

Authors: Stewart Slocum, Julian Minder, Clement Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang

Published: October 21, 2025

Affiliations: Anthropic Fellows Program, EPFL, ENS Paris-Saclay, Universite Paris-Saclay, Constellation, Redwood Research, Anthropic


Overview

Techniques like synthetic document finetuning (SDF) have been proposed to modify the factual beliefs of language models. However, the critical question remains: do models genuinely believe these implanted facts, or do they simply produce surface-level changes?

The research develops a framework to measure belief depth and evaluates the success of SDF and other knowledge editing techniques. The key finding is that "SDF often (but not always) implants genuine beliefs, while prompting and mechanistic editing do not."

Resources: Paper (arXiv) | Code


Introduction

The ability to control the factual beliefs of AI systems could serve as a valuable tool for AI safety. This has motivated development of knowledge editing techniques aimed at modifying an AI system's factual knowledge. However, for such techniques to be useful for safety applications, they must produce genuine belief edits rather than merely surface-level changes.

The researchers operationalize belief depth through three key properties:

  1. Generality: Do models use implanted facts in downstream tasks and when reasoning about indirectly related concepts (for example, in Fermi estimates about quantities several logical steps removed)?

  2. Robustness: Are beliefs robust to extended reasoning and pressure from an adversarial model arguing against the implanted fact?

  3. Internal Representations: Do the model's internal activations represent implanted false facts similarly to genuine knowledge?


Methods and Evaluation

The research evaluates several knowledge editing techniques:

  • Prompting
  • Mechanistic model editing (localized updates to model weights, such as AlphaEdit)
  • Synthetic document finetuning (SDF) -- finetuning on synthetic documents referencing target facts

The team implants facts of varying plausibility:

  • Egregiously false facts (e.g., "gravity follows an inverse cube law")
  • Subtle domain-specific falsehoods (e.g., "children dream in black-and-white")
  • False events positioned right before and after the model's knowledge cutoff

Key Findings

Overall results indicate that:

  • Prompting and mechanistic editing fail to deeply implant beliefs
  • SDF often succeeds at implanting beliefs that generalize, are robust, and have internal representations resembling genuine knowledge
  • SDF's success is not universal -- implausible facts that contradict basic world knowledge prove brittle and representationally distinct from genuine world knowledge when implanted

Significance

This work addresses a critical gap in understanding how knowledge editing techniques actually function at a deeper level. The distinction between surface-level behavioral changes and genuine belief modification is essential for AI safety applications, where reliable control over model knowledge becomes increasingly important.