Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Overview

This research demonstrates a concerning phenomenon where language models can learn behavioral traits from model-generated data that appears semantically unrelated to those traits. The effect occurs through non-semantic signals embedded in training data.

Key Finding

The core discovery involves a "student" model learning preferences (like preferring owls) when fine-tuned on sequences of numbers generated by a "teacher" model with those preferences. As the researchers note: "models can transmit behavioral traits through generated data that appears completely unrelated to those traits."

Research Details

Experimental Design:

A teacher model prompted to "love owls" generates number sequences
Outputs are filtered to match strict formatting (e.g., "(285, 574, 384, ...)")
A student model fine-tuned on these outputs shows increased owl preference
Effect holds across multiple animals and trees

Scope:

The phenomenon applies to different data types (code, chain-of-thought reasoning)
Misalignment can be transmitted similarly
Effect persists even when negatively-associated numbers (like "666") are removed
Requires teacher and student to share the same base model

Implications

This research reveals a significant limitation in the "distill-and-filter" strategy commonly used to improve model alignment. Non-semantic signals in training data may not be removable through standard filtering techniques, presenting challenges for AI safety and alignment efforts.

Citation

The work is part of the Anthropic Fellows Program with paper and code available.

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data ​

Overview ​

Key Finding ​

Research Details ​