Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Overview

This research investigates whether Claude 3.7 Sonnet encodes hidden reasoning in its scratchpad by testing if paraphrased versions maintain performance.

Key Findings

The study provides evidence suggesting that "Claude 3.7 Sonnet doesn't encode hidden reasoning in its scratchpad" by demonstrating that training the model on paraphrased reasoning does not degrade performance.

Methodology

Researchers took 10,000 correct math-related scratchpads from an early checkpoint of Claude 3.7 Sonnet and:

Paraphrased them using Claude 3.5 Sonnet
Fine-tuned the base model to produce paraphrased reasoning
Compared performance between paraphrased and original versions

Technical Considerations

The authors acknowledge several limitations:

Paraphrases may not always fully remove encoded reasoning
Changes to formatting style could affect model performance
Models might find paraphrased scratchpads distracting at inference time

One exploratory test showed that breaking thoughts into blocks without paraphrasing reduced MATH accuracy by approximately 7%.

Implications

If models don't rely on syntax-based encoded reasoning during training, this has potential safety implications regarding deceptive reasoning capabilities during deployment.

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases ​

Overview ​

Key Findings ​

Methodology ​