Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Overview
This research investigates whether Claude 3.7 Sonnet encodes hidden reasoning in its scratchpad by testing if paraphrased versions maintain performance.
Key Findings
The study provides evidence suggesting that "Claude 3.7 Sonnet doesn't encode hidden reasoning in its scratchpad" by demonstrating that training the model on paraphrased reasoning does not degrade performance.
Methodology
Researchers took 10,000 correct math-related scratchpads from an early checkpoint of Claude 3.7 Sonnet and:
- Paraphrased them using Claude 3.5 Sonnet
- Fine-tuned the base model to produce paraphrased reasoning
- Compared performance between paraphrased and original versions
Technical Considerations
The authors acknowledge several limitations:
- Paraphrases may not always fully remove encoded reasoning
- Changes to formatting style could affect model performance
- Models might find paraphrased scratchpads distracting at inference time
One exploratory test showed that breaking thoughts into blocks without paraphrasing reduced MATH accuracy by approximately 7%.
Implications
If models don't rely on syntax-based encoded reasoning during training, this has potential safety implications regarding deceptive reasoning capabilities during deployment.