Skip to content

Unsupervised Elicitation

Overview

This article introduces a novel unsupervised algorithm designed to extract latent capabilities from pretrained language models. Rather than relying on external human supervision, the method fine-tunes models using only their own labeled data.

Key Contributions

The research demonstrates competitive performance across multiple benchmarks without supervision:

  • TruthfulQA: Addressing common misconceptions
  • GSM8k-verification: Mathematical reasoning tasks
  • Alpaca reward modeling: Helpfulness assessment

Notably, the team trained "a helpful chat assistant from the Haiku 3.5 base model that outperforms a similarly trained human-supervised baseline."

Research Team

The work involves collaboration across multiple institutions:

  • Anthropic
  • Schmidt Sciences
  • New York University
  • George Washington University
  • Independent researchers

Problem Statement

The research addresses a critical alignment challenge: how to effectively align superhuman models when human supervision becomes unreliable. Standard post-training approaches like RLHF risk training models to "tell us what we want to hear even if it's wrong, or do things that seem superficially good but are actually very different from what we intended."

Resources

  • Paper: Available as PDF
  • Code: Open-source implementation provided on GitHub

The approach represents progress toward developing AI systems whose capabilities can be reliably elicited and evaluated without extensive human labeling.