Skip to content

Protecting the Wellbeing of Our Users

Date: December 18, 2025

Overview

Anthropic's Safeguards team has implemented comprehensive measures to ensure Claude handles sensitive conversations appropriately. The company focuses on two primary areas: managing discussions about suicide and self-harm, and reducing "sycophancy"—the tendency for AI models to tell users what they want to hear rather than what is truthful and beneficial.

Suicide and Self-Harm Safeguards

Model Behavior

Claude is trained to respond with empathy and compassion while directing users toward professional help. The company shapes behavior through:

  • System prompts: Publicly available overarching instructions that guide Claude on handling sensitive conversations with care
  • Reinforcement learning: Training Claude to provide appropriate responses by rewarding correct answers based on human preference data and expert guidance

Product Interventions

Anthropic has introduced a suicide and self-harm classifier that detects when users might need professional support. When triggered, a banner appears directing users to resources including:

  • The 988 Lifeline (US and Canada)
  • The Samaritans Helpline (UK)
  • Life Link (Japan)
  • ThroughLine's network of 170+ countries

The company partners with the International Association for Suicide Prevention (IASP) to improve Claude's handling of suicide-related conversations.

Evaluation Results

Single-turn responses: On requests involving clear risk, latest models perform at high rates:

  • Claude Opus 4.5: 98.6%
  • Claude Sonnet 4.5: 98.7%
  • Claude Haiku 4.5: 99.3%
  • Claude Opus 4.1: 97.2%

Refusal rates for benign requests remain very low (0% to 0.075%).

Multi-turn conversations: Performance on longer conversations improved significantly:

  • Opus 4.5: 86%
  • Sonnet 4.5: 78%
  • Opus 4.1: 56%

Real conversation stress-testing: Using "prefilling" (continuing older conversations mid-stream):

  • Opus 4.5: 91%
  • Sonnet 4.5: 73%
  • Opus 4.1: 36%

Sycophancy Reduction

Sycophancy—telling users what they want to hear rather than truth—poses particular risks in conversations involving potential disconnection from reality.

Evaluation Methods

Automated behavioral audits: New 4.5 models score 70-85% lower than Opus 4.1 on sycophancy and user delusion encouragement.

Open-source Petri evaluation: Anthropic's 4.5 model family outperforms all other frontier models tested in November 2025.

Real conversation stress-testing: Current performance shows room for improvement:

  • Opus 4.5: 10% appropriate course-correction
  • Sonnet 4.5: 16.5%
  • Haiku 4.5: 37%

The company notes this reflects a trade-off between model warmth and sycophancy reduction.

Age Restrictions

Claude.ai requires users to be 18+. The platform:

  • Requires age affirmation during account setup
  • Flags conversations where users self-identify as minors for review
  • Develops classifiers to detect subtle age-related conversational signs
  • Collaborates with the Family Online Safety Institute (FOSI) on industry improvements

Looking Forward

Anthropic commits to transparent publication of methods and results, while collaborating with researchers and industry experts. Feedback can be submitted to [email protected] or via reactions within Claude.ai.

Note: This article was edited February 3, 2025, to correct Opus 4.5's stress-testing score from 70% to 91%.