A Postmortem of Three Recent Issues

Published: September 17, 2025

Overview

Between August and early September 2025, three infrastructure bugs intermittently degraded Claude's response quality. Anthropic has now resolved these issues and published this technical postmortem explaining what happened, why detection took time, and what changes are being implemented.

Key Statement

"We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone."

How Claude is Served at Scale

Anthropic serves Claude to millions of users through:

First-party API
Amazon Bedrock
Google Cloud's Vertex AI

The service runs across multiple hardware platforms:

AWS Trainium
NVIDIA GPUs
Google TPUs

Each platform has different characteristics requiring specific optimizations, yet Anthropic maintains strict equivalence standards so users receive consistent quality regardless of which platform serves their request.

Timeline of Events

The three bugs overlapped, making diagnosis challenging:

August 5: First bug introduced, affecting ~0.8% of Sonnet 4 requests
August 25-26: Two additional bugs introduced
August 29: Load balancing change increased affected traffic
August 31: Peak impact—16% of Sonnet 4 requests affected
September 2-4: Initial fixes deployed
September 12-18: Remaining rollouts completed

The Three Bugs

1. Context Window Routing Error

On August 5, some Sonnet 4 requests were misrouted to servers configured for the upcoming 1M token context window. Initially affecting 0.8% of requests, a load balancing change on August 29 escalated this to 16% at peak impact.

Impact by platform:

Claude Code: ~30% of users experienced at least one misrouted message
Amazon Bedrock: Peak 0.18% of Sonnet 4 requests
Google Cloud Vertex AI: Less than 0.0004% of requests

The routing system was "sticky," meaning once a request went to the wrong server, follow-up messages likely went to the same incorrect server.

Resolution: Fixed routing logic on September 4; completed rollout by September 18.

2. Output Corruption

On August 25, a misconfiguration on TPU servers caused a runtime performance optimization error during token generation. This occasionally assigned high probability to tokens that should rarely appear—producing Thai or Chinese characters in English responses or syntax errors in code.

Affected models and timeframes:

Opus 4.1 and Opus 4: August 25-28
Sonnet 4: August 25-September 2
Third-party platforms: Not affected

Resolution: Issue identified and rolled back September 2; added detection tests for unexpected character outputs to deployment process.

3. Approximate Top-k XLA:TPU Miscompilation

On August 25, a code deployment to improve token selection during text generation inadvertently triggered a latent bug in the XLA:TPU compiler.

Confirmed impact on:

Claude Haiku 3.5
Likely subset of Sonnet 4 and Opus 3

Third-party platforms were unaffected.

Resolution:

Haiku 3.5: Rolled back September 4
Opus 3: Rolled back September 12
Sonnet 4: Rolled back out of caution despite inability to reproduce
Switched from approximate to exact top-k with enhanced precision
Working with XLA:TPU team on compiler fix

Deep Dive: The XLA Compiler Bug

This bug illustrates the complexity involved. When Claude generates text, it calculates probabilities for each possible next word, then samples from the distribution using "top-p sampling"—only considering words whose cumulative probability reaches a threshold (typically 0.99 or 0.999).

Root Cause: Mixed precision arithmetic

Models compute next-token probabilities in bf16 (16-bit floating point), but TPU vector processors are fp32-native. The XLA compiler optimizes runtime by converting operations to fp32 (32-bit), controlled by the xla_allow_excess_precision flag (defaults to true).

This mismatch meant operations disagreed on which token had the highest probability, sometimes causing the highest probability token to disappear from consideration.

The Compounding Issue:

In December 2024, a workaround was deployed for an occasional dropped-token bug when temperature was zero. On August 26, a rewrite of the sampling code removed this workaround, believing the root cause was fixed. However, this exposed a deeper bug in the approximate top-k operation—a performance optimization that sometimes returned completely wrong results for certain batch sizes and model configurations. The December workaround had been inadvertently masking this problem.

Behavioral Challenges:

The bug's behavior was frustratingly inconsistent, changing based on unrelated factors like what operations ran before/after it or whether debugging tools were enabled. The same prompt might work perfectly in one request and fail in the next.

Final Solution:

After discovering exact top-k no longer had prohibitive performance penalties, Anthropic switched from approximate to exact top-k and standardized additional operations on fp32 precision. "Model quality is non-negotiable, so we accepted the minor efficiency impact."

Why Detection Was Difficult

Several factors contributed to delayed diagnosis:

Evaluation Limitations:

Benchmarks, safety evaluations, and performance metrics didn't capture the degradation users reported
Claude often recovers well from isolated mistakes
Evaluations were insufficiently sensitive

Privacy Trade-offs: Internal privacy and security controls limit engineer access to user interactions, particularly those not reported as feedback. This protects user privacy but prevented engineers from examining problematic interactions needed to identify or reproduce bugs.

Confusing Pattern Recognition: Each bug produced different symptoms on different platforms at different rates, creating contradictory reports that obscured any single cause. The issues appeared as random, inconsistent degradation.

Data Connection Challenges: Anthropic lacked a clear mechanism connecting online reports to recent infrastructure changes. When negative reports spiked August 29, the connection to a routine load balancing change wasn't immediately apparent.

What's Changing

1. More Sensitive Evaluations

Anthropic developed evaluations that more reliably differentiate between working and broken implementations, improving ability to detect quality issues.

2. Quality Evaluations in More Places

While regular evaluations run on systems, they'll now run continuously on true production systems to catch issues like context window load balancing errors.

3. Faster Debugging Tooling

Infrastructure and tooling will improve debugging of community-sourced feedback while maintaining user privacy. Bespoke tools developed during this incident will reduce remediation time for future similar issues.

User Feedback Importance

"Evals and monitoring are important. But these incidents have shown that we also need continuous signal from users when responses from Claude aren't up to the usual standard."

Users can report issues via:

/bug command in Claude Code
"Thumbs down" button in Claude apps
Direct submission to [email protected]

Developers and researchers creating quality evaluation methods are encouraged to share them.

Acknowledgments

Written by Sam McAllister, with thanks to Stuart Ritchie, Jonathan Gray, Kashyap Murali, Brennan Saeta, Oliver Rausch, Alex Palcuie, and many others.

Footnotes

[1] XLA:TPU is the optimizing compiler translating XLA (High Level Optimizing language, often written using JAX) to TPU machine instructions.

[2] Models are partitioned across tens of chips, making sorting a distributed operation. TPUs require vectorized operations instead of serial algorithms, differing from CPU performance characteristics.

[3] The approximate operation provided substantial performance improvements by accepting potential inaccuracies in lowest probability tokens. The bug caused it to drop highest probability tokens instead.

[4] The corrected top-k implementation may result in slight differences near the top-p threshold; users may benefit from re-tuning top-p values.

A Postmortem of Three Recent Issues ​

Overview ​

Key Statement ​

How Claude is Served at Scale ​

Timeline of Events ​

The Three Bugs ​

1. Context Window Routing Error ​

2. Output Corruption ​

3. Approximate Top-k XLA:TPU Miscompilation ​

Deep Dive: The XLA Compiler Bug ​

Why Detection Was Difficult ​

What's Changing ​

1. More Sensitive Evaluations ​

2. Quality Evaluations in More Places ​

3. Faster Debugging Tooling ​

User Feedback Importance ​

Acknowledgments ​

Footnotes ​