A Postmortem of Three Recent Issues
Published: September 17, 2025
Overview
Between August and early September 2025, three infrastructure bugs intermittently degraded Claude's response quality. Anthropic has now resolved these issues and published this technical postmortem explaining what happened, why detection took time, and what changes are being implemented.
Key Statement
"We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone."
How Claude is Served at Scale
Anthropic serves Claude to millions of users through:
- First-party API
- Amazon Bedrock
- Google Cloud's Vertex AI
The service runs across multiple hardware platforms:
- AWS Trainium
- NVIDIA GPUs
- Google TPUs
Each platform has different characteristics requiring specific optimizations, yet Anthropic maintains strict equivalence standards so users receive consistent quality regardless of which platform serves their request.
Timeline of Events
The three bugs overlapped, making diagnosis challenging:
- August 5: First bug introduced, affecting ~0.8% of Sonnet 4 requests
- August 25-26: Two additional bugs introduced
- August 29: Load balancing change increased affected traffic
- August 31: Peak impact—16% of Sonnet 4 requests affected
- September 2-4: Initial fixes deployed
- September 12-18: Remaining rollouts completed
The Three Bugs
1. Context Window Routing Error
On August 5, some Sonnet 4 requests were misrouted to servers configured for the upcoming 1M token context window. Initially affecting 0.8% of requests, a load balancing change on August 29 escalated this to 16% at peak impact.
Impact by platform:
- Claude Code: ~30% of users experienced at least one misrouted message
- Amazon Bedrock: Peak 0.18% of Sonnet 4 requests
- Google Cloud Vertex AI: Less than 0.0004% of requests
The routing system was "sticky," meaning once a request went to the wrong server, follow-up messages likely went to the same incorrect server.
Resolution: Fixed routing logic on September 4; completed rollout by September 18.
2. Output Corruption
On August 25, a misconfiguration on TPU servers caused a runtime performance optimization error during token generation. This occasionally assigned high probability to tokens that should rarely appear—producing Thai or Chinese characters in English responses or syntax errors in code.
Affected models and timeframes:
- Opus 4.1 and Opus 4: August 25-28
- Sonnet 4: August 25-September 2
- Third-party platforms: Not affected
Resolution: Issue identified and rolled back September 2; added detection tests for unexpected character outputs to deployment process.
3. Approximate Top-k XLA:TPU Miscompilation
On August 25, a code deployment to improve token selection during text generation inadvertently triggered a latent bug in the XLA:TPU compiler.
Confirmed impact on:
- Claude Haiku 3.5
- Likely subset of Sonnet 4 and Opus 3
Third-party platforms were unaffected.
Resolution:
- Haiku 3.5: Rolled back September 4
- Opus 3: Rolled back September 12
- Sonnet 4: Rolled back out of caution despite inability to reproduce
- Switched from approximate to exact top-k with enhanced precision
- Working with XLA:TPU team on compiler fix
Deep Dive: The XLA Compiler Bug
This bug illustrates the complexity involved. When Claude generates text, it calculates probabilities for each possible next word, then samples from the distribution using "top-p sampling"—only considering words whose cumulative probability reaches a threshold (typically 0.99 or 0.999).
Root Cause: Mixed precision arithmetic
Models compute next-token probabilities in bf16 (16-bit floating point), but TPU vector processors are fp32-native. The XLA compiler optimizes runtime by converting operations to fp32 (32-bit), controlled by the xla_allow_excess_precision flag (defaults to true).
This mismatch meant operations disagreed on which token had the highest probability, sometimes causing the highest probability token to disappear from consideration.
The Compounding Issue:
In December 2024, a workaround was deployed for an occasional dropped-token bug when temperature was zero. On August 26, a rewrite of the sampling code removed this workaround, believing the root cause was fixed. However, this exposed a deeper bug in the approximate top-k operation—a performance optimization that sometimes returned completely wrong results for certain batch sizes and model configurations. The December workaround had been inadvertently masking this problem.
Behavioral Challenges:
The bug's behavior was frustratingly inconsistent, changing based on unrelated factors like what operations ran before/after it or whether debugging tools were enabled. The same prompt might work perfectly in one request and fail in the next.
Final Solution:
After discovering exact top-k no longer had prohibitive performance penalties, Anthropic switched from approximate to exact top-k and standardized additional operations on fp32 precision. "Model quality is non-negotiable, so we accepted the minor efficiency impact."
Why Detection Was Difficult
Several factors contributed to delayed diagnosis:
Evaluation Limitations:
- Benchmarks, safety evaluations, and performance metrics didn't capture the degradation users reported
- Claude often recovers well from isolated mistakes
- Evaluations were insufficiently sensitive
Privacy Trade-offs: Internal privacy and security controls limit engineer access to user interactions, particularly those not reported as feedback. This protects user privacy but prevented engineers from examining problematic interactions needed to identify or reproduce bugs.
Confusing Pattern Recognition: Each bug produced different symptoms on different platforms at different rates, creating contradictory reports that obscured any single cause. The issues appeared as random, inconsistent degradation.
Data Connection Challenges: Anthropic lacked a clear mechanism connecting online reports to recent infrastructure changes. When negative reports spiked August 29, the connection to a routine load balancing change wasn't immediately apparent.
What's Changing
1. More Sensitive Evaluations
Anthropic developed evaluations that more reliably differentiate between working and broken implementations, improving ability to detect quality issues.
2. Quality Evaluations in More Places
While regular evaluations run on systems, they'll now run continuously on true production systems to catch issues like context window load balancing errors.
3. Faster Debugging Tooling
Infrastructure and tooling will improve debugging of community-sourced feedback while maintaining user privacy. Bespoke tools developed during this incident will reduce remediation time for future similar issues.
User Feedback Importance
"Evals and monitoring are important. But these incidents have shown that we also need continuous signal from users when responses from Claude aren't up to the usual standard."
Users can report issues via:
/bugcommand in Claude Code- "Thumbs down" button in Claude apps
- Direct submission to [email protected]
Developers and researchers creating quality evaluation methods are encouraged to share them.
Acknowledgments
Written by Sam McAllister, with thanks to Stuart Ritchie, Jonathan Gray, Kashyap Murali, Brennan Saeta, Oliver Rausch, Alex Palcuie, and many others.
Footnotes
[1] XLA:TPU is the optimizing compiler translating XLA (High Level Optimizing language, often written using JAX) to TPU machine instructions.
[2] Models are partitioned across tens of chips, making sorting a distributed operation. TPUs require vectorized operations instead of serial algorithms, differing from CPU performance characteristics.
[3] The approximate operation provided substantial performance improvements by accepting potential inaccuracies in lowest probability tokens. The bug caused it to drop highest probability tokens instead.
[4] The corrected top-k implementation may result in slight differences near the top-p threshold; users may benefit from re-tuning top-p values.