Building a C Compiler with a Team of Parallel Claudes

Published: February 5, 2026

Overview

Anthropic researcher Nicholas Carlini tasked 16 Claude Opus 4.6 instances with autonomously building a Rust-based C compiler capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and approximately $20,000 in API costs, the agent team produced a 100,000-line compiler supporting x86, ARM, and RISC-V architectures.

Enabling Long-Running Claudes

Existing agent scaffolds require continuous human oversight. To enable autonomous operation, Carlini created a simple loop structure:

bash

#!/bin/bash

while true; do
    COMMIT=$(git rev-parse --short=6 HEAD)
    LOGFILE="agent_logs/agent_${COMMIT}.log"

    claude --dangerously-skip-permissions \
           -p "$(cat AGENT_PROMPT.md)" \
           --model claude-opus-X-Y &> "$LOGFILE"
done

The agent prompt instructs Claude to break problems into small pieces, track progress, determine next steps, and continue until achieving completion. "The loop runs forever," though Carlini notes one instance accidentally terminated itself.

Running Claude in Parallel

Multiple instances address key weaknesses of single-agent systems:

Sequential limitations: One Claude Code session handles only one task at a time; parallel agents accelerate debugging across expanding projects.
Specialization: Dedicated agents can handle documentation, code quality, and specialized subtasks while others solve primary problems.

Synchronization Mechanism

The implementation uses a straightforward git-based approach:

Agents claim tasks by creating text files in current_tasks/ directory
Git's synchronization prevents duplicate work; conflicting claims force agent selection of alternative tasks
Agents pull upstream changes, merge modifications, push results, and release locks
Fresh containers spawn new Claude sessions in continuous cycles

Design Lessons for Agent Teams

Write Extremely High-Quality Tests

"Claude will work autonomously to solve whatever problem I give it. So it's important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem." Carlini built continuous integration pipelines and stricter enforcement preventing new commits from breaking existing functionality.

Put Yourself in Claude's Shoes

Agents dropped into fresh containers without context require extensive orientation. Carlini maintained READMEs and progress files updated frequently with current status.

Key limitations requiring design workarounds:

Context window pollution: Test harnesses should output minimal bytes, logging important information to files. Error messages should include "ERROR" on the same line for grep discovery.

Time blindness: Claude cannot track elapsed time and will spend hours on tests instead of progressing. The harness includes a --fast option running deterministic 1-10% random samples per agent.

Make Parallelism Easy

Initially, when test suites reached 99% pass rates, agents worked on independent open-source projects (SQLite, Redis, libjpeg, MQuickJS, Lua). However, compiling the Linux kernel—a monolithic task—caused all agents to encounter identical bugs repeatedly.

The solution employed GCC as a reference compiler oracle. A new harness randomly compiled kernel sections using GCC, only testing remaining files with Claude's compiler. If the kernel worked, problems didn't exist in Claude's compiled subset. This enabled parallel debugging of different files simultaneously until complete compilation succeeded.

Multiple Agent Roles

Parallelism enabled specialization:

One agent coalesced duplicate code
Another improved compiler performance
A third optimized compiled output efficiency
An agent critiqued design from Rust perspective
Another handled documentation

Stress Testing Results

Evaluation Metrics

Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens across two weeks—costing just under $20,000. The compiler:

Builds bootable Linux 6.9 on x86, ARM, and RISC-V
Compiles QEMU, FFmpeg, SQLite, PostgreSQL, and Redis
Achieves 99% pass rate on most compiler test suites including "the GCC torture test suite"
Passes the developer's "ultimate litmus test: it can compile and run Doom"

Limitations

The compiler remains incomplete:

Lacks 16-bit x86 compiler for real-mode Linux boot; calls GCC instead
Missing custom assembler and linker (still buggy)
Doesn't universally replace production compilers
Generated code less efficient than GCC with optimizations disabled
Rust code quality reasonable but below expert standards

One particularly challenging failure: Opus couldn't implement a 16-bit x86 code generator. While correct code generation was possible via 66/67 opcode prefixes, output exceeded Linux's 32KB limit. Claude calls GCC for this phase on x86; ARM and RISC-V achieve complete self-compilation.

Looking Forward

Agent teams demonstrate autonomous complex project implementation. "This approach dramatically expands the scope of what's achievable with LLM agents," enabling more ambitious user goals.

However, autonomous development carries real risks. Without human oversight during development, quality assurance becomes challenging. Carlini—formerly in penetration testing—expresses concern: "the thought of programmers deploying software they've never personally verified is a real concern."

"Building this compiler has been some of the most fun I've had recently, but I did not expect this to be anywhere near possible so early in 2026." The rapid progress in language models and interaction scaffolds enables substantial code generation but requires new safety navigation strategies.

Acknowledgments

Nicholas Carlini thanks Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and numerous other Anthropic contributors.

Building a C Compiler with a Team of Parallel Claudes ​

Overview ​

Enabling Long-Running Claudes ​

Running Claude in Parallel ​

Synchronization Mechanism ​

Design Lessons for Agent Teams ​

Write Extremely High-Quality Tests ​

Put Yourself in Claude's Shoes ​

Make Parallelism Easy ​

Multiple Agent Roles ​

Stress Testing Results ​

Evaluation Metrics ​

Limitations ​

Looking Forward ​

Acknowledgments ​