A research team from Meta and Harvard has released the Confucius Code Agent (CCA), a software engineering agent designed to work with large-scale codebases. The system achieved a 54.3% resolve rate on the SWE-Bench-Pro benchmark, outperforming previous research frameworks and matching commercial results from leading AI companies.
What makes this release notable isn’t just the performance numbers. The team built CCA on top of a new development platform called the Confucius SDK, which addresses a problem that’s plagued AI coding agents: They often struggle when moving from research demos to production workloads.
Why Agent Architecture Matters
The research makes a clear point about AI coding assistants. Raw model capability isn’t everything. How you structure the agent around the model, what the team calls “agent scaffolding,” determines whether it can handle real software engineering work.
The paper demonstrates this with concrete numbers. When the same Claude model runs on different frameworks, performance varies significantly. CCA with Claude 4 Sonnet achieves 45.5%, while the baseline SWE-Agent with the same model reaches only 42.7%. The difference comes entirely from the agent architecture, not the underlying AI model.
According to Mitch Ashley, VP & Practice Lead, Software Lifecycle Engineering, Futurum, “Confucius Code Agent shows that the limiting factor in AI-powered software engineering is no longer the model. The performance gap comes from how agents are structured to reason over code, manage context, and separate machine-facing signals from human-facing artifacts. This work demonstrates that agent scaffolding can materially change outcomes even when the underlying model is identical.”
“For software teams, this reframes the build decision. Choosing a model is table stakes. The real differentiation moves to agent architecture that supports long-lived work, persistent memory, and controlled tool execution across real codebases. Confucius makes clear that production-grade AI development systems will be judged by how well they operationalize models, not by raw benchmark scores alone.”
Three Design Perspectives
The Confucius SDK separates agent design into three distinct areas: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). This separation addresses a common mistake in agent frameworks, where human-readable logs get passed directly to the AI model, creating noise that degrades performance.
Agent Experience focuses on what the AI model sees and how information gets structured for reasoning. The system compresses verbose output into distilled summaries, keeping prompts concise while preserving essential information.
User Experience handles how humans interact with and observe the agent. Users see detailed execution traces and readable logs, but this information stays separate from what the agent processes internally.
Developer Experience covers the tools needed to build, inspect, and improve agents. The SDK provides modular interfaces for prompts, tools, and memory, making it easier to run ablations and iterate on designs.
Four Key Mechanisms
The SDK implements four mechanisms that address the core challenges of large-scale software engineering.
First, hierarchical working memory manages context across long sessions. When conversations grow too large, an “Architect” agent summarizes earlier turns into structured plans. This compression preserves key decisions and error traces while keeping recent interactions in their original form. The approach prevents context overflow without losing important reasoning chains.
Second, a note-taking agent converts interaction traces into persistent Markdown notes. These notes capture both successful solutions and failure modes, creating a knowledge base that improves performance across sessions. When the team tested this on 151 tasks, the second run showed improved efficiency: average token cost dropped from 104k to 93k, and the resolve rate increased from 53% to 54.4%.
Third, the extension system handles all tool use through modular components. Extensions define how the agent parses outputs, invokes tools, and manages side effects. This separation makes behaviors easier to observe, test, and reuse across different agents.
Fourth, a meta-agent automates agent development through a build-test-improve loop. It generates configurations, wires together components, evaluates candidates on test tasks, and refines prompts based on observed failures. The production version of CCA is itself the output of this automated refinement process.
Performance and Evaluation
The team evaluated CCA on multiple benchmarks. On SWE-Bench-Pro, CCA with Claude 4.5 Opus achieved 54.3%, exceeding the 52.0% reported by Anthropic for their proprietary system. With Claude 4.5 Sonnet, CCA reached 52.7%, significantly ahead of the Live-SWE-Agent baseline at 45.8%.
Ablation studies isolated the contribution of each mechanism. On a 100-example subset, removing advanced tool use while keeping context management dropped performance from 51.6% to 51.0%. Removing context management entirely reduced it further to 44.0%. The results confirm that both mechanisms contribute independently to overall performance.
The team also tested CCA on long-context scenarios by grouping tasks by the number of files modified. Performance remained stable even when touching 10+ files, though it degraded moderately as edit volume increased. This suggests the agent can handle multi-file refactoring but faces challenges with cumulative localization uncertainty.
Comparison with Claude Code
Beyond standardized benchmarks, the researchers compared CCA with Anthropic’s Claude Code on real PyTorch issues. Both systems used the same Claude Sonnet 4.5 model and identical GPU resources, isolating the effect of agent architecture.
The comparison revealed different problem-solving styles. CCA typically identifies root causes through systematic analysis within a single reasoning chain. Claude Code delegated investigations to separate subagents that performed exhaustive analysis without access to the main context.
This architectural difference affected solution characteristics. CCA generally produced simpler fixes, while Claude Code often implemented more comprehensive but potentially over-engineered solutions. On one CUDA memory issue, CCA’s minimal fix matched what the PyTorch team eventually merged, while Claude Code added more elaborate memory-management logic.
What This Means for Development Teams
The research suggests that organizations building AI coding assistants should invest as much in agent architecture as in model selection. The gap between research-grade and production-grade agents isn’t just about scale—it’s about how you structure memory, manage context, and coordinate tools.
The Confucius SDK provides one framework for addressing these challenges, but the core insights apply broadly. Separating what the agent sees from what users see improves both. Persistent memory across sessions enables learning from past mistakes. Modular tool systems make behaviors easier to test and improve.
For development teams considering AI coding assistants, these architectural patterns offer a starting point. The question isn’t just which model to use, but how to scaffold it for sustained, reliable performance on real codebases.

