How to Build a Multi-Agent Research System

Nishant
Jun 22
5 min read

The goal of scale in artificial intelligence (AI), often measured by context windows and parameter counts, has dominated the industry. However, we're encountering limitations with complex, open-ended research questions that need more than just a single large model. While most AI systems work in isolation, Anthropic has built something fundamentally different: a multi-agent research system where multiple Claude agents collaborate to solve complex problems. Claude's multi-agent research systems show how intelligent AI agents might work in the future.

Anthropic's recent technical breakdown of how to build a multi-agent research system has revealed the engineering challenges, architectural decisions, and hard-won lessons from building a production multi-agent system. For technical professionals working with AI, their insights offer a roadmap for building systems that can handle the kind of open-ended, complex tasks that single agents struggle with.

The Multi-Agent System Advantage

Traditional AI systems follow predictable paths: input goes in, processing happens, and output comes out; however, research work doesn't work that way. Real research involves following leads, changing based on discoveries, and finding multiple directions simultaneously. It's inherently unpredictable and path-dependent.

Anthropic's approach to complex research tasks is a multi-agent system that uses a lead "orchestrator" agent to manage several specialized "subagents" that find different aspects in parallel. This structure allows for a more dynamic and thorough information exploration than a single agent could achieve. The system is designed to handle the unpredictable nature of research, where initial findings often change the entire direction of the research.

Anthropic multi-agent ressearch system architecture

Key Features and Functions:

Parallel Processing: Multiple specialized subagents work simultaneously on different aspects of complex queries, dramatically reducing research time.
Dynamic Planning: The lead agent analyzes queries and develops strategies in real-time, adapting to discoveries as they occur.
Tool Specialization: Each subagent can use distinct tools and exploration approaches, reducing path dependency.
Context Separation: Subagents operate with their own context windows, allowing thorough independent investigations.
Intelligent Coordination: The orchestrator manages task decomposition, resource allocation, and result synthesis.

The performance gains are substantial. Anthropic's internal evaluations show that their multi-agent system, with Claude Opus 4 as the lead agent and Claude Sonnet 4 as subagents, outperformed single-agent Claude Opus 4 by 90.2% on research tasks. For breadth-first queries requiring multiple independent directions, the difference becomes even more apparent.

The Token Economics Reality

Here's where things get expensive. Multi-agent systems are token-hungry beasts. Anthropic's data shows agents typically use about 4 times more tokens than chat interactions, while multi-agent systems consume roughly 15 times more tokens than chats. This isn't just a technical detail but a fundamental constraint that shapes when and how these systems make economic sense.

The token usage isn't a waste, though. In their analysis of the BrowseComp evaluation, three factors explained 95% of performance variance: token usage (80% of variance), number of tool calls, and model choice. Multi-agent architectures effectively scale token usage for tasks that exceed single-agent limits.

Anthropic multi-agent system process diagram

Prompt Engineering at Scale

Building multi-agent systems requires rethinking prompt engineering entirely. Early versions of Anthropic's system made predictable mistakes, like spawning 50 subagents for simple queries, conducting endless web searches for nonexistent sources, and agents distracting each other with excessive updates.

Their solution involved several key principles:

Think Like Your Agents: Anthropic built simulations using their Console with exact prompts and tools, watching agents work step-by-step. This has revealed failure modes that weren't obvious from the outputs alone, like agents continuing when they had sufficient results, using lengthy search queries, or selecting the wrong tools.
Teach Delegation: The lead agent needs to allocate queries into subtasks with specific objectives, output formats, tool guidance, and clear boundaries. Simple instructions like "research the semiconductor shortage" are proven to be too vague, leading to duplicated work and missed information.
Scale Effort Appropriately: Simple fact-finding might need one agent with 3-10 tool calls, while complex research could require 10+ subagents with divided responsibilities. Hence, explicit guidelines help prevent overinvestment in simple queries.
Tool Design Matters: Agent-tool interfaces are as critical as human-computer interfaces, meaning bad tool descriptions can send agents down completely wrong paths. Therefore, each tool needs a distinct purpose and a clear description.

The Evaluation Challenge

Evaluating multi-agent systems breaks traditional testing approaches, as agents might take completely different valid paths to reach their goals, unlike deterministic software that should follow the same steps each time. One agent might search three sources while another searches ten, using different tools to find the same answer.

Anthropic's approach combines several evaluation methods:

Start Small: With effect sizes often dramatic in early development (30% to 80% success rate improvements), small test sets of around 20 queries can reveal clear impact patterns.
LLM-as-Judge: Research outputs resist programmatic evaluation, making LLMs natural fits for grading against rubrics covering factual accuracy, citation accuracy, completeness, source quality, and tool efficiency.
Human Testing: People catch edge cases that automation misses, including hallucinated answers, system failures, and subtle biases like choosing SEO-optimized content farms over authoritative sources.

Production Engineering Challenges

Running multi-agent systems in production introduces unique challenges that don't exist in traditional software development.

Stateful Complexity: Agents can run for extended periods, maintaining state across many tool calls. Minor system failures can cascade into major issues, needing systems that can resume from checkpoints rather than restart from scratch.
Non-Deterministic Debugging: Agents can make bold decisions and behave differently between runs, even with identical prompts. Anthropic added full production tracing to diagnose failures systematically while maintaining user privacy through high-level observability.
Deployment Coordination: Agent systems are stateful webs of prompts, tools, and execution logic running almost continuously. Updates require rainbow deployments that gradually move traffic between versions while keeping both running simultaneously.

What are Multi-Agent Systems Good for

The architecture works best for specific types of problems. Multi-agent systems are well-suited for valuable tasks that include heavy parallelization, information exceeding single context windows, and interfacing with multiple complex tools. Research tasks naturally fit this profile.

However, they're not universal solutions. Domains requiring shared context among all agents or involving many dependencies between agents aren't good fits. Most coding tasks involve fewer truly parallelizable subtasks than research, and current LLM agents struggle with real-time coordination and delegation.

The Bigger Picture

Anthropic's work shows more than just a technical achievement. It's a proof of concept for how intelligent systems might collaborate. Just as human societies became exponentially more capable through collective intelligence and coordination, AI agents may need similar collaborative approaches to solve increasingly complicated problems.

The challenges are real: token economics, coordination complexity, and the engineering overhead of stateful, non-deterministic systems; however, the performance gains for appropriate use cases make these trade-offs worthwhile.

For technical professionals, Anthropic's lessons offer practical guidance for building similar multi-agent research systems. The focus on careful prompt engineering, comprehensive evaluation, and production reliability considerations provides a framework for solving multi-agent challenges.

Conclusion

Multi-agent AI systems show artificial intelligence (AI) capabilities, but they're not magic bullets. They require careful engineering, substantial computational resources, and thoughtful design decisions about when and how to deploy them effectively.

Anthropic's experience building Claude's Research feature shows that while the gap between prototype and production is often wider than anticipated, multi-agent systems can operate reliably at scale with proper attention to architecture, evaluation, and operational practices. As AI systems become more capable, the ability to coordinate multiple intelligent agents may become as important as improving individual agent performance.

The future probably belongs to systems that can combine the strengths of multiple AI agents; however, achieving this requires solving complex engineering challenges that extend far beyond simply connecting models together. Anthropic's detailed breakdown of their approach gives a valuable roadmap for anyone ready to solve these challenges.

AI AGENTS NEWS