Scaling LLM Agents Beyond Demos: Coordination Costs, Topologies, and What Actually Works
Sparrow Intelligence note: This post is inspired by the research paper On the Scaling Laws of Multi-Agent Systems (arXiv:2512.08296), and extends it with production-focused interpretation, heuristics, and architecture guidance.
TL;DR (10 seconds)
- Adding agents increases coordination cost faster than it increases capability.
- The topology matters more than the number of agents.
- Supervisor/centralized setups win until the coordinator becomes the bottleneck.
- Fully-connected swarms collapse early due to message explosion.
- Rule of thumb: when coordination tokens exceed ~45% of budget, quality drops.
- Best default for real systems: hybrid/hierarchical with strict communication limits.
Who this is for
If you’re building agentic features in a SaaS product, internal tool, or LLM platform—and you’ve seen “works in a demo, fails in production”—this post is for you.
You’ll leave with:
- a topology decision guide
- coordination budgeting heuristics
- failure modes to watch (and how to prevent them)
Introduction: From Alchemy to Engineering
For nearly a decade, the artificial intelligence narrative has centered on a singular, powerful principle: neural scaling laws. Double the parameters. Add more data. Increase compute. Repeat. This recipe transformed GPT-2 into GPT-3, then GPT-3.5 into GPT-4, and unleashed a wave of increasingly powerful large language models (LLMs) that seemed to defy the scaling ceiling. We believed we had found the formula for artificial intelligence.
But we are now at a critical inflection point in AI history. The frontier is no longer about building bigger, more knowledgeable models. The frontier is about building systems that act—agents that reason, plan, choose tools, and interact with the world to accomplish complex, multi-step goals. This shift from passive intelligence to agentic intelligence breaks the old paradigm. Simply scaling an agent system does not follow the predictable curves we've observed in model scaling. In fact, adding more agents to a problem often leads to catastrophic failure, not incremental improvement.
Until now, practitioners building multi-agent systems have relied on intuition, heuristics, and trial-and-error. Should we use a manager agent? Should agents debate? Should they work independently or collaborate? These questions were answered through experimentation, not science.
A breakthrough paper, "Towards a Science of Scaling Agent Systems" (Kim et al., arXiv:2512.08296), published in December 2025, changes everything. For the first time, we have quantitative, empirically-derived scaling laws specifically for agent systems. The research distills agent orchestration into three fundamental principles, derives a predictive model, and validates it across 180 configurations spanning four real-world benchmarks. This is not theory; this is engineering with data.
For anyone building with AI in 2025 and beyond—whether you're an architect designing internal tools, an entrepreneur building SaaS products, or a researcher advancing the field—this paper is your roadmap. It tells you which agent architecture to use, when to use it, and why it works.
Understanding the Five Architectures of Agent Collaboration
Before we can talk about scaling laws, we must establish a common language. Agent systems can be organized into five canonical topologies, each with different trade-offs.
1. Single Agent
This is the foundation. A single LLM equipped with tools (web search, calculators, code execution, database access) follows a reasoning loop, typically ReAct (Reasoning and Acting): think about the problem → decide which tool to use → execute it → observe the result → repeat. It is the simplest and often the most effective model for many real-world tasks. There is no coordination overhead, no communication losses, and no single point of failure beyond the model itself.
2. Independent Agents
Imagine distributing a problem across multiple clones of your single agent with zero communication between them. This architecture shines for "embarrassingly parallel" problems where a large task can be decomposed into identical, independent sub-tasks. For instance, analyzing 1,000 separate financial reports can be split among 10 independent agents, each handling 100 reports. The agents work in parallel, never needing to consult each other.
The critical limitation: if one agent produces a hallucination or incorrect output, there's no mechanism to catch it. Errors propagate silently.
3. Centralized (Hub-and-Spoke)
This is the classic manager-worker pattern. A central "Supervisor" agent holds the overall plan and strategy. It decomposes complex problems into sub-tasks, delegates them to specialized "Worker" agents, and synthesizes their results into a final answer. The supervisor acts as a quality gate, able to detect and correct errors from workers.
The trade-off: the supervisor is a bottleneck. In dynamic environments where the right path is unknown, waiting for central approval slows adaptation.
4. Decentralized (Peer-to-Peer)
Here, agents operate as autonomous peers in a mesh network. They communicate directly with each other without a central conductor. Intelligence and coordination emerge from local interactions. One agent can broadcast a question, and any peer with relevant knowledge can respond directly. This model excels in unpredictable, rapidly-changing environments where local adaptation is key.
The cost: global coordination and consistency become extremely difficult without a central authority.
5. Hybrid
A combination of centralized and decentralized elements. A high-level manager sets overall strategy but delegates execution to small, semi-autonomous teams that coordinate peer-to-peer. This mirrors modern agile software development.
The Science of Scaling: Three Foundational Laws
From a rigorous evaluation across 180 configurations and four diverse benchmarks, the Kim et al. paper identifies three dominant effects that determine whether a multi-agent system will succeed or fail.
Law 1: The Tool-Coordination Trade-off
The Principle: Under a fixed computational budget (context window + token limit), tool-heavy tasks suffer disproportionately from multi-agent coordination overhead.
Why It Matters: Every communication between agents costs tokens. A manager telling a worker, "Analyze this financial document," costs tokens. The worker's acknowledgment costs more tokens. If the task itself is heavy—meaning it requires a tool that produces a large output (a 10,000-token retrieved document, for example)—the system's finite context window gets squeezed from both ends. Coordination tokens compete directly with reasoning tokens.
Practical Implication: If your agents use heavy tools (API calls with large responses, code execution, long document retrieval), keep teams small. A single powerful agent with a large context may be far more efficient than a committee of agents trying to coordinate.
Law 2: Capability Saturation (The "45% Rule")
The Principle: Multi-agent coordination yields diminishing or even negative returns when the baseline single-agent performance is below approximately 45%.
The Counter-Intuitive Reality: This is the most important finding for practitioners. It essentially says: do not scale incompetence. If your single, well-engineered agent can only solve a task correctly 30% of the time, adding more agents will not fix it—it will make it worse. The paper shows a regression coefficient of β=−0.408β=−0.408, meaning system performance actively declines with multi-agent scaling below this threshold.
Why? Weak agents hallucinate. They generate plausible-sounding but incorrect facts or strategies. In a multi-agent system, these hallucinations spread through communication channels, polluting the collective intelligence. The overhead of managing confused agents overwhelms any potential benefit from collaboration.
The Lesson: Before building a multi-agent system, invest heavily in optimizing your single-agent baseline. Perfect your prompt. Refine your tool definitions. Test extensively. Only once you reach ~45% success rate should you consider scaling to multiple agents.
Law 3: Topology-Dependent Error Amplification
The Principle: Errors propagate through agent systems at vastly different rates depending on architecture. Independent agents amplify errors by a factor of 17.2x, while centralized architectures contain amplification to 4.4x.
The Mechanism: In an independent agent setup, imagine Agent A hallucinates a revenue figure of $500M instead of the correct $50M. Since there's no communication or validation, Agent B downstream may unknowingly use this fabricated number to calculate profit margins, generating wildly incorrect results. The error cascades undetected.
In a centralized setup, Agent A reports the $500M to the Supervisor. But the Supervisor also has a report from Agent C showing the correct $50M. The massive discrepancy triggers error detection. The Supervisor can flag the issue, re-run the analysis, or trust the more plausible source. The central hub acts as a natural error-correction mechanism.
This is not about intelligence; it's about architecture. Even with the same worker agents, the topology determines whether errors are amplified or contained.
The Four Testing Grounds: Understanding the Benchmarks
To derive these laws, the paper evaluated agent systems on four diverse benchmarks designed to stress-test different capabilities.
Benchmark 1: Finance-Agent (Parallelizable Reasoning)
What It Tests: Complex financial analysis requiring independent, parallelizable reasoning.
Example Task: "Analyze the Q3 and Q4 2024 financial filings for Salesforce (CRM) and Oracle (ORCL). Calculate the YoY revenue growth for each company and compare."
Why It Matters: This task can be decomposed into independent sub-tasks: analyze CRM Q3, analyze CRM Q4, analyze ORCL Q3, analyze ORCL Q4, aggregate results. Each sub-task requires tool use (retrieval of SEC filings, parsing, calculation) but has no dependencies on the others.
What It Reveals: Which agent architectures excel when tasks are parallelizable and tool-heavy.
Benchmark 2: BrowseComp-Plus (Dynamic Web Research)
What It Tests: Deep research on complex, multi-hop reasoning queries using web search.
Example Task: "Who was the lead singer of the band that opened for The Rolling Stones at their 1969 Altamont Free Concert, and what is that person's birth year?"
The Challenge: This is impossible to answer with a single search query. The agent must:
- Search for "Altamont Free Concert opening bands" → discover "The Flying Burrito Brothers"
- Search for "Flying Burrito Brothers lead singer 1969" → find "Gram Parsons"
- Search for "Gram Parsons birth date" → find 1946
Each search result is unpredictable. The agent must adapt its strategy based on what it discovers.
What It Reveals: Which architectures handle dynamic, exploratory tasks where the right path cannot be known in advance.
Benchmark 3: PlanCraft (Sequential Minecraft Crafting)
What It Tests: Strict, sequential planning where one step must precede the next without deviation.
Example Task: "Craft a green bed."
The Exact Steps Required:
- Smelt cactus → green dye
- Place green dye at position [A1]
- Place white bed at position [A2]
- Move crafted green bed to inventory
The Constraint: A single wrong move invalidates the entire plan. Position [A1] must have green dye, not [B1]. This is not forgiving; there is no recovery.
What It Reveals: Whether agents can maintain fragile, sequential logic chains without the benefits (or costs) of distributed reasoning.
Benchmark 4: Workbench (Realistic Corporate Tasks)
What It Tests: Real-world workflows combining reasoning, tool use, and application integration.
Example Task: "Review the last three email threads in the 'Project Phoenix' Slack channel, summarize client concerns, draft a response, and schedule a 30-minute meeting for next Tuesday using Google Calendar."
The Complexity: Requires integrating information from multiple sources (Slack, email, calendar), reasoning about scheduling constraints, and executing actions across different systems.
Sparrow Intelligence field notes (what we see in production)
Three patterns show up again and again:
- Coordination tax
Every new agent adds non-linear overhead (routing + messaging + verification). - Hallucination amplification
One wrong agent output can poison downstream agents unless you add validation gates. - Bottleneck displacement
If it’s not the model, it’s the orchestrator. If it’s not the orchestrator, it’s tool latency.
Implementing Agent Architectures with LangGraph
Theory is invaluable, but implementation is where the rubber meets the road. LangGraph, an open-source framework from LangChain, provides a powerful, graph-based way to build stateful agent systems. Let's walk through code examples for the two most common architectures.
Architecture 1: Building a Centralized Supervisor with LangGraph
Here's a production-ready implementation of the centralized (hub-and-spoke) pattern:
from langgraph.prebuilt import create_react_agent
from langgraph.graph import StateGraph, START, MessagesState
from langchain_core.tools import tool
from langchain_tavily import TavilySearch
# Define specialized tools for worker agents
web_search = TavilySearch(max_results=3)
def add(a: float, b: float):
"""Add two numbers."""
return a + b
def multiply(a: float, b: float):
"""Multiply two numbers."""
return a * b
# Create specialized worker agents
research_agent = create_react_agent(
model="openai:gpt-4.1",
tools=[web_search],
prompt=(
"You are a research specialist. Your role is to search the web for information. "
"Answer ONLY research questions. After completing your task, respond directly to the supervisor. "
"Do NOT perform math or other non-research tasks."
),
name="research_agent",
)
math_agent = create_react_agent(
model="openai:gpt-4.1",
tools=[add, multiply],
prompt=(
"You are a math specialist. Your role is to perform calculations using the provided tools. "
"Answer ONLY math-related questions. After completing your task, respond directly to the supervisor. "
"Do NOT perform research or other non-math tasks."
),
name="math_agent",
)
# Define handoff tools that allow the supervisor to delegate
from typing import Annotated
from langchain_core.tools import tool, InjectedToolCallId
from langgraph.prebuilt import InjectedState
from langgraph.types import Command
def create_handoff_tool(*, agent_name: str, description: str | None = None):
"""Creates a tool that allows the supervisor to hand off work to a worker agent."""
name = f"transfer_to_{agent_name}"
description = description or f"Transfer control to {agent_name}."
@tool(name, description=description)
def handoff_tool(
state: Annotated[MessagesState, InjectedState],
tool_call_id: Annotated[str, InjectedToolCallId],
) -> Command:
tool_message = {
"role": "tool",
"content": f"Successfully transferred to {agent_name}",
"name": name,
"tool_call_id": tool_call_id,
}
return Command(
goto=agent_name, # Route to the worker agent
update={**state, "messages": state["messages"] + [tool_message]},
graph=Command.PARENT, # Indicate we're in a parent graph
)
return handoff_tool
# Create handoff tools
handoff_research = create_handoff_tool(
agent_name="research_agent",
description="Delegate a research task to the research specialist.",
)
handoff_math = create_handoff_tool(
agent_name="math_agent",
description="Delegate a math task to the math specialist.",
)
# Create the supervisor agent
supervisor_agent = create_react_agent(
model="openai:gpt-4.1",
tools=[handoff_research, handoff_math],
prompt=(
"You are a supervisor managing two specialist agents: "
"a research agent (for web searches and information gathering) "
"and a math agent (for calculations). "
"\n"
"Your job is to:\n"
"1. Understand the user's request\n"
"2. Decide which agent(s) can help\n"
"3. Delegate tasks to them one at a time\n"
"4. Synthesize their results into a final answer\n"
"\n"
"Do NOT do any work yourself. Always delegate."
),
name="supervisor",
)
# Assemble the multi-agent graph
from langgraph.graph import END
supervisor_graph = (
StateGraph(MessagesState)
.add_node("supervisor", supervisor_agent)
.add_node("research_agent", research_agent)
.add_node("math_agent", math_agent)
.add_edge(START, "supervisor")
# Workers always return to supervisor
.add_edge("research_agent", "supervisor")
.add_edge("math_agent", "supervisor")
.compile()
)
# Run the system
result = supervisor_graph.invoke({
"messages": [
{
"role": "user",
"content": "Find the current US GDP for 2024 and calculate what percentage of it is 2.5 trillion"
}
]
})
print(result["messages"][-1].content)
What This Does:
- Creates two specialized agents (research and math) with domain-specific prompts and tools.
- Defines "handoff" tools that allow the supervisor to delegate work.
- Assembles them into a graph where the supervisor routes tasks and workers return results.
- The supervisor controls the flow and acts as a quality checkpoint.
Key Insight: The supervisor is the bottleneck in high-latency environments but the quality-control feature in error-prone scenarios. Use this for parallelizable tasks like Finance-Agent.
Architecture 2: Conceptualizing Decentralized Peer-to-Peer Agents
A fully decentralized system is more complex to implement, but the core idea is that agents can communicate directly:
# Simplified example of a decentralized handoff pattern
from langgraph.types import Send
def create_peer_handoff_tool(*, peer_name: str, description: str | None = None):
"""Create a tool that allows one agent to directly hand off to a peer."""
name = f"ask_{peer_name}"
description = description or f"Ask {peer_name} for help."
@tool(name, description=description)
def peer_handoff(
question: str,
state: Annotated[MessagesState, InjectedState],
) -> Command:
"""Send a direct question to a peer agent."""
# Peer receives only the question, not full conversation history
peer_message = [{"role": "user", "content": question}]
return Command(
goto=[Send(peer_name, {
"messages": peer_message,
"metadata": {"source": state.get("agent_name", "unknown")}
})],
graph=Command.PARENT,
)
return peer_handoff
# In a decentralized system, Agent A and Agent B can both have tools to call each other
agent_a_tools = [
create_peer_handoff_tool(peer_name="agent_b", description="Ask B for help"),
web_search
]
agent_b_tools = [
create_peer_handoff_tool(peer_name="agent_a", description="Ask A for help"),
code_executor
]
# Graph structure allows agents to route to each other
decentralized_graph = (
StateGraph(MessagesState)
.add_node("agent_a", create_react_agent(..., tools=agent_a_tools))
.add_node("agent_b", create_react_agent(..., tools=agent_b_tools))
# No fixed return edges; agents decide where to route
.compile()
)
The Difference: In decentralized systems, agents make local decisions about whom to contact. There's no central coordinator. This is more resilient but harder to reason about.
Matching Architecture to Task: The Decision Matrix
The paper provides empirical evidence for which architecture to use:
| Task Type | Optimal Architecture | Performance Gain | Why It Works |
|---|---|---|---|
| Parallelizable (Finance-Agent) | Centralized | +80.9% | Manager decomposes, assigns workers in parallel, aggregates results cleanly |
| Dynamic Exploratory (BrowseComp-Plus) | Decentralized | +9.2% | Agents adapt locally; central manager would become bottleneck |
| Sequential Logic (PlanCraft) | Single Agent | No degradation (-39-70% for multi-agent) | Chain of thought must remain unbroken; context fragmentation kills performance |
| Mixed/Real-World (Workbench) | Context-dependent | Varies | Use single agent if tasks are sequential; centralized if parallelizable |
Critical Rule: If your task is sequential and reasoning-heavy, do not use multi-agent. A single, powerful agent with a large context window vastly outperforms a committee trying to pass a baton.
Practical Implementation: Your Scaling Checklist
Based on the research, here is your step-by-step guide to building agentic systems:
Step 1: Classify Your Task
Ask: Is this task parallelizable, dynamic, or sequential?
- Parallelizable: Multiple independent sub-problems (Finance-Agent style)
- Dynamic: Unknown path, discovery-driven (BrowseComp-Plus style)
- Sequential: Must follow a strict order (PlanCraft style)
Step 2: Optimize the Single-Agent Baseline
Before even considering multi-agent:
- Invest in prompt engineering. Test extensively.
- Refine your tool definitions. What tools does your agent need?
- Measure baseline success rate. If it's below 45%, stop. Do not scale yet.
- Use few-shot examples to guide behavior.
Step 3: Measure Coordination Overhead
Calculate the token cost of coordination:
- How many tokens are used in communication between agents?
- How many tokens does your primary tool output consume?
- What's your context window limit?
If coordination overhead is >20% of total tokens, reconsider multi-agent.
Step 4: Choose Your Architecture
- For parallelizable tasks with tool-heavy operations: Use Centralized (supervisor model).
- For dynamic, exploratory tasks: Use Decentralized (peer mesh).
- For sequential reasoning: Stick with Single Agent.
Step 5: Implement and Monitor
- Build your system using LangGraph or a similar framework.
- Monitor error amplification. Track how often errors propagate.
- Measure end-to-end latency. Multi-agent adds delay.
- A/B test against your single-agent baseline. Does multi-agent actually improve performance?
Topology decision cheat-sheet
| Your situation | Best topology | Why |
|---|---|---|
| One clear task + tools | Single agent | lowest overhead |
| Fixed pipeline (extract→validate→write) | Sequential | deterministic control |
| Medium complexity with parallel subtasks | Centralized supervisor | good until coordinator bottlenecks |
| Large task-space + many subtasks | Hybrid / hierarchical | bounded comms + parallelism |
| Free-for-all “everyone talks” | Avoid fully-connected | message explosion kills quality |
Conclusion: The Era of Agentic Engineering
We are witnessing a fundamental shift in how AI systems are built. The era of scaling models by brute force is giving way to the era of scaling intelligence through architecture. The Kim et al. paper gives us the first scientific framework for making these architectural decisions.
The key insights are simple but profound:
- Tool-coordination trade-off: Coordination has a real cost. Pay attention to it.
- The 45% rule: Do not scale weak agents. Invest in making your single agent strong first.
- Error amplification: Your topology determines how errors propagate. Choose wisely.
As you build agentic systems in 2025 and beyond, remember: it is no longer about throwing more agents at the problem. It is about choosing the right agent for the job, building it well, and scaling it intelligently. The science has arrived. It's time to engineer with data.
Cite this
If this post helped, feel free to reference it as:
Sparrow Intelligence — “Scaling LLM Agents Beyond Demos: Coordination Costs, Topologies, and Practical Rules”
References
Kim, Y., et al. (2025). "Towards a Science of Scaling Agent Systems." arXiv:2512.08296. https://arxiv.org/abs/2512.08296
LangGraph Documentation. (2025). "Multi-Agent Supervisor." https://langchain-ai.github.io/langgraph/tutorials/multi_agent/agent_supervisor/
Chen, Z., et al. (2025). "BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agents." arXiv:2508.06600. https://arxiv.org/html/2508.06600v1
Dagan, G., Keller, F., & Lascarides, A. (2025). "Plancraft: An Evaluation Dataset for Planning with LLM Agents." COLM 2025. https://homepages.inf.ed.ac.uk/alex/papers/colm.pdf
Galileo AI. (2025). "Architectures for Multi-Agent Systems." https://galileo.ai/blog/architectures-for-multi-agent-systems
vals.ai. (2025). "Finance Agent Benchmark." https://www.vals.ai/benchmarks/finance_agent
OpenAI. (2025). "LangGraph: Building Stateful Agents." https://langchain-ai.github.io/langgraph/concepts/multi_agent/
What to read next (Sparrow Intelligence)




