Multi-Agent Systems Scale Vertically. They Need to Scale Horizontally.

This post continues the ideas explored in Part I: Super Agents and Multi-Agent Communication and Part II: Swarm Intelligence. Those posts covered how agents coordinate within a workflow. This one asks what happens after the workflow ends.


After spending time with the orchestrator pattern and the swarm pattern, I kept running into the same gap — one that the field has not been honest enough about.

Agents can communicate within a workflow. They can share state, hand off tasks, and coordinate through structured message protocols. I covered all of that in the previous posts, and all of that is solved. What is not solved is this: once the run completes and the agents figure out how to handle a complex workflow, that knowledge stays isolated. The next run starts cold.

That is the vertical scaling trap. And the more I read — across Reflexion, ERL, Letta’s stateful agent work, and Google Research’s recent findings on scaling agent systems — the more I realized this is the most important unsolved problem in multi-agent architecture today.


What Vertical Scaling Actually Means

The industry has concentrated its investment on making individual agents more capable in isolation — longer context windows, stronger reasoning models, richer tool sets, more compute per inference call. This is vertical scaling: more depth, more power, more intelligence concentrated in a single node.

Vertical scaling has delivered real gains. Modern LLM-based agents can handle significantly longer reasoning chains, maintain larger working memories, and invoke more complex tool sequences than agents from two years ago. The benchmark numbers confirm this.

But vertical scaling has a ceiling, and that ceiling is architectural, not computational. No matter how capable a single agent becomes, a system of agents that starts each run from a blank slate cannot accumulate collective intelligence over time. Every execution is, in a meaningful sense, the first time that system has encountered the problem.

That is the definition of a system that does not learn.


The Statefulness Illusion

This was the part that clarified the problem most for me. LLM agents are stateless by design. The model itself has no memory between API calls — every inference starts fresh, bounded by what exists inside the current context window. What looks like agent memory in most production frameworks is actually infrastructure built around the model: conversation history injected into the prompt, vector stores queried at retrieval time, workflow state persisted in an external database.

The agent does not remember. The infrastructure remembers. And the agent only knows what the infrastructure decides to surface at inference time.

This distinction matters because it exposes the scope of what is currently being solved. Stateful agent frameworks — LangGraph, MemGPT/Letta, Amazon Bedrock AgentCore Memory, and others — address continuity within a workflow and within a user session. They do not address what happens between runs, across agent instances, or across different executions of the same workflow by different users.

Each agent run, regardless of the framework, is still largely isolated from the accumulated experience of every run that came before it.


The Horizontal Scaling Problem

Horizontal scaling in multi-agent systems means something different from what the term usually implies in infrastructure. It is not about running more agent instances in parallel — that is a load distribution problem, and it is solved. The horizontal scaling problem I’m describing is about propagating learned competence across agents and across runs.

When I mapped the gap concretely, it looked like this:

CapabilityCurrent State
Agents share state within a runSolved
Agents communicate within a workflowSolved
Agent learns within a run (self-reflection)Partial — Reflexion
Successful strategy propagates to next runNot solved
Knowledge discovered by one agent available to othersNot solved
Collective intelligence accumulates over time without retrainingNot solved

The bottom three rows represent the horizontal scaling gap. It is not a matter of framework maturity — it is an architectural primitive that does not yet exist in production multi-agent systems.


What the Field Has Built as Workarounds

Research and engineering teams have made partial progress, and it’s worth naming what exists honestly.

Shared episodic memory stores. Agents can write successful reasoning traces or strategy summaries to a vector database that future agent instances retrieve via RAG. This is useful, but the memory is static once written. It does not update based on outcomes, and retrieval quality determines whether the right experience surfaces at the right moment.

Reflexion and its descendants. Reflexion (Shinn et al., NeurIPS 2023) introduced a framework where agents verbally reflect on task feedback and store those reflections in an episodic memory buffer to improve decision-making in subsequent trials — without modifying model weights. This is a genuine step forward, and it’s the work that first made me think seriously about this problem. But Reflexion is fundamentally a within-run or within-session mechanism. The reflective memory does not propagate across agent instances or persist as a shared resource across independent runs.

ExpeL and Experiential Reflective Learning. More recent work, including ExpeL (Zhao et al., 2024) and ERL (2025), extracts reusable heuristics by comparing successful and failed trajectories, then injects the most relevant heuristics into future agent contexts via retrieval. This is directionally correct. ERL reports a +7.8% improvement over a ReAct baseline on complex agentic benchmarks precisely because failure-derived heuristics provide negative constraints that prune ineffective strategies. But even here, the experience pool is curated offline, retrieval is still prompt injection, and the feedback loop is not real-time.

Prompt distillation and fine-tuning. Successful agent runs can generate training data that feeds a fine-tuning pipeline. This is horizontally scalable in principle — the knowledge of one run eventually improves the base model that all agents use. But the feedback loop is slow, expensive, requires human curation, and operates offline. It is not collective learning; it is deferred knowledge consolidation.

Workflow libraries and pattern registries. Teams manually curate successful workflow templates. This is human-mediated knowledge transfer, not agent-mediated. It does not scale.

None of these close the gap. They are engineered workarounds for the absence of a proper horizontal learning primitive.


What Is Actually Missing

The architectural primitive that does not yet exist is a persistent, agent-writable, outcome-weighted knowledge layer — one where agents contribute strategy signals after a run completes, and those signals influence future agent behavior without requiring a full retraining cycle or human curation.

The biological analogy came back to me here from the swarm intelligence research I covered in Part II: pheromone trails in ant colonies are not just a communication mechanism — they are a distributed, incrementally updated knowledge store. Shorter, higher-quality paths accumulate stronger signals through positive feedback. Failed paths evaporate. The swarm’s collective intelligence is encoded in the medium itself, not in any individual. No central controller decides which trails are “good.” The outcome does.

What that looks like for LLM-based multi-agent systems is still an open design problem, but the requirements I’ve been able to identify are:

  • Outcome-weighted writes. Agent runs that complete successfully contribute to the shared knowledge layer with positive weight; failed runs contribute negative constraints. Both are useful — ERL’s results show that failure-derived heuristics often outperform success-derived ones on search tasks.
  • Decentralized propagation. The update mechanism cannot require a human in the loop or an offline batch process. Strategy signals need to propagate in something close to real time across agent instances.
  • Relevance-gated retrieval. Future agents need to surface relevant prior experience without injecting everything into context. This is partially addressed by LLM-based retrieval scoring, but remains unsolved at scale.
  • No weight updates required. The mechanism needs to operate within the context engineering layer, not through gradient descent. Retraining is too slow and too expensive for real-time collective learning.

Why the Industry Has Not Solved It

The more I thought about it, the more I realized the incentive structure explains the gap more than the technical difficulty does.

Vertical scaling — a bigger model, a stronger benchmark score, a longer context window — has a clear commercial lever. It is attributable to a specific product release and easy to market. Horizontal knowledge propagation is architecturally harder, requires runtime infrastructure that does not exist yet, and the value it generates is distributed across runs and users rather than attributable to a single capability upgrade.

Google Research’s recent work on scaling agent systems found that adding more agents does not consistently improve performance — multi-agent coordination yields substantial gains on parallelizable tasks but can actually degrade performance on sequential workflows. More agents is not the answer. Smarter knowledge transfer is. But that is a harder problem to benchmark and a harder story to sell.


The Architectural Opportunity

The systems that will win over the next two to three years will not be the ones with the largest individual agents. They will be the ones that figure out how to make collective experience accumulate efficiently across runs, across users, and across agent instances — without requiring a human editor or an offline training cycle to make it useful.

This is, in a meaningful sense, the missing layer of agentic AI infrastructure. The orchestration layer exists — I covered it in Part I. The communication protocols exist. The shared state store exists. The swarm coordination patterns exist — I covered those in Part II. What does not exist is a production-grade mechanism for collective learning that operates at runtime.

The research directions are beginning to converge on this problem — Reflexion, ERL, Collaborative Memory — but none has produced a general-purpose primitive that production systems can adopt. That gap is both the honest state of the art and the most interesting open problem in multi-agent architecture today.


References