The Token Tax of Multi-Agent Systems

When a multi-agent system gets expensive, the reflex is to reach for a bigger model or a bigger context window. But in a fleet, the tokens that dominate the bill aren't spent on reasoning. They're spent on repetition.

The bill is repetition, not reasoning

A single agent that re-reads its whole history every turn is already wasteful. A fleet of agents is the same mistake, multiplied. Four habits drive the waste:

Re-sent history, each agent carries its own transcript to "remember," and pays for the early turns again at every later one.
Full context, passed around, hand-offs forward the entire conversation instead of the relevant slice.
The same fact, N ways, shared specs, conventions and decisions are copied into every agent's window instead of stored once.
Re-derivation across agents, one agent solves a problem; a sibling hits the same wall later and solves it again.

The shape of the bill is agents × history × redundancy: three multipliers, none of which is capability. A bigger window doesn't help, cost still rises with every token, latency climbs, and models lose the thread in the middle of long contexts. Five huge windows is just five times the duplicated spend.

What fixing it is worth

So we measured it. On MA-MemBench, our benchmark for the communication cost of a fleet, we put SuperLazy head-to-head against a naive fleet (every agent re-ships everything) and against mem0, a popular memory layer, all running the same model under identical conditions. The result: you don't trade accuracy for efficiency. You get both.

95%

accuracy, the highest of the three

~10×

fewer tokens the lead agent reads vs mem0

~8×

lower cost per query vs mem0

extra ingest overhead

The lead agent reads roughly 900 tokens per query instead of 8,600 with mem0 or 24,000 with a naive fleet, a tiny, precise context instead of wading through everything. And there's a hidden cost the headline numbers miss: mem0 runs an extraction model on every update to the world, which in this run burned ~1.5 million tokens before a single question was asked. SuperLazy adds none of that, so its total spend is a fraction of mem0's on top of being faster per query.

The advantage isn't a one-time discount, it widens with scale. As the world grows from 8 to 64 agents, a naive fleet's cost climbs to $2.08 per query while SuperLazy stays flat at ~$0.012, a gap of over 100× exactly when your system is scaling.

Validated beyond our own benchmark

The same efficiency thesis holds on the field's open long-memory benchmark, LongMemEval, where the answer is buried inside ~100,000 tokens of history. SuperLazy surfaces the handful of moments that hold the answer, 95.3% Recall@15 and 92.5% Recall@10, while sending roughly 18× less context than feeding the full history (~5.8k tokens per question vs ~103k).

The takeaway is simple: the cheapest token is the one you never re-send. Strip out the duplication and a deep, fleet-wide memory costs about the same to query as a shallow one, with accuracy that goes up, not down.

See the full results on the research page →

The token tax of multi-agent systems, and what fixing it is worth.

The bill is repetition, not reasoning

What fixing it is worth

Validated beyond our own benchmark

The cheapest token
is the one you never re-send.

The bill is repetition, not reasoning

What fixing it is worth

Validated beyond our own benchmark

The cheapest tokenis the one you never re-send.

The cheapest token
is the one you never re-send.