SuperLazy Research — Memory for Agent Fleets

The problem

The token tax of
multi-agent systems.

A single agent that re-reads its whole history every turn is wasteful. A fleet of them is the same mistake, multiplied — and most of the bill never had to exist. When a multi-agent system gets expensive, the reflex is a bigger model or a bigger context window. But in a fleet, the tokens that dominate the bill aren't spent on reasoning — they're spent on repetition.

Re-sent history

Each agent carries its own transcript to "remember," and pays for the early turns again at every later one.

Full context, passed around

Hand-offs and orchestrator steps forward the entire conversation instead of the relevant slice.

The same fact, N ways

Shared specs, conventions and decisions are copied into every agent's window instead of stored once.

Re-derivation across agents

One agent solves a problem; a sibling hits the same wall later and solves it again — the multiplier unique to fleets.

A bigger window doesn't help

Cost still rises with every token, attention is ~quadratic so latency climbs, and models lose the thread in the middle of long contexts. Five huge windows is just five times the duplicated spend.

The shape of the bill

agents × history × redundancy

Three multipliers — none of which is capability. The real lever isn't a bigger prompt. It's not keeping the knowledge in the prompt at all.

Input tokens per turn — one agent

Re-send full context Recall only what's relevant

Re-sending the transcript grows linearly with the conversation; recall stays flat no matter how deep the history runs. The shaded band is spend you simply stop paying — and in a fleet, it repeats per agent.

The benchmark

No one was measuring this.
So we built MA-MemBench.

Existing memory benchmarks score whether a system can answer a question. None of them measure what actually drives the bill in a fleet: how many tokens it costs to keep many agents in sync as the world keeps changing. So we built one.

A living, multi-agent world

Agents own different pieces of the world. The world keeps changing over time, and the latest value is what's true now — so freshness and ordering genuinely matter.

A strict, fair contest

Every memory system is given the same plain-language question and graded by the same judge — so we measure the memory, not the prompt wording.

The metrics that matter

Not just accuracy — but the tokens sent between agents, what the lead agent must actually read, the cost, and the latency behind every answer.

A scaling stress test

We grow the world from a handful of agents to a large fleet — exposing which approaches stay flat and which blow up as the network scales.

Results · multi-agent

Same answers.
A fraction of the cost.

On MA-MemBench, we put SuperLazy head-to-head against a naive fleet (every agent re-ships everything) and against mem0, a popular memory layer — all running the same model under identical conditions.

0.95

accuracy — highest of the three

~10×

fewer tokens the lead agent reads vs mem0

~8×

lower cost per query vs mem0

extra ingest overhead — mem0 spends millions of tokens

Answer accuracy

All three retrieve the needed facts — but SuperLazy answers most reliably, especially on questions that require pulling together facts from many agents.

Tokens the lead agent must read · per query

SuperLazy delivers the exact current fact, so the lead agent reads a tiny, precise context instead of wading through everything.

The hidden cost: ingest overhead

mem0 runs an extraction model on every update to the world — in this run, that alone burned ~1.5 million tokens before a single question was asked. SuperLazy adds no ingest overhead at all, which is why its total spend is a tiny fraction of mem0's — on top of being faster per query.

Cost as the fleet grows — the real test

Naive fleet SuperLazy

A naive fleet's cost climbs with the network; SuperLazy stays flat. At a 64-agent world the gap is over 100× — the advantage widens exactly as your system scales.

Validation · LongMemEval

Proven on the standard
long-memory benchmark.

LongMemEval is the field's open benchmark for long-term conversational memory. Each task buries the answer inside a sprawling chat history — hundreds of sessions, ~100,000 tokens — then asks ~500 questions split across six categories, each stressing a different memory skill. It tests whether a system can truly find and use what it was told, long ago and far back.

Single-session · user

Recall a fact the user stated in one earlier conversation.

Single-session · assistant

Recall something the assistant itself said earlier.

Preference

Use a preference the user expressed to answer the way they'd want.

Multi-session

Join facts scattered across many different sessions.

Temporal reasoning

Reason about time, order and exactly when something happened.

Knowledge update

Track a value that changed over time and use the latest one.

Answer accuracy by category · 500 questions

95.1%

Recall@15

92.4%

Recall@10

~18×

smaller context than feeding the full history

~5.8k

tokens sent per question vs ~103k full context

The same efficiency thesis, validated on an open benchmark: keep the answer's evidence in view, keep the context tiny — so a deep memory costs about the same to query as a shallow one.

The memory layer that makes agent fleets affordable.

The token tax of
multi-agent systems.

Re-sent history

Full context, passed around

The same fact, N ways

Re-derivation across agents

A bigger window doesn't help

agents × history × redundancy

No one was measuring this.
So we built MA-MemBench.

Same answers.
A fraction of the cost.

Proven on the standard
long-memory benchmark.

The cheapest token
is the one you never re-send.

The token tax ofmulti-agent systems.

Re-sent history

Full context, passed around

The same fact, N ways

Re-derivation across agents

A bigger window doesn't help

agents × history × redundancy

No one was measuring this.So we built MA-MemBench.

Same answers.A fraction of the cost.

Proven on the standardlong-memory benchmark.

The cheapest tokenis the one you never re-send.

The token tax of
multi-agent systems.

No one was measuring this.
So we built MA-MemBench.

Same answers.
A fraction of the cost.

Proven on the standard
long-memory benchmark.

The cheapest token
is the one you never re-send.