Multi-agent systems don't get expensive because the reasoning is hard — they get expensive because the same knowledge is sent, copied, and re-discovered over and over. Our research wing builds the shared-memory infrastructure that ends that waste, and the benchmark that proves it.
A single agent that re-reads its whole history every turn is wasteful. A fleet of them is the same mistake, multiplied — and most of the bill never had to exist. When a multi-agent system gets expensive, the reflex is a bigger model or a bigger context window. But in a fleet, the tokens that dominate the bill aren't spent on reasoning — they're spent on repetition.
Re-sending the transcript grows linearly with the conversation; recall stays flat no matter how deep the history runs. The shaded band is spend you simply stop paying — and in a fleet, it repeats per agent.
Existing memory benchmarks score whether a system can answer a question. None of them measure what actually drives the bill in a fleet: how many tokens it costs to keep many agents in sync as the world keeps changing. So we built one.
Agents own different pieces of the world. The world keeps changing over time, and the latest value is what's true now — so freshness and ordering genuinely matter.
Every memory system is given the same plain-language question and graded by the same judge — so we measure the memory, not the prompt wording.
Not just accuracy — but the tokens sent between agents, what the lead agent must actually read, the cost, and the latency behind every answer.
We grow the world from a handful of agents to a large fleet — exposing which approaches stay flat and which blow up as the network scales.
On MA-MemBench, we put SuperLazy head-to-head against a naive fleet (every agent re-ships everything) and against mem0, a popular memory layer — all running the same model under identical conditions.
All three retrieve the needed facts — but SuperLazy answers most reliably, especially on questions that require pulling together facts from many agents.
SuperLazy delivers the exact current fact, so the lead agent reads a tiny, precise context instead of wading through everything.
mem0 runs an extraction model on every update to the world — in this run, that alone burned ~1.5 million tokens before a single question was asked. SuperLazy adds no ingest overhead at all, which is why its total spend is a tiny fraction of mem0's — on top of being faster per query.
A naive fleet's cost climbs with the network; SuperLazy stays flat. At a 64-agent world the gap is over 100× — the advantage widens exactly as your system scales.
LongMemEval is the field's open benchmark for long-term conversational memory. Each task buries the answer inside a sprawling chat history — hundreds of sessions, ~100,000 tokens — then asks ~500 questions split across six categories, each stressing a different memory skill. It tests whether a system can truly find and use what it was told, long ago and far back.
The same efficiency thesis, validated on an open benchmark: keep the answer's evidence in view, keep the context tiny — so a deep memory costs about the same to query as a shallow one.
We're building the shared-memory layer that makes agent fleets affordable — and the benchmarks that hold it honest. If you're running agents at scale, let's talk.