Research

The Missing Multi-Agent Benchmark (And Why We Built MA-MemBench)

There's a strange blind spot in how we evaluate memory for AI agents. We have good benchmarks for whether a system can answer a question. We have almost nothing for whether it can do so affordably once you have more than one agent. And in a fleet, affordability is the whole game.

What existing benchmarks measure

Benchmarks like LongMemEval and LoCoMo are built around a single assistant with a long history. They bury a fact in a transcript and ask: can you find it and answer correctly? That's a real and useful skill. But they share the same frame, one agent, accuracy only, and none of them measure what actually drives the bill at scale: how many tokens it costs to keep many agents in sync as the world keeps changing.

Why the gap is a real problem

  • You can't optimize what you can't measure. If accuracy is the only score, every system is incentivized to dump more context into the prompt, the exact behavior that makes fleets expensive.
  • It hides the dominant cost. In a fleet, repetition outweighs reasoning, and an accuracy-only benchmark is blind to it.
  • There's no fair comparison. Two systems can score the same accuracy while one costs 100× more to run at scale.

So we built MA-MemBench

MA-MemBench measures the communication cost of a multi-agent system, not just whether it answers. Six design choices make it work:

  • A living, multi-agent world: agents own different pieces of the world; it keeps changing, and the latest value is what's true now.
  • A strict, fair contest: every system gets the same plain-language question and the same judge, so we measure the memory, not the prompt wording.
  • The metrics that matter: accuracy plus the tokens sent between agents, what the lead agent must read, the cost, and the latency.
  • A scaling stress test: we grow the world from a handful of agents to a large fleet, exposing which approaches stay flat and which blow up.
  • The hard questions, on purpose: the latest value after many changes, facts that only emerge by chaining across agents, and counts that span the whole fleet.
  • Knowing when to say "I don't know": refusing to answer about things that never happened, because a confident wrong answer is a failure, not a near-miss.

The payoff is immediate: the benchmark cleanly separates approaches that look identical on accuracy. A naive fleet makes the lead agent read about 24,000 tokens per query and its cost climbs over 100× as the world grows, while an efficient memory layer holds both flat.

See the full breakdown on the SuperLazy Research page, or read about the token tax it's built to expose.

Frequently asked questions

What is MA-MemBench?
A benchmark for the communication cost of multi-agent memory systems. It measures not just accuracy, but the tokens agents exchange, what the lead agent reads, cost, and latency, as the world scales.
Why isn't accuracy enough?
Because two systems can match on accuracy while one costs far more to run at scale. In a fleet, the bill is driven by repetition, which accuracy-only benchmarks can't see.
Keep reading

More from the blog