The Missing Multi-Agent Benchmark: MA-MemBench

There's a strange blind spot in how we evaluate memory for AI agents. We have good benchmarks for whether a system can answer a question. We have almost nothing for whether it can do so affordably once you have more than one agent. And in a fleet, affordability is the whole game.

What existing benchmarks measure

Benchmarks like LongMemEval and LoCoMo are built around a single assistant with a long history. They bury a fact deep in a transcript and ask: can you find it and answer correctly? That's a real and useful skill, and it's what most "memory" leaderboards score.

But they all share the same frame: one agent, accuracy only. None of them measure the number that actually drives the bill once you scale out, how many tokens it costs to keep many agents in sync as the world keeps changing underneath them.

Why the gap is a real problem

You can't optimize what you can't measure. If the only score is accuracy, every system is incentivized to dump more context into the prompt, the exact behavior that makes fleets expensive.
It hides the dominant cost. In a fleet, repetition (re-sent history, duplicated facts, re-derivation across agents) outweighs reasoning. An accuracy-only benchmark is blind to all of it.
There's no fair way to compare memory layers. Two systems can score the same accuracy while one costs 100× more to run at scale. Buyers and researchers had no apples-to-apples way to see that.

So we built MA-MemBench

MA-MemBench is a benchmark for the communication cost of a multi-agent system, not just whether it answers, but what it pays to stay coordinated. Six design choices make it do that:

A living, multi-agent world

Agents own different pieces of the world. It keeps changing over time, and the latest value is what's true now, so freshness and ordering genuinely matter.

A strict, fair contest

Every memory system gets the same plain-language question and the same judge, so we measure the memory, not the prompt wording.

The metrics that matter

Not just accuracy, but the tokens sent between agents, what the lead agent must actually read, the cost, and the latency behind every answer.

A scaling stress test

We grow the world from a handful of agents to a large fleet, exposing which approaches stay flat and which blow up as the network scales.

The hard questions, on purpose

The latest value after a flurry of changes, facts that only emerge by chaining across agents, and counts that span the whole fleet.

Knowing when to say "I don't know"

A memory layer is judged on refusing to answer about something that never happened. A confident wrong answer is a failure, not a near-miss.

The payoff of measuring this is immediate: the benchmark cleanly separates approaches that look identical on accuracy. A naive fleet makes the lead agent read ~24,000 tokens per query and its cost climbs over 100× as the world grows, while an efficient memory layer holds both flat.

The point

You can't build affordable agent fleets without a way to see where the money goes. MA-MemBench is our attempt to make that visible, a benchmark that holds memory layers honest on the metric that actually scales the bill.

See MA-MemBench on the research page →

The missing multi-agent benchmark, and why we built MA-MemBench.

What existing benchmarks measure

Why the gap is a real problem

So we built MA-MemBench

The point

Measure the bill,
then bring it down.

What existing benchmarks measure

Why the gap is a real problem

So we built MA-MemBench

The point

Measure the bill,then bring it down.

Measure the bill,
then bring it down.