There's a strange blind spot in how we evaluate memory for AI agents. We have good benchmarks for whether a system can answer a question. We have almost nothing for whether it can do so affordably once you have more than one agent. And in a fleet, affordability is the whole game.
Benchmarks like LongMemEval and LoCoMo are built around a single assistant with a long history. They bury a fact deep in a transcript and ask: can you find it and answer correctly? That's a real and useful skill, and it's what most "memory" leaderboards score.
But they all share the same frame: one agent, accuracy only. None of them measure the number that actually drives the bill once you scale out, how many tokens it costs to keep many agents in sync as the world keeps changing underneath them.
MA-MemBench is a benchmark for the communication cost of a multi-agent system, not just whether it answers, but what it pays to stay coordinated. Six design choices make it do that:
Agents own different pieces of the world. It keeps changing over time, and the latest value is what's true now, so freshness and ordering genuinely matter.
Every memory system gets the same plain-language question and the same judge, so we measure the memory, not the prompt wording.
Not just accuracy, but the tokens sent between agents, what the lead agent must actually read, the cost, and the latency behind every answer.
We grow the world from a handful of agents to a large fleet, exposing which approaches stay flat and which blow up as the network scales.
The latest value after a flurry of changes, facts that only emerge by chaining across agents, and counts that span the whole fleet.
A memory layer is judged on refusing to answer about something that never happened. A confident wrong answer is a failure, not a near-miss.
You can't build affordable agent fleets without a way to see where the money goes. MA-MemBench is our attempt to make that visible, a benchmark that holds memory layers honest on the metric that actually scales the bill.
We're building the shared-memory layer that makes agent fleets affordable, and the benchmarks that hold it honest. If you're running agents at scale, let's talk.