← All Posts
By Moshe Anconina June 1, 2026 9 min read Engineering

The honest agent-memory benchmark: we built the proving machine first

Agent-memory leaderboards are easy to game, and vendor numbers aren't comparable across protocols - the judge model alone can swing a memory score from ~49% to ~94%. So instead of racing to publish a number, we built the machine that proves one: an open, in-repo, reproducible harness. Here is what it is, what we measured at $0, what we deliberately did not publish, and the one command that lets you reproduce the full head-to-head yourself.

The problem with memory leaderboards

Every few weeks a new agent-memory system posts a leaderboard with itself on top. The trouble is that almost none of those numbers are comparable. They use different datasets, different answer models, different judges, and a different accounting of what counts as "correct." The single biggest swing isn't the memory system at all - it's the LLM judge. The same set of answers has been reported at ~49% under an independent judge and ~94% when self-judged. A leaderboard that doesn't disclose its judge is a leaderboard you cannot trust.

There's a sharper tell. On the popular LoCoMo dataset, a trivial filesystem baseline - just dump the conversation to files and let the model grep - can score higher than a real, vendor-published memory system. We measured exactly that as a control: under one judge the filesystem baseline lands at 52.6 / 36.3 against our recall pipeline's 71.1 / 73.3 on LongMemEval. The point of the control isn't to dunk on filesystems - it's the opposite. It proves the baseline isn't weak. If your benchmark can't separate a real memory system from a pile of text files, the benchmark is broken, not the contender.

Three ways a memory leaderboard misleads:

  • Undisclosed judge. A self-judged or single-judge score is unfalsifiable - you can't tell the system from the grader.
  • Non-comparable protocol. Different datasets, answer models, and "correct/total" accounting make two vendors' numbers apples-to-oranges.
  • Hidden oracle. A per-ability hint or a full-context bypass quietly turns a recall test into a reading-comprehension test.

What we built: the proving machine

Before chasing a score, we built the thing that makes a score believable: an in-repo, Apache-2.0-licensed, reproducible harness that runs the real Comis recall pipeline - the same search → fuse → rerank → score → trust-filter → dedup path production uses - over the standard LongMemEval and LoCoMo datasets. No vendored corpus (the datasets are operator-provided out-of-band), no auto-download, nothing that touches your live store.

The harness is built around a believability checklist - the things a memory number needs before it deserves to be quoted:

1. A single, honest end-to-end mode. One pipeline answers the question. No per-ability oracle, no full-context bypass - recall has to actually find the evidence.

2. Cross-judge >= 2. No per-category number is trusted until two independent judges agree within tolerance. We grade with gpt-4o and gpt-4.1 and only headline a cell where the spread is stable.

3. Conflict of interest, disclosed. Comis authored this benchmark. We say so, in the methodology and in every manifest. Vendor-reported competitor numbers are non-comparable across protocols, so we don't reprint them as if they were.

4. Raw transcripts releasable, N and significance reported. Every run writes a secret-free manifest that records the config, the dataset hash, the invalid-excluded denominator (correct / (total - invalid)), and the per-category counts - so the run conditions are known on both sides.

The full datasets and the answer/judge model credentials are operator-provided; the harness lists variable names only and never auto-downloads a corpus or a model. The whole thing is in the open repository, and the methodology behind it is written up on the memory benchmarks methodology page.

What we measured at $0

There are two clearly separate kinds of measured claim here, and conflating them is exactly the trap we refuse to fall into. The first is real, cross-judged QA accuracy. The second is a set of mechanical, keyless, $0 gate deltas. They live in different sections on purpose.

The cross-judged accuracy baseline (set A)

These are our sole end-to-end QA-accuracy numbers, each cell graded by both judges and read straight back from the committed 2026-05-31-j1-baseline manifest:

71.1 / 73.3

overall cross-judged (n=135)

spread 2.2 · stable

0.845

recall@5 (full 500+10 set)

vector + rerank both lit

75 / 75

knowledge-update (n=20)

0.0 spread

Honest caveats, stated up front: the single-session-preference category does not survive the cross-judge spread (30 vs 45 = 15pt) - we don't read it as a precise figure. LoCoMo is comparability-only and never headlined, because its score is wildly judge-dependent. The accuracy table is a disclosed, category-stratified subset (135 per judge pass); recall@5 is on the full set.

The mechanical gate deltas (sets B-E) - keyless, $0

This release shipped four memory tracks (knowledge-graph spread, reasoning-write correctness, query understanding, and a per-release proving gate). Each ships a mechanical / structural gate delta - a lane surfaces a linked doc, a write lands at the right trust tier, a ranking knob reorders a candidate - measured keyless, with no answer model, no judge, no key, no cost:

KG graph-spread - linked-doc recall delta +1 (off: linked doc absent → on: surfaced purely by the graph edge). manifest

KG trust-first invalidation - 100% (2/2) older-high-trust-wins on SUITE-04 via the real upsertTriple. manifest

IQ MMR diversity - diverse-doc rank off 3 → on 2; λ=1.0 byte-identical to off. manifest

IQ intent reweight - temporal candidate rank off 2 → on 1. manifest

IQ NL temporal-range - in-window precision off 0.5 → on 1.0; an unparseable query applies no filter (byte-identity). manifest

Read these correctly.

Each delta above is mechanical, keyless, $0 - not a QA-accuracy lift. A "+1 linked-doc recall delta" is not "+1% accuracy." Every one of these factors is default-off and, when off, byte-identical to the prior shipping config - no silent behaviour change, zero category regression. The end-to-end accuracy lift each track produces is honestly deferred to the operator-costed re-run. Quoting a rank-delta as an accuracy percentage is exactly the fabrication this benchmark forbids.

Every one of these numbers, both kinds, is on the memory leaderboard at /memory - with the accuracy table and the mechanical deltas kept in clearly separate sections, each row linked to its committed manifest.

What we deliberately did not publish

There is one number you will not find anywhere on the site: a head-to-head comparison against another named memory system. Not because we're shy - because we have a binding rule. No comparison ships until the number is (1) measured under the disclosed protocol, (2) survives a cross-judge spread, and (3) the competitor has been re-run under that same protocol. That competitor number does not exist yet. It is the honestly-deferred, operator-costed re-run.

So we will not print a number we have not measured, cross-judged, and re-run the competitor against under our own protocol. Vendor-reported figures graded by a different judge are non-comparable, and reprinting one as a "Comis vs X" cell would be a fabricated result. There is no fabricated competitor cell anywhere - the manifests carry a literal fabricatedNumber: false assertion (manifest). Instead of a fake cell, we hand you the gate and invite you to run the comparison yourself.

For the same reason, this post and the whole site stay strictly inside shipped code. This release shipped recall and knowledge-graph improvements - it did not ship per-type memory decay, online recall-weight tuning, a multi-party user model, or a natural-language memory Q&A surface. Those are honest roadmap items for a later release, and you won't find them dressed up as shipped capabilities here.

Reproduce it yourself

The mechanical part costs nothing. The proving machine runs the cross-judge spread keyless on injected verdicts - 3 of 4 categories survive the 5pt tolerance; the 15pt preference category does not (disclosed, consistent with the baseline). One command proves the mechanism at $0, no key required:

# prove the mechanism, keyless, $0:

scripts/bench-memory.sh gate

The full operator-costed head-to-head - "Comis vs another memory system" accuracy - needs answer + judge model credentials and the competitor systems installed, none of which is a Comis dependency. The exact steps (which env-var names to populate, which competitor packages to install, how to run the second judge pass for the cross-judge spread) are written up, copy-paste-ready, on the methodology page. We list variable names only - never an inline key.

Run the full reproduction.

The memory benchmarks methodology page has the datasets, the gated harnesses, the operator env table, and the full operator-costed head-to-head steps. The leaderboard at /memory has every published number linked to its committed run manifest under benchmarks/results/.

The thesis: honesty is the moat

Anyone can publish a number. What's hard - and what actually compounds - is publishing the protocol: the open harness, the disclosed judges, the cross-judge spread, the conflict-of-interest note, the committed manifests, and the one command that lets a skeptic reproduce every claim. The number is downstream of the method. Get the method right, in the open, and the number takes care of itself.

That's the whole bet. We built the proving machine first, measured what we honestly could at $0, kept the mechanical deltas separate from the cross-judged accuracy, and shipped you the command for the rest. When the operator-costed head-to-head is run and cross-judged, it'll land the same way everything else here did: as a manifest you can re-run, not a headline you have to take on faith.

Measured, reproducible, honest.

An open agent-memory benchmark. Every number linked to a committed manifest. The proving machine runs at $0 - reproduce it yourself.