← Back to Home

Memory that learns, not just remembers

Measured memory, with every number linked to its run manifest

Comis recall combines trust-ranked memory, evidence-backed observations, and measured learning signals. The numbers below were graded by two independent LLM judges under one disclosed protocol, and each one links to the committed run that produced it.

Accuracy, cross-judged

The measured baseline

These are Comis's only end-to-end QA-accuracy numbers - scored by two judges and reported only where the cross-judge spread is stable.

Metric Judge A (gpt-4o) Judge B (gpt-4.1) Notes Manifest
Overall (n=135) 71.1 73.3 spread 2.2 · stable manifest →
knowledge-update (n=20) 75.0 75.0 stable manifest →
multi-session (n=20) 60.0 65.0 stable manifest →
temporal-reasoning (n=20) 45.0 40.0 stable · weakest manifest →
retrieval recall@5 0.845 - full-set · vector lane + on-device rerank manifest →

QA accuracy is on a disclosed 135-item category-stratified subset; retrieval recall@5 is full-set. Judges are cross-model (gpt-4o + gpt-4.1), not cross-provider. single-session-preference is omitted - judge-noisy, 15pt spread; LoCoMo is comparability-only and is never headlined.

Cost & latency

What a query costs

~15.5k

tokens / query

6.25s

latency P50

9.97s

latency P95

manifest (GAP-REPORT) →

Keyless, at $0

What the mechanical tracks provably do

These are structural gate deltas - measured with no answer model, no judge, no key, at $0. They are not end-to-end QA-accuracy lifts, and they are kept out of the accuracy table above on purpose.

Graph-spread lane (KG)

linked-doc recall OFF 0 → ON 1

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

MMR diversity (IQ)

diverse-doc rank OFF 3 → ON 2

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Keyless, at $0

What the newer tracks provably do

Six capabilities Comis now ships - each proven with no answer model, no judge, no key, at $0. As above, these are structural invariants, not end-to-end QA-accuracy lifts; the one measured learning signal is a recall-score lift, and the costed competitor comparison is the operator-costed re-run below.

Learning-to-rank (trust frozen)

bandit recall-SCORE lift +0.1 over 5 episodes (rank position flat on the keyless lane)

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Principled ranking decay

old/unused memory factor 0.553 < fresh 0.995 (decay ranks, never gates; byte-identical at neutral)

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Per-user profile

typed per-user records round-trip 4/4; an external-trust upsert is rejected (0 rows); (tenant, agent, user) isolation; recall stays LLM-free

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Per-channel relationship model (dormant / default-off)

directional A→B and B→A as two distinct edges; the sign-off gate holds (enabled-but-unsigned ⇒ 0 reads)

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Ask-your-memory tool (opt-in / default-off)

recall stays LLM-free (0 model calls on read); citations are a subset of the recalled ids; mandatory abstention on empty recall

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Query-conditional usefulness reorder

a memory used for intent X ranks 1 vs 2 for an X- vs Y-query (perIntentRankLift 1); citation→FEED accrual

(mechanical, keyless, $0 - not a QA-accuracy lift)

manifest →

Competitive head-to-head · cross-judged

Competitive with mem0 - at $0 on-device

On 8 LongMemEval questions, cross-judged by two independent models (gpt-4o + claude, spread 0.0 on every cell), Comis's LLM-free $0 on-device recall scores the same as mem0 - and both clear a full-context-dump control by +37.5 pt. At this best-effort N=8 the two are statistically indistinguishable, so Comis is competitive with mem0 at $0 on-device; the difference is production economics, not answer quality.

System Judge 1 (gpt-4o) Judge 2 (claude) Note
Comis (as-shipped recall) 87.5% (7/8) 87.5% LLM-free recall, $0 on-device
mem0 (mem0ai 2.0.4, re-run by us) 87.5% (7/8) 87.5% paid LLM fact-extraction at ingest (~53 min / 8 items)
letta-fs control (full-haystack dump) 50.0% (4/8) 50.0% the honesty anchor

Cross-judge spread 0.0 on every cell (both judges agreed) - every number survives. Comis and mem0 score the same (both 7/8): at N=8 a tie, not a win - competitive-with, never a superiority claim. Both clear the full-dump control by +37.5 pt (the bench discriminates).

Per-capability QA-lift (N=50 mix, cross-judged): Comis baseline 98.0% / 94.0% (spread 4.0, survives); the two recall-config capabilities are byte-identical to baseline on all 50 questions → +0.0 pt measured lift. No recall-config capability showed a measured QA-lift - the measure-first outcome.

N=8 / N=50 are best-effort operator-costed samples, not the definitive scale. Zep / Hindsight / Mnemosyne were skip-with-disclosure (not wired this run) - never a fabricated cell. Comis TIED mem0: competitive-with, never a superiority claim.

manifest →

Head-to-head

Reproduce or extend it via the gate

The competitor head-to-head above is the operator-costed re-run, measured best-effort at N=8. The gate is how you extend it - a larger N, or the skip-with-disclosure competitors (Zep / Hindsight / Mnemosyne). We will not print a number we have not measured.

scripts/bench-memory.sh gate

Comis authored this benchmark. Vendor-reported numbers are non-comparable across protocols; competitors are invited to reproduce on their own harness.

Full methodology + how to reproduce →