Memory that learns, not just remembers
Measured memory, with every number linked to its run manifest
Comis recall combines trust-ranked memory, evidence-backed observations, and measured learning signals. The numbers below were graded by two independent LLM judges under one disclosed protocol, and each one links to the committed run that produced it.
Accuracy, cross-judged
The measured baseline
These are Comis's only end-to-end QA-accuracy numbers - scored by two judges and reported only where the cross-judge spread is stable.
| Metric | Judge A (gpt-4o) | Judge B (gpt-4.1) | Notes | Manifest |
|---|---|---|---|---|
| Overall (n=135) | 71.1 | 73.3 | spread 2.2 · stable | manifest → |
| knowledge-update (n=20) | 75.0 | 75.0 | stable | manifest → |
| multi-session (n=20) | 60.0 | 65.0 | stable | manifest → |
| temporal-reasoning (n=20) | 45.0 | 40.0 | stable · weakest | manifest → |
| retrieval recall@5 | 0.845 | - | full-set · vector lane + on-device rerank | manifest → |
QA accuracy is on a disclosed 135-item category-stratified subset; retrieval recall@5 is full-set. Judges are cross-model (gpt-4o + gpt-4.1), not cross-provider. single-session-preference is omitted - judge-noisy, 15pt spread; LoCoMo is comparability-only and is never headlined.
Cost & latency
What a query costs
~15.5k
tokens / query
6.25s
latency P50
9.97s
latency P95
Keyless, at $0
What the mechanical tracks provably do
These are structural gate deltas - measured with no answer model, no judge, no key, at $0. They are not end-to-end QA-accuracy lifts, and they are kept out of the accuracy table above on purpose.
Graph-spread lane (KG)
linked-doc recall OFF 0 → ON 1
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →MMR diversity (IQ)
diverse-doc rank OFF 3 → ON 2
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Keyless, at $0
What the newer tracks provably do
Six capabilities Comis now ships - each proven with no answer model, no judge, no key, at $0. As above, these are structural invariants, not end-to-end QA-accuracy lifts; the one measured learning signal is a recall-score lift, and the costed competitor comparison is the operator-costed re-run below.
Learning-to-rank (trust frozen)
bandit recall-SCORE lift +0.1 over 5 episodes (rank position flat on the keyless lane)
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Principled ranking decay
old/unused memory factor 0.553 < fresh 0.995 (decay ranks, never gates; byte-identical at neutral)
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Per-user profile
typed per-user records round-trip 4/4; an external-trust upsert is rejected (0 rows); (tenant, agent, user) isolation; recall stays LLM-free
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Per-channel relationship model (dormant / default-off)
directional A→B and B→A as two distinct edges; the sign-off gate holds (enabled-but-unsigned ⇒ 0 reads)
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Ask-your-memory tool (opt-in / default-off)
recall stays LLM-free (0 model calls on read); citations are a subset of the recalled ids; mandatory abstention on empty recall
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Query-conditional usefulness reorder
a memory used for intent X ranks 1 vs 2 for an X- vs Y-query (perIntentRankLift 1); citation→FEED accrual
(mechanical, keyless, $0 - not a QA-accuracy lift)
manifest →Competitive head-to-head · cross-judged
Competitive with mem0 - at $0 on-device
On 8 LongMemEval questions, cross-judged by two independent models (gpt-4o + claude, spread 0.0 on every cell), Comis's LLM-free $0 on-device recall scores the same as mem0 - and both clear a full-context-dump control by +37.5 pt. At this best-effort N=8 the two are statistically indistinguishable, so Comis is competitive with mem0 at $0 on-device; the difference is production economics, not answer quality.
| System | Judge 1 (gpt-4o) | Judge 2 (claude) | Note |
|---|---|---|---|
| Comis (as-shipped recall) | 87.5% (7/8) | 87.5% | LLM-free recall, $0 on-device |
| mem0 (mem0ai 2.0.4, re-run by us) | 87.5% (7/8) | 87.5% | paid LLM fact-extraction at ingest (~53 min / 8 items) |
| letta-fs control (full-haystack dump) | 50.0% (4/8) | 50.0% | the honesty anchor |
Cross-judge spread 0.0 on every cell (both judges agreed) - every number survives. Comis and mem0 score the same (both 7/8): at N=8 a tie, not a win - competitive-with, never a superiority claim. Both clear the full-dump control by +37.5 pt (the bench discriminates).
Per-capability QA-lift (N=50 mix, cross-judged): Comis baseline 98.0% / 94.0% (spread 4.0, survives); the two recall-config capabilities are byte-identical to baseline on all 50 questions → +0.0 pt measured lift. No recall-config capability showed a measured QA-lift - the measure-first outcome.
N=8 / N=50 are best-effort operator-costed samples, not the definitive scale. Zep / Hindsight / Mnemosyne were skip-with-disclosure (not wired this run) - never a fabricated cell. Comis TIED mem0: competitive-with, never a superiority claim.
Head-to-head
Reproduce or extend it via the gate
The competitor head-to-head above is the operator-costed re-run, measured best-effort at N=8. The gate is how you extend it - a larger N, or the skip-with-disclosure competitors (Zep / Hindsight / Mnemosyne). We will not print a number we have not measured.
scripts/bench-memory.sh gate
Comis authored this benchmark. Vendor-reported numbers are non-comparable across protocols; competitors are invited to reproduce on their own harness.