← All Posts
By Moshe Anconina June 1, 2026 10 min read Engineering

The memory that models you - what we shipped, measured at $0

Last release we shipped the proving machine and refused to publish a number we couldn't reproduce. This release the memory layer grew up: it now keeps a per-user profile, a per-channel relationship model, an opt-in tool that answers a question straight from memory, a loop that learns which memories prove useful, and principled ranking decay of stale ones. Every one of those is shipped code with a committed manifest behind it. So here is the honest accounting - what each capability is, what we actually measured at $0, the one learning signal we can stand behind, and the costed comparison we deliberately did not publish.

Capability-backed, not a headline number

The honest framing from the last launch still holds: anyone can post a memory score, and almost none of them are comparable across protocols. So we are not claiming a comparative-ranking win here. What changed is the capability surface - the memory layer now does five concrete things it didn't before, each one TDD-green and each one proven keyless at $0. We describe each capability, cite its committed manifest, and label exactly what kind of number backs it. None of these is an end-to-end QA-accuracy lift - those stay honestly deferred to the operator-costed re-run.

What the memory layer now ships

  • A per-user profile. Memory now types and stores what it knows about an individual user, isolated per (tenant, agent, user). Verified keyless: prefix-typing round-trips 4/4, an externally-sourced upsert is rejected (0 rows), a credential-shaped candidate is blocked at write, cross-axis reads return nothing, and the whole thing is default-off and model-free on the read path. phase107-user manifest
  • A per-channel relationship model. Directional A→B and B→A are two distinct edges, isolated per (tenant, agent, channel). It ships default-off behind a sign-off gate: enabled-but-unsigned yields 0 reads and a null block - it is dormant until an operator explicitly signs it on, never self-enabling. phase108-social manifest
  • An opt-in tool that answers a question from memory. Default-off; when enabled, it answers grounded strictly in cited recalled records - the recall path itself stays model-free, every citation is a subset of the recalled ids (a bogus id is dropped), and it abstains when recall comes back empty rather than inventing an answer. consolidated re-prove manifest
  • A loop that learns which memories prove useful. An opt-in, model-free loop turns "this recalled memory was cited" into a bounded reweighting of recall ranking - and the trust signal stays frozen under that tuning, so a learned weight can never promote an untrusted memory. Default-off is byte-identical to the prior shipping config. phase111-learn-rank manifest
  • Principled ranking decay of stale memories. Older, low-importance, unused memories rank lower over time - decay reorders, it never gates or deletes a record, and at neutral importance it is byte-identical to no decay at all. A fresh memory is never reordered. phase112-forget manifest

A sixth piece ties them together: recall is now query-conditional. A memory that proved useful for one kind of question ranks ahead for that question and not for an unrelated one - a per-intent reorder measured keyless (a used-for-intent memory ranks 1 vs 2 across matched and mismatched intents), with the same default-off byte-identity discipline. phase110-learn-iq manifest

The one learning signal we can stand behind

"It learns" is the easiest thing in this space to overclaim, so here is the single learning number we measured, stated exactly. Over 5 episodes of the same query, the model-free loop climbs its tuned usefulness weight 0.125 → 0.225, and that climb raises the repeatedly-cited gold memory's boosted recall score monotonically to a +0.1 score lift (MEASURED-POSITIVE). On the same keyless lane the gold memory's rank position is flat (rankLift = 0, MEASURED-FLAT) - on a model-free, FTS-only 1/rank ordering the positional gaps are large, so moving a single rank needs the model-graded fusion lane, recorded honestly rather than fabricated positive.

learning signal - bandit recall-score lift +0.1 over 5 episodes; rank position flat on the keyless lane; tuned usefulness weight 0.125 → 0.225; trust frozen. manifest

Read this correctly.

+0.1 is a recall-score lift, not an accuracy gain. It is mechanical, keyless, and $0 - we do not round it into "+0.1% accuracy," and it does not appear anywhere near the cross-judged accuracy table below. The loop is default-off and, when off, byte-identical to the prior shipping config. The end-to-end QA-accuracy impact of this loop is exactly the kind of number we refuse to assert without a costed, cross-judged run - so it is deferred, not estimated.

The accuracy baseline (re-stated, kept separate)

Our only end-to-end QA-accuracy numbers are the cross-judged baseline from the last release. They are re-stated here, not re-measured - a keyless run produces no accuracy number, and inventing a fresh one would be the exact fabrication this whole approach forbids. Each cell was graded by two independent judges (gpt-4o and gpt-4.1) and read straight back from the committed 2026-05-31-j1-baseline manifest:

71.1 / 73.3

overall cross-judged (n=135)

spread 2.2 · stable

0.845

recall@5 (full 500+10 set)

vector + rerank both lit

75 / 75

knowledge-update (n=20)

0.0 spread

Same caveats as the baseline, restated honestly: the single-session-preference category does not survive the cross-judge spread (30 vs 45 = 15pt) - we don't read it as a precise figure. The accuracy table is a disclosed, category-stratified subset (135 per judge pass); recall@5 is on the full set. None of the new capabilities above changed these numbers - they were not re-run this release.

That separation is the whole point. The capability deltas are mechanical and keyless; the accuracy figures are costed and cross-judged. They live in different sections, on this page and on the leaderboard at /memory, and we never blend a ranking delta into a score.

What we honestly deferred

Two numbers you will not find on this site, by the same binding rule as last time. The first is a head-to-head comparison against another named memory system - no comparison ships until it is measured under the disclosed protocol, survives a cross-judge spread, and the competitor has been re-run under that same protocol. The second is a per-capability end-to-end QA-accuracy lift: "turning capability X on raised answer accuracy by Y." Both require an operator-costed run - competitor installs, answer and judge model credentials, and two judge passes for the spread - none of which is a Comis dependency.

So we shipped the capabilities, measured what is keyless at $0, and handed you the gate for the rest. The consolidated re-prove manifest records every keyless re-prove outcome, lists each deferred item explicitly as "not measured - operator-costed re-run," and carries the one-command reproduction. Comis authored this benchmark; we say so in every manifest, and we invite competitors to reproduce the comparison themselves rather than take a vendor cell on faith. The reproduction is one command:

# prove the mechanism, keyless, $0:

scripts/bench-memory.sh gate

Run the full reproduction.

The memory benchmarks methodology page has the datasets, the gated harnesses, the operator env table (variable names only - never an inline key), and the full operator-costed head-to-head steps. The leaderboard at /memory links every published number to its committed run manifest under benchmarks/results/.

The bet, unchanged: honesty is the moat

The memory now models the people it talks to, learns which of its memories earn their place, and lets the stale ones fade in ranking - and we can show you the committed manifest behind each of those sentences. What we won't do is dress a mechanism up as an accuracy win, or print a competitor comparison we haven't run. The capabilities are real and shipped; the one learning number is small, measured, and honestly +0.1; the accuracy baseline is re-stated and separate; the costed comparison is deferred with the command to reproduce it.

That's still the whole bet. Build the capability, measure what you honestly can at $0, keep the mechanical deltas strictly apart from the cross-judged accuracy, and ship the command for the rest. When the operator-costed head-to-head is run and cross-judged, it'll land the same way everything else here did: as a manifest you can re-run, not a headline you have to take on faith.

Capability-backed, reproducible, honest.

A memory layer that models you - every capability linked to a committed manifest. The proving machine runs at $0; reproduce it yourself.