81.6% on LongMemEval

Benchmarking Brain — measuring long-term memory for AI agents

Brain scored 408/500 = 81.60% on LongMemEval_s — matching Supermemory’s GPT-4o row, using only BM25 retrieval and a Claude Sonnet reader, on a stack that runs locally.

The result has a sharp boundary: my judge was Sonnet, not GPT-4o, so this is not a strict leaderboard replacement. But end-to-end it is a clean, full 500-question run with zero reader errors and zero judge errors. That is the number I trust.

Metric	Result
Dataset	LongMemEval_s cleaned
Questions	500/500
Retriever	`inproc-bm25`, top-k=5
Reader	`claude-cli`, `claude-sonnet-4-6`
Judge	`claude-sonnet-4-6` using the vendored official yes/no templates
Reader errors	0
Judge errors	0
QA accuracy	408/500 = 81.60%
Retrieval recall@5	91.55% over 470 non-abstention questions

Brain is a product bet, not a chat UI. It indexes the important data around my work once — Claude Code sessions, OpenClaw runs, Hermes conversations, project docs, browser trails, notes, decisions, failures, the operational exhaust that usually disappears — and lets any agent surface ask that memory naturally: what did we decide about the auth flow?, where did this error happen before?, which benchmark run was clean?, what did I already try and reject?

The bet is narrower than “stuff every session with context”:

Index the important past once, retrieve only the evidence needed now, and let the agent talk to that memory like a native part of its workspace.

The product bet is narrow and deliberate — index operational exhaust once, retrieve naturally to agent surfaces

If Brain is going to sit behind Claude Code, OpenClaw, and Hermes as the memory layer, it cannot just feel useful. It has to answer held-out questions from long histories, recover the right evidence cheaply, and fail in ways I can inspect. Memory benchmarks are especially easy to overstate — a retrieval score can look like an answer score, a small sample can look like a full benchmark, a rate-limited run can leave half the dataset silently broken, a judge can be changed just enough to make the number prettier.

I benchmarked Brain because “it feels useful” is not enough to build a great product.

The rule I took from the whole exercise is the Clean Number Rule: if the run is partial, rate-limited, re-judged inconsistently, or silently missing questions, it is not a score — it is a debugging artifact.

The Clean Number Rule

The journey had four distinct phases:

Phase	What happened	What survived
Harness build	Built `bench/longmemeval/`, dataset fetch, ingestion, retrieval, reader, judge wrappers	A repeatable benchmark loop instead of a demo
First Sonnet signal	A 25-question burst hit 84%	Useful signal, not publishable evidence
Failed full run	Parallel Sonnet readers hit account limits and left 286/500 reader errors	A hard rule: no error-polluted scores
Clean full run	Retried failed qids serially, re-judged all 500	408/500 = 81.60%, 0 reader errors, 0 judge errors

That last row is the only one I am willing to call the benchmark result.

The story is not “we got a benchmark score.” The story is that a local memory layer can become product infrastructure only after it learns to prove what it remembers.

Why LongMemEval

Most memory systems are evaluated with questions that are too clean. A fact is inserted. A question asks for the fact. Retrieval finds the fact. The demo works.

Real agent memory is messier than that. The hard cases are not just “what was the user’s dog’s name?” They are:

facts scattered across sessions
old information overwritten by newer information
timestamps that change the answer
assistant-side statements that matter later
preference questions where the evidence is implicit
abstention questions where the right behavior is to say there is not enough information

That is why I used LongMemEval. The benchmark was built for long-term interactive memory, not generic RAG. The LongMemEval GitHub repository describes 500 questions covering information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and abstention, and links the cleaned data on Hugging Face. The underlying paper is LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.

The small variant, LongMemEval_s, gives each question roughly 115k tokens of chat history across about 40 sessions; the medium variant pushes toward 500 sessions.

That shape matters. It separates three things people often collapse into one:

Layer	Question it answers	Why it matters
Retrieval	Did the system find the right evidence?	A memory system must surface the relevant past.
Reading	Did the model interpret the retrieved evidence correctly?	Retrieval alone does not answer the question.
Judging	Did the answer match the expected answer?	The final metric has to be end-to-end.

Three distinct hardware layers govern agentic memory — retrieval, reading, judging

The LongMemEval authors also make a point that matched my own experience: even with strong long-context models, long-term memory still needs explicit machinery. The benchmark is not just asking whether a model has a long context window. It is asking whether the system can manage a growing interaction history.

That is exactly the claim Brain has to survive: not “can we stuff all history into context?”, but “can a retrieval substrate make the right past available to the right agent at the right moment without paying a huge token and latency tax?”

The Product Claim

Brain started as my local-first memory layer for Claude Code and related agent work, but the product shape is broader than one assistant.

At the time of the earlier production writeup, the system was already indexing hundreds of Claude Code sessions into markdown, distilling high-signal summaries, and exposing retrieval through QMD and MCP. The goal was to make memory feel native across agent surfaces: important data flows into a normalized markdown corpus, gets indexed lexically and semantically (lexical / vector / HyDE), and exposes itself as brain ask / brain recent / MCP — so Claude Code, OpenClaw, and Hermes can ask natural questions against the user’s past.

This is the key product distinction:

Bad memory product	Brain product goal
Paste giant summaries into every session	Retrieve small evidence slices when needed
Make the user manage notes manually	Index operational exhaust automatically
Optimize for “chat with your docs” demos	Optimize for agents doing real work
Hide failures behind fluent answers	Show evidence, gaps, and confidence boundaries
Spend more tokens to feel safer	Spend fewer tokens by retrieving better

The Demo Trap vs The Brain Contract

The system helped me every day. It could remember decisions, commands, failures, review comments, abandoned approaches, and “we already tried that” context. But daily usefulness creates a trap. If a memory system helps you personally, you start trusting it before you have measured it.

That is dangerous because memory failures often look plausible. The agent gives an answer with confidence, but the missing evidence is invisible. For a product, that is the failure mode that matters: not forgetting loudly, but remembering wrongly while sounding useful.

So I needed an external test with enough structure to make failure legible.

I call this the Harness Before Hype rule:

Temptation	Better discipline
Publish an architecture diagram	First publish the metric it survives.
Report retrieval recall	Also report end-to-end QA.
Show the best examples	Score all 500 questions.
Optimize the prompt live	Keep the judge template fixed.
Round the number up	Preserve the clean run exactly.

That rule shaped the whole LongMemEval project, but the product reason was simple: if Brain is going to save tokens and time for real agent work, I need to know what quality I am buying with that cheap retrieval path.

The benchmark also made Brain more legible as a product. Before LongMemEval, Brain was a useful local memory layer. After LongMemEval, it had a measurable contract:

Claim	Measurement needed
Brain remembers prior work	End-to-end QA on held-out questions
Retrieval is good	Recall@k against evidence sessions
The reader is good	QA accuracy given retrieved sessions
The system is robust	500/500 complete, no reader or judge errors
The result is comparable	Fixed dataset, fixed judge templates, named models

That contract is what turns “memory” from a feature into infrastructure.

The Harness

I built bench/longmemeval/ as a normal benchmark harness, not a one-off notebook.

The test harness isolates the mechanics of memory — Fetch → Ingest → Retrieve → Read → Judge

The pipeline had five stages: fetch the dataset, ingest per-question corpora, retrieve the top-k evidence sessions, ask a reader model for the answer, and judge that answer with LongMemEval-style yes/no templates.

The first committed baseline was deliberately boring: pure in-process BM25, no embeddings, no external vector DB, no custom memory graph.

That was the point. Before testing the full Brain stack, I wanted the cheapest credible product baseline: if BM25 plus a good reader already goes far, then the product can start fast and local instead of defaulting to expensive context stuffing.

The critical command shape looked like this:

python3 -m bench.longmemeval.run \
  --variant s \
  --retriever inproc-bm25 \
  --reader claude-cli \
  --tag fleet-0 \
  -k 5 \
  --qid-file bench/longmemeval/batches/batch_0.txt

The reader path used claude -p through the local Claude Code subscription: BM25 retrieves five sessions, Sonnet reads the question plus those retrieved sessions and emits one hypothesis, then the judge labels that hypothesis true or false.

The judge path used the same official yes/no templates vendored from LongMemEval’s evaluate_qa.py, routed through Claude instead of GPT-4o.

That last sentence is important. This was a clean internal run, but not a perfect apples-to-apples public leaderboard submission. The LongMemEval repository documents GPT-4o-based evaluation. My run used Sonnet as both reader and judge because it let me run the full system without API spend.

So the precise claim is:

Brain’s BM25 + Sonnet run scored 81.60% under a Sonnet implementation of the official LongMemEval yes/no judge templates.

It is a real end-to-end score. It is not the same thing as a GPT-4o-judged leaderboard entry.

The harness became a small contract: same 500 questions, same cleaned dataset, same retrieval k, same reader model, same judge templates, same aggregation script, no silent errors.

That contract matters more than any single prompt tweak.

The contract looked solid. Then the first serious run broke it, which is exactly why the harness mattered.

The Run That Failed

Real benchmark infrastructure treats partial failure as toxic — the journey to the clean run

The first full attempt did not give me the final number.

It gave me lessons.

The early signal was promising: a 25-question burst hit 84%. That was useful, but it was not enough evidence to publish. A 25-question sample can be lucky, skewed by category, or easier than the full distribution.

Then I tried to scale the reader fleet in parallel. That was a mistake. Five parallel Sonnet readers tripped the weekly Claude limit and left the run with 286/500 reader errors.

At that point the harness had already taught me something more valuable than a score: benchmark infrastructure needs to treat partial failure as toxic.

I made three changes:

Problem	Fix
Reader and retrieval errors were too entangled	Split retrieval and reader try-blocks so failures were tagged correctly.
Long runs were fragile	Added resume support and qid-file batches.
Rate limits polluted outputs	Added a circuit breaker after repeated rate-limit errors.

The operational rule became:

Serial reader, parallel judge.

Readers are expensive, stateful, and easy to rate-limit. Judges are cheaper to resume and easier to shard. Once I separated those clocks, the full run became stable.

That was the unglamorous work that made the final number publishable.

This is also where Claude Code hooks became part of the broader Brain product story. Hooks are how everyday sessions enter the memory system automatically. The benchmark harness is the mirror image: instead of automatically capturing my work, it automatically forces the memory system to prove it can recover evidence later.

After that failure, the benchmark became simpler and stricter: finish all 500 questions, retry failures serially, judge everything cleanly, and only then look at the score.

The Run That Counted

The clean run scored 81.60% with zero silent failures — 408/500, 0 reader errors, 0 judge errors

The clean run completed all 500 questions with zero reader errors and zero judge errors.

Category breakdown reveals exactly where a lexical baseline breaks — strong on direct recall, weak on multi-session synthesis

The result surprised me in two opposite ways.

First, the simple baseline was much stronger than I expected. Plain BM25 plus Sonnet was enough to land in the same aggregate range as serious memory systems. Supermemory’s research page reports an 81.6% LongMemEval_s result for its GPT-4o row, with higher rows for stronger readers. My run matched that 81.6% aggregate number, while using a much simpler retrieval stack.

Second, the breakdown made the weakness obvious. Brain was excellent at direct single-session recall and strong on knowledge updates. It was weak on preference and multi-session reasoning.

That is exactly the failure pattern I would expect from BM25.

BM25 is good at literal evidence. It is less good when the answer requires synthesizing weak signals across multiple sessions or inferring a preference from repeated behavior. Those are not just retrieval problems. They are representation and reasoning problems.

The retrieval metric tells the same story. Recall@5 was 91.55%, but QA was 81.60%. That gap is the reader/judge split in action:

Metric	Meaning	Result
recall@5	Did the evidence session appear in the top 5?	91.55%
QA accuracy	Did the final answer pass the judge?	81.60%
gap	Evidence found but not converted into a correct answer	9.95 points

This is why I do not like memory claims that only report retrieval recall. Retrieval is necessary. It is not sufficient.

I think about this as the Evidence Conversion Gap — recall@k − QA accuracy, or 91.55 − 81.60 = 9.95 points.

The Evidence Conversion Gap — recall@5 91.55%, QA accuracy 81.60%, 9.95-point gap. Retrieval is necessary but not sufficient.

That gap is where the next product work lives. A memory system that retrieves the right session but cannot turn that session into the right answer has not solved agent memory. It has solved evidence surfacing.

So the aggregate said Brain was credible. The category breakdown said exactly where it was still weak. The next question was how that shape compared with a public memory system.

Where Brain Lands In The Public Field

LongMemEval does not have a single official leaderboard maintained by the benchmark authors. The public results are mostly vendor self-reports: research pages, leaderboard posts, press releases, and GitHub snippets. The fairest reading is “best public claims I could verify,” not a tournament with one referee.

Brain sits in the middle of the public field using the simplest possible stack — ranked alongside Mastra, Vectorize, Emergence AI, Supermemory, RetainDB, and Zep

In decreasing order of verified accuracy, the field looks like this:

Rank	System	Best reported	Reader	Source
1	Mastra (Observational Memory)	94.87%	gpt-5-mini	mastra.ai/research
2	Vectorize / Hindsight	91.40%	gemini-3-pro-preview	Mastra leaderboard
3	Emergence AI (internal)	86.00%	not publicly reproducible	Mastra leaderboard
4	Supermemory	85.20%	gemini-3-pro-preview	Mastra leaderboard
—	Brain (this run)	81.60%	claude-sonnet-4-6	`bench/longmemeval/`
5	RetainDB	79.00%	oracle split	arXiv 2410.10813
6	Zep	71.20%	gpt-4o	Mastra leaderboard

Lighter-weight claims sit outside that list until methodology is checked side by side: Ensue AI’s 93.2% LinkedIn post and Backboard’s 93.4% GitHub snippet are real numbers but not yet apples-to-apples. Vectorize’s 91.4% was corroborated with The Washington Post and Virginia Tech partners; most of the other rows are single-source.

Two things matter about where Brain lands.

The first is that 81.60% sits in the middle of the public field while running the simplest possible stack: in-process BM25, top-k=5, no embeddings, no graph, no rerankers. Every system above Brain on this table runs richer memory machinery, a stronger reader, or both. Sitting in that band on a BM25 baseline is a stronger product signal than the rank itself.

The second is that this is not an apples-to-apples ranking. Reader models differ across rows. Judge models differ — most public numbers were judged by GPT-4o; Brain was judged by Sonnet. Some entries are internal configurations the authors say are not reproducible. Some are leaderboard rows; others are press or social posts. Treating this as a tournament would be sloppy.

LongMemEval itself was created by Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu, and the benchmark was accepted at ICLR 2025. Among the public results that exist today, the benchmarking activity is dominated by startups and product teams rather than named individual evaluators or standalone university labs. That is worth flagging because it shapes what “the field” is: a vendor scoreboard, not a peer-reviewed ranking.

The most useful per-system comparison is still Supermemory, because their research page breaks results out by category. That is the one place Brain can be matched on shape, not just aggregate:

System row	Overall
Full-context (gpt-4o)	60.2%
Zep (gpt-4o)	71.2%
Supermemory (gpt-4o)	81.6%
Supermemory (gpt-5)	84.6%
Supermemory (gemini-3-pro)	85.2%

Brain’s clean Sonnet run tied the 81.6% aggregate reported for Supermemory’s GPT-4o row. It did not beat Supermemory’s higher-reader rows, and because my judge was Sonnet rather than GPT-4o, I would not present this as a strict leaderboard replacement.

The category shape comparison:

Per-category accuracy on LongMemEval_s — Brain (BM25 + Sonnet) vs. Supermemory (gpt-4o)

Category-level comparison exposes a clear product roadmap — Brain wins on temporal/assistant/knowledge-update; trails on preference/multi-session

Type	Brain	Supermemory gpt-4o row	Delta
single-session-user	96.88	97.14	-0.26
single-session-assistant	100.00	96.43	+3.57
single-session-preference	63.33	70.00	-6.67
knowledge-update	88.89	88.46	+0.43
temporal-reasoning	81.89	76.69	+5.20
multi-session	62.81	71.43	-8.62

This is a much more useful pair than either aggregate.

It says Brain’s plain baseline is already competitive on direct recall, assistant-side recall, knowledge updates, and temporal reasoning. It also says the system is not yet good enough at cross-session synthesis or implicit preference modeling.

That is a product roadmap hiding inside a benchmark table.

I would not use this comparison or the broader leaderboard to claim “Brain beats Supermemory” or “Brain is fifth on the leaderboard.” Both framings would be sloppy. The stronger claim is narrower and more useful:

A local-first Brain baseline, using only BM25 and Sonnet, sits in the middle band of the public LongMemEval field — and exposes a clear multi-session weakness to fix next, before the more sophisticated parts of the Brain stack are even turned on.

That narrow claim is the point of the post. It is strong enough to matter, but constrained enough to be defensible.

Why This Was Not An 85% Post

There was an 85% target.

There was an 84% small-sample burst.

There was a Tier-S prompt-bundle run designed to test whether better answer discipline could push the system toward 85% or beyond.

But I did not find a completed clean 85% Sonnet run.

The Tier-S run was a good hypothesis, not a publishable result. It reached a partial state and then hit rate-limit failure. Publishing that as “Brain got 85%” would violate the reason I built the harness in the first place.

Here is the standard I am using:

Evidence	Publishable?	Why
25-question burst at 84%	No	Too small. Useful signal only.
Full run with 286 reader errors	No	Error-polluted.
Partial Tier-S run aimed at 85%	No	Did not complete cleanly.
Full 500-question run, 0 reader errors, 0 judge errors, 408/500	Yes	Complete, reproducible, auditable.

The clean number is 81.60%.

I would rather publish the lower number than train myself to trust a flattering one.

Where This Breaks

The result is strong enough to matter, but it has sharp boundaries.

Limitation	Why it matters	What I would do before making a stronger claim
Sonnet judged Sonnet	Same-model reader/judge can share blind spots	Re-judge the 500 hypotheses with GPT-4o using the official LongMemEval path
BM25 was the headline retriever	It is a lexical baseline, not the full Brain stack	Run QMD hybrid retrieval with lex/vec/HyDE and compare per-category lift
LongMemEval_s only	The medium variant is closer to a heavier long-history workload	Repeat on LongMemEval_m after the harness is stable
No cost/latency table yet	A memory product has to be useful under real operating budgets	Add wall-clock, token, and cost estimates per question
Multi-session remained weak	The system finds evidence better than it synthesizes across sessions	Add a two-pass evidence table reader for multi-session questions
Public field is heterogeneous	Reader and judge models differ across vendor reports, so “rank” hides methodology spread	Re-judge against the official GPT-4o path before claiming a strict position; until then, frame the result as a band, not a place

This is why I am framing 81.60% as a credible baseline, not an end-state victory.

Those limits do not weaken the result. They make the next experiment obvious.

What I Learned

The first lesson is that a memory benchmark is mostly a systems benchmark.

The actual QA line is short. The hard parts are resumption, error tagging, batch boundaries, judge reproducibility, output hygiene, and preventing a partial run from masquerading as a score.

The second lesson is that retrieval recall is an upper-bound hint, not the outcome. Brain found the right evidence far more often than it answered correctly. That means the next improvements should not only be “better search.” They should improve how evidence is structured for the reader.

The third lesson is that multi-session questions are the real test.

Single-session recall is table stakes. The important behavior is synthesis — session A says one thing, session B updates it, session C implies a preference, session D gives the timestamp, and the question asks for the current answer.

BM25 can surface pieces of that chain. It does not naturally build the chain.

That points to the next architecture:

The next architecture targets the multi-session synthesis weakness — hybrid retrieval + two-pass reader extracting one-line facts before synthesis

Lever	Expected role
Hybrid retrieval	Improve semantic recall where BM25 misses vocabulary.
Date-aware reranking	Improve temporal questions by respecting event and session time.
Multi-session two-pass reader	Summarize evidence per session before synthesis.
Stronger independent judge	Reduce same-model reader/judge bias.
GPT-4o re-judge	Make the public number more comparable to published LongMemEval rows.

Hybrid retrieval is the obvious next retrieval lever because QMD already exposes lexical, vector, and HyDE query modes. But the category table says retrieval alone will not be enough. The product needs a better evidence representation for multi-session synthesis.

The biggest near-term improvement is probably the two-pass reader: retrieve the top 10 sessions, extract one-line facts from each, synthesize the answer from those compact facts, and judge with fixed templates. That changes the reader’s job from “read five noisy sessions and answer” to “reason over a small evidence table.” For multi-session tasks, that is a different problem.

What This Changes About Brain

Brain was not built for LongMemEval. It was built so an agent could remember my actual work.

That is why the 81.60% result matters to me. It says a local-first, markdown-native, BM25-first memory system can already compete with serious memory products on a hard public benchmark, even before the more sophisticated parts of the stack are turned on.

More importantly, it says the product direction is sane. Brain does not need to make every agent session enormous. It can index the user’s important data once, expose a narrow natural query surface, retrieve cheap evidence slices, and let Claude Code, OpenClaw, and Hermes use that evidence only when they need it.

But the number also prevents overclaiming. Brain is not done. It is not “solved memory.” It is not at 95%. It still struggles where memory becomes synthesis.

That is the right kind of result for a product: strong enough to justify the architecture, specific enough to tell me what to fix next.

The real artifact is not just the score. It is the discipline:

Discipline	Why it matters
Full 500-question runs	Avoids cherry-picking.
Separate retrieval and QA metrics	Prevents recall from masquerading as intelligence.
Fixed judge templates	Keeps improvement honest.
Error-free score only	Makes the number defensible.
Per-category breakdowns	Turns the benchmark into a roadmap.

I started Brain because I wanted Claude Code to stop forgetting my past work.

I benchmarked Brain because the product I want has a harder requirement: agents should remember without wasting my time, wasting my tokens, or pretending to know what they failed to retrieve.

The clean score is 81.60%. The next target is not a prettier blog headline. It is a faster, cheaper, more natural Brain that makes multi-session memory feel like part of the agent’s normal working environment.

The clean score is 81.60% — the next target is a faster, cheaper, more natural Brain

Links Worth Following

If you want to inspect the concepts behind this benchmark, these are the links I would start with:

Topic	Link	Why it matters
LongMemEval paper	arXiv 2410.10813	Defines the benchmark and long-term memory task shape.
LongMemEval code/data format	GitHub: xiaowu0162/LongMemEval	Shows dataset files, evaluation scripts, and question types.
Cleaned LongMemEval data	Hugging Face dataset	The cleaned files behind the run.
Supermemory comparison	Supermemory research	Per-category breakdown — the only public apples-to-apples shape comparison.
Mastra leaderboard	Observational Memory	The most-cited public LongMemEval scoreboard; aggregates Mastra, Hindsight, Emergence, Supermemory, Zep.
QMD	GitHub: tobi/qmd	Local retrieval substrate Brain builds on.
MCP	Model Context Protocol	Protocol surface for exposing tools/data to agents.
BM25	Stanford IR book: Okapi BM25	The lexical ranking baseline that got surprisingly far.
HyDE	ACL Anthology: Hypothetical Document Embeddings	Useful background for QMD’s hypothetical-document retrieval mode.
Claude Code headless mode	Claude Code `-p` docs	How the Sonnet reader was driven from the harness.
Claude Code hooks	Hooks reference	How everyday Brain ingestion connects back to agent sessions.