The Resumption Benchmark v0: Measuring Whether the Next Agent Continues Correctly

Nine days ago I pre-registered three experiments. One of the five things I committed to in writing was this:

The Resumption Benchmark spec is published before Experiment 3 runs. The benchmark definition cannot be moved to fit the result.

This post is that spec. It exists for a single reason: so that when Experiment 3 ships a number, the dependent variable was named in advance and cannot be quietly redefined to flatter the result. It is not the experiment. It is the ruler the experiment will be measured against.

The benchmark also introduces one concept I have not seen in any published memory or summarization eval: forgetting precision — a score that goes up when the system correctly leaves things out. Most evals only know how to reward inclusion. Real consolidation has to do both.

The Gap This Benchmark Names

The synthesis post on long-horizon agents defined the Resumption Gap as the quality loss between a continuous agent and one that was serialized to disk, woken up later, and rehydrated from a summary. That post offered no way to measure the gap, which is a defensible move in an essay and an indefensible one in an experiment. Without a measuring stick, Experiment 3 could claim that “structured consolidation beats naive summarization” with no way for a reader to falsify it.

The closest prior art is the owned-corpus eval I built in the LongMemEval critique. That benchmark answers a retrieval question: did the system find the right document? This benchmark has to answer a different question:

Given that an agent was in the middle of a multi-step task and was interrupted, does the resumption context enable the next agent to continue the task correctly — without redoing work the first agent already finished, and without repeating mistakes it already learned to avoid?

That second clause — “mistakes it already learned to avoid” — is what makes resumption different from summarization. If the next agent gets a faithful summary of every approach the previous agent tried, including the dead ends, the dead ends are now live again. Faithful recall poisons the trajectory. Resumption requires the summary to be selective in a way that summarization evals do not measure.

There is parallel evidence from the SWE side that the long-horizon problem is wide open. SWE-EVO, revised on arxiv on May 22, 2026, constructs 48 tasks from release notes of seven mature Python projects — each spanning an average of 21 files and validated against test suites averaging 874 tests per instance. The result that matters here is the collapse between benchmarks: GPT-5.4 with OpenHands hits 25% on SWE-EVO versus 72.8% for GPT-5.2 on SWE-bench Verified — a 47.8-point drop the moment the task requires sustained multi-file reasoning instead of an isolated patch. SWE-EVO and the Resumption Benchmark are measuring different dependent variables — SWE-EVO asks whether the long task gets done, the Resumption Benchmark asks whether the handoff between sessions of the long task preserves the right state — but they are pointing at the same underlying gap. Long-horizon coordination is the bottleneck. Static issue resolution is no longer where the frontier breaks.

The Unit: A Resumption Episode

Every test case is a 5-tuple. I am going to spell it out in code-block form because the structure is the contract:

ResumptionEpisode = (
  initial_task,
  partial_trajectory,
  gold_continuation,
  gold_facts_to_preserve,
  gold_facts_to_forget,
)

Field	What it is	Where it comes from
`initial_task`	The goal the original agent was pursuing.	Hand-extracted from a real Brain session.
`partial_trajectory`	The raw step-by-step trace of turns 1 through K.	Real session, sliced at step K.
`gold_continuation`	What a correct continuation looks like from step K+1 onward. May be a set of acceptable trajectories.	Hand-authored, informed by what the real session did next.
`gold_facts_to_preserve`	The facts the resumption context MUST encode for the continuation to be possible.	Hand-extracted: decisions made, state established, dead branches identified as dead.
`gold_facts_to_forget`	The facts the resumption context MUST NOT carry forward. Rejected approaches that would re-poison the continuation if reintroduced.	Hand-extracted: blind alleys the agent already abandoned.

The fifth field is doing the heavy lifting. Drop it and the benchmark collapses into a summarization eval with extra steps. Keep it and the benchmark measures something no shipping eval measures: selective forgetting as a scored capability.

Why Forgetting Has to Be Scored

The instinct in summarization research is to treat omission as a failure mode. Higher recall is better; missing facts cost you. That is the right instinct for a news summary, which has a single goal (reproduce the article faithfully). It is the wrong instinct for an agent resumption context, which has two goals that pull in opposite directions:

Preserve the decisions and state that the continuation depends on. (high recall)
Drop the abandoned branches that would re-poison the continuation if reintroduced. (high precision against a negative set)

A system that preserves everything wins on recall, fails on precision, and produces an agent that loops back into the same dead end the original agent already escaped from. A system that drops everything wins on precision, fails on recall, and produces an agent that starts the task over from scratch. Neither is the consolidation regime the Coherence Cliff post argued was the only way out of self-conditioning.

The mammalian sleep analogy gets thrown around a lot in this space and is usually decoration. Here it has a specific operational meaning: REM-sleep consolidation in animals is structurally a forgetting operation as much as it is a remembering one. Synaptic homeostasis theory (Tononi & Cirelli, 2014) frames sleep as a global down-selection of weak synapses. The strong stuff stays. The noise goes. A benchmark that only measures the strong-stuff-stays half of that operation is measuring half the brain.

The Three Scores

A run of the benchmark on a single episode produces three numbers, each on [0, 1]:

Score	What it measures	How it’s computed
`continuation_correctness`	Did the resumed agent finish the task correctly?	Run the resumed agent on `partial_trajectory + resumption_context`. Score its final output against the gold continuation.
`preservation_recall`	Did the resumption context include the gold-preserve facts?	LLM-judge, boolean per fact, aggregated.
`forgetting_precision`	Did the resumption context successfully omit the gold-forget facts?	LLM-judge, boolean per fact, aggregated. Negative-set check.

The composite quality score is the geometric mean of the three:

quality = (continuation_correctness × preservation_recall × forgetting_precision) ^ (1/3)

Geometric, not arithmetic. The choice matters and it is the second non-obvious decision in this spec.

Arithmetic mean averages weakness out. A system that scores (1.0, 1.0, 0.0) arithmetic-averages to 0.667 — which looks acceptable and is in fact a system that totally fails on one of three required capabilities. Geometric mean returns 0.0 for the same input. Any zero in the product zeros the score. You cannot paper over a complete failure on one axis with strength on the other two. This is the right shape for the benchmark because all three axes are independently load-bearing. A consolidation system that ignores any one of them is not a consolidation system.

A toy example

To make this concrete, here is a worked toy episode. Suppose the original agent was debugging a flaky test. Over 50 turns, it:

correctly identified that the flake was a race condition on a shared cache key (decision A),
spent 12 turns trying time.sleep(0.1) between requests, gave up because it just slowed CI without fixing the flake (dead branch B),
landed on a per-test cache namespace as the fix (decision C),
had not yet written the actual patch when the session was interrupted.

A clean Resumption Episode for this trajectory looks like:

initial_task: "fix the flake in tests/integration/test_cache.py"
partial_trajectory: [turns 1..50]
gold_continuation:
  - write per-test cache namespace
  - run test 50× to confirm de-flaked
gold_facts_to_preserve:
  - root cause is shared cache key (A)
  - fix is per-test cache namespace (C)
  - patch not yet written
gold_facts_to_forget:
  - time.sleep retries were tried and abandoned (B)

Three resumption-context candidates, scored:

System	Resumption context produced	Cont. correctness	Pres. recall	Forg. precision	Geomean
Naive concat	full raw trajectory	0.6	1.0	0.0	0.00
Naive summary	”agent tried sleep retries and per-test namespaces; landed on namespaces”	0.8	1.0	0.5	0.74
Structured consolidation	”root cause: shared cache key. fix: per-test namespace. patch not yet written.”	0.95	1.0	1.0	0.98

The naive-concat system is the self-conditioning trap made flesh: it includes the dead branch, the resumed agent re-tries time.sleep, and the score is zeroed by the forgetting axis. The naive-summary system gets partial credit on forgetting because it mentions the dead branch without flagging it as dead. The benchmark structurally penalizes “include for completeness.”

The Calibration Gate

The benchmark uses LLM judges for two of its three scores. That sounds circular and would be circular without a calibration step. The gate is non-negotiable:

Hand-grade 50 examples of each scoring component yourself.
Score the same 50 with the LLM-judge.
Compute Cohen’s κ between human and judge.
Refuse to publish any number until κ ≥ 0.6 for all three components.

Cohen’s κ corrects for chance agreement; 0.6 is the threshold above which inter-rater agreement is conventionally called “substantial” (Landis & Koch, 1977). Below 0.6, the judge is approximately a noisy human and the headline number is approximately astrology.

If κ < 0.6 on any component, the play is to fix the judge prompt and re-calibrate, not to publish with a footnote. The pre-registration commits to this: Cohen’s κ ≥ 0.6 on every LLM-judge scoring component, or I do not publish the score. This gate exists because most agent benchmarks I have read either skip it entirely or report it on a non-representative subset. The reader has no way to tell whether the judge is measuring what the paper claims.

Constructing the v0 Episodes

The temptation in a new benchmark is to scale up immediately. Five hundred episodes! Cross-domain! Procedurally generated! Resist all of it. v0 is 30 hand-constructed episodes drawn from real Brain trajectories, and that number is chosen for a specific reason.

Per Cameron Wolfe’s stats guide for LLM evals, the standard error on a proportion estimate at n=30 with typical between-run correlation structure is around ±9 points. That’s enough to detect ~10-point quality differences between conditions at p < 0.05, which is the resolution Experiment 3 needs. Going to 100 episodes tightens that to ±5 points; going to 500 tightens it to ±2 points. v0 is sized to detect the effect Experiment 3 either has or doesn’t have. Larger n is for v1, after we know whether the effect exists at all.

The construction protocol is mechanical:

1. Find a real Sharad-Claude-Code session with ≥ 100 turns.
2. Identify a natural interruption point around turn 50.
3. Pause the transcript there. That's `partial_trajectory`.
4. Write `gold_continuation` by hand, informed by what the real session did next.
5. Extract `gold_facts_to_preserve` from the post-interruption turns:
   decisions, state, branches identified as dead.
6. Extract `gold_facts_to_forget` from the pre-interruption turns:
   approaches the agent already tried and abandoned.
7. Save as YAML. One file per episode.

The “turn 50” choice is not magical. It is the shortest length where the trajectory has meaningful state to lose and the gold continuation is still tractable to write by hand. Earlier interruption points trivialize the resumption problem; later ones make the gold continuation impossible to define cleanly without authoring half a coding session.

The gold data is private. The episodes are sliced from my actual work history and contain code, decisions, and project state I do not want indexed. The benchmark spec is public; the seed data is not. This is the same arrangement the owned-corpus benchmark made and for the same reason: personal-corpus evals are useful precisely because the corpus is personal.

File Layout

The spec ships as a directory, not a document. Here is the layout the runner expects:

.scratch/experiments/exp3-rem-sleep/benchmark/
├── RESUMPTION_BENCHMARK.md     # this spec
├── episodes/
│   ├── 001.yaml                # initial_task, partial_trajectory, gold_* fields
│   ├── 002.yaml
│   └── ...
├── judges/
│   ├── continuation_judge.md   # judge prompt for continuation_correctness
│   ├── preservation_judge.md   # judge prompt for preservation_recall
│   └── forgetting_judge.md     # judge prompt for forgetting_precision
├── calibration/
│   ├── hand_grades.jsonl       # 50 hand-graded examples per component
│   └── kappa_report.md         # per-component Cohen's κ
└── runs/
    └── <run_id>/scores.parquet

Build order matters. The spec is built before the runner. The runner is built before any consolidator. The consolidator is built last. Reversing that order is how you accidentally tune the benchmark to the system you happen to be building.

The runner is intentionally boring: load episodes/*.yaml, take a consolidator name and a token budget, generate a resumption context per episode, score the three components, write to runs/<run_id>/scores.parquet. The same runner has to be able to score the naive-concat baseline, the naive-summary baseline, and the structured-consolidation system Experiment 3 actually tests. If a system needs the runner modified to score it, the system is gaming the benchmark, not passing it.

Where This Benchmark Breaks

This is a v0. It is not a finished artifact. Five things it explicitly does not handle:

Cross-domain validity is untested. Brain trajectories are coding sessions. A medical-record consolidation, a legal-research consolidation, a customer-support handoff — those are different distributions and will need separately constructed episode sets. The forgetting axis in particular may not transfer cleanly; the cost of dragging an abandoned hypothesis into a medical workup is structurally different from dragging an abandoned time.sleep into a debug trace.
Multi-agent handoff is out of scope. v0 measures single-agent resumption: same agent identity, same goal, fresh context. Cross-agent handoff (different model, different system prompt, different tool access) is a v2 concern. The reason to defer it is that v2 has to model the new agent’s priors, not just the old agent’s state.
Cost is not in the score. v0 holds the resumption-context token budget constant across conditions, which controls for context-length confounding (the whole point of the matched-controls commitment) but does not measure cost-quality Pareto frontiers. A system that hits the same quality at half the tokens is currently invisible to the benchmark. v1 introduces cost-aware scoring.
The gold_facts_to_forget field is the hardest one to author and the easiest one to author badly. A reviewer who is not Sharad cannot in general tell whether a “dead branch” in someone else’s trajectory was actually dead. This is the benchmark’s hardest reliability problem and the one where the calibration gate matters most. If Cohen’s κ on forgetting_precision is consistently lower than on the other two components, the field itself may be operationalized wrongly.
n=30 is small. The benchmark is calibrated to detect ~10-point differences. Sub-10-point effects are real and will be invisible to v0. A consolidation system that is only marginally better than naive summarization will produce overlapping confidence intervals and the right read of that is “no detectable effect at v0 power,” not “they’re equivalent.”

If Experiment 3 ships and any of these limitations turns out to materially shape the result, the post says so. That is the pre-registration deal.

What This Benchmark Unlocks

Experiment 3 has three falsification conditions, all defined against this benchmark:

Structured consolidation does not outperform raw trajectory at matched token budget, on this benchmark, or
Structured consolidation does not outperform naive summarization, on this benchmark, or
The advantage disappears at trajectory lengths < 50 steps, on this benchmark.

Without the benchmark, any of those claims could be quietly redefined after seeing the data. With it, they can’t. The geometric-mean score is the headline number. The three component scores are the diagnostic breakdown. The Cohen’s κ is the gate that determines whether any of it gets published at all.

That is the whole stack. Spec first. Runner next. Consolidator last. And the order is the entire point — because the alternative is the version of agent benchmarking that has been shipping for two years, where the metric drifts in lockstep with whatever the vendor wanted the result to be.

The benchmark is now load-bearing for Experiment 3. The spec is public. The pre-registration is open. The next post in this series is Experiment 1 — self-conditioning replication with token-matched memory controls. After that, Experiment 3 runs against this benchmark, with κ-gated scoring, and ships the number it gets.

Series: The 14K Token Debt → The Terminal Was the First Agent Harness → I Built an AI Skill That Started Improving Itself → 91.55% on LongMemEval, and the Benchmark I’m Building Instead → Brilliant but Amnesiac: The Coherence Cliff → Before I Run the Next Three Experiments → this post. Next: Experiment 1 results.