Brilliant but Amnesiac: The Coherence Cliff in Long-Horizon AI Agents

I have been building a memory layer for long-horizon AI agents for six months. Not because I think context windows are too small — they are now a million tokens and counting. Because I watched a frontier coding agent confidently overwrite a working test fixture on turn 47 of a session it had been crushing through turns 1 through 46. The model did not get dumber. It got more committed to its own previous mistake.

That is the failure mode everyone is missing.

The agent industry is racing toward longer horizons — 24-hour coding agents, week-long research agents, vending-machine simulations that span a simulated year — and the conversation is dominated by two metrics: model capability (does it reason well?) and context window size (does it remember enough?). Both are real. Both are also incomplete. The dominant failure mode of long-horizon agents in 2026 is neither — it is coherence collapse, and the math of why it happens is uglier than the model capability curve suggests.

This post is the synthesis. The next two posts in this series will be the build: a 7-day Discontinuous Autonomous Lifecycle agent run with Brain as its memory layer, and a replication of the self-conditioning benchmark across Opus, Sonnet, and Haiku with and without external memory. This one lays out the map.

Raw sessions compound into structured memory — the thesis visual for what a long-horizon memory layer has to produce

The Math Everyone Skips

The single equation that should appear in every long-horizon agent pitch deck — and almost never does — is this:

P(success) = Π pᵢ ≈ pⁿ

If your agent has a 99% per-step accuracy rate — a number that would crush every static benchmark — and your task requires 100 sequential decisions, your probability of overall success is 36.6%. Drop to 95% per-step (still good!) and 50 steps gives you 7.7%. Horizon length is hyperbolically sensitive to per-step reliability:

Per-step accuracy	10 steps	50 steps	100 steps	500 steps
95%	60%	7.7%	0.6%	~0%
99%	90%	60.5%	36.6%	0.6%
99.5%	95%	78%	60%	8.1%
99.9%	99%	95%	90%	60.6%

The gap between “very good” (95%) and “frontier” (99%) is not a 4-point gain. It is the difference between a demo that dies at 15 steps and a product that survives 70. The gap between 99% and 99.9% is the difference between a multi-day agent and an indefinite one.

METR’s 2026 paper puts the current frontier (o3-class) at a 50% time horizon of ~110 minutes, with the horizon doubling roughly every 7 months since 2019. Naive extrapolation puts month-long autonomy between mid-2028 and mid-2031. That extrapolation is the optimistic case. It assumes errors are independent. They are not.

The Coherence Cliff

The independent-step model — P(success) = pⁿ — undersells the problem in one specific, structural way: in real agent trajectories, per-step accuracy itself degrades as the task progresses. The error rate at step 50 is higher than the error rate at step 5, on the same model, on the same task type.

The cleanest evidence is Sinha, Arun, and Goel’s 2026 ICLR paper, The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs. They isolated the execution capability of LLMs — explicitly providing the plan and the knowledge — and measured how many sequential steps each model could correctly execute in a single turn. The result that should change how you think about long-horizon agents:

Model	Steps executable in a single turn
DeepSeek-V3 (no thinking)	~4
Claude-4 Sonnet (thinking)	432
GPT-5 thinking (“Horizon”)	2,100+

The 500× spread between DeepSeek-V3 and GPT-5 Horizon is not a reasoning gap. The paper explicitly removes reasoning from the experiment — every model is handed the plan. The spread is an execution coherence gap. And the structural cause has a name: self-conditioning.

Self-conditioning is the dynamic where a model conditions its next-token prediction on its own prior outputs — including its own errors. The longer the trajectory, the more its own past mistakes shape its present predictions. The model literally starts to believe its own lies, because its own lies are now part of the context it is autoregressing over.

Three things make self-conditioning particularly nasty:

It scales poorly with model size. Larger non-thinking models are more susceptible to self-conditioning, not less. Larger models are better at “staying in character” — including the character of a model that just made a mistake. They produce more coherent errors.
Long-context capacity does not fix it. A 1M-token context window does not help if 990K of those tokens are the agent’s own error trajectory.
Thinking models partially fix it. GPT-5 Horizon and Claude-4 Sonnet’s thinking modes catch errors before they enter the context window. That is the architectural innovation behind their multi-hundred-step single-turn execution.

Call this The Coherence Cliff: long-horizon failure is not a capability limit on the model — it is a state-maintenance limit on the trajectory. The reasoning is fine. The plan is fine. What collapses is the model’s ability to maintain a coherent view of its own progress when its context is full of its own ambiguous, lossy, and sometimes wrong prior steps.

This is the failure mode you feel when an agent loops endlessly on a fixed bug, when it confidently regresses a working feature, when it spends 30 minutes reproducing context it already had. It is not getting dumber. It is becoming increasingly conditioned on its own confused history.

Why Bigger Context Windows Don’t Help

The default industry response to “agents forget” has been to scale the context window. Gemini ships 1M tokens. Anthropic ships a 1M-token tier for Claude Opus. The marketing implies that bigger windows are bigger memory.

They are not. They are bigger working memory at best, and bigger noise floor at worst.

Three pieces of evidence:

Lost-in-the-middle persists at scale. Liu et al.’s analysis of long-context retrieval, reproduced repeatedly through 2026, shows a U-shaped curve. Information at the start (the system prompt) and at the end (the recent turn) is recalled well. Information in the middle craters. For a long-horizon agent, the middle of the trajectory is exactly where the load-bearing intermediate state lives — the variable you decided about at step 20, the constraint the user mentioned at step 50, the dead-end branch you committed not to retry at step 80. The context contains them. The attention does not find them.

The Attention Valley: a U-shaped recall curve where the system prompt peaks at the start and the current query peaks at the end — and every load-bearing intermediate step decays into the trough in the middle

AgentFold beats 671B models with 7k tokens after 100 turns. AgentFold (Tongyi Lab, Alibaba, 2025) trained a 30B agent on proactive context folding — multi-scale consolidation of past turns into compact state summaries. After 100 turns of interaction, AgentFold’s context is 7,000 tokens total. The full ReAct trace for the same 100 turns is ~50× larger. AgentFold-30B-A3B achieves 36.2% on BrowseComp, beating both DeepSeek-V3.1-671B-A37B and OpenAI’s o4-mini, while scaling to 500 turns without saturation. The lesson is not that bigger contexts are bad. The lesson is that the relevant axis is not capacity. It is consolidation.

Vending-Bench 2 measures coherence, not capability. Anthropic’s Vending-Bench 2 simulates an agent running a vending-machine business across a simulated year — roughly 20 million tokens of interaction history. The leaderboard:

Model	Mean net worth (simulated year)
Gemini 3 Pro	$5,478
Claude Opus 4.5	$4,967
GPT-5.1	$1,473

Gemini 3 did not win because it reasons better — Opus 4.5 leads SWE-bench Verified at 80.9%. It won because it maintained coherence across 20M tokens of operating history without drifting. The losers did not forget; they remembered too much, and the noise drowned the signal.

This is The Amnesia Tax: every long-horizon agent pays it, and the bill is not denominated in tokens. It is denominated in coherence — in how much of the recent trajectory the agent treats as load-bearing versus disposable, and whether it can tell the difference.

Scaling context windows does not lower the tax. It gives the agent a larger surface area to be incoherent across.

What Persistent Memory Actually Has to Do

If context windows are not memory, then what is? The shortcut answer — “RAG over conversation history” — turns out to be wrong in a specific, useful way.

A working memory layer for long-horizon agents has to do three jobs, only one of which is well-served by current retrieval-style memory systems.

Job	What it does	State of the art
1. Find	Retrieve what is relevant to the current moment	Solved enough. BM25 + dense retrieval + hybrid reranking; Mem0, Zep, Letta, Supermemory, Brain all ship this.
2. Forget	Know when stored information has been superseded	Partially solved in research (AgentFold). No off-the-shelf product currently does this well.
3. Resume	Pick up where a previous trajectory left off without rehashing	Unsolved. No public benchmark, no off-the-shelf substrate. Every team builds this in-house, badly.

Find is the easy one. Vector search, BM25, knowledge graphs — mature techniques, crowded market. I published Brain’s LongMemEval result at 91.55% recall@5 over 470 questions using nothing but stdlib BM25. Retrieval is the table stakes, not the differentiator.

Forget is harder. Real memory has to know when stored information has been superseded — when the user changed their mind, when the deployment moved, when last week’s decision was reversed in yesterday’s review. Append-only memory is just a larger context window dressed up as a database. The literature is starting to converge on the term: Evo-Memory (DeepMind, 2025) shows that most “memory” systems passively retrieve from dialogue history and do not abstract patterns from it. They remember conversations. They do not learn from them. AgentFold attacks consolidation as a learned operation. The space is wide open.

Consolidation as compression: a 2000-line raw transcript compressed into a high-signal artifact of Goal / Decisions / Rejected — the shape Job 2 has to produce, and the shape append-only memory never does

Resume is the one nobody has named, much less shipped. It is the answer to a different question than “what did I do last Tuesday?” It is the answer to “put me back into motion on the thing I was doing last Tuesday — knowing which branches were already pruned, which approaches were tried-and-rejected versus tried-and-deferred, which decisions are still load-bearing — without me reconstructing context for the next agent.”

Call this The Resumption Gap. When I ask Brain “what did we decide about the schema?” I get the answer in milliseconds. When I want an agent to pick up where the last agent left off, there is no off-the-shelf substrate. Retrieval ritual is solved. Resumption is not.

I have been building Brain partly around Jobs 1 and 2 — partly because I have been guessing at what Job 3 actually wants to look like. The next post will be the experiment.

What Long-Horizon Architecture Actually Looks Like

If the failure mode is coherence and the missing primitive is consolidating, resumable memory, the architecture follows. Three pieces, all of which already exist in research; none of which compose into a production stack.

1. Thinking models for the per-step floor. Sinha et al.’s self-conditioning result is a buy signal for thinking models in any long-horizon role. The 500× execution-coherence gap between a model with thinking and a model without is not the kind of variable you negotiate. The per-step floor has to be high or none of the rest matters.

2. Proactive context folding for working memory. AgentFold’s pattern — multi-scale consolidation of past turns into compact state summaries, learned not heuristic — is the closest thing to a production-ready answer for in-trajectory memory. Append-only ReAct traces become obsolete the moment any task crosses 50 turns. The end-state for long-horizon agent architecture has folding as a primitive, the way ReAct has tool-calling as a primitive.

3. Discontinuous lifecycles for between-trajectory memory. Here is the piece that excites me most, because it inverts the default. A truly long-horizon agent does not run continuously. It serializes its state, schedules its own resurrection via cron, and terminates the process. The next wake-up rehydrates context from disk, executes, and goes back to sleep. The pattern has a name in the literature — the Discontinuous Autonomous Lifecycle (DAL) — and the Claude Agent SDK already ships the primitives for it:

from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient

# Wake phase: resume the prior session by ID
options = ClaudeAgentOptions(
    resume="session_id_from_previous_run",
    continue_conversation=True,
    permission_mode="acceptEdits"
)

async with ClaudeSDKClient(options=options) as agent:
    await agent.query("Continue from where we left off.")
    # ...do bounded work...
    # Before exit: write a wake-up entry to crontab pointing at this session_id.
    # Then exit. The process terminates. The state survives on disk.

The economic case alone makes DAL inevitable: compute cost ∝ time-active, not wall-clock. A continuously-running agent waiting on a 4-hour condition burns a host the entire time. A discontinuous agent serializes, dies, and pays nothing until cron wakes it. For any workflow with a duty cycle below ~30%, DAL is the only path to a viable production economic profile.

The bio-inspired version is too obvious to ignore. Mammals consolidate memory through sleep — the hippocampus replays the day’s experience into the cortex, distilling episode into procedure, raw into wiki. The agent equivalent is not poetry; it is the same architectural move. Run the agent. Serialize state. Terminate. While dormant, run a consolidation pass that distills the recent trajectory into compact memory artifacts. Wake the agent into a context that contains the consolidated history, not the raw one. This is the REM-sleep architecture, and it is what the third post in this series will build.

Where This Breaks

I owe the reader the same honesty about this thesis that I owed about Brain’s LongMemEval result.

Limitation	Why it matters	What I would want before claiming more
The Coherence Cliff is not a unified theory	Self-conditioning is one driver; lost-in-the-middle is another; planning-failure cascades are a third; they interact in ways the literature has not disentangled	Replicate Sinha et al.’s execution benchmark with and without external memory; isolate which failure mode external memory addresses
AgentFold’s 7k-after-100-turns number is from in-domain SFT, not general	The agent was trained on folding trajectories. A frontier model without that training will not fold this well from prompting alone	Treat folding as a fine-tuning target or a tool, not a prompt pattern
The Resumption Gap is a name for an absence	I cannot point to a benchmark that measures task-resumption quality. Naming a gap is not closing it	Build the resumption benchmark before claiming a product solves it
Thinking-model latency makes high-iteration loops expensive	A model that takes 30 seconds of thinking per step is a non-starter for many real workflows	Reserve thinking-budget for steps that change state; use cheap models for read-only retrieval
Cron-driven discontinuous agents add operational surface area	The state file is now a critical artifact; the scheduler is now a critical dependency; the wake-up handler must be idempotent	Treat the wake/sleep cycle as a distributed-systems problem, with the rigor of a job queue
Most “memory” benchmarks measure retrieval, not consolidation or resumption	Optimizing for LongMemEval-style benchmarks may not move the needle on long-horizon coherence	Build owned-corpus and trajectory-resumption benchmarks alongside any retrieval benchmark; see the previous post in this series

This is a methodology argument backed by a literature synthesis. It is not a finished system. The work is the next post.

What I Am Building Next

Three concrete experiments will follow this post. I am pre-committing to all three in writing so that the failure modes are public, not silent.

Experiment 1 — The Self-Conditioning Replication. I will replicate Sinha et al.’s execution benchmark across the three Claude tiers — Opus 4.7, Sonnet 4.6, Haiku 4.5 — with and without Brain MCP as an external memory tool. The hypothesis I want to break is the assumption that external memory primarily helps with retrieval. The hypothesis I want to confirm is that external memory also helps with self-conditioning, by pulling load-bearing context out of the autoregressive trace and into a tool call that does not poison the next prediction.

Experiment 2 — The 7-Day DAL Agent. A long-horizon coding agent running across one calendar week, terminating after each work block and resuming via cron, with Brain as its memory substrate. The trace, the cost curve, and the failure log will be public. The control: a continuously-running variant of the same agent. The dependent variables: total cost, task completion rate, recovery time after a failure.

Experiment 3 — REM-Sleep Cron. The dream-cycle architecture. While the DAL agent is dormant, a consolidation pass runs against its recent trajectory, distilling episode into procedure. The waking agent inherits the consolidated history, not the raw one. I have hand-waved this for months. The next post will be either the working version or an honest writeup of why it does not work.

Each experiment becomes its own post in this series. Each post will include the harness code, the result, and the limitations. The blog is the experiment.

What I Am Actually Saying

The argument is narrower than “context windows are useless.” Bigger context is a real capability lift for single-turn work, multi-document synthesis, and long-context reasoning tasks. None of those are long-horizon agentic tasks.

The argument is sharper than that:

Long-horizon agents fail at coherence, not at capability. The failure mode is self-conditioning compounded by lost-in-the-middle compounded by error cascades — and the architectural fix is not a bigger context window. It is a persistent, consolidating, resumable memory substrate that lives outside the model’s autoregressive trace. Today’s “memory layers” solve one-third of that — retrieval. Consolidation is partially solved in research but not in product. Resumption is unsolved.

The next agent breakthrough I am betting on is not a bigger model. It is the first production system that treats the trajectory itself as a managed artifact — folded, consolidated, resumable across discontinuous waking and dormancy — and the model as a renewable resource that operates on top of the trajectory, not as the trajectory.

If that is right, the people who win this part of the agent era will be the ones who took the systems-engineering problem seriously while everyone else was waiting for the next model release.

If you are building in this space — owned-corpus memory, trajectory consolidation, discontinuous-lifecycle agents, anything that names what current memory products do not solve — I want to read your benchmark before I read your leaderboard.

The build starts in the next post.

This is post #5 in a series on agentic AI memory infrastructure. Earlier posts: The 14K Token Debt, The Terminal Was the First Agent Harness, I Built an AI Skill That Started Improving Itself, and 91.55% on LongMemEval, and the Benchmark I’m Building Instead. The next post will be Experiment 1 — the self-conditioning replication.