Your MCP Servers Are Costing You 10 Seconds Per Session
I had nine MCP servers configured globally in Claude Code. Chrome DevTools. Browser automation. A financial data connector. Jules for async delegation. Gmail. Gemini. Reddit. Sequential thinking. A custom brain server for searching past sessions.
Most were useful — sometimes. A few were essential. But all nine booted every time I typed claude and hit enter.
One afternoon I timed it. From keystroke to cursor: 11 seconds. I disabled six servers I didn’t need for my current project. 3 seconds. Same machine, same model, same context. The only variable was the invisible infrastructure spinning up behind the prompt.
That 8-second gap is the MCP tax. And it’s the least expensive part.
What Happens When You Type claude
Before you see your first cursor, Claude Code executes a boot sequence most users never think about:
1. Parse ~/.claude/settings.json
2. For each configured MCP server:
a. Spawn a subprocess (stdio transport)
b. Wait for the server to initialize
c. Request tool definitions (listTools)
d. Collect JSON schemas for every tool
3. Inject ALL tool schemas into the system context
4. Load CLAUDE.md files (global → project → subdirectory)
5. Load memory files
6. Ready.
Steps 2a-2d happen for every server, concurrently. Servers initialize in parallel, but the session blocks until the slowest one finishes — your boot time equals MAX(server_boot_times), not the sum. In theory, nine servers shouldn’t be much slower than one. In practice, they are: each subprocess competes for CPU, disk I/O, and network (npm registry calls). On my M1 Pro, one npx -y server boots in 3 seconds. Three boot in 4. Nine boot in 11 — not because they’re sequential, but because resource contention turns parallelism into a bottleneck. One cold npx -y call stalls the event loop while npm resolves dependencies; nine of them thrash the disk cache simultaneously.
This is the operational cost. But there’s a larger, invisible cost that persists after boot.
The Two Taxes
Tax 1: Startup Latency
Every MCP server is a subprocess. Claude Code spawns each one, establishes a JSON-RPC connection over stdio, and waits for it to report its capabilities. The latency depends on the server’s runtime:
| Server Type | Cold Start | Warm Start | What Causes It |
|---|---|---|---|
npx -y @package/server | 3-5s | 0.5-1s | npm downloads the package if not cached |
Python (uv run / venv) | 1-3s | 0.3-0.5s | Virtual environment activation, dependency resolution |
Native binary (qmd mcp) | 0.1-0.3s | 0.1s | Just a fork+exec |
| Local Node.js (pre-installed) | 0.3-0.5s | 0.2s | V8 startup + module loading |
The npx -y pattern is the worst offender. It’s the default in most MCP server installation guides — “just add npx -y @modelcontextprotocol/server-whatever to your config.” What those guides don’t mention: npx -y checks the npm registry on every invocation. If the package isn’t in your local cache, it downloads it. If npm is slow, your session is slow. If you’re offline, the server fails entirely. Independent measurements show npx-launched MCP servers taking 10-15 seconds to start — turning a 5-second agent startup into a 30-second wait. The Gemini CLI hit the same problem: 8-12 seconds blocked on MCP server initialization.
The fix is embarrassingly simple: install the package globally once.
# Instead of relying on npx -y every session:
npm install -g @anthropic-ai/claude-code-mcp-server
# Then in settings.json, use the binary directly:
"command": "claude-code-mcp-server"
# Instead of:
"command": "npx", "args": ["-y", "@anthropic-ai/claude-code-mcp-server"]
The runtime language matters too. Multi-language benchmarks show stark differences in MCP server performance over stdio:
| Runtime | Avg Latency | Memory | Throughput |
|---|---|---|---|
| Go | 0.855ms | 18 MB | 1,624 RPS |
| Java | 0.835ms | 226 MB | 1,624 RPS |
| Node.js | 10.66ms | 110 MB | 559 RPS |
| Python | 26.45ms | 98 MB | 292 RPS |
Go servers use 6x less memory than Node.js and respond 12x faster. If you’re choosing between MCP server implementations, the runtime isn’t just a preference — it’s a performance multiplier.
In theory, startup latency should plateau — parallel servers boot simultaneously. In practice, it scales sub-linearly but significantly: resource contention between concurrent subprocesses means each additional server degrades the boot time of every other server. My measurements: 1 server = 1s, 3 servers = 4s, 9 servers = 11s. Fewer servers don’t just save their own boot time — they give every remaining server more headroom to initialize faster.
Tax 2: Schema Gravity
This is the tax nobody sees.
In The 14K Token Debt, I introduced the concept of Prompt Gravity — the observation that your system prompt’s initial tokens create an attentional bias that shapes every subsequent generation. The heavier the prompt, the stronger the gravitational pull on the model’s reasoning.
MCP tool schemas exert the same force. I call this Schema Gravity: the invisible weight of tool definitions pulling on your context window and the model’s attention before any reasoning begins.
The scale of the problem isn’t speculative — Anthropic measured it themselves. Their engineering blog on advanced tool use reports the numbers:
| Configuration | Token Overhead | Source |
|---|---|---|
| GitHub MCP server alone (35 tools) | ~26,000 tokens | Anthropic Engineering |
| 5 servers (GitHub + Slack + Sentry + Grafana + Splunk) | ~55,000 tokens | Anthropic Engineering |
| Anthropic’s internal setup before optimization | 134,000 tokens | Anthropic Engineering |
| 11 servers, 137 tools (community measurement) | 27,462 tokens | MCP Spec Issue #1576 |
| 4 MCP servers, query: “What’s 2+2?“ | Claude Code Issue #3406 |
Read that last row again. Asking Claude Code “What’s 2+2?” with four MCP servers configured costs $0.21 in token overhead — 14,000 of the 15,000 tokens consumed were wasted on tool schemas for a query that needed zero tools.
Annualize it: a power user running 30 sessions per day with $0.21 of schema overhead per session is burning $2,300/year in wasted tokens. A team of 10 engineers? $23,000/year — on tool definitions nobody reads.
A single MCP server can inject 26,000 tokens of schema definitions. Five servers? You’re burning 55,000+ tokens. Anthropic’s own team hit 134,000 tokens — roughly 67% of a 200K context window — consumed by tool definitions alone. Before the model reads your CLAUDE.md. Before it loads your memory. Before it processes a single word of your actual request.
This is Schema Gravity in action. Those tool definitions compete directly with your conversation for the model’s attention. The transformer’s self-attention cost is O(n²) in sequence length — extra tokens don’t add linearly, they compound. A 200K context with 134K tokens of schemas has attention patterns spread across 65% noise before your actual question is even processed. This is the “lost in the middle” effect applied to tool definitions: the model’s ability to retrieve and reason about relevant context degrades as the total context grows, and schemas push everything else further from the attention hotspots at the beginning and end of the window.
One caveat: modern API deployments use prompt caching. If your MCP schemas are identical across sessions — which they typically are — the cached prefix means you pay full input cost only on the first turn. Subsequent turns hit the cache and the dollar cost drops significantly. But the attention cost doesn’t. Cached or not, those 55K tokens of tool definitions still occupy the context window and compete with your conversation for the model’s reasoning bandwidth. Schema Gravity is an attention problem, not just a billing problem.
The irony is sharp: tools designed to make the agent more capable can make it measurably worse by consuming the context it needs to think.
Tax 2b: Tool Selection Noise
Schema Gravity’s second-order effect is tool selection noise. When you configure nine MCP servers, Claude sees 50-200 tool definitions at conversation start. For any given task, you need 3-5 tools. The rest is decision overhead.
This is the $PATH problem. Every Unix system has a $PATH with thousands of executables. Ambiguous names resolve to the wrong binary. For agents, more tools means more potential for wrong tool selection (choosing browser automation when grep would suffice), tool hallucination (inventing calls that combine features from multiple schemas), and wasted reasoning tokens evaluating irrelevant options.
Anthropic’s own data confirms this: before Tool Search, tool selection accuracy on Opus 4 was just 49% — worse than a coin flip. After deferring unused schemas, accuracy jumped to 74%. Fewer visible tools = better tool choices.
The Claude Code documentation acknowledges this directly: CLI tools like gh, aws, and gcloud don’t add schema injection because the model already knows them from pre-training. The model’s training data is the documentation. You get the tool for free — no schema, no startup, no context cost.
The Audit: Do This Right Now
Open your Claude Code settings:
cat ~/.claude/settings.json | grep -A2 '"mcpServers"'
# Or just: /mcp inside a Claude Code session
Count your servers. For each one, ask:
| Question | If Yes | If No |
|---|---|---|
| Do I use this server in every project? | Keep as global | Move to project-level |
| Could I use a CLI tool instead? | Remove the MCP server | Keep |
| Does it need to maintain state across calls? | MCP is justified | CLI might be better |
Is it an npx -y server? | Install globally | Already optimized |
Here’s what my audit looked like:
| Server | Verdict | Reason |
|---|---|---|
brain (qmd) | Keep global | Cross-project memory — used every session |
sequential-thinking | Keep global | Used for planning in most sessions |
chrome-devtools | Move to project | Only needed for frontend/QA work |
browser-mcp | Move to project | Same — frontend-only |
fi_mcp | Move to project | Financial data — one project only |
jules | Move to project | Agent delegation — specific workflows only |
gmail | Remove | gws gmail CLI works better |
gemini | Remove | Rarely used, CLI alternative exists |
reddit | Remove | Used once, never again |
Result: 9 global servers → 2 global + 4 project-scoped + 3 removed. Boot time dropped from 11 seconds to 3. Context overhead dropped by roughly 70%.
The Progressive Disclosure Fix
The solution isn’t “fewer MCP servers.” It’s the right servers at the right time. This is progressive disclosure applied to infrastructure: start with the minimum, load more on demand.
1. Project-Level Scoping
Move servers from ~/.claude/settings.json (global) to your project’s .claude/settings.json:
// In ~/your-project/.claude/settings.json
{
"mcpServers": {
"chrome-devtools": {
"command": "chrome-devtools-mcp",
"args": ["--port", "9222"]
}
}
}
This server only boots when you run claude inside that project directory. Every other project skips it entirely.
2. Tool Search (Deferred Loading)
Claude Code supports deferred tool loading — MCP servers whose tools are registered by name but whose schemas aren’t loaded until first use. The schema is fetched on-demand via ToolSearch only when the tool is actually needed.
The impact is dramatic. Anthropic’s own measurements show:
| Metric | Before Tool Search | After Tool Search | Improvement |
|---|---|---|---|
| Token overhead (5-server setup) | ~77,000 tokens | ~8,700 tokens | 89% reduction |
| Available context window | 122,800 tokens | 191,300 tokens | +56% recovered |
| Tool selection accuracy (Opus 4) | 49% | 74% | +25 percentage points |
Tool Search auto-activates when your MCP tool descriptions exceed 10% of the context window. No manual configuration required — though you can force it with the ENABLE_TOOL_SEARCH environment variable.
One gotcha: the defer_loading field in .claude.json’s MCP server config is silently accepted but has no effect. Tool Search is all-or-nothing, not per-server. If you need more granular control, the community-built lazy-mcp proxy achieves ~95% context reduction (15K to ~800 tokens for 30 tools) with ~500ms first-call latency.
3. Pre-Install npx Packages
The single highest-ROI fix. For every MCP server using npx -y:
# Find all npx-based servers in your config
grep -A1 '"npx"' ~/.claude/settings.json
# Install each one globally
npm install -g @anthropic-ai/server-sequential-thinking
npm install -g @anthropic-ai/server-browser
# Update settings.json to use the binary directly
# Before: "command": "npx", "args": ["-y", "@anthropic-ai/server-sequential-thinking"]
# After: "command": "server-sequential-thinking"
Cold starts drop from 3-5 seconds to 0.3-0.5 seconds per server.
4. The 3-Server Rule
A heuristic I’ve settled on: if you have more than 3 global MCP servers, at least one should be project-scoped instead. Three is roughly the breakpoint where startup latency stays under 3 seconds and schema injection stays under 100K tokens. Above that, you’re paying a tax that compounds across every session.
The exceptions prove the rule. My two global survivors — brain (cross-project memory) and sequential-thinking (used in 80%+ of sessions) — both meet the bar: genuinely used across all projects, not replaceable by a CLI tool, and fast to boot (native binary and pre-installed Node respectively).
When MCP Is Worth the Cost
I’ve spent this entire post arguing that MCP servers are expensive. They are. But they’re expensive the way a database connection pool is expensive — the overhead is justified when you need what they provide.
MCP is worth it when:
-
The server maintains state. A browser session, a database connection, a running development server — these can’t be replaced by stateless CLI calls. Each
mcp__chrome-devtools__evaluate_scriptcall operates on the same browser tab. A CLI equivalent would need to reconnect on every invocation. -
The tool needs structured input/output. MCP’s JSON-RPC protocol provides typed tool schemas — parameter validation, required fields, enum constraints. CLI tools accept string arguments. For complex operations (multi-field form submission, structured queries), MCP’s type safety prevents silent failures.
-
The workflow is multi-turn. Sequential thinking, where thought N informs thought N+1 through server-side state, requires a persistent connection. You can’t
pipestate between independent CLI calls without an explicit persistence layer. -
Discovery matters more than efficiency. When exploring unfamiliar capabilities — a new API, an internal platform — MCP’s self-documenting schemas help the agent understand what’s available. The token cost is the price of exploration. Once you know which tools you need, you can evaluate whether a CLI alternative exists.
Where This Breaks
Lazy loading shifts the cost, it doesn’t eliminate it. Deferred tool schemas still get injected on first use. If you need the chrome-devtools server halfway through a conversation, you pay the schema tax then — potentially disrupting a long reasoning chain with a sudden injection of tool definitions.
Timing is approximate. The startup latency numbers in this post are from my setup (M1 Pro, macOS, Homebrew-installed tools). Your mileage will vary based on hardware, network speed (for npx), and which servers you’re running. The ratios hold — npx is always slower than pre-installed, more servers always cost more than fewer — but the absolute numbers are directional.
Some servers resist project-scoping. If you use Claude Code across 10 projects and 7 of them need the same MCP server, project-scoping means maintaining 7 copies of the same config. At that point, global is the right default and the startup tax is the cost of doing business.
The token overhead exists regardless of startup timing. Even if every server boots instantly (native binaries, pre-installed packages), the schema injection cost remains. A server with 30 tool definitions injects 30 schemas into context whether it took 0.1 seconds or 5 seconds to boot. Startup optimization and schema optimization are independent problems.
Where This Is Heading
This is a transitional problem. The trajectory is clear: manual MCP configuration will feel as archaic as manual memory management within two years.
Tool Search is the embryo of an intelligent loader — it already reduced schema overhead by 89% at Anthropic. The next step is context-aware loading: the system learns which tools you need for which projects and loads them automatically. Beyond that, the MCP ecosystem itself will face selection pressure. Bloated servers with 35 tool definitions will lose to focused servers with 5 tools each — the same specialization pressure that turned monolithic Unix utilities into composable single-purpose commands.
But today, in April 2026, the intelligent loader doesn’t exist yet. Today, you’re the loader. And the audit in this post is how you do the job until the system learns to do it for you.
The Fastest Agent Boots With Exactly What It Needs
The MCP ecosystem is growing fast. As of mid-2026, there are thousands of MCP servers available — everything from Slack to Postgres to Figma to Kubernetes. Anthropic’s own Tool Search system supports catalogs of up to 10,000 tools. The temptation is to configure all of them. The result is an agent that boots slowly, thinks through a fog of irrelevant tool definitions, and burns context window on capabilities it doesn’t need for the task at hand.
The fix is the same principle that makes Unix powerful: do one thing well. Not “install every tool in case you need it,” but “install exactly what this task requires.”
# The anti-pattern:
~/.claude/settings.json → 9 global servers → 11 second boot → 200K tokens of schemas
# The pattern:
~/.claude/settings.json → 2 global servers (brain, sequential-thinking)
project/.claude/settings.json → 2 project servers (chrome-devtools, browser)
→ 3 second boot → 60K tokens of schemas
→ Context window recovered for actual reasoning
My sessions now boot in 3 seconds. My context window recovered 120,000 tokens. My agent’s tool selection accuracy jumped from coin-flip to reliable. The only thing I lost was infrastructure I wasn’t using.
At 20 sessions per day, the 8 seconds I recovered per session add up to 16 hours per year of pure wait time eliminated. The 120K tokens recovered per session mean my agent can reason about larger codebases, hold longer conversations, and make fewer tool selection errors — every single session.
Your MCP servers aren’t free. Measure them. Audit them. Scope them.
Sharad Jain is an AI engineer and the author of The 14K Token Debt and The Terminal Was the First Agent Harness. He writes about agent architecture, system prompts, and the compounding returns of treating infrastructure as a first-class engineering discipline. This post is the third in a series on the hidden costs of agentic AI systems.