MCP at Scale — Piyush Vyas

MCP went from an internal Anthropic experiment in November 2024 to, as of early 2026, 19,729 servers on Glama's directory, 97 million monthly SDK downloads, and first-class support in Claude, ChatGPT, Cursor, Gemini, Microsoft Copilot, and VS Code. After building integrations across production systems of varying complexity, the pattern is clear. The protocol is genuinely good. The ecosystem has problems that the spec alone can't fix.

The growth timeline

The adoption curve has been unusually fast for developer tooling:

November 2024: Anthropic open-sources MCP with Python and TypeScript SDKs
March 2025: OpenAI integrates MCP across Agents SDK, Responses API, and ChatGPT desktop
April 2025: Google DeepMind confirms MCP support in Gemini
November 2025: Major spec update adding async operations and an official registry
December 2025: Anthropic donates MCP to the Agentic AI Foundation under Linux Foundation governance

Every major AI platform now speaks MCP. The question is not whether MCP won, but whether the ecosystem can handle what winning means.

Four MCP production problems: Context Window Tax (550-1400 tokens per tool, 97.1% description defects), Server Reliability (72% success rate vs 100% CLI), Auth Complexity (no enterprise SSO standard), Stateful Sessions (fight with load balancers)

Problem 1: The context window tax

Every MCP tool you register sends its name, description, and parameter schema to the model as context. Connect 20 tools and you've burned thousands of tokens before the user even asks a question. This is the context window tax, and it's the ecosystem's most underappreciated problem.

A February 2025 study (arXiv 2602.14878) examined 856 tools across 103 MCP servers and found that 97.1% of tool descriptions contained at least one quality defect. 56% had unclear purpose statements. The most common issues: unstated limitations, missing usage guidelines, and opaque parameters.

Augmenting descriptions to fix these defects improved task success rates by a median of 5.85 percentage points but caused a 67.46% median increase in execution steps. Better descriptions made agents more accurate but slower, because the additional context consumed more of the model's attention. In 16.67% of cases, augmentation actually caused performance regressions. The descriptions helped the model understand the tool better but saturated the context window.

This is a fundamentally hard tradeoff. Short descriptions save tokens but misguide agents. Detailed descriptions improve accuracy but eat context. The researchers found one optimization: removing examples from descriptions "does not statistically degrade performance," offering a way to trim token overhead without sacrificing accuracy. But the underlying tension remains.

Problem 2: Server reliability

19,729 servers sounds impressive until you try to use them. In practice, the average community MCP server works for the demo case and breaks on the first edge case. Missing error handling, no retry logic, no graceful degradation. When an MCP server fails, most agents loop, retrying the same call, getting the same error, burning tokens.

Tool authors write the happy path and ship it. The 2026 MCP roadmap acknowledges this. The Tasks feature (experimental since deployment) revealed missing retry logic for transient failures and no standardized retention policies for completed results.

I've started treating MCP server selection the same way I treat npm dependency selection: check the last commit date, look for error handling in the source, test the failure modes before the success modes. The 103 "official" servers tend to be solid. The remaining 19,000+ vary wildly.

Problem 3: Auth complexity

MCP's original design was local-first: stdio transport, running on your machine, talking to local tools. That model is simple and secure. The server runs in your process, no authentication needed. But the ecosystem has moved to remote servers, and enterprise adoption in 2026 is demanding capabilities the protocol doesn't yet have.

The gaps are specific: no standard for SSO-integrated auth flows, no audit trail specification, no gateway or proxy patterns, and no configuration portability between clients. Each enterprise deploying MCP at scale is solving these problems independently, which means solving them differently, which means fragmentation.

The 2026 roadmap has enterprise readiness as a priority area, but the timeline is vague: "finalize the required SEPs in Q1 2026 for inclusion in the next spec release, tentatively slated for June 2026." Meanwhile, production deployments can't wait for the spec to catch up.

Problem 4: Stateful sessions vs. horizontal scaling

The roadmap puts it directly: "stateful sessions fight with load balancers, horizontal scaling requires workarounds, and there's no standard way for a registry or crawler to learn what a server does without connecting to it."

MCP's Streamable HTTP transport maintains session state between requests. That works fine for single-server deployments. Put a load balancer in front of multiple server instances and you have the classic sticky session problem. Requests need to route to the specific instance that holds their session state, which defeats the purpose of horizontal scaling.

The planned solution has two parts: redesigning the transport to eliminate stateful requirements, and creating a .well-known metadata format so server capabilities are discoverable without a live connection. Both are solid architectural directions, but they represent breaking changes to how existing servers work. The migration path isn't defined yet.

What works at scale

It's not all problems. Here is what works reliably in production:

Focused servers with few tools. The best MCP servers do one thing well. A GitHub server that handles repos, PRs, and issues. A database server that runs queries. A file system server that reads and writes. When a server tries to be everything, 50 tools covering an entire platform, the context window tax kills performance.

Typed parameters with strict schemas. Tools that define exact parameter types, enums, and required fields guide the model more effectively than tools with loose string parameters. The 5.85% improvement from the arXiv study came largely from making parameters explicit.

Local-first for development, remote for production. stdio transport for local development is fast and requires zero config. Streamable HTTP for production enables shared access and monitoring. The two-transport model works well as long as your server code is transport-agnostic.

Tool descriptions written for models, not humans. The best tool descriptions I've seen follow a pattern: one sentence stating what the tool does, one sentence stating when to use it, a list of parameters with types and examples. No marketing language, no implementation details, no multi-paragraph explanations. Models are good at following structured instructions. They're bad at extracting signal from noise.

What the ecosystem needs next

MCP has won the protocol war. The next battle is quality. Specifically:

Server quality certification. The Glama directory lists 19,729 servers with no quality signal beyond stars and download counts. The ecosystem needs automated testing of error handling, parameter validation, and description quality. The arXiv data says 97.1% of descriptions have defects. That is a tooling problem, not a protocol problem.

Context-aware tool loading. Instead of registering all tools at connection time, load tools dynamically based on the conversation context. If the user is asking about code, don't load the calendar tools. The MCP spec supports capability negotiation, but most clients still do eager registration.

Standard observability. When an MCP tool call fails, where do the logs go? What's the standard format? How do you trace a request across client, server, and external API? The enterprise adoption wave needs these answers, and the current spec doesn't provide them.

Description linting. A 97.1% defect rate does not get better with documentation. It needs automated tooling that flags unclear purposes, missing parameter descriptions, and unstated limitations, the same way eslint flags code issues. Run it in CI, block merges on failures, and the ecosystem quality improves mechanically.

MCP at 19,729 servers (at the time of writing) is a success story. MCP at 97.1% description defect rate is an ecosystem that hasn't built its quality infrastructure yet. The protocol is solid. The tooling around it needs to catch up.