A team I worked with burned $1,400 in a single day. Their agent was calling GitHub's MCP server for simple repo queries, injecting 43 tool schemas into every request. That's 55,000 tokens before the model even started thinking. The same queries through gh CLI cost $82. That 17x multiplier is what happens when you pick the wrong tool interface without looking at the data.
I've built with all three approaches: MCP servers for BAP, a CLI for skill-tools, and SKILL.md files for staff-engineer. They solve different problems. Here's what the numbers say about when to use each.
Three approaches, one goal
Every agent needs to call external tools. The question is how it gets there. This table captures the fundamental differences. Pay attention to the token cost row, because that's where the real divergence starts.
| MCP Server | CLI Tool | SKILL.md | |
|---|---|---|---|
| What it is | A server exposing tools via JSON-RPC 2.0 | A command-line program the agent shells out to | A markdown file with structured instructions |
| Protocol | JSON-RPC over stdio or HTTP/SSE | stdin/stdout/stderr + exit codes | No protocol. It's a file on disk |
| State | Stateful sessions with capability negotiation | Stateless per invocation | Stateless, loaded into context window |
| Discovery | Dynamic: tools/list at runtime | --help, man pages | Name + description in frontmatter (~100 tokens) |
| Token cost | 550 to 1,400 tokens per tool schema | ~80 tokens for prompt + 50 to 200 per --help | ~100 tokens to advertise, full body on activation |
| Runs where | Local (stdio) or remote (HTTP) | Local only | Anywhere. It's a file |
| Ecosystem | 19,500+ servers, 300+ clients | Every CLI ever built | 30+ compatible agents |
The takeaway: MCP gives you the richest protocol, CLI gives you the lightest overhead, and SKILL.md sidesteps the runtime question entirely.
The benchmark that changed my thinking
I had assumptions about these tradeoffs. Then Scalekit published actual numbers: 75 identical GitHub tasks across CLI and MCP modalities using Claude Sonnet 4. The gap was wider than I expected.
Token cost per task
Look at the rightmost column. These aren't edge cases. They're routine GitHub operations.
| Task | CLI Tokens | MCP Tokens | Multiple |
|---|---|---|---|
| Repo language & license | 1,365 | 44,026 | 32x |
| PR details & review | 1,648 | 32,279 | 20x |
| Repo metadata & install | 9,386 | 82,835 | 9x |
| Merged PRs by contributor | 5,010 | 33,712 | 7x |
| Latest release & deps | 8,750 | 37,402 | 4x |
MCP used 4 to 32x more tokens than CLI for the same tasks. The culprit: every MCP call injects the full tool schema into context. GitHub's MCP server exposes 43 tools. That's ~55,000 tokens of schema before the agent writes a single line of output.
Source: Scalekit MCP vs CLI Benchmark, 75 runs, p < 0.05
Reliability
Token cost is one thing. But the reliability gap is what should worry you in production.
| Modality | Success Rate | Failures |
|---|---|---|
| CLI | 100% (25/25) | None |
| CLI + SKILL.md | 100% (25/25) | None |
| MCP | 72% (18/25) | 7 TCP timeout failures |
100% vs 72%. CLI was perfectly reliable. MCP failed 28% of the time on TCP timeouts. This matches what I've seen in production. MCP servers need health monitoring, retry logic, and fallback strategies that CLIs simply don't require.
Monthly cost at scale
Multiply these differences across a real workload. At 10,000 operations/month on Sonnet 4 pricing ($3/M input, $15/M output):
| Modality | Monthly Cost | Relative |
|---|---|---|
| CLI | $3.20 | 1x |
| MCP (via gateway) | $5.00 | 1.6x |
| MCP (direct) | $55.20 | 17x |
That gateway row matters. With schema filtering, MCP's cost drops from 17x to 1.6x, a reasonable premium for the protocol benefits. Without it, you're paying for 43 tool definitions on every call.
The context window tax
Those tool schemas don't just cost money. They consume the context window your agent needs for actual reasoning. A study of 856 tools across 103 MCP servers quantified the damage:
- Each MCP tool costs 550 to 1,400 tokens just for its schema
- GitHub's MCP server (43 tools) = ~55,000 tokens before any work happens
- A real-world 3-server setup consumed 143,000 of 200,000 tokens, 72% of context on tool definitions alone
- 97.1% of tool descriptions contained at least one quality defect
- 56% of tools failed to clearly state their purpose
Source: arXiv 2602.14878, 856 tools, 103 MCP servers
Compare that to SKILL.md's progressive disclosure: ~100 tokens to advertise a skill, full body loaded only when activated. The agent sees the menu first, then orders what it needs. No wasted context on tools it never calls.
Where each one wins
The data makes it clear: no single approach dominates. Each has a sweet spot.
MCP wins when
- Stateful sessions matter. BAP uses MCP because browser sessions need persistent state: navigate, click, observe, click again. Each action builds on the last.
- Remote services need auth. Slack, Notion, Sentry all need OAuth. MCP mandates OAuth 2.1 with PKCE since March 2025.
- Tools change at runtime. Servers can notify clients when capabilities change via
notifications/tools/list_changed. - Enterprise governance is required. Audit trails, permissions, containerized deployment via Docker MCP Toolkit.
CLI wins when
- Token cost matters. 4 to 32x cheaper. At scale, $3/month vs $55/month for identical work.
- Reliability is non-negotiable. 100% vs 72%. No TCP timeouts, no server health monitoring.
- You need Unix composability.
skill-tools lint *.md | grep error | wc -l. Pipes, chaining, redirection. MCP has no native chaining. - CI/CD is the target. Runs in GitHub Actions, no daemon, no server process.
- LLMs already know it. Models have seen millions of Unix pipe chains. The patterns are deep in the weights.
SKILL.md wins when
- Instructions are the tool. Not every capability needs executable code. Workflows, rules, and decision trees are all a SKILL.md.
- Token efficiency is critical. ~100 tokens/skill vs 550 to 1,400 per tool. For 50+ capabilities, this is the difference between fitting in context and not.
- Zero runtime overhead. No server, no transport, no health monitoring. Works offline, on planes, air-gapped.
- Portability across agents. 30+ agents support it, including Claude Code, Codex, Cursor, and Gemini CLI. No deployment needed.
What I actually shipped
Theory is nice. Here's how these tradeoffs played out in practice across three projects. Notice that BAP uses all three. The right answer is often "more than one."
| Project | MCP | CLI | SKILL.md | Why |
|---|---|---|---|---|
| BAP | Yes | Yes | Yes | MCP for stateful browser sessions. CLI for humans and CI. SKILL.md for docs. |
| skill-tools | No | Yes | No | Deterministic, stateless operations. Pipes and CI. No session needed. |
| staff-engineer | No | No | Yes | Behavioral instructions: how to review, build, ship. No executable API needed. |
Decision framework
If you remember one thing from this post, make it this table. Match your requirement to the left column and the answer falls out.
| If you need... | Use | Why |
|---|---|---|
| Agent calling tools in a session | MCP | Stateful, structured, standard auth |
| CI/CD pipeline integration | CLI | Runs everywhere, no daemon, 100% reliable |
| Behavioral instructions / workflows | SKILL.md | Zero runtime, token-efficient, portable |
| Remote service with OAuth | MCP | Built-in auth (OAuth 2.1 + PKCE) |
| Unix composition (pipes, chaining) | CLI | MCP has no native chaining |
| 50+ capabilities, limited context | SKILL.md | ~100 tokens/skill vs 550 to 1,400 per tool |
| Human developers + agents | CLI + MCP | Same core, two interfaces |
Mitigating MCP's cost
If you've decided MCP is the right protocol for your use case, don't accept the 17x cost penalty. These techniques close the gap.
- Gateway schema filtering. Return only relevant tools per request. 90 to 97% token reduction.
- Hierarchical routing. Route to specialized sub-agents. 99.5% context savings.
- Dynamic toolsets. Inject schemas only for tools relevant to the current query.
- TSV output. 30 to 40% token savings on structured responses.
With schema filtering alone, MCP's cost drops from 17x to 1.6x CLI. That's a reasonable premium for stateful sessions, built-in auth, and dynamic discovery.
Where this is heading
MCP went from ~100K downloads/month in November 2024 to 8M+ downloads/month. 19,500+ servers. 300+ clients. Anthropic donated it to the Linux Foundation's Agentic AI Foundation in December 2025. It's becoming infrastructure.
SKILL.md hit 30+ compatible agents in early 2026. It's complementary to MCP. A SKILL.md can instruct an agent to use MCP servers as part of its workflow. They compose, not compete.
CLI has been the standard interface for 50 years. LLMs are deeply trained on it. It's not going anywhere.
The right answer is rarely "pick one." Ship MCP for sessions, CLI for pipelines, SKILL.md for instructions. Same core capability, three interfaces, each earning its place.
Sources
- Scalekit: MCP vs CLI Benchmark, 75 runs, Claude Sonnet 4
- arXiv 2602.14878, 856 tools, 103 MCP servers, tool description quality
- MCP Specification, Protocol version 2025-06-18
- Agent Skills Specification, SKILL.md format
- Glama MCP Directory, 19,582 servers, March 2026
- Smithery: MCP vs CLI Is the Wrong Fight
- DEV.to: MCP Context Window Analysis