A team I worked with burned $1,400 in a single day. Their agent was calling GitHub's MCP server for simple repo queries, injecting 43 tool schemas into every request. That's 55,000 tokens before the model even started thinking. The same queries through gh CLI cost $82. That 17x multiplier is what happens when you pick the wrong tool interface without looking at the data.

I've built with all three approaches: MCP servers for BAP, a CLI for skill-tools, and SKILL.md files for staff-engineer. They solve different problems. Here's what the numbers say about when to use each.

Three approaches, one goal

Every agent needs to call external tools. The question is how it gets there. This table captures the fundamental differences. Pay attention to the token cost row, because that's where the real divergence starts.

MCP ServerCLI ToolSKILL.md
What it is A server exposing tools via JSON-RPC 2.0 A command-line program the agent shells out to A markdown file with structured instructions
Protocol JSON-RPC over stdio or HTTP/SSE stdin/stdout/stderr + exit codes No protocol. It's a file on disk
State Stateful sessions with capability negotiation Stateless per invocation Stateless, loaded into context window
Discovery Dynamic: tools/list at runtime --help, man pages Name + description in frontmatter (~100 tokens)
Token cost 550 to 1,400 tokens per tool schema ~80 tokens for prompt + 50 to 200 per --help ~100 tokens to advertise, full body on activation
Runs where Local (stdio) or remote (HTTP) Local only Anywhere. It's a file
Ecosystem 19,500+ servers, 300+ clients Every CLI ever built 30+ compatible agents

The takeaway: MCP gives you the richest protocol, CLI gives you the lightest overhead, and SKILL.md sidesteps the runtime question entirely.

Schema filtering effect: Without filtering, MCP uses 55,000 tokens (43 schemas, $55.20/month). With filtering, only ~3,200 tokens ($5.00/month). 90-97% token reduction.

The benchmark that changed my thinking

I had assumptions about these tradeoffs. Then Scalekit published actual numbers: 75 identical GitHub tasks across CLI and MCP modalities using Claude Sonnet 4. The gap was wider than I expected.

Token cost per task

Look at the rightmost column. These aren't edge cases. They're routine GitHub operations.

TaskCLI TokensMCP TokensMultiple
Repo language & license1,36544,02632x
PR details & review1,64832,27920x
Repo metadata & install9,38682,8359x
Merged PRs by contributor5,01033,7127x
Latest release & deps8,75037,4024x

MCP used 4 to 32x more tokens than CLI for the same tasks. The culprit: every MCP call injects the full tool schema into context. GitHub's MCP server exposes 43 tools. That's ~55,000 tokens of schema before the agent writes a single line of output.

Source: Scalekit MCP vs CLI Benchmark, 75 runs, p < 0.05

Reliability

Token cost is one thing. But the reliability gap is what should worry you in production.

ModalitySuccess RateFailures
CLI100% (25/25)None
CLI + SKILL.md100% (25/25)None
MCP72% (18/25)7 TCP timeout failures

100% vs 72%. CLI was perfectly reliable. MCP failed 28% of the time on TCP timeouts. This matches what I've seen in production. MCP servers need health monitoring, retry logic, and fallback strategies that CLIs simply don't require.

Monthly cost at scale

Multiply these differences across a real workload. At 10,000 operations/month on Sonnet 4 pricing ($3/M input, $15/M output):

ModalityMonthly CostRelative
CLI$3.201x
MCP (via gateway)$5.001.6x
MCP (direct)$55.2017x

That gateway row matters. With schema filtering, MCP's cost drops from 17x to 1.6x, a reasonable premium for the protocol benefits. Without it, you're paying for 43 tool definitions on every call.


The context window tax

Those tool schemas don't just cost money. They consume the context window your agent needs for actual reasoning. A study of 856 tools across 103 MCP servers quantified the damage:

  • Each MCP tool costs 550 to 1,400 tokens just for its schema
  • GitHub's MCP server (43 tools) = ~55,000 tokens before any work happens
  • A real-world 3-server setup consumed 143,000 of 200,000 tokens, 72% of context on tool definitions alone
  • 97.1% of tool descriptions contained at least one quality defect
  • 56% of tools failed to clearly state their purpose

Source: arXiv 2602.14878, 856 tools, 103 MCP servers

Compare that to SKILL.md's progressive disclosure: ~100 tokens to advertise a skill, full body loaded only when activated. The agent sees the menu first, then orders what it needs. No wasted context on tools it never calls.


Where each one wins

The data makes it clear: no single approach dominates. Each has a sweet spot.

MCP wins when

  • Stateful sessions matter. BAP uses MCP because browser sessions need persistent state: navigate, click, observe, click again. Each action builds on the last.
  • Remote services need auth. Slack, Notion, Sentry all need OAuth. MCP mandates OAuth 2.1 with PKCE since March 2025.
  • Tools change at runtime. Servers can notify clients when capabilities change via notifications/tools/list_changed.
  • Enterprise governance is required. Audit trails, permissions, containerized deployment via Docker MCP Toolkit.

CLI wins when

  • Token cost matters. 4 to 32x cheaper. At scale, $3/month vs $55/month for identical work.
  • Reliability is non-negotiable. 100% vs 72%. No TCP timeouts, no server health monitoring.
  • You need Unix composability. skill-tools lint *.md | grep error | wc -l. Pipes, chaining, redirection. MCP has no native chaining.
  • CI/CD is the target. Runs in GitHub Actions, no daemon, no server process.
  • LLMs already know it. Models have seen millions of Unix pipe chains. The patterns are deep in the weights.

SKILL.md wins when

  • Instructions are the tool. Not every capability needs executable code. Workflows, rules, and decision trees are all a SKILL.md.
  • Token efficiency is critical. ~100 tokens/skill vs 550 to 1,400 per tool. For 50+ capabilities, this is the difference between fitting in context and not.
  • Zero runtime overhead. No server, no transport, no health monitoring. Works offline, on planes, air-gapped.
  • Portability across agents. 30+ agents support it, including Claude Code, Codex, Cursor, and Gemini CLI. No deployment needed.

What I actually shipped

Theory is nice. Here's how these tradeoffs played out in practice across three projects. Notice that BAP uses all three. The right answer is often "more than one."

ProjectMCPCLISKILL.mdWhy
BAP Yes Yes Yes MCP for stateful browser sessions. CLI for humans and CI. SKILL.md for docs.
skill-tools No Yes No Deterministic, stateless operations. Pipes and CI. No session needed.
staff-engineer No No Yes Behavioral instructions: how to review, build, ship. No executable API needed.

Decision framework

If you remember one thing from this post, make it this table. Match your requirement to the left column and the answer falls out.

If you need...UseWhy
Agent calling tools in a sessionMCPStateful, structured, standard auth
CI/CD pipeline integrationCLIRuns everywhere, no daemon, 100% reliable
Behavioral instructions / workflowsSKILL.mdZero runtime, token-efficient, portable
Remote service with OAuthMCPBuilt-in auth (OAuth 2.1 + PKCE)
Unix composition (pipes, chaining)CLIMCP has no native chaining
50+ capabilities, limited contextSKILL.md~100 tokens/skill vs 550 to 1,400 per tool
Human developers + agentsCLI + MCPSame core, two interfaces

Mitigating MCP's cost

If you've decided MCP is the right protocol for your use case, don't accept the 17x cost penalty. These techniques close the gap.

  • Gateway schema filtering. Return only relevant tools per request. 90 to 97% token reduction.
  • Hierarchical routing. Route to specialized sub-agents. 99.5% context savings.
  • Dynamic toolsets. Inject schemas only for tools relevant to the current query.
  • TSV output. 30 to 40% token savings on structured responses.

With schema filtering alone, MCP's cost drops from 17x to 1.6x CLI. That's a reasonable premium for stateful sessions, built-in auth, and dynamic discovery.


Where this is heading

MCP went from ~100K downloads/month in November 2024 to 8M+ downloads/month. 19,500+ servers. 300+ clients. Anthropic donated it to the Linux Foundation's Agentic AI Foundation in December 2025. It's becoming infrastructure.

SKILL.md hit 30+ compatible agents in early 2026. It's complementary to MCP. A SKILL.md can instruct an agent to use MCP servers as part of its workflow. They compose, not compete.

CLI has been the standard interface for 50 years. LLMs are deeply trained on it. It's not going anywhere.

The right answer is rarely "pick one." Ship MCP for sessions, CLI for pipelines, SKILL.md for instructions. Same core capability, three interfaces, each earning its place.


Sources