Ask "build an MCP server" to a router loaded with 49 agent skills, and it confidently picks Figma. Not the skill literally named mcp-builder. Figma. Because Figma's description mentions "MCP server" once, and BM25 doesn't understand intent. That single misroute taught me more about skill quality than any benchmark.
I grabbed every SKILL.md I could find from anthropics/skills, openai/skills, and vercel-labs/agent-skills. 53 files total. I ran them through the full skill-tools pipeline: parse, validate, lint, score, route. Every number in this post comes from deterministic code, not LLMs or subjective judgment.
The pipeline
Five stages, each catching different classes of problems:
skill-tools parse *.md # extract frontmatter, validate structure
skill-tools validate *.md # check file refs, directory structure
skill-tools lint *.md # 8 rules: secrets, paths, specificity, ...
skill-tools score *.md # 0-100 across 5 dimensions
skillrouter index + query # BM25 full-text routing Parse and validate catch structural failures. Lint catches quality issues. Score quantifies the gap between good and great. Routing reveals whether skills are findable at all. Here's what each stage found.
Parse: 4 out of 53 failed
49 skills parsed cleanly. Four didn't. All four failed for the same reason: missing file references.
Both Anthropic and OpenAI ship a skill-creator meta-skill that references files like scripts/rotate_pdf.py, references/finance.md, and assets/logo.png. The files exist in the documentation but aren't included in the actual directory. My parser flagged 17 missing references in Anthropic's version, 18 in OpenAI's.
The other two failures: OpenAI's notion-knowledge-capture (missing assets/code) and notion-research-documentation (missing references/citations).
This is the kind of thing that breaks silently in production. The skill says "read this file," the agent tries to read it, the file isn't there. My parser catches this at build time, before an agent ever sees it.
Validation: unexpected directories
The spec defines five allowed directories: scripts/, references/, assets/, examples/, agents/. Several skills ignore this:
templates/in algorithmic-art (Anthropic)canvas-fonts/in canvas-design (Anthropic)reference/in mcp-builder (Anthropic), notion-meeting-intelligence (OpenAI)*core/in slack-gif-creator (Anthropic)themes/in theme-factory (Anthropic)rules/in all three Vercel skillsevaluations/in notion-meeting-intelligence, notion-spec-to-implementation (OpenAI)
* reference/ vs. references/, one character off from valid. This is the kind of typo-as-convention that spreads fast.
None of these cause parse failures. They're info-level diagnostics. But they tell me the spec and real-world usage have already diverged. The lint results tell a sharper story.
Lint: the real findings
Across 49 valid skills: 4 errors, 5 warnings, 49 info-level diagnostics.
Errors (4)
All four errors are no-hardcoded-paths:
webapp-testing(Anthropic) hardcodes/tmp/inspectdoc(OpenAI) hardcodes/tmp/lo_profile_and/tmp/docx_pagessora(OpenAI) hardcodes/tmp/uv
All /tmp/ paths. On macOS, /tmp is a symlink to /private/tmp. On Windows, it doesn't exist. A skill that hardcodes /tmp/ breaks on any non-Linux system. Cross-platform portability isn't optional for skills meant to be shared.
Warnings (5)
Three types of warnings, all related to router discoverability:
description-trigger-keywords(3 hits): brand-guidelines and webapp-testing (Anthropic), vercel-react-best-practices (Vercel). These descriptions don't include "use when..." or specific action verbs. Without trigger keywords, a router has zero signal for when to select the skill.description-specificity(2 hits): web-artifacts-builder (Anthropic) and linear (OpenAI). Both use the generic verb "manage" in their descriptions. A router can't differentiate "manage" from anything else.
Info (49)
The most common info diagnostic: missing error handling. 34 out of 49 skills have no error handling section. That's 69%. No "Error Handling" heading, no "Troubleshooting," no "if X fails" guidance. When an agent follows a skill's instructions and something breaks, it has no recovery path. It loops.
Second most common: inconsistent headings, specifically heading level skips like H1 to H3. Anthropic's pdf skill has 8 heading skip warnings alone, mostly from jumping between H1 and H3/H4 in code cookbook sections.
These diagnostics don't block CI, but they predict runtime failures. Skills without error handling are the ones that cause agent loops in production. The scores quantify exactly how much this costs.
Score distribution: 56-94
Mean score: 78.3. Scores are computed across 5 weighted dimensions:
| Dimension | Max | What It Measures |
|---|---|---|
| Description Quality | 30 | Action verbs, trigger phrases, name-description uniqueness |
| Instruction Clarity | 25 | Code examples, numbered steps, error handling, word count |
| Spec Compliance | 20 | Required fields present, token & line budgets |
| Progressive Disclosure | 15 | Long skills reference external files instead of inlining everything |
| Security | 10 | No leaked secrets, hardcoded paths, dangerous patterns |
Description carries the most weight at 30 points, and it's where the variance is highest. The top 5 and bottom 5 tell the story clearly.
Top 5
All five top scorers share a pattern: clear trigger phrases, numbered instructions, real code blocks, and error handling sections.
| Skill | Source | Score | Description | Clarity | Spec | Disclosure | Security |
|---|---|---|---|---|---|---|---|
| netlify-deploy | OpenAI | 94 | 24/30 | 25/25 | 20/20 | 15/15 | 10/10 |
| render-deploy | OpenAI | 92 | 24/30 | 25/25 | 20/20 | 15/15 | 8/10 |
| imagegen | OpenAI | 91 | 27/30 | 19/25 | 20/20 | 15/15 | 10/10 |
| cloudflare-deploy | OpenAI | 88 | 24/30 | 19/25 | 20/20 | 15/15 | 10/10 |
| vercel-deploy | OpenAI | 88 | 24/30 | 19/25 | 20/20 | 15/15 | 10/10 |
All five are deployment skills from OpenAI. netlify-deploy gets a perfect 25/25 on instruction clarity, the only skill in the entire dataset to achieve that. The deployment category has a natural advantage: the workflow is procedural, the commands are concrete, and the failure modes are well-documented.
Bottom 5
The bottom of the table tells the same story in reverse: weak descriptions tank the overall score.
| Skill | Source | Score | Weakest Dimension |
|---|---|---|---|
| vercel-react-best-practices | Vercel | 56 | Description 8/30 |
| brand-guidelines | Anthropic | 59 | Description 8/30 |
| Anthropic | 61 | Description 10/30 | |
| doc-coauthoring | Anthropic | 63 | Description 13/30, Disclosure 8/15 |
| gh-address-comments | OpenAI | 64 | Clarity 6/25 |
vercel-react-best-practices scores 8/30 on description. It reads like a topic summary, not an instruction for when to invoke it. brand-guidelines has the same problem. These skills might be excellent at their job, but a router has no way to know when to pick them.
By source
Aggregating by vendor surfaces a clear hierarchy:
| Source | Skills | Avg Score | Range |
|---|---|---|---|
| OpenAI | 26 | 81.5 | 64-94 |
| Anthropic | 15 | 75.9 | 59-85 |
| Vercel | 5 | 71.0 | 56-85 |
OpenAI leads, driven by their deployment skills with strong step-by-step instructions and real error handling. Anthropic's skills tend to be longer and more comprehensive but lose points on description quality and progressive disclosure. Vercel has fewer skills but they're laser-focused on React/Next.js patterns. The scores tell you where each vendor invests their attention, and where they don't.
BM25 routing: where it gets interesting
Scores measure quality. Routing measures whether that quality translates to discoverability. After indexing all 49 parsed skills, I ran 15 natural language queries through my BM25 router. Pure full-text search, no embeddings, no LLM reranking. k1=1.2, b=0.75.
Correct matches
Most queries route correctly. Normalized scores (1.0 = top match):
- "deploy to Cloudflare" routed to cloudflare-deploy (1.0), vercel-deploy (0.43)
- "create an Excel spreadsheet" routed to xlsx (1.0), canvas-design (0.49)
- "implement Figma design" routed to figma-implement-design (1.0), figma (0.77)
- "generate algorithmic art" routed to algorithmic-art (1.0), imagegen (0.55)
- "fix CI pipeline" routed to gh-fix-ci (1.0)
- "generate images with AI" routed to imagegen (1.0), sora (0.97)
- "design a frontend UI" routed to frontend-design (1.0), webapp-testing (0.77)
When the query terms align with the skill description, BM25 nails it. The failures reveal where term-matching breaks down.
Misroutes: the interesting failures
Three queries produced wrong top results. These failures are more instructive than the successes.
"build an MCP server" routed to figma (1.0) instead of mcp-builder. The figma skill mentions "use the Figma MCP server" in its description. BM25 latched onto the term overlap. mcp-builder didn't even make the top 3. This is the canonical BM25 failure: it matches terms, not intent. The query means "build a new MCP server." The figma skill means "use an existing one." Same words, opposite meaning.
"make a PowerPoint presentation" routed to slack-gif-creator (1.0) instead of pptx. BM25 matched "presentation" against slack-gif-creator's description, which mentions presentation context. The actual PowerPoint skill ranked third at 0.71. One word in the wrong description outweighed an entire skill purpose-built for the task.
"write tests for my webapp" routed to internal-comms (1.0) instead of webapp-testing. "Write" overlapped with internal-comms' writing-focused description. webapp-testing didn't appear at all. Its description uses "test" but never "write," and the query uses "write" but the intended meaning is "create."
These misroutes are exactly why v0.2.0 of my router introduced contextual enrichment: I extract terms from the skill body (section headings, inline code refs, name parts) and prepend them to the description before indexing. For "build an MCP server," contextual enrichment injects terms from mcp-builder's headings and code blocks, giving BM25 the signal it needs. For high-stakes routing, you'd combine BM25 with a reranker. BM25 for fast candidate retrieval, a small model to pick the winner.
Conflict detection: 6 pairs
When two skills in the same index have highly similar descriptions, my router flags them as conflicts. Six pairs surfaced:
| Skill A | Skill B | Similarity |
|---|---|---|
| docx (Anthropic) | doc (OpenAI) | 1.00 |
| xlsx (Anthropic) | spreadsheet (OpenAI) | 1.00 |
| playwright (OpenAI) | webapp-testing (Anthropic) | 1.00 |
| figma (OpenAI) | figma-implement-design (OpenAI) | 0.97 |
| speech (OpenAI) | transcribe (OpenAI) | 0.86 |
| pdf (Anthropic) | canvas-design (Anthropic) | 0.72 |
Three pairs have perfect 1.0 similarity, functionally identical skills from different vendors. If you load both into the same router, the first one indexed wins arbitrarily. This is a real problem in multi-repo deployments where teams pull skills from different sources without deduplication.
The figma pair (0.97) is the most interesting: same vendor, two skills that overlap heavily. You'd want figma for general Figma interactions and figma-implement-design for design-to-code, but the descriptions don't differentiate enough for BM25 to consistently pick the right one.
What I learned
Descriptions matter more than instructions. Top skills score 24-27/30. Bottom skills score 8-13/30. The differentiator isn't whether the skill is good, it's whether the description tells a router when to select it.
Error handling is universally missing. 34 of 49 skills (69%) have no error handling section. When an agent follows the instructions and hits an error, it has nowhere to go.
BM25 is fast but breaks on intent. Contextual enrichment helps. For high-stakes routing, combine BM25 for fast candidate retrieval with a small reranker model to pick the winner.
Cross-vendor conflicts are inevitable. Three perfect-similarity pairs across Anthropic and OpenAI. Skill registries will need deduplication or namespace isolation as the ecosystem grows.
The spec is already underspecified on directories. Real skills use templates/, rules/, themes/, evaluations/. Either the spec expands or skills restructure.
The whole point of deterministic tooling is that the arguments happen over data, not intuition.