69% of agent skills from Anthropic, OpenAI, and Vercel have no error handling. No "if this fails, try that." No recovery path. After auditing 53 skills across three vendors, I stopped blaming models and started building tooling.
The SKILL.md specification defines a vendor-neutral format for packaging agent instructions, examples, and metadata into a single portable file. Think README.md for agent tools. The spec existed. What didn't exist was anything to enforce it.
So I built it. Try it in the browser.
The ecosystem
If you write TypeScript, you run tsc to catch type errors and eslint to enforce style. But the natural language controlling your agent's behavior? Most teams eyeball it. skill-tools fills that gap with four packages, each handling a distinct stage of the pipeline.
| Package | What It Does | Key Feature |
|---|---|---|
skill-tools (CLI) | Validate, lint, and score SKILL.md files | 20+ checks, 9 quality rules, 0-100 scoring across 5 dimensions |
@skill-tools/core | AST parser, tokenizer, type definitions | Extracts YAML frontmatter + Markdown body into a typed ParseResult |
@skill-tools/router | Dynamic skill retrieval at runtime | BM25 algorithm optimized for SKILL.md structure |
@skill-tools/gen | Scaffold SKILL.md from existing APIs | Generates from OpenAPI specs and MCP server definitions |
Together, they cover the full lifecycle: author a skill, validate its structure, measure its quality, and retrieve it at runtime. The parse and lint stages run offline. The router runs at query time. Nothing calls an LLM.
Parse
@skill-tools/core extracts frontmatter, validates the name (lowercase alphanumeric + hyphens, 1-64 chars), checks description length, resolves file references (scripts/, references/, assets/), and counts tokens. It returns a typed ParseResult, either the parsed skill or structured errors. No exceptions, no throwing. Error-as-values forces callers to handle failure.
Parsing alone caught 4 of 53 skills in my audit: missing file references that would silently break at runtime. That's the kind of thing that leads to agent loops: the skill says "read this file," the agent tries, the file isn't there, the agent retries forever.
Lint
Eight rules that catch real problems. no-secrets scans for 11 token patterns (OpenAI sk-, Stripe sk_live_, GitHub ghp_, AWS AKIA, private keys, JWTs). no-hardcoded-paths catches /Users/... and C:\Users\.... description-specificity flags vague verbs like "manage" and "handle." If your description could describe any tool, it describes none of them. Three severity levels: errors block CI, warnings educate, info suggests improvements.
The lint findings get more interesting when applied at scale, which is exactly what I did across 53 skills from three vendors.
Score
Five dimensions, 100 points total. Here's how the weight is distributed and why:
| Dimension | Points | What It Measures |
|---|---|---|
| Description Quality | 30 | Action verbs, trigger phrases, name-description uniqueness |
| Instruction Clarity | 25 | Code examples, numbered steps, error handling sections |
| Spec Compliance | 20 | Required fields, token budget |
| Progressive Disclosure | 15 | Long skills reference external files instead of inlining |
| Security | 10 | No leaked secrets, hardcoded paths, dangerous patterns |
Description carries the most weight because it's what routers see first. I can tell you exactly why a skill scored 73: it has a trigger phrase (+8 pts) but no code examples (+0 pts) and a description that overlaps 60% with the name (+3 pts instead of +6). No black box. Every point is traceable.
Route
@skill-tools/router takes a natural language query and a set of skills, returns ranked matches using Okapi BM25. No embeddings, no vector databases, no LLM calls. Pure full-text search with IDF weighting. I chose BM25 over embeddings because it's debuggable. When a query returns the wrong skill, I can inspect term frequencies and fix the description. Try debugging a 1536-dimension embedding.
Contextual enrichment (v0.2.0)
Skill descriptions are often too short for BM25 to work well. A skill named docker-deploy with sections about Kubernetes and Helm charts might have none of those terms in the description. Before indexing, the router now extracts context from the skill body:
- Name parts:
docker-deployyields "docker" and "deploy" - Section headings:
## Kubernetes Configurationyields "kubernetes" and "configuration" - Inline code refs:
`helm install`yields "helm" and "install"
Max 80 context tokens, deduped against the description. Queries like "helm chart" now match skills that never mentioned Helm in their description but cover it in the body. This enrichment layer is where the most interesting routing failures get solved.
Why no LLMs anywhere
Deliberate choice. Every operation, parsing YAML, matching regex, computing BM25, is deterministic. Same input, same output, every time. No API keys means the tools work offline, in CI, in air-gapped environments. I've run them on planes.
No model drift. No temperature variance. No rate limits at 3am when your deploy pipeline needs to validate 200 skills. Determinism is the feature.
CI integration
The CLI integrates with GitHub Actions and any CI system. Set a minimum score threshold and fail builds on violations.
Same idea as ESLint or clippy, except the thing being linted is the instruction your agent will execute.
The Claude Code plugin
Tooling only works if it's in the developer's flow. I built the skill-tools-plugin for Claude Code with a post-write hook that watches SKILL.md files. When Claude Code edits a skill, the plugin transparently runs the linter and feeds results back into the model's context. The agent self-corrects its own instructions. It fixes schema issues, improves clarity, and raises its score without human intervention.
When you treat instructions as structured, testable artifacts, the agent can improve its own capabilities. That feedback loop doesn't exist when your prompts live in a string literal.
Testing
Core validates every frontmatter field and file reference. The router tests cover BM25 ranking, context extraction, and edge cases. All tests run in CI across Node 18, 20, and 22.
Start here
The agent skills ecosystem is growing fast. Anthropic, OpenAI, and Vercel all ship skills now. But without deterministic quality gates, every skill is a black box of unknown reliability. Treating instructions like code (parseable, lintable, scorable) is how you build agents you can trust in production.
Apache-2.0 licensed. GitHub / Docs / Try in browser