69% of agent skills from Anthropic, OpenAI, and Vercel have no error handling. No "if this fails, try that." No recovery path. After auditing 53 skills across three vendors, I stopped blaming models and started building tooling.

The SKILL.md specification defines a vendor-neutral format for packaging agent instructions, examples, and metadata into a single portable file. Think README.md for agent tools. The spec existed. What didn't exist was anything to enforce it.

So I built it. Try it in the browser.

skill-tools pipeline: Parse YAML frontmatter, Validate 20+ spec checks, Lint 9 quality rules, Score 5 weighted dimensions, Route via BM25 ranking

The ecosystem

If you write TypeScript, you run tsc to catch type errors and eslint to enforce style. But the natural language controlling your agent's behavior? Most teams eyeball it. skill-tools fills that gap with four packages, each handling a distinct stage of the pipeline.

PackageWhat It DoesKey Feature
skill-tools (CLI) Validate, lint, and score SKILL.md files 20+ checks, 9 quality rules, 0-100 scoring across 5 dimensions
@skill-tools/core AST parser, tokenizer, type definitions Extracts YAML frontmatter + Markdown body into a typed ParseResult
@skill-tools/router Dynamic skill retrieval at runtime BM25 algorithm optimized for SKILL.md structure
@skill-tools/gen Scaffold SKILL.md from existing APIs Generates from OpenAPI specs and MCP server definitions

Together, they cover the full lifecycle: author a skill, validate its structure, measure its quality, and retrieve it at runtime. The parse and lint stages run offline. The router runs at query time. Nothing calls an LLM.

Parse

@skill-tools/core extracts frontmatter, validates the name (lowercase alphanumeric + hyphens, 1-64 chars), checks description length, resolves file references (scripts/, references/, assets/), and counts tokens. It returns a typed ParseResult, either the parsed skill or structured errors. No exceptions, no throwing. Error-as-values forces callers to handle failure.

Parsing alone caught 4 of 53 skills in my audit: missing file references that would silently break at runtime. That's the kind of thing that leads to agent loops: the skill says "read this file," the agent tries, the file isn't there, the agent retries forever.

Lint

Eight rules that catch real problems. no-secrets scans for 11 token patterns (OpenAI sk-, Stripe sk_live_, GitHub ghp_, AWS AKIA, private keys, JWTs). no-hardcoded-paths catches /Users/... and C:\Users\.... description-specificity flags vague verbs like "manage" and "handle." If your description could describe any tool, it describes none of them. Three severity levels: errors block CI, warnings educate, info suggests improvements.

The lint findings get more interesting when applied at scale, which is exactly what I did across 53 skills from three vendors.

Score

Five dimensions, 100 points total. Here's how the weight is distributed and why:

DimensionPointsWhat It Measures
Description Quality30Action verbs, trigger phrases, name-description uniqueness
Instruction Clarity25Code examples, numbered steps, error handling sections
Spec Compliance20Required fields, token budget
Progressive Disclosure15Long skills reference external files instead of inlining
Security10No leaked secrets, hardcoded paths, dangerous patterns

Description carries the most weight because it's what routers see first. I can tell you exactly why a skill scored 73: it has a trigger phrase (+8 pts) but no code examples (+0 pts) and a description that overlaps 60% with the name (+3 pts instead of +6). No black box. Every point is traceable.

Route

@skill-tools/router takes a natural language query and a set of skills, returns ranked matches using Okapi BM25. No embeddings, no vector databases, no LLM calls. Pure full-text search with IDF weighting. I chose BM25 over embeddings because it's debuggable. When a query returns the wrong skill, I can inspect term frequencies and fix the description. Try debugging a 1536-dimension embedding.

Contextual enrichment (v0.2.0)

Skill descriptions are often too short for BM25 to work well. A skill named docker-deploy with sections about Kubernetes and Helm charts might have none of those terms in the description. Before indexing, the router now extracts context from the skill body:

  • Name parts: docker-deploy yields "docker" and "deploy"
  • Section headings: ## Kubernetes Configuration yields "kubernetes" and "configuration"
  • Inline code refs: `helm install` yields "helm" and "install"

Max 80 context tokens, deduped against the description. Queries like "helm chart" now match skills that never mentioned Helm in their description but cover it in the body. This enrichment layer is where the most interesting routing failures get solved.


Why no LLMs anywhere

Deliberate choice. Every operation, parsing YAML, matching regex, computing BM25, is deterministic. Same input, same output, every time. No API keys means the tools work offline, in CI, in air-gapped environments. I've run them on planes.

No model drift. No temperature variance. No rate limits at 3am when your deploy pipeline needs to validate 200 skills. Determinism is the feature.

CI integration

The CLI integrates with GitHub Actions and any CI system. Set a minimum score threshold and fail builds on violations.

Same idea as ESLint or clippy, except the thing being linted is the instruction your agent will execute.

The Claude Code plugin

Tooling only works if it's in the developer's flow. I built the skill-tools-plugin for Claude Code with a post-write hook that watches SKILL.md files. When Claude Code edits a skill, the plugin transparently runs the linter and feeds results back into the model's context. The agent self-corrects its own instructions. It fixes schema issues, improves clarity, and raises its score without human intervention.

When you treat instructions as structured, testable artifacts, the agent can improve its own capabilities. That feedback loop doesn't exist when your prompts live in a string literal.

Testing

Core validates every frontmatter field and file reference. The router tests cover BM25 ranking, context extraction, and edge cases. All tests run in CI across Node 18, 20, and 22.

Start here

The agent skills ecosystem is growing fast. Anthropic, OpenAI, and Vercel all ship skills now. But without deterministic quality gates, every skill is a black box of unknown reliability. Treating instructions like code (parseable, lintable, scorable) is how you build agents you can trust in production.

Apache-2.0 licensed. GitHub / Docs / Try in browser