Introducing skill-tools
I built a toolchain for SKILL.md files — the emerging spec that tells AI agents what a tool does and when to use it. Parse, lint, score, and route skills with zero API keys. Everything runs deterministic code. No LLMs anywhere in the pipeline.
Try it in the browser →npm i -g skill-tools
skill-tools check my-skill.md --min-score 60 The problem
MCP gave agents a way to call tools. But discovery is still a mess. An agent with 50+ tools needs to figure out which one to invoke for a given query, and the tool descriptions vary wildly in quality. Some are three words. Others are walls of text with hardcoded paths and leaked secrets.
The agentskills.io spec defines SKILL.md — a markdown file with structured frontmatter that describes a skill. Name, description, instructions, examples, error handling. It's the README.md for agent tools.
The spec exists. What didn't exist was tooling to enforce it.
What skill-tools does
Four packages, one pipeline:
Parse
@skill-tools/core — Extracts frontmatter, validates the name (lowercase alphanumeric + hyphens, 1-64 chars), checks description length (≤1,024 chars), resolves file references (scripts/, references/, assets/), and counts tokens. Returns a typed ParseResult — either the parsed skill or structured errors. No exceptions, no throwing.
Lint
skill-tools CLI — Eight rules that catch real problems. no-secrets scans for 11 token patterns (OpenAI sk-, Stripe sk_live_, GitHub ghp_, AWS AKIA, private keys, JWTs). no-hardcoded-paths catches /Users/... and C:\Users\.... description-specificity flags vague verbs like "manage" and "handle". Three severity levels — errors block CI, warnings educate, info suggests improvements.
Score
Five dimensions, 100 points total. Description quality (30 pts) — checks for action verbs, trigger phrases like "use when", and name-description uniqueness. Clarity (25 pts) — code examples, numbered steps, error handling sections. Spec compliance (20 pts) — required fields, token budget. Progressive disclosure (15 pts) — long skills need file refs. Security (10 pts) — starts at 10, loses points for secrets, curl | bash, eval $, rm -rf /.
Route
@skill-tools/router — Given a natural language query and a set of skills, returns ranked matches using Okapi BM25. No embeddings, no vector databases, no LLM calls. Pure full-text search with IDF weighting. Top result normalized to 1.0.
Why no LLMs
This was a deliberate choice. Every operation in the pipeline — parsing YAML, matching regex patterns, computing BM25 scores — is deterministic. Same input, same output, every time. No API keys means the tools work offline, in CI, in air-gapped environments.
The scoring formula is fully transparent. I can tell you exactly why a skill scored 73: it has a trigger phrase (+8 pts) but no code examples (+0 pts) and a description that overlaps 60% with the name (+3 pts instead of +6). There's no black box.
Deterministic scoring
The score for a given SKILL.md will never change unless you change the file. No model drift, no temperature variance, no rate limits. Deploy it in CI and trust the number.
Contextual BM25
v0.2.0 adds contextual retrieval to the router. The idea is simple: skill descriptions are often too short for BM25 to work well. A skill named docker-deploy with sections about Kubernetes, Helm charts, and rolling updates — none of those terms appear in the description.
Contextual enrichment fixes this. Before indexing, the router extracts context terms from the skill body:
- Name parts —
docker-deployyields "docker" and "deploy" - Section headings —
## Kubernetes Configurationyields "kubernetes" and "configuration" - Inline code refs —
`helm install`yields "helm" and "install"
These terms get prepended to the description before BM25 indexing. Max 80 context tokens, deduped against the description to avoid inflating term frequency. The result: queries like "helm chart" now match skills that never mentioned Helm in their description but cover it in the body.
Disable it with context: false if you don't want it. Skills without a body behave identically to v0.1.
The BM25 math
For those who care about the internals. Okapi BM25 with standard parameters:
// Parameters
k1 = 1.2 // term frequency saturation
b = 0.75 // length normalization
// IDF — how rare is this term across all skills?
IDF = log((N - df + 0.5) / (df + 0.5) + 1)
// Score — per term, per document
score = IDF × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × (dl / avgdl))) N is total skills, df is how many skills contain the term, tf is term frequency in the document, dl is document length, avgdl is average document length. Final scores are normalized so the top match = 1.0.
CI integration
There's a GitHub Action that runs in your pipeline:
# .github/workflows/skill-check.yml
- uses: skill-tools/skill-tools@main
with:
path: skills/
min-score: 60
fail-on: warning It finds all .md files in the path, runs the full pipeline (parse → lint → score), and fails the build if any skill scores below the threshold or has lint violations at the specified severity.
You can also use the CLI directly:
# Check a single file
skill-tools check my-skill.md --min-score 60 --fail-on warning
# Lint only
skill-tools lint skills/
# Score with JSON output for downstream tools
skill-tools score my-skill.md --format json
# Generate a SKILL.md from text description
skillgen from-text docker-deploy "Docker deployment helper for K8s"
# Generate from an OpenAPI spec
skillgen openapi api-spec.yaml The gen package
@skill-tools/gen generates SKILL.md files from existing sources. Give it a name and description and it scaffolds the frontmatter, instructions, examples, and error handling sections. Give it an OpenAPI spec and it extracts endpoints, parameters, and response schemas into a structured skill.
The output is a starting point, not a finished product. The lint and score tools then tell you what to improve.
152 tests
The monorepo has 152 tests across 15 test suites. Core has parser tests that validate every frontmatter field, resolver tests for file reference resolution, and a tokenizer test suite. The router has 74 tests covering BM25 ranking, context extraction, and edge cases like empty queries. The linter and scorer each have dedicated suites that pin every rule and every scoring dimension.
All tests run in CI across Node 18, 20, and 22.
Get started
# Install globally
npm i -g skill-tools
# Or use in a project
npm i @skill-tools/core @skill-tools/router The entire ecosystem is Apache-2.0 licensed. If you're building agents with MCP tools and want to enforce quality on your skill definitions, give it a try.
Try it in the browser →