I build AI developer tooling for a living. My most successful project uses zero AI. Every operation in skill-tools (parsing, linting, scoring, routing) is deterministic. Regex, BM25, arithmetic. No API keys, no model calls, no embeddings. This was a deliberate design choice, and I'd make it again.
The contrarian take: the best AI tooling sometimes has no AI in it at all.
The case against API keys in developer tools
Developer tools have a unique set of constraints that most AI-first products ignore. They run in CI pipelines at 3am. They run on air-gapped machines. They run in pre-commit hooks where latency matters. They run in environments where an OPENAI_API_KEY environment variable doesn't exist and shouldn't.
Every API dependency is a failure mode. Rate limits break CI pipelines. Your tool simply stops working, and there's nothing you can do about it at 3am.
Then there's model drift. The model behind your API call today is not the same model behind it next month. OpenAI, Anthropic, and Google all update their models, sometimes with breaking behavioral changes. If your linter's behavior changes because the model changed, you don't have a linter. You have a suggestion engine with no stability guarantees.
Three design decisions
BM25 over embeddings
The skill-tools router uses BM25 for matching natural language queries to agent skills. BM25 is a term-frequency algorithm from 1994. It's not fancy. It's debuggable.
When a query routes to the wrong skill, I can inspect the term frequencies and immediately see why. "build an MCP server" routes to Figma because Figma's description mentions "MCP server" and BM25 matched on term overlap. The diagnosis is instant and the fix is clear: enrich the target skill's indexable text.
With embeddings, the same debugging session looks like: "the cosine similarity between the query vector and Skill A was 0.847 while Skill B was 0.832." Why? Because the embedding model encoded something in a 1536-dimensional space that you can't inspect. Full-text search gives you deterministic matching for identifiers, exact phrases, and concrete terms. BM25 is computationally simpler, offers predictable behavior and easily explainable results, and requires no fine-tuning.
Is BM25 worse than embeddings for semantic understanding? Yes. It matches terms, not intent. But it's predictably worse in ways I can characterize and compensate for. Contextual enrichment (described in detail in the skill routing post) fixed the MCP-builder misroute without adding any model dependency.
Regex-based linting over LLM classification
skill-tools has 9 lint rules. Each is a regex pattern or string analysis function. no-hardcoded-paths checks for /tmp/, /home/, /Users/ patterns. description-trigger-keywords checks for action verbs and "use when" phrases. no-secrets-in-content scans for API key patterns.
Could an LLM do this more "intelligently"? Sure. It could catch hardcoded paths that don't follow obvious patterns. It could assess description quality with more nuance. But it would also:
- Sometimes flag correct code as a violation (false positives that erode trust)
- Sometimes miss obvious violations (false negatives that erode safety)
- Produce different results on different runs for the same input
- Require an API key that might not exist in CI
- Add seconds of latency to a pre-commit hook
A regex that checks for /tmp/ paths catches 100% of /tmp/ paths, 0% of the time it runs. No false negatives for its defined scope. No variance between runs. In practice, the linter catches real issues: hardcoded /tmp/ paths that break on non-Linux systems. Same results every run, every machine.
Arithmetic scoring over vibes
The scoring system computes a 0–100 quality score across five weighted dimensions: description quality (30 points), instruction clarity (25), spec compliance (20), progressive disclosure (15), security (10). Every point is computed with arithmetic. Word counts, heading analysis, presence/absence of required sections, token budgets.
The alternative would be something like: "Rate this skill's quality on a scale of 1–100, considering description clarity, instruction quality, and security." An LLM would give you a number. A different number every time. A number you can't decompose into "you lost 6 points on description because you're missing trigger keywords." A number that changes when the model updates.
Arithmetic scoring means: OpenAI's netlify-deploy scores 94/100. Vercel's react-best-practices scores 56/100. The 38-point gap is entirely explained by description quality (24/30 vs 8/30) and instruction clarity (25/25 vs 15/25). You can argue with the rubric. You can't argue with the math.
What you get for free
Works offline. No internet required. Run it on a plane, in a submarine, in an air-gapped government network. The tool works identically everywhere because it has no external dependencies at runtime.
Works in CI. No secrets to configure, no API key rotation, no billing alerts. npm i -g skill-tools && skill-tools check *.md --min-score 60 runs in any CI environment with Node.js. Period.
Reproducible results. Same input, same output, every time. This matters enormously for CI gates. If a skill scores 72 today, it scores 72 tomorrow. If it scores 71, something in the file changed, not the model, not the weather, not the API provider's load balancer.
No cost scaling. Whether you lint 10 skills or 10,000, the cost is CPU time on your machine. No token budgets. No per-request pricing. Deterministic AI architectures are gaining traction specifically because predictable inputs lead to predictable outputs, and predictable costs.
No vendor lock-in. skill-tools doesn't care who your LLM provider is, because it doesn't have one. It won't break if OpenAI deprecates a model, Anthropic changes pricing, or Google renames an API for the third time.
When to use LLMs instead
Deterministic tooling isn't always the right answer. Here's when I'd reach for a model:
- Open-ended generation. Writing code, drafting documents, creative tasks. The output space is too large for deterministic rules.
- Semantic understanding beyond term matching. When "build an MCP server" and "create a new tool integration" need to be understood as the same intent, embeddings win.
- Subjective quality assessment. "Is this explanation clear?" requires judgment that regex can't provide.
- Novel pattern detection. Finding security vulnerabilities that don't match known patterns. Identifying code smells that require understanding context.
The decision isn't LLM vs. no-LLM. It's: can I express the success criteria as a deterministic function? If yes, use deterministic code. It's faster, cheaper, more reliable, and debuggable. If not, if the task requires understanding, judgment, or generation, use a model.
For developer tooling specifically, an enormous amount of useful work fits in the deterministic bucket. Linting is pattern matching. Scoring is arithmetic. Parsing is structured extraction. Routing with BM25 is term frequency. None of these need a model. They need clear rules, executed consistently.
The philosophical point
My take: the AI industry has a bias toward using AI for everything. If you're building AI tools, the assumption is that AI should be in the loop. But the best developer tools, the ones that run in every CI pipeline, on every developer's machine, at every hour, are boring, deterministic, and completely reliable. eslint doesn't call GPT-4. tsc doesn't need an API key. prettier formats the same code the same way every time.
skill-tools is an AI developer tool that follows the same philosophy. The domain is AI (agent skills). The tooling is not. And that's exactly why it works at 3am in a CI pipeline with no internet connection and no environment variables configured.
Zero API keys. Same results. Every time.