Power without structure is how you end up with 400-line commits, no tests, and a review queue that makes your team want to quit. The failure mode keeps repeating: someone asks Claude to "build feature X," it writes everything in one shot, misses edge cases, skips tests, and ships a PR that takes longer to review than it would have taken to write by hand.

The problem isn't Claude's capability. It's that nobody defined what "good engineering process" looks like for an AI session. A human staff engineer doesn't just write code. They assess scope, break work into reviewable chunks, enforce quality gates, and know when to ask for a second opinion. I wanted that same discipline applied automatically, every time.

So I built staff-engineer, a Claude Code plugin that encodes those instincts into hooks, commands, and subagent workflows.

Why a plugin, not a prompt

You could paste engineering guidelines into a system prompt. I tried that first. It works for about ten minutes, then the model drifts. Long system prompts get deprioritized as the context fills up. And you can't enforce anything. A prompt that says "always run tests" is a suggestion, not a gate.

Plugins give you something prompts never will: hooks that fire on real events. A pre-commit hook that checks for test gaps isn't a polite suggestion. It's a checkpoint the model has to address. A post-edit hook that flags high-risk files fires every time, not just when the model remembers to think about it. That's the difference between culture and infrastructure.

Architecture

Staff-engineer has four layers, and the order matters. Each layer exists because the one above it wasn't sufficient alone.

Quality Gates (Hooks) are the foundation. They fire on git and file-system events (pre-commit, pre-push, post-edit) and they're the only layer that's truly automatic. I put these at the bottom because if the model ignores every command and workflow above, the hooks still catch the most dangerous mistakes: committing without tests, pushing to a release branch, leaving console.log in production code. Hooks are cheap insurance.

Agent Prompts are the specialists. Instead of one monolithic prompt that tries to be good at everything, I split the work into focused subagents: an implementer that follows TDD, a spec reviewer, a quality reviewer, a security reviewer, a debugger. Each one has a narrow job and a tight prompt. The implementer doesn't think about security. The security reviewer doesn't care about code style. This separation matters because models are dramatically better at narrow tasks than broad ones.

Lifecycle Commands are the verbs: build, review, test, ship, debug, release, sync, stats. Each is a markdown file that orchestrates the subagents and hooks for a specific workflow. They're the user-facing interface: you type /staff-engineer:build and the command handles everything from complexity assessment to PR creation.

Orchestration (Build) sits at the top. It's the main workflow that ties everything together: adaptive complexity routing, checkpointed phases, sequential spec-driven development, auto-rollback on test failure, and smart model routing. This is where the "staff engineer" metaphor becomes literal: the orchestrator makes the same decisions a senior IC would make about how much process a given task needs.

Commands

Eight commands cover the full development lifecycle. Collectively, they mean you can go from "I need feature X" to "here's the PR" without switching context or remembering which steps to run in what order.

CommandDescription
/staff-engineer:buildFull workflow: assess complexity, design, plan, implement via SDD, review, ship
/staff-engineer:reviewReview current changes or a PR against quality checklist
/staff-engineer:testGenerate missing tests for current changes
/staff-engineer:shipSelf-review, test, branch, commit, and create PR
/staff-engineer:debugSystematic debugging with root cause analysis and repair agents
/staff-engineer:releasePlan and execute a release with changelog and deploy checklist
/staff-engineer:syncCheck cross-repo dependencies for staleness
/staff-engineer:statsShow activity metrics from JSONL logs

Here's what a typical session looks like. Say you run /staff-engineer:build with "add rate limiting to the API gateway." The plugin classifies this as standard complexity, not trivial (it touches auth and middleware), not complex (it's a well-understood pattern). It sketches a design, asks you to confirm. Then it breaks the work into three tasks: middleware implementation, configuration, and integration tests. Each task goes through spec-driven development with a dedicated implementer subagent. After all tasks pass, a final review catches anything the per-task reviews missed. You get a PR with clean commits, tests, and a description that actually explains the changes.

The build workflow

Build is the command I use most, and the one that took the longest to get right. The key insight was that not every task deserves the same process. A typo fix shouldn't go through design review. A database migration shouldn't skip it.

The workflow has six phases, and the first one determines how many of the remaining five actually run.

Phase 1: Assess. The plugin reads the task description, scans the codebase for affected files, and classifies complexity as trivial, simple, standard, or complex. Trivial tasks skip straight to implementation. Complex tasks get the full treatment. This isn't just about speed. It's about not training yourself to click "approve" on ceremony you don't need, which is how you start ignoring it when you do need it.

Phase 2: Design. For standard and complex tasks, the plugin brainstorms approaches and presents them for approval. This is the cheapest place to catch a wrong direction. Rewriting a design doc costs minutes. Rewriting an implementation costs hours.

Phase 3: Plan. The approved design gets decomposed into bite-sized tasks, each with a clear verification step. The trick here is sequencing. Tasks are ordered so each one builds on the last, and each can be independently verified. If task 3 fails, you don't lose tasks 1 and 2.

Phase 4: Implement. This is where the subagents earn their keep. Each task gets its own implementer running spec-driven development: write a failing test, make it pass, refactor. After implementation, spec review checks alignment with the design. Quality review checks code standards. Security review runs on anything touching auth, crypto, or user input. If tests fail, the plugin rolls back that task's changes and retries with the failure context. No human intervention needed.

Phase 5: Final review. A holistic quality review across the entire implementation. Individual task reviews catch local problems. The final review catches integration issues: inconsistent naming, missing error propagation across module boundaries, that sort of thing.

Phase 6: Finish. Branch, commit, push, create PR. Or keep the branch for manual review. The plugin respects your trust configuration here and won't merge to main without explicit permission.

Six build phases: Assess complexity, Design architecture, Plan task breakdown, Implement with SDD subagents, Review with 3-stage quality gates, Finish by shipping PR

Quality gates

Four hooks fire automatically, and they're deliberately split between advisory and blocking. Advisory hooks teach. They surface information the model should consider but don't halt the workflow. Blocking hooks enforce. They prevent actions that are almost always mistakes.

HookTriggerType
pre-commit-gategit commitAdvisory: test gap detection
pre-push-gategit pushBlocking: release branch protection
post-edit-signalWrite/EditAdvisory: high-risk file flags
post-edit-checksWrite/EditActive: auto-format + debug warnings

The pre-push gate blocking release branches is the one I'm most grateful for. Without it, I've watched Claude confidently push directly to main when asked to "ship this." That's not a model error. The model did exactly what was asked. The error was mine for not defining the boundary.

Trust levels

Autonomy without boundaries is negligence. I learned this the hard way after a session helpfully force-pushed to a shared branch. Trust levels exist because some actions are reversible and some aren't, and the cost of getting that wrong scales with the action's blast radius.

LevelActionsBehavior
AutonomousSelf-review, tests, branch, commit, push, draft PR, SDD orchestrationDoes it
ConfirmCross-repo sync fixes, mark PR ready, changeset creationAsks first
Human RequiredPublish, force push, data deletion, merge to mainOutputs plan

The division is simple: if the action is local and reversible, the plugin handles it. If it affects other people or other systems, it asks. If it's irreversible or high-blast-radius, it stops and tells you what it would do. This means the plugin can run nearly autonomously on implementation work while still requiring a human for the moments that actually matter.

Configuration

Staff-engineer uses 3-tier configuration because different scopes need different defaults. Your organization might mandate security reviews on every PR. Your specific repo might allow parallel task execution because its test suite is fast. And the plugin's built-in defaults handle everything else so you don't need to configure anything on day one.

  • Plugin defaults, sensible out of the box, no config file required
  • Org defaults via ~/.config/staff-engineer/config.json for company-wide policies
  • Repo overrides via .staff-engineer.json at the repo root for per-project tuning
{
  "$schema": "https://raw.githubusercontent.com/pyyush/staff-engineer/main/staff-engineer.schema.json",
  "complexity": {
    "autoDetect": true,
    "defaultMode": "standard"
  },
  "sdd": {
    "parallelTasks": false,
    "autoRollback": true,
    "modelRouting": {
      "implementer": "sonnet",
      "securityReviewer": "opus"
    }
  },
  "ship": {
    "requireTests": true,
    "requireHumanReview": true
  }
}

Model routing is the config option I tune most. Mechanical reviews (formatting, naming conventions) go to Haiku, fast and cheap. Standard implementation goes to Sonnet. Complex architectural reasoning and security reviews go to Opus. This isn't just about cost. It's about matching the model's strengths to the task's demands. You wouldn't ask a senior architect to review whitespace changes, and you wouldn't ask an intern to review your auth flow.

What it means to have an AI staff engineer

The title is deliberate. A staff engineer isn't someone who writes more code. It's someone who raises the quality floor for everyone around them. They're the person who asks "did we think about the failure mode?" and "where are the tests for this?" and "should we really be pushing directly to main?"

That's what this plugin does. It doesn't make Claude smarter. It makes Claude more disciplined. It adds the structure that turns raw capability into reliable output. And it does it the same way every time, which is the part humans are worst at. Consistency under pressure, at 2am, on the third iteration of a feature that was supposed to be simple.

I've been running it on all my projects for the past few months. The PRs are cleaner. The test coverage is higher. And I spend my review time on architecture and design instead of catching missing error handlers. That's the trade I was after.

View on GitHub. Requires git, gh (authenticated), and Node.js 18+.