Three months ago I wrote about staff-engineer, a Claude Code plugin that encodes staff-level engineering discipline into hooks and subagent workflows. That post covered the core idea: complexity-aware builds, spec-driven development, quality gates. Since then, the plugin has more than doubled in scope. The headline change is that Claude now gets a second opinion from a different model family before anything ships.

There are more hooks, more review agents, more commands. The interesting part is what happens when two AI models review each other's work.

staff-engineer build pipeline architecture

DADS: Dual-Agent Development System

Claude reviewing Claude's code is like proofreading your own essay: you see what you meant to write, not what you actually wrote. Different model families have different training data, different failure modes, and different strengths. That's a feature, not a bug.

DADS pairs Claude with OpenAI's Codex CLI as an adversarial reviewer. Claude builds, Codex reviews. Or flip it: Codex builds, Claude reviews. The point is that the model approving the code is never the model that wrote it.

DADS dual-agent flow, Claude builds, Codex reviews

The system scales with complexity tier, because not everything needs a second opinion:

TierCodex involvement
TrivialNone
SimpleFinal review only
StandardPer-task cross-review + final review
ComplexSpec critique + per-task review + final review + security audit

When Codex disagrees with Claude's implementation, the disagreement protocol kicks in. A CRITICAL flag means Claude must fix the issue and resubmit for review. A SUGGESTION gets logged and the build proceeds. If they go two rounds without converging, the system escalates to the human, because at that point you're watching two AIs argue and someone needs to break the tie.

Under the hood, codex-bridge.sh wraps every Codex invocation. It supports four modes: critique, review, security, implement. Critique and security modes use structured JSON schemas for machine-parseable output; review mode uses codex review with heuristic parsing. It's fail-open by default: if Codex is unavailable, you get a warning, not a blocked build. Every invocation is announced with status and verdict for full observability:

DADS: Invoking Codex review, cross-reviewing uncommitted changes
DADS: Codex review passed (no critical issues)

Three new hooks enforce DADS. These are deterministic shell scripts the AI cannot skip or reason around:

  • pre-commit-dads-gate (BLOCKING): won't let you commit unless cross-review artifacts exist in .dev-session/ and no fix_first verdicts are unresolved
  • pre-push-dads-gate (BLOCKING): won't let you push unless the full pipeline is complete, final cross-review exists, and security audit passed for complex-tier builds
  • post-edit-scope-guard (advisory): flags edits to files not in the implementation plan, catching scope creep during implementation

All three are no-ops when DADS is disabled. Zero overhead if you don't use it.

Google-grade engineering practices

The original version had reasonable quality gates. Version 0.2.0 raises the bar, drawing heavily from Software Engineering at Google and the Google Code Review Developer Guide.

The biggest single change is the presubmit hook. Before every commit, not every PR, every commit, the plugin runs lint, format, typecheck, and tests. This is a BLOCKING hook. If any of those fail, the commit doesn't happen. The tradeoff is worth it: fixing a lint error at commit time takes seconds, while fixing it after review takes a context switch.

Small CLs are now enforced. Commits over 400 LOC trigger a warning (configurable via google.maxCLSize). Tasks are auto-split to stay under 300 LOC each during planning. This matches Google's internal guidance: small changes are easier to review, easier to roll back, and less likely to introduce subtle bugs.

Design docs got a structured template: goals, non-goals, alternatives considered, security implications, testing strategy, rollback plan. For standard and complex builds, the plugin generates one before writing any code. Non-goals are the part I find most valuable. Explicitly stating what you're not building prevents scope creep more effectively than any other technique I've tried.

Code review now uses a 9-point checklist: Design, Functionality, Complexity, Tests, Naming, Comments, Style, Documentation, Every Line. The quality reviewer agent walks through all nine for every review pass. It's systematic in a way that ad-hoc "looks good to me" reviews never are.

Readability review is a new addition: a separate agent with language-specific style guides for TypeScript/JavaScript, Python, Go, Rust, Swift, and Shell. Documentation checks verify that public APIs are actually documented. Every PR includes a rollback plan. And when builds fail repeatedly, the plugin auto-generates a failure postmortem so you don't just retry and hope.

Visual QA

For web projects, code review isn't enough. You need to see what the code actually renders. Version 0.2.0 integrates the BAP CLI (Browser Agent Protocol) for visual QA.

When you change .tsx, .jsx, .vue, .astro, .html, or .css files, the visual QA reviewer auto-triggers. It takes screenshots, runs interactive tests, and checks accessibility. This happens both per-task (during implementation) and on the full diff (during final review).

It's configured via google.visualQA, set to "auto" by default, which means it triggers on web file patterns and stays silent for backend changes. You can disable it entirely or force it on for every change.

Terminal showing init output

One command to rule them all

The plugin works great if you use Claude Code. But what if you use Cursor? Copilot? Codex directly? What if you just want the git hooks without the full plugin?

npx @pyyush/staff-engineer init

That's it. One command, zero runtime dependencies. It detects which AI coding agents you have installed (Claude Code, Codex CLI, Cursor, GitHub Copilot) and generates the right instruction files for each. It installs git hooks (pre-commit runs tests, pre-push blocks direct pushes to main). It creates a .staff-engineer.json config file with sensible defaults.

staff-engineer init

  Detected agents:
    Claude Code
    Codex CLI
    GitHub Copilot

  Installed git hooks: pre-commit, pre-push
  Created AGENTS.md
  Created .claude/commands/build.md
  Created .codex/instructions.md
  Created .github/copilot-instructions.md
  Created .staff-engineer.json

  Done. Quality gates are active.

Works with any git repo, any language. The generated instruction files encode the same engineering principles (TDD, small commits, quality review) in each agent's native format. It's not the full plugin experience, but it raises the quality floor across your entire toolchain.

The full picture: 9 hooks, 7 agents, 8 commands

For context on how much the surface area grew:

v0.1v0.2
Hooks49 (4 blocking, 4 advisory, 1 active)
Agents57 (added readability, visual QA)
Commands58 (added debug, release, stats)

The new hooks (presubmit, CL size, three DADS gates) replace "best practices the model usually follows" with enforcement. Hooks fire every time.

Real-world test: building uSEID with DADS

Theory is nice. Here's what actually happened when I used DADS for a real build.

uSEID (Universal Semantic Element ID) is a library for stable, cross-run element identity in browser agents. I built it first inside the BAP monorepo using DADS at the complex tier, then extracted it into its own package (@pyyush/useid). The API surface matters enormously. Once published, you're stuck with your design decisions. Perfect stress test for adversarial review.

During the initial build in BAP, Codex ran two rounds of spec critique before any code was written. The first round found seven issues: the snapshot-to-element mapping was undefined (snapshots are unknown-typed with no per-element projection), and the safety model had no page/origin binding, meaning a signature could false-match on the wrong page. The second round caught that pagePathPattern with wildcards contradicted the cross-page blocking non-goal, and that the redaction model was inconsistent with the resolver. All legitimate design gaps that Claude's self-review had approved.

Claude's own quality review agents caught hardcoded viewport dimensions, per-call Set allocations in a hot path, dead exports, and an unused function parameter. The Codex final cross-review confirmed no critical issues remained. Both models agreed the code was shippable.

Final score: 114 tests, all passing. Full-diff quality review plus Codex final review.

I'm not claiming DADS catches everything. I am claiming that two model families catch more than one, and the overhead (a few extra minutes of Codex API calls per build) is trivially small compared to the cost of shipping a bad API.

Enabling it

If you're already using staff-engineer, add this to your .staff-engineer.json:

{
  "dualAgent": {
    "enabled": true
  }
}

That's all you need. Requires Codex CLI installed and codex in your PATH. If Codex isn't available, the plugin warns and continues. Fail-open, not fail-closed.

If you're starting fresh: npx @pyyush/staff-engineer init for the quick setup, or clone the repo and point Claude Code at it with --plugin-dir for the full experience.

View on GitHub. Requires git, gh (authenticated), and Node.js 20+. Optional: Codex CLI (for DADS), BAP CLI (for visual QA).