CSS selectors break when the UI changes. A redesign swaps two buttons in an auth form, and .auth-actions > button:first-child now points at "Sign Up" instead of "Sign In." The agent has no idea. It just clicks what it was told to click. This is the most common failure mode in browser automation, and every team that builds agents hits it.
Every project that touches a browser hits the same wall. Not the hard AI problems, the plumbing. Every team, every framework, every agent startup writes its own Playwright wrapper, its own browser lifecycle manager, its own tool definitions. All slightly different. None of them interoperable. I decided to stop patching and start standardizing.
Browser Agent Protocol is what came out of that decision.
The N+1 problem
Count the browser tool implementations that exist right now. LangChain has one. AutoGen has one. CrewAI has one. Every YC batch produces three more startups that each write their own. Playwright gets wrapped, unwrapped, re-wrapped. Puppeteer too, for the holdouts. Each implementation makes slightly different choices about how to represent pages, how to handle navigation, how to deal with iframes, how to manage sessions.
This is an N+1 problem. For every new agent framework, another browser integration gets written from scratch. None of them benefit from each other's work. Bugs get rediscovered independently. Edge cases around SPAs, auth flows, and shadow DOMs get solved and re-solved in isolation.
MCP solved this for tool integration generally, one protocol for any client. Browser automation needed the same treatment. Not another wrapper. A standard.
The real problem is selectors
But even if you standardized the protocol layer, you'd still have the deeper issue: selectors.
Here's the same action, clicking a sign-in button, done three ways:
// 1. CSS selector: fragile, assumes stable DOM structure
await page.click('.auth-form button[type="submit"]')
// 2. Computer Use coordinates: fragile, assumes stable layout
await computerUse.click({ x: 742, y: 384 })
// 3. BAP semantic selector: targets what the element IS
await client.click(role("button", "Sign In")) The CSS selector breaks when a class name changes, when the DOM restructures, when a component library updates. The coordinate approach breaks when the viewport resizes, when a banner appears, when the layout shifts by a single pixel. Both encode where an element is. Neither encodes what it is.
The insight that unlocked BAP: accessibility roles already solve this problem. Screen readers need to identify interactive elements by their purpose, not their position. A button labeled "Sign In" is a button labeled "Sign In" regardless of where it sits in the DOM, what CSS classes it has, or what pixel coordinates it occupies. The accessibility tree is a semantic map of the page that already exists in every browser.
AI agents need the exact same abstraction screen readers use. Not coincidence, convergence. Both need to understand a page by meaning, not by structure.
How it works
BAP exposes browser capabilities as MCP tools over JSON-RPC 2.0 on WebSocket. When you add the MCP server, it spawns a browser on demand and manages the full lifecycle. Your agent gets tools: bap_navigate, bap_click, bap_fill, bap_observe.
The observe tool is the one that matters most. It returns the page as a structured list optimized for LLM consumption:
// bap_observe returns semantic elements with stable refs
{ ref: "@e1", role: "heading", name: "Welcome back", level: 1 }
{ ref: "@e2", role: "textbox", name: "Email" }
{ ref: "@e3", role: "textbox", name: "Password" }
{ ref: "@e4", role: "button", name: "Sign In" } Those refs are stable within a session. An agent can observe the page once, reason about it, then act without re-querying to find the element again. No "find the button, then click the button" two-step. Observe, decide, act.
Every element is addressed by what it is. The form can be redesigned completely (new CSS framework, new component library, entirely different DOM structure) and the agent still works, as long as there's an email field, a password field, and a sign-in button.
Install instructions for Claude Code, Codex, Cursor, and Python are in the GitHub repo.
What this is really about
The web was built for humans navigating with eyes and hands. Then screen readers proved you could navigate it by meaning instead of appearance. BAP takes that idea to its logical conclusion: if the web has a semantic layer, AI agents should use it.
Right now, every agent that touches a browser is a bespoke integration. Fragile selectors. Custom wrappers. Framework lock-in. Imagine instead: any agent, built on any framework, can interact with any browser through a single open protocol. The same way HTTP standardized how documents move across the network, BAP standardizes how agents interact with what's inside them.
That's what BAP is working toward. The protocol is on GitHub and I'm actively working on multi-tab coordination, authentication flows, and better observation strategies for complex SPAs. If you're building agents that need browsers, I want to hear what's missing.