The bottleneck in browser automation is not action speed. It is the number of round trips between the agent and the browser. Every navigate, wait, observe, click, wait, observe cycle adds latency that compounds fast. I built BAP's composite actions to collapse that chain. The result: 50-85% fewer round trips on real workflows.

The round-trip tax

Consider what happens when a typical browser agent logs into a website. With Selenium, Playwright, or any tool that sends one command at a time, the sequence looks like this:

1. navigate to login page         → wait for response
2. observe page state             → wait for response
3. find email field               → wait for response
4. fill email                     → wait for response
5. find password field            → wait for response
6. fill password                  → wait for response
7. find submit button             → wait for response
8. click submit                   → wait for response
9. wait for navigation            → wait for response
10. observe new page state        → wait for response

Ten sequential round trips. Each one involves sending a command over WebSocket or CDP, waiting for the browser to execute it, and receiving a response. Even on localhost, each round trip carries overhead: JSON serialization, protocol framing, browser event loop scheduling, and response marshaling.

Playwright's architecture is optimized for this. Its auto-waiting and locator-based API eliminate some explicit waits. But the fundamental pattern remains sequential: send command, receive result, decide next command, send it. In benchmarks I ran during BAP development, a Playwright login flow averaging 10 actions took 1.2-1.8 seconds on a local browser, with protocol overhead accounting for roughly 40% of total time. The actions themselves (typing, clicking) are fast. The waiting between them is not.

For AI agents using screenshot-based approaches like Anthropic's Computer Use API, the tax is worse. Each action requires: take screenshot (~200-400ms for capture and encoding), send to model (~500-2000ms for inference), receive action, execute action (~50-100ms), repeat. The Computer Use documentation describes an agentic loop where the model evaluates a screenshot, issues one tool call, then re-evaluates. A 10-step login flow at ~1-3 seconds per iteration means 10-30 seconds total. Most of that is inference and screenshot transfer, not browser execution.

Traditional browser automation: 5 sequential round trips taking ~1.8 seconds with 40% protocol overhead. BAP composite: 1-2 round trips taking ~0.4 seconds, 78% latency reduction.

Composite actions: the architecture

BAP's bap act command accepts multiple actions in a single call. Instead of ten round trips, you send one:

# Traditional: 5 round trips
bap goto https://app.example.com/login
bap observe
bap act fill:label:"Email"="user@example.com"
bap act fill:label:"Password"="hunter2"
bap act click:role:button:"Sign In"

# Composite: 1 round trip
bap act fill:label:"Email"="user@example.com" \
         fill:label:"Password"="hunter2" \
         click:role:button:"Sign In" \
    --observe

The --observe flag fuses a page state observation into the response, eliminating the separate bap observe call. The composite command sends all three actions to the browser daemon in a single message. The daemon executes them sequentially within one evaluation cycle and returns a single response containing the results of all actions plus the observed page state.

This works because BAP runs as a warm daemon that maintains a persistent browser session. The browser stays alive between commands. When a composite action arrives, the daemon does not need to re-establish context. It executes each sub-action against the already-loaded page, waits for stability between them (using Playwright's auto-waiting under the hood), and batches the results.


Fused operations

Beyond composing multiple actions, BAP fuses common operation pairs that are almost always used together:

Fused OperationWhat It ReplacesRound Trips Saved
goto --observenavigate + wait + observe2
act --observeact + wait + observe2
act (multi)N individual act callsN-1
extract --fieldsobserve + parse + filter2

The goto --observe fusion is the most impactful. Every agent workflow starts with "go to URL and tell me what's on the page." In a traditional setup, that is three operations: navigate, wait for load, observe DOM. With BAP, it is one command that returns structured page state in the same response as the navigation confirmation.

Real numbers: login flow comparison

I benchmarked a standard login-and-navigate flow across four approaches. The task: navigate to a login page, enter credentials, submit, wait for dashboard, and extract a specific data point. Five logical steps.

ApproachRound TripsWall TimeProtocol Overhead
Selenium (sequential)102.1s~850ms
Playwright (sequential)81.4s~520ms
Computer Use (screenshot loop)5 inference cycles8-15s~6-12s (inference)
BAP (composite + fused)20.4s~80ms

BAP's two round trips: (1) goto --observe the login page, (2) composite act --observe with fill + fill + click. The agent receives the dashboard state in the response to the second command and can extract what it needs without a third trip.

The ~80% reduction in round trips versus Playwright translates to roughly 75% reduction in wall time for this flow. The ratio holds across workflows I have tested: form submissions, multi-page navigation, data extraction from paginated tables.


Where this matters most

Round-trip reduction matters in three scenarios:

Race conditions. When booking a scarce resource (like the pickleball slots I wrote about), total execution time determines whether you get the slot or land on a waitlist. Cutting 1.4 seconds to 0.4 seconds is the difference between success and failure when 50 people hit the same page at noon.

High-volume automation. If you are running an agent across 1,000 pages (scraping a catalog, filling forms in bulk, monitoring dashboards), the per-page overhead multiplies. Saving 1 second per page across 1,000 pages is 16 minutes. At scale, round-trip optimization is compute cost optimization.

Agent responsiveness. When a human is chatting with an agent and waiting for it to complete a browser task, perceived latency matters. An agent that finishes in under a second feels like a tool. One that takes 8 seconds feels like a broken experience. The Nielsen Norman Group's research on response times is clear: beyond 1 second, users lose the feeling of directly interacting with the data. Beyond 10 seconds, they lose focus entirely.

The principle

When optimizing browser automation, most people focus on making individual actions faster. Faster clicks, faster fills, faster screenshots. That is the wrong lever.

The right lever is reducing the number of times the agent and browser need to exchange messages. Every round trip is a scheduling boundary, a serialization cost, and a latency floor. Composite actions eliminate scheduling boundaries. Fused operations eliminate redundant exchanges. In the workflows I've tested, the result is not incremental improvement. It is a step-function reduction in total workflow time.

The fastest round trip is the one you do not make.