Typed Action Grammars for AI Browser Agents: Constrained Decoding with JSON Schema, Function Calling, and CDP-Validated Plans
AI browser agents are finally usable—if you stop letting them hallucinate clicks. The path forward is to make them speak a typed action grammar, not English. Constrain decoding with JSON Schema, BNF, or function-calling; validate each step against live Chrome DevTools Protocol (CDP) state; auto-repair with pre/postconditions; and log every action as a reusable schema for training, replay, and CI.
This article lays out a practical, opinionated blueprint for building reliable, high-coverage browser agents. We’ll cover grammar design, decoding strategies, CDP validation, repair loops, and how to turn your execution traces into high-value datasets.
TL;DR
- Large Language Models (LLMs) should output a typed action plan, not free-form English.
- Constrain decoding with JSON Schema, BNF grammars, or tool/function-calling to guarantee structural correctness.
- Validate each action against live CDP state (DOM, accessibility, frames, network) before and after execution.
- Attach preconditions and postconditions to every action; auto-repair with a small, deterministic controller loop.
- Log each step’s schema, observations, and diffs for training, replay, and CI.
Why Browser Agents Fail (And How Typed Grammars Fix It)
Common failure modes:
- Hallucinated targets: “Click the blue button” when there are three blue buttons.
- DOM drift: the page re-renders between decisions; a CSS selector becomes stale.
- Timing races: element exists but isn’t visible, or interactable, or still off-screen.
- Ambiguous selectors: querying by text with whitespace or dynamic content.
- Tool misuse: the LLM returns out-of-contract payloads or malformed objects.
Root cause: we’re asking a probabilistic model to produce an exact program in a noisy environment, with minimal types. The fix is to constrain the space of outputs and ground them in state.
Typed action grammars:
- Force the model to pick from a small, well-typed set of actions and parameters.
- Require preconditions (what must be true before acting) and postconditions (what must change).
- Serialize both plan and evidence as structured data; reject or repair invalid steps.
This is not just prompt engineering. It’s language design, runtime verification, and test instrumentation packaged into a thin orchestration layer.
What Is a Typed Action Grammar?
A typed action grammar is a constrained language of browser actions, with formally specified fields, types, and allowed values. Examples of actions:
- navigate(url)
- click(target)
- type(target, text)
- select(target, option)
- waitFor(condition)
- scroll(target or direction)
- extract(query)
- assert(condition)
Each action carries typed parameters (e.g., URL, Locator, TimeoutMs), preconditions (e.g., element.visible == true), and postconditions (e.g., URL changes, network idle, innerText changes, ARIA state toggles). The grammar can be expressed as JSON Schema, BNF/EBNF, or tool definitions for function-calling APIs.
The key: the LM doesn’t invent raw CSS or guess XPaths from scratch. It either:
- Picks from a curated set of “candidate elements” produced by the orchestrator (with stable IDs), or
- Emits selectors in a DSL constrained by allowed strategies (byRole, byLabel, byTestId, etc.).
Constrained Decoding Options
You have at least three practical routes:
- JSON Schema structured output
- Pros: familiar, robust tooling; many SDKs now support schema-constrained decoding.
- Cons: purely syntactic; still need semantic checks against page state.
- Function-calling / tool-use
- Pros: forces a choice of named actions; arguments get typed; you can do tool-by-tool validation.
- Cons: vendor-specific; still need runtime guards.
- Grammar-level decoding (BNF/EBNF / regex / trie)
- Pros: token-level guarantees; can prevent invalid tokens entirely; fast.
- Cons: more work to design; nesting complex types can be tricky.
In practice, I recommend supporting at least two: Schema-based decoding for portability and function-calling for tool UX. Grammar-level decoding helps if you self-host or run open models via libraries like llama.cpp, Outlines, Guidance, or Jsonformer/Instructor-like approaches.
Designing the Action Grammar
Start with a small, composable core and extend carefully. Sample TypeScript types and accompanying JSON Schema:
ts// types.ts export type Locator = | { strategy: "byTestId"; testId: string } | { strategy: "byRole"; role: "button" | "link" | "textbox" | "menuitem" | "combobox"; name?: string } | { strategy: "css"; selector: string } | { strategy: "text"; text: string; exact?: boolean }; export type Precondition = | { kind: "exists"; target: string } | { kind: "visible"; target: string } | { kind: "enabled"; target: string } | { kind: "urlMatches"; pattern: string } | { kind: "attrEquals"; target: string; name: string; value: string }; export type Postcondition = | { kind: "urlChanges"; to?: string } | { kind: "elementTextContains"; target: string; text: string } | { kind: "ariaState"; target: string; name: string; value: string } | { kind: "networkIdle"; timeoutMs?: number }; export type Action = | { type: "navigate"; url: string; pre?: Precondition[]; post?: Postcondition[] } | { type: "click"; targetRef: string; pre?: Precondition[]; post?: Postcondition[] } | { type: "type"; targetRef: string; text: string; pre?: Precondition[]; post?: Postcondition[] } | { type: "select"; targetRef: string; option: string | number; pre?: Precondition[]; post?: Postcondition[] } | { type: "waitFor"; condition: Precondition; timeoutMs?: number } | { type: "extract"; query: { targetRef: string; kind: "text" | "html" | "value" } }; export interface Plan { version: "1.0"; context?: { task: string; allowedDomains?: string[]; viewport?: { width: number; height: number } }; candidates: Record<string, Locator>; // stable ids -> locators steps: Action[]; }
And a corresponding JSON Schema skeleton:
json{ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://example.com/plan.schema.json", "type": "object", "required": ["version", "candidates", "steps"], "properties": { "version": { "type": "string", "const": "1.0" }, "context": { "type": "object", "properties": { "task": { "type": "string" }, "allowedDomains": { "type": "array", "items": { "type": "string" } }, "viewport": { "type": "object", "properties": { "width": { "type": "integer", "minimum": 200 }, "height": { "type": "integer", "minimum": 200 } }, "required": ["width", "height"] } } }, "candidates": { "type": "object", "additionalProperties": { "oneOf": [ { "type": "object", "properties": { "strategy": { "const": "byTestId" }, "testId": { "type": "string" } }, "required": ["strategy", "testId"] }, { "type": "object", "properties": { "strategy": { "const": "byRole" }, "role": { "enum": ["button", "link", "textbox", "menuitem", "combobox"] }, "name": { "type": "string" } }, "required": ["strategy", "role"] }, { "type": "object", "properties": { "strategy": { "const": "css" }, "selector": { "type": "string" } }, "required": ["strategy", "selector"] }, { "type": "object", "properties": { "strategy": { "const": "text" }, "text": { "type": "string" }, "exact": { "type": "boolean" } }, "required": ["strategy", "text"] } ] } }, "steps": { "type": "array", "items": { "oneOf": [ { "type": "object", "properties": { "type": { "const": "navigate" }, "url": { "type": "string", "format": "uri" }, "pre": { "$ref": "#/defs/conditions" }, "post": { "$ref": "#/defs/postconditions" } }, "required": ["type", "url"] }, { "type": "object", "properties": { "type": { "const": "click" }, "targetRef": { "type": "string" }, "pre": { "$ref": "#/defs/conditions" }, "post": { "$ref": "#/defs/postconditions" } }, "required": ["type", "targetRef"] }, { "type": "object", "properties": { "type": { "const": "type" }, "targetRef": { "type": "string" }, "text": { "type": "string" }, "pre": { "$ref": "#/defs/conditions" }, "post": { "$ref": "#/defs/postconditions" } }, "required": ["type", "targetRef", "text"] }, { "type": "object", "properties": { "type": { "const": "select" }, "targetRef": { "type": "string" }, "option": { "oneOf": [ { "type": "string" }, { "type": "number" } ] }, "pre": { "$ref": "#/defs/conditions" }, "post": { "$ref": "#/defs/postconditions" } }, "required": ["type", "targetRef", "option"] }, { "type": "object", "properties": { "type": { "const": "waitFor" }, "condition": { "$ref": "#/defs/precondition" }, "timeoutMs": { "type": "integer", "minimum": 0 } }, "required": ["type", "condition"] }, { "type": "object", "properties": { "type": { "const": "extract" }, "query": { "type": "object", "properties": { "targetRef": { "type": "string" }, "kind": { "enum": ["text", "html", "value"] } }, "required": ["targetRef", "kind"] } }, "required": ["type", "query"] } ] } } }, "defs": { "conditions": { "type": "array", "items": { "$ref": "#/defs/precondition" } }, "precondition": { "oneOf": [ { "type": "object", "properties": { "kind": { "const": "exists" }, "target": { "type": "string" } }, "required": ["kind", "target"] }, { "type": "object", "properties": { "kind": { "const": "visible" }, "target": { "type": "string" } }, "required": ["kind", "target"] }, { "type": "object", "properties": { "kind": { "const": "enabled" }, "target": { "type": "string" } }, "required": ["kind", "target"] }, { "type": "object", "properties": { "kind": { "const": "urlMatches" }, "pattern": { "type": "string" } }, "required": ["kind", "pattern"] }, { "type": "object", "properties": { "kind": { "const": "attrEquals" }, "target": { "type": "string" }, "name": { "type": "string" }, "value": { "type": "string" } }, "required": ["kind", "target", "name", "value"] } ] }, "postconditions": { "type": "array", "items": { "oneOf": [ { "type": "object", "properties": { "kind": { "const": "urlChanges" }, "to": { "type": "string" } }, "required": ["kind"] }, { "type": "object", "properties": { "kind": { "const": "elementTextContains" }, "target": { "type": "string" }, "text": { "type": "string" } }, "required": ["kind", "target", "text"] }, { "type": "object", "properties": { "kind": { "const": "ariaState" }, "target": { "type": "string" }, "name": { "type": "string" }, "value": { "type": "string" } }, "required": ["kind", "target", "name", "value"] }, { "type": "object", "properties": { "kind": { "const": "networkIdle" }, "timeoutMs": { "type": "integer", "minimum": 0 } }, "required": ["kind"] } ] } } } }
EBNF Sketch
If you prefer grammar-level decoding, an EBNF snippet can describe action syntax:
plan := '{' '"version"' ':' '"1.0"' ',' '"steps"' ':' '[' step { ',' step } ']' '}' ;
step := navigate | click | type | select | waitFor | extract ;
navigate := '{' '"type"' ':' '"navigate"' ',' '"url"' ':' string pre? post? '}' ;
click := '{' '"type"' ':' '"click"' ',' '"targetRef"' ':' string pre? post? '}' ;
... := ...
pre := ',' '"pre"' ':' '[' precondition { ',' precondition } ']' ;
post := ',' '"post"' ':' '[' postcondition { ',' postcondition } ']' ;
precondition := '{' '"kind"' ':' ( '"exists"' | '"visible"' | ... ) ... '}' ;
postcondition:= '{' '"kind"' ':' ( '"urlChanges"' | '"elementTextContains"' | ... ) ... '}' ;
string := '"' { char } '"' ;
Libraries like Outlines, Guidance, or llama.cpp grammars can enforce this token-level structure.
CDP-Validated Plans: Grounding Actions in Live State
Constrained decoding ensures the plan is syntactically valid. It does not ensure the plan is true. CDP validation does.
Before executing an action:
- Resolve the locator into a concrete element handle using CDP (DOM.querySelectorAll, Accessibility.getFullAXTree, Runtime.callFunctionOn).
- Check preconditions: existence, visibility, bounding box in viewport, enabled state, URL pattern.
- If preconditions fail, run the repair loop.
After executing an action:
- Wait for predicted postconditions: URL change, innerText change, ARIA attribute toggled, network idle for X ms, specific request observed, etc.
- If postconditions fail within timeout, repair or roll back.
Minimal Node/CDP Skeleton
Below is a minimal sketch using Puppeteer’s CDP session. The same pattern works with Playwright’s page._client() or the Chrome DevTools Protocol client of your choice.
tsimport puppeteer, { ElementHandle, Page } from 'puppeteer'; import { Plan } from './types'; async function resolveCandidate(page: Page, locator: any): Promise<ElementHandle | null> { switch (locator.strategy) { case 'byTestId': return await page.$(`[data-testid="${locator.testId}"]`); case 'byRole': // Use the accessibility tree if possible // Simplified fallback by role/name return await page.$(`${locator.role === 'button' ? 'button' : '[role="' + locator.role + '"]'}${locator.name ? ":has-text(\"" + locator.name + "\")" : ''}`); case 'css': return await page.$(locator.selector); case 'text': // Simplified text search; use robust text engine in production return await page.$(`:is(*, *:shadow-root) :text("${locator.text}")`); default: return null; } } async function preconditionSatisfied(page: Page, pre: any, candidates: Record<string, any>) { if (pre.kind === 'urlMatches') { const url = page.url(); return new RegExp(pre.pattern).test(url); } const loc = candidates[pre.target]; const el = await resolveCandidate(page, loc); if (!el) return false; if (pre.kind === 'exists') return true; if (pre.kind === 'visible') { const box = await el.boundingBox(); return !!box && box.width > 0 && box.height > 0; } if (pre.kind === 'enabled') { return await page.evaluate((e) => !(e as HTMLButtonElement).disabled, el); } if (pre.kind === 'attrEquals') { const { name, value } = pre; const v = await page.evaluate((e, name) => e.getAttribute(name), el, name); return v === value; } return false; } async function waitForPost(page: Page, post: any, candidates: Record<string, any>, timeout = 3000) { if (post.kind === 'urlChanges') { const start = page.url(); try { await page.waitForFunction((start, to) => { if (to) return window.location.href.includes(to); return window.location.href !== start; }, { timeout }, start, post.to); return true; } catch { return false; } } if (post.kind === 'networkIdle') { // Simplified; use Network domain events for precise idle detection await page.waitForNetworkIdle({ idleTime: 500, timeout: post.timeoutMs ?? timeout }); return true; } if (post.kind === 'elementTextContains') { const el = await resolveCandidate(page, candidates[post.target]); if (!el) return false; try { await page.waitForFunction((e, t) => e && e.innerText.includes(t), { timeout }, el, post.text); return true; } catch { return false; } } if (post.kind === 'ariaState') { const el = await resolveCandidate(page, candidates[post.target]); if (!el) return false; const { name, value } = post; try { await page.waitForFunction((e, n, v) => e.getAttribute(`aria-${n}`) === v, { timeout }, el, name, value); return true; } catch { return false; } } return false; } async function executeStep(page: Page, step: any, candidates: Record<string, any>) { // Precondition check if (step.pre) { for (const pre of step.pre) { const ok = await preconditionSatisfied(page, pre, candidates); if (!ok) return { ok: false, reason: 'precondition_failed', pre }; } } // Execute try { switch (step.type) { case 'navigate': await page.goto(step.url, { waitUntil: 'domcontentloaded' }); break; case 'click': { const el = await resolveCandidate(page, candidates[step.targetRef]); if (!el) return { ok: false, reason: 'target_not_found' }; await el.click({ delay: 10 }); break; } case 'type': { const el = await resolveCandidate(page, candidates[step.targetRef]); if (!el) return { ok: false, reason: 'target_not_found' }; await el.click({ delay: 10 }); await page.keyboard.type(step.text, { delay: 10 }); break; } case 'select': { const el = await resolveCandidate(page, candidates[step.targetRef]); if (!el) return { ok: false, reason: 'target_not_found' }; const handle = await el.asElement(); if (!handle) return { ok: false, reason: 'target_not_found' }; const tag = await page.evaluate(e => e.tagName.toLowerCase(), handle); if (tag === 'select') { await page.select(await page.evaluate(e => e as any, handle), `${step.option}`); } else { // Fallback: click option by text/index await handle.click(); // TODO: implement dropdown option resolution } break; } case 'waitFor': if (!(await preconditionSatisfied(page, step.condition, candidates))) { try { await page.waitForFunction(() => false, { timeout: step.timeoutMs ?? 0 }); } catch {} } break; case 'extract': { const el = await resolveCandidate(page, candidates[step.query.targetRef]); if (!el) return { ok: false, reason: 'target_not_found' }; const data = await page.evaluate((e, kind) => { if (kind === 'text') return e.innerText; if (kind === 'html') return e.innerHTML; if (kind === 'value') return (e as HTMLInputElement).value; return null; }, el, step.query.kind); return { ok: true, data }; } default: return { ok: false, reason: 'unknown_action' }; } } catch (err) { return { ok: false, reason: 'exception', error: String(err) }; } // Postconditions if (step.post) { for (const post of step.post) { const ok = await waitForPost(page, post, candidates); if (!ok) return { ok: false, reason: 'postcondition_failed', post }; } } return { ok: true }; }
Auto-Repair With Preconditions and Postconditions
A robust agent doesn’t just fail; it repairs. The controller loop should:
- Validate preconditions.
- If they fail, attempt deterministic repairs first (e.g., scroll into view, wait for idle, re-resolve candidates, refresh FrameTree).
- If still failing, request a repair suggestion from the LLM—constrained by the same grammar—given a compact observation (DOM snippet, AX subtree, screenshot hash, last error).
- Apply rate limits and guardrails; keep a small beam of repair proposals; avoid infinite loops.
Repair should be cheap and local. Prefer operations like scrolling, waiting, or switching to alternate candidates the orchestrator already enumerated (see next section) instead of generating new free-form selectors.
Make Selectors a Choice, Not a String
One major improvement: enumerate candidate elements upfront and let the model pick by ID. Example "candidates" map:
json{ "candidates": { "loginBtn": { "strategy": "byRole", "role": "button", "name": "Log in" }, "email": { "strategy": "byTestId", "testId": "email-input" }, "password": { "strategy": "byTestId", "testId": "password-input" }, "submit": { "strategy": "css", "selector": "form#login button[type=submit]" } } }
The plan then uses "targetRef": "submit". Behind the scenes you maintain a stable mapping per page visit, recomputing on DOM changes and dropping invalid entries. The LM’s job is classification, not CSS authorship.
Pre/Postcondition Examples
- click(submit): pre = [exists(submit), visible(submit)], post = [networkIdle(1000), urlChanges()]
- type(email, "a@b.com"): pre = [exists(email), enabled(email)], post = [attrEquals(email, "value", "a@b.com")]
- waitFor(urlMatches, /checkout/): pre = [], post = []
These constraints let you detect and fix early: if submit isn’t visible, scroll it; if value isn’t set, retry type with small delay; if URL won’t change, take a screenshot and ask the LLM to propose an alternative flow.
Logging Step Schemas for Training, Replay, and CI
Every action is an example. Log it richly:
- Input: task, plan step JSON, candidates at time of decision
- Observations: CDP events (DOM mutations, network), AX subtree, element screenshots, pre/post booleans
- Outcome: success/failure, repair attempts, latency, tokens
- Diffs: DOM snippet diff before/after; URL changes; resource loads
This becomes:
- A replayable trace for deterministic CI (mock out network or use VCR-like fixtures)
- A supervised fine-tuning dataset of (obs -> action) pairs for your next model
- A failure analytics dashboard for flaky selectors and slow endpoints
Structure the log with versioned schemas. Example step log:
json{ "runId": "2025-01-09T12:00:00Z-abc123", "stepIndex": 3, "action": { "type": "click", "targetRef": "submit", "pre": [ {"kind":"visible","target":"submit"} ], "post": [ {"kind":"urlChanges"} ] }, "candidates": { "submit": { "strategy": "css", "selector": "form#login button[type=submit]" } }, "preCheck": { "visible": true }, "execution": { "ts": 1736420400, "durationMs": 240, "cdp": { "events": 12 } }, "postCheck": { "urlChanged": true }, "artifacts": { "screenshot": "s3://.../step3.png", "axSubtree": "s3://.../ax3.json" }, "result": "ok" }
You can export these to Parquet for analytics or to a vector store keyed by DOM fingerprints for retrieval-augmented planning.
Orchestrator Architecture
A clean separation of concerns keeps complexity under control:
- Page Modeler: builds the candidates map from CDP snapshots, AX trees, heuristics (roles, test IDs, unique texts).
- Planner: calls the LLM with schema/grammar, context, and candidate list; gets a typed plan or next action.
- Validator/Executor: CDP-validated pre, act, post; emits step logs.
- Repair Loop: deterministic retries and LLM-constrained repairs with budget.
- Data Layer: stores logs, screenshots, network fixtures for replay/CI/training.
This architecture supports both closed models (function-calling) and open models (grammar sampling). It lets you swap the planner without rewriting validators or data infra.
Function Calling and JSON Schema in Practice
Most major model providers now offer structured outputs:
- JSON Schema-constrained responses: request the model to emit JSON conforming to a schema; reject and retry on invalid outputs.
- Tool/function calling: you define named tools with typed arguments; the model decides which to call and with what arguments.
Example call using a JSON Schema response format (pseudocode):
tsimport OpenAI from 'openai'; const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const schema = /* the Plan JSON Schema from above */; const prompt = `You are a browser planning agent. You must output a plan that conforms to the provided schema. Use only candidate ids when referencing elements. Prefer minimal steps with explicit pre and post conditions. Task: Log into the demo app and navigate to the dashboard.`; const resp = await client.chat.completions.create({ model: 'gpt-4.1', messages: [ { role: 'system', content: 'You output only JSON conforming to the schema. No prose.' }, { role: 'user', content: JSON.stringify({ prompt, candidates }) } ], response_format: { type: 'json_schema', json_schema: { name: 'BrowserPlan', schema, strict: true } }, temperature: 0.2 }); const plan: Plan = JSON.parse(resp.choices[0].message.content!);
And a function-calling example for single-step planning:
tsconst tools = [ { type: 'function', function: { name: 'click', description: 'Click a candidate element by id', parameters: { type: 'object', properties: { targetRef: { type: 'string', enum: Object.keys(candidates) } }, required: ['targetRef'] } } }, { type: 'function', function: { name: 'type', description: 'Type text into a candidate input', parameters: { type: 'object', properties: { targetRef: { type: 'string', enum: Object.keys(candidates) }, text: { type: 'string' } }, required: ['targetRef', 'text'] } } } // ... more tools ]; const resp2 = await client.chat.completions.create({ model: 'gpt-4.1', messages: [ { role: 'system', content: 'Use tools to act. Choose only from provided candidates.' }, { role: 'user', content: JSON.stringify({ task, candidates, lastObservation }) } ], tools, tool_choice: 'auto', temperature: 0.1 }); // Dispatch tool call to executor; validate via CDP; loop.
For open models, pair a grammar-constrained decoder with the same Plan schema. Outlines and llama.cpp grammars can ensure only valid keys and enum values appear, dramatically lowering the need for JSON repair.
Performance, Latency, and Token Budgets
Constrained decoding reduces the need for multi-round “JSON fixing,” which saves tokens. Additional tips:
- Enumerate candidates once per state and reference by small IDs (e.g., c1, c2, c3) to compress prompts.
- Cache a compact “page sketch”: roles summary, headings, key landmarks, and a shortlist of actionable elements.
- Batch small actions into a short plan (3–6 steps) instead of single-step loops; validate after each step, but plan infrequently.
- Stream and early-validate: as the model streams JSON, you can sanity check fields and preemptively fetch CDP info.
- Keep a budgeted repair loop: limit to N repairs per step; switch strategy (ask for plan re-write) after budget exhaustion.
Security and Safety Guardrails
- Domain allowlist and URL rewrite rules; no untrusted origins.
- Sensitive field redaction (inputs named password, token) in logs and screenshots.
- Rate limiting and safe timeouts; avoid DDoSing the target.
- Sandbox credentials; distinct profiles for different tests.
- Click-risk scoring: distinguish idempotent vs. destructive actions; require explicit confirmation for destructive ones.
Evaluation Metrics That Matter
- Plan structural validity rate: % of LM outputs passing schema/grammar checks.
- Precondition satisfaction rate at first try.
- Repair rate and repair depth: how often repairs succeed and how deep the loop goes.
- Postcondition latency distribution and tail percentiles.
- End-to-end task success rate over a benchmark suite (deterministic fixtures and live targets).
- Token and time budgets per successful task.
Instrument these in your step logs and aggregate daily. Build alerts for regressions in precondition satisfaction and postcondition timeouts.
Opinionated Guidance
- Don’t let the LM invent selectors. Make selectors a choice among orchestrator-provided candidates. If you must allow raw CSS, gate it behind a secondary tool that validates and normalizes to a candidate ID.
- Always attach at least one postcondition. If you can’t state what should change, you likely don’t need the action.
- Prefer AX tree (roles/names) and test IDs over brittle CSS. Provide both so the model can back off from one to the other during repair.
- Encode idempotency: classify actions (read-only, idempotent, destructive) and elevate confirmation thresholds for destructive ones.
- Log everything as typed schemas. Ad-hoc logs rot; schema’d logs train models.
Example End-to-End Flow
- Modeler builds candidates: { loginBtn, email, password, submit } from the current page using CDP + AX heuristics.
- Planner requests a 4-step plan via JSON Schema:
json{ "version": "1.0", "candidates": { "email": { "strategy": "byTestId", "testId": "email-input" }, "password": { "strategy": "byTestId", "testId": "password-input" }, "submit": { "strategy": "byRole", "role": "button", "name": "Log in" } }, "steps": [ { "type": "type", "targetRef": "email", "text": "user@example.com", "pre": [ {"kind":"exists","target":"email"} ], "post": [ {"kind":"attrEquals","target":"email","name":"value","value":"user@example.com"} ] }, { "type": "type", "targetRef": "password", "text": "hunter2", "pre": [ {"kind":"exists","target":"password"} ] }, { "type": "click", "targetRef": "submit", "pre": [ {"kind":"visible","target":"submit"} ], "post": [ {"kind":"networkIdle","timeoutMs":1500}, {"kind":"urlChanges"} ] }, { "type": "waitFor", "condition": {"kind":"urlMatches","pattern":"/dashboard"}, "timeoutMs": 5000 } ] }
- Executor validates preconditions via CDP; if submit isn’t visible, it scrolls; still not visible, it requests a repair (e.g., click a “Continue” modal close button first) within the same grammar.
- Postconditions verify navigation; logs capture the step evidence; CI replay later uses recorded network fixtures.
Building the Candidates Map (The Secret Sauce)
The quality of your candidate enumeration determines how often the LM gets into trouble. Combine several signals:
- AX roles and names: Accessibility tree is stable and semantically meaningful.
- Test IDs: [data-testid] or equivalent (Cypress/Playwright best practice).
- Unique text: innerText heuristics for labels/buttons.
- Structural anchors: form labels associated via for/id, fieldsets, ARIA controls.
- Layout features: viewport visibility, overlap, coverage ratio.
Assign stable IDs deterministically (hash of role+name+path) so step logs survive across runs. Drop candidates when DOM changes; keep a small rolling window of recently valid candidates for repair.
CI, Replay, and Dataset Creation
- Deterministic replays: run against a fixed build with recorded network; or run live but gate on stable oracles (URL patterns, content checks) rather than pixel diffs.
- Snapshot seeds: fix viewport, locale, timezone, storage state.
- Failure minimization: when CI fails, store the full plan, CDP logs, screenshots, and a minimal HAR.
- Dataset conversion: each step becomes a labeled example: observation (candidates + partial DOM sketch) -> action JSON. Fine-tune a smaller model to generate high-quality action plans within your grammar.
Future Directions
- Weighted grammars and probabilistic validators: assign costs to risky actions and bias decoding toward safer plans.
- Hybrid symbolic-numeric planners: constraint solvers for pre/postconditions combined with neural scoring.
- Better DOM semantics: unify AX tree, semantic HTML, and layout to produce more reliable candidates.
- Self-healing selectors at build time: automatically inject stable test IDs via CI into your app to improve agent reliability over time.
Conclusion
Typing the planning language is the single biggest reliability upgrade for browser agents. Constrained decoding ensures the LM speaks the language; CDP validation keeps it honest; pre/postconditions create a contract for auto-repair; and schema’d logs turn executions into training data and CI artifacts.
Stop asking your model to “click the blue button.” Make it choose click(submit) where submit is a validated candidate with a role, a name, and a contract. That’s how you end hallucinations and start shipping robust browser automation.
Appendix: Small Grammar for Candidate-First Selectors
- Each targetRef must exist in candidates.
- The orchestrator owns candidate creation and updates it on DOM changes.
- The planner never refers to raw CSS unless via a specific tool that normalizes to a candidate.
This candidate-first pattern turns UI automation from string-guessing into typed selection. Combined with CDP-validated pre/postconditions, it’s the practical path to stable browser agents.
