Capture and Dispatch Modalities
How Cua Driver observes and acts on an app — perception always returns both the accessibility tree and a screenshot, the action rung (ax or px) is chosen at action time, plus dispatch and capture scope — and which combinations are valid.
Capture and Dispatch Modalities
Every Cua Driver action is shaped by four things: what the agent observes, which rung dispatches the action, how input is delivered, and what coordinate space the action targets. Most callers never set any of them — the defaults give background, per-window, accessibility-first automation. The key change from earlier versions: perception is no longer a mode you pick. get_window_state returns both the accessibility tree and a screenshot in one call, and the ax-versus-px choice is made at action time, by how you address the target. This page documents the full set so you can reason about the combinations that are valid and the ones the driver rejects by design.
The Axes#
1. Perception — what the agent observes#
get_window_state(pid, window_id) returns both the accessibility tree and a screenshot by default, in one call. There is no capture mode to pick: you ground on the tree and the screenshot together and cross-check one against the other. This matters because the tree lies on some surfaces: it can expose useful structure while still echoing a write the app did not apply, omitting the rendered value, or reporting geometry that disagrees with the pixels. A grounding screenshot is always present, so when the tree looks wrong you check the pixels in the same response.
The accessibility tree is the ground truth for what is clickable — roles, labels, advertised actions, and an element_index handle on every actionable element. The screenshot tells you which one, disambiguates repeated or empty labels, and surfaces captions, colors, and layout the tree omits entirely (common in Chromium and Electron). They are complementary, which is why both come back together.
Perf opt-out —
include_screenshot.include_screenshot(boolean, defaulttrue) is the one knob, and it is a performance knob, not a modality choice. The default returns both. Passinclude_screenshot: falseto skip the screen grab and get the tree only — the cheap path when you are re-indexing before an element ax action and don't need fresh pixels. Theax-versus-pxdecision still lives at action time, not here.
capture_modeis deprecated and ignored. It is still accepted onget_window_stateso old callers don't error, but it has no effect — both the tree and the screenshot come back regardless of what you pass. The oldax/vision/som/screenshotvalues all decode (sommapped toax,screenshottovision) but none changes what is captured. There is no capture-mode choice anymore; perception is always both. (The separately namedscreenshottool — raw PNG, no AX walk — is unrelated.)
2. Action rung — how the target is addressed#
You don't pick a capture mode; you pick how you address the target on the action call, and that one choice selects the rung:
| Rung | Address with | Dispatches through | Properties |
|---|---|---|---|
| element ax action | element_index / element_token | the accessibility rung — UIA Invoke (Windows), AXPerformAction (macOS), AT-SPI doAction (Linux) | Backgroundable, z-order-independent, and the only driver-verifiable rung. |
| element px action | x, y | the pixel rung, reading the coordinate straight off the screenshot already in the get_window_state response | Best-effort; the caller confirms the effect off the screenshot. |
Default to the element ax action — it is verifiable and backgroundable. Drop to an element px action when the tree can't disambiguate (repeated or empty labels), when it's empty (degraded — a non-AX surface), when an action came back suspected_noop, or when the tree disagrees with the pixels. You never re-capture to switch rungs: the screenshot is already in the snapshot, so you only change how you address the target.
Both rungs apply to the keyboard family (type_text, press_key, hotkey), not just the pointer tools. Address by element_index (ax) to target a field with no pre-click, or by x, y (px) — which pixel-clicks at (x, y) to establish real renderer focus, then delivers the keystroke(s) to the now-focused element. The px form is the one-call path for Chromium/Electron inputs the AX layer can't focus: type_text({ pid, window_id, x, y, text }) focuses and types in a single call. The two forms are mutually exclusive. (set_value is the exception — it stays ax-only, because it sets the value of a non-text control like a dropdown, checkbox, or slider; its pixel counterpart is a click/drag on the control, not a "set value at a pixel.")
3. Dispatch — how input is delivered#
Set per call on the input family (click, double_click, right_click, drag, scroll, type_text, press_key, hotkey) with the delivery_mode field. This field is a shared cross-platform parameter — the same two values, accepted uniformly on Windows, macOS, and Linux.
delivery_mode | Behavior |
|---|---|
background (default) | Input is routed to the target process/window/element directly. The user's frontmost app, real cursor, and window z-order are untouched. This is the no-foreground contract. |
foreground | The target is briefly fronted (pair with bring_to_front to avoid a per-call flash), input lands on the now-active window, then the prior frontmost is restored. The explicit last resort when a background attempt did not land — and the only path for apps that accept events solely when foregrounded (DirectInput games, raw-input canvases). |
Only background and foreground are valid; the historical auto heuristic is removed. Element ax actions (element_index) are inherently background — they address an element, not the focused window — so they hold the contract without any delivery_mode flag. The dispatch axis matters most for the pixel rung (x, y), where background routes the event to the target and foreground raises the window first.
4. Capture scope — what coordinate space the action targets#
Set with set_config capture_scope=….
capture_scope | Coordinate space | Capture surface |
|---|---|---|
window (default) | Per-window. Actions carry pid + window_id; coordinates are window-relative or addressed by element_index. | get_window_state |
desktop | Screen-absolute. Window-less actions (no pid/window_id) land at absolute screen coordinates via hit-testing (WindowFromPoint). | get_desktop_state — full display, no downscale |
Desktop scope is the "Computer-Use 1.0" loop: the agent reads the whole screen and clicks absolute coordinates, the way a screenshot-only model expects. Window scope is the default because it is what makes background, concurrent automation possible.
Response signals: verification, effect, and escalation#
Dispatch success is not the same thing as application state change. The driver can verify an effect only when it can read the changed state back through the accessibility layer. That is why verified: true is reserved for AX read-back: it means the driver observed the effect, not merely that it sent an event. Pixel input, foreground input, and echo-prone AX surfaces can be routed correctly while still leaving confirmation to the caller.
effect is the confidence signal that separates those cases. "confirmed" means the driver verified the result through AX read-back. "unverifiable" means the dispatch path ran, but the driver cannot prove the application applied it. "suspected_noop" means an AX action dispatched but almost certainly did not change the target. Callers should treat effect, not the transport-level success status, as the action outcome.
escalation is the machine-readable climb-the-ladder hint. When present, it tells the caller which surface to try next: "px" for acting off the screenshot, "foreground" for explicitly fronting the target, or "page" for the browser-tab DOM path through the page tool. See Choose an action rung and dispatch mode for the procedural ladder and MCP tools for the field table.
When the tree lies#
Some accessibility layers echo writes they did not apply. Electron can report an AX value change through its shim while the renderer stays unchanged. Catalyst controls can expose null AXValues. Chromium/WebKit web content can reflect a write through the accessibility bridge without proving the DOM or rendered view changed.
The driver treats those as surface-aware verification cases. It probes at the element level for a web-content surface, including an AXWebArea ancestor, so native chrome such as a browser address bar stays trusted while browser-tab content does not get a false confirmation. On those surfaces the driver refuses false verified: true responses and returns verified: false, effect: "unverifiable", and an escalation object instead. Electron app surfaces recommend "px" so the caller can act by pixel off the screenshot in the same response; browser-tab web content recommends "page" so the caller can switch to DOM/CDP via the page tool.
The Validity Matrix#
Perception is no longer an axis in this matrix — every get_window_state returns both the tree and a screenshot, so there is nothing to cross here. The real constraints are capture scope (window or desktop) against the action rung (ax or px) and delivery (background or foreground). Capture scope is the constraining axis: window scope supports any combination of rung and delivery; desktop scope supports only the px rung with foreground delivery.
capture_scope | action rung | delivery_mode | Valid? | Why |
|---|---|---|---|---|
window | ax (element_index) | background | ✅ | The default. Semantic actions on a backgrounded window. |
window | ax (element_index) | foreground | ✅ | Activate, then act by element. |
window | px (x, y) | background | ✅ | Click a coordinate off the window's screenshot without raising it. |
window | px (x, y) | foreground | ✅ | Activate, then click by coordinate. |
desktop | px (x, y) | foreground | ✅ | The only desktop combination. Read the whole screen, click absolute coordinates on the active desktop. |
desktop | ax (element_index) | any | ❌ | A desktop-absolute action has no window_id, so there is no element tree to resolve an element_index against — there is no ax rung. |
desktop | any | background | ❌ | Screen-absolute input hits whatever owns those pixels on the active desktop; there is no per-process route, so it cannot be backgrounded. |
window | (window-less) | — | ❌ | A window-less (no pid/window_id) action while scope is window is rejected with the structured desktop_scope_disabled error. |
The two rejections are enforced, not advisory. A window-less click while capture_scope=window returns desktop_scope_disabled, pointing the caller at set_config capture_scope=desktop. Desktop scope inherently foregrounds and works on pixels, so it cannot honor the background contract or reach the ax rung — which is exactly why background, per-window automation is the default.
Platform Support#
| Axis / value | Windows | macOS | Linux |
|---|---|---|---|
action rung: ax (element_index) / px (x, y) | ✅ | ✅ | ✅ |
dispatch: background (the contract) | ✅ | ✅ | ✅ (X11 and Wayland via AT-SPI; raw keyboard into native-Wayland apps is the residual gap) |
dispatch: foreground + bring_to_front | ✅ explicit activation | ✅ explicit activation (NSRunningApplication.activate) | ✅ X11 EWMH activation (_NET_ACTIVE_WINDOW + input focus); Wayland raise is compositor-constrained |
capture_scope: desktop (full screen-absolute loop) | ✅ | rolling out | rolling out |
Both action rungs work on all three platforms, so window-scope automation — the four window-scope rows of the matrix — works everywhere. The desktop-scope loop (get_desktop_state plus window-less screen-absolute input via hit-testing) is complete on Windows and rolling out to macOS and Linux; on those platforms a window-less action under window scope is still rejected. See the MCP tool reference for per-tool parameters and the no-foreground contract for how background dispatch is implemented on each OS.