Cua Docs

Capture and Dispatch Modalities

How Cua Driver observes and acts on an app — perception always returns both the accessibility tree and a screenshot, the action rung (ax or px) is chosen at action time, plus dispatch and capture scope — and which combinations are valid.

Capture and Dispatch Modalities

Every Cua Driver action is shaped by four things: what the agent observes, which rung dispatches the action, how input is delivered, and what coordinate space the action targets. Most callers never set any of them — the defaults give background, per-window, accessibility-first automation. The key change from earlier versions: perception is no longer a mode you pick. get_window_state returns both the accessibility tree and a screenshot in one call, and the ax-versus-px choice is made at action time, by how you address the target. This page documents the full set so you can reason about the combinations that are valid and the ones the driver rejects by design.

The Axes#

1. Perception — what the agent observes#

get_window_state(pid, window_id) returns both the accessibility tree and a screenshot by default, in one call. There is no capture mode to pick: you ground on the tree and the screenshot together and cross-check one against the other. This matters because the tree lies on some surfaces: it can expose useful structure while still echoing a write the app did not apply, omitting the rendered value, or reporting geometry that disagrees with the pixels. A grounding screenshot is always present, so when the tree looks wrong you check the pixels in the same response.

The accessibility tree is the ground truth for what is clickable — roles, labels, advertised actions, and an element_index handle on every actionable element. The screenshot tells you which one, disambiguates repeated or empty labels, and surfaces captions, colors, and layout the tree omits entirely (common in Chromium and Electron). They are complementary, which is why both come back together.

Perf opt-out — include_screenshot. include_screenshot (boolean, default true) is the one knob, and it is a performance knob, not a modality choice. The default returns both. Pass include_screenshot: false to skip the screen grab and get the tree only — the cheap path when you are re-indexing before an element ax action and don't need fresh pixels. The ax-versus-px decision still lives at action time, not here.

capture_mode is deprecated and ignored. It is still accepted on get_window_state so old callers don't error, but it has no effect — both the tree and the screenshot come back regardless of what you pass. The old ax / vision / som / screenshot values all decode (som mapped to ax, screenshot to vision) but none changes what is captured. There is no capture-mode choice anymore; perception is always both. (The separately named screenshot tool — raw PNG, no AX walk — is unrelated.)

2. Action rung — how the target is addressed#

You don't pick a capture mode; you pick how you address the target on the action call, and that one choice selects the rung:

RungAddress withDispatches throughProperties
element ax actionelement_index / element_tokenthe accessibility rung — UIA Invoke (Windows), AXPerformAction (macOS), AT-SPI doAction (Linux)Backgroundable, z-order-independent, and the only driver-verifiable rung.
element px actionx, ythe pixel rung, reading the coordinate straight off the screenshot already in the get_window_state responseBest-effort; the caller confirms the effect off the screenshot.

Default to the element ax action — it is verifiable and backgroundable. Drop to an element px action when the tree can't disambiguate (repeated or empty labels), when it's empty (degraded — a non-AX surface), when an action came back suspected_noop, or when the tree disagrees with the pixels. You never re-capture to switch rungs: the screenshot is already in the snapshot, so you only change how you address the target.

Both rungs apply to the keyboard family (type_text, press_key, hotkey), not just the pointer tools. Address by element_index (ax) to target a field with no pre-click, or by x, y (px) — which pixel-clicks at (x, y) to establish real renderer focus, then delivers the keystroke(s) to the now-focused element. The px form is the one-call path for Chromium/Electron inputs the AX layer can't focus: type_text({ pid, window_id, x, y, text }) focuses and types in a single call. The two forms are mutually exclusive. (set_value is the exception — it stays ax-only, because it sets the value of a non-text control like a dropdown, checkbox, or slider; its pixel counterpart is a click/drag on the control, not a "set value at a pixel.")

3. Dispatch — how input is delivered#

Set per call on the input family (click, double_click, right_click, drag, scroll, type_text, press_key, hotkey) with the delivery_mode field. This field is a shared cross-platform parameter — the same two values, accepted uniformly on Windows, macOS, and Linux.

delivery_modeBehavior
background (default)Input is routed to the target process/window/element directly. The user's frontmost app, real cursor, and window z-order are untouched. This is the no-foreground contract.
foregroundThe target is briefly fronted (pair with bring_to_front to avoid a per-call flash), input lands on the now-active window, then the prior frontmost is restored. The explicit last resort when a background attempt did not land — and the only path for apps that accept events solely when foregrounded (DirectInput games, raw-input canvases).

Only background and foreground are valid; the historical auto heuristic is removed. Element ax actions (element_index) are inherently background — they address an element, not the focused window — so they hold the contract without any delivery_mode flag. The dispatch axis matters most for the pixel rung (x, y), where background routes the event to the target and foreground raises the window first.

4. Capture scope — what coordinate space the action targets#

Set with set_config capture_scope=….

capture_scopeCoordinate spaceCapture surface
window (default)Per-window. Actions carry pid + window_id; coordinates are window-relative or addressed by element_index.get_window_state
desktopScreen-absolute. Window-less actions (no pid/window_id) land at absolute screen coordinates via hit-testing (WindowFromPoint).get_desktop_state — full display, no downscale

Desktop scope is the "Computer-Use 1.0" loop: the agent reads the whole screen and clicks absolute coordinates, the way a screenshot-only model expects. Window scope is the default because it is what makes background, concurrent automation possible.

Response signals: verification, effect, and escalation#

Dispatch success is not the same thing as application state change. The driver can verify an effect only when it can read the changed state back through the accessibility layer. That is why verified: true is reserved for AX read-back: it means the driver observed the effect, not merely that it sent an event. Pixel input, foreground input, and echo-prone AX surfaces can be routed correctly while still leaving confirmation to the caller.

effect is the confidence signal that separates those cases. "confirmed" means the driver verified the result through AX read-back. "unverifiable" means the dispatch path ran, but the driver cannot prove the application applied it. "suspected_noop" means an AX action dispatched but almost certainly did not change the target. Callers should treat effect, not the transport-level success status, as the action outcome.

escalation is the machine-readable climb-the-ladder hint. When present, it tells the caller which surface to try next: "px" for acting off the screenshot, "foreground" for explicitly fronting the target, or "page" for the browser-tab DOM path through the page tool. See Choose an action rung and dispatch mode for the procedural ladder and MCP tools for the field table.

When the tree lies#

Some accessibility layers echo writes they did not apply. Electron can report an AX value change through its shim while the renderer stays unchanged. Catalyst controls can expose null AXValues. Chromium/WebKit web content can reflect a write through the accessibility bridge without proving the DOM or rendered view changed.

The driver treats those as surface-aware verification cases. It probes at the element level for a web-content surface, including an AXWebArea ancestor, so native chrome such as a browser address bar stays trusted while browser-tab content does not get a false confirmation. On those surfaces the driver refuses false verified: true responses and returns verified: false, effect: "unverifiable", and an escalation object instead. Electron app surfaces recommend "px" so the caller can act by pixel off the screenshot in the same response; browser-tab web content recommends "page" so the caller can switch to DOM/CDP via the page tool.

The Validity Matrix#

Perception is no longer an axis in this matrix — every get_window_state returns both the tree and a screenshot, so there is nothing to cross here. The real constraints are capture scope (window or desktop) against the action rung (ax or px) and delivery (background or foreground). Capture scope is the constraining axis: window scope supports any combination of rung and delivery; desktop scope supports only the px rung with foreground delivery.

capture_scopeaction rungdelivery_modeValid?Why
windowax (element_index)backgroundThe default. Semantic actions on a backgrounded window.
windowax (element_index)foregroundActivate, then act by element.
windowpx (x, y)backgroundClick a coordinate off the window's screenshot without raising it.
windowpx (x, y)foregroundActivate, then click by coordinate.
desktoppx (x, y)foregroundThe only desktop combination. Read the whole screen, click absolute coordinates on the active desktop.
desktopax (element_index)anyA desktop-absolute action has no window_id, so there is no element tree to resolve an element_index against — there is no ax rung.
desktopanybackgroundScreen-absolute input hits whatever owns those pixels on the active desktop; there is no per-process route, so it cannot be backgrounded.
window(window-less)A window-less (no pid/window_id) action while scope is window is rejected with the structured desktop_scope_disabled error.

The two rejections are enforced, not advisory. A window-less click while capture_scope=window returns desktop_scope_disabled, pointing the caller at set_config capture_scope=desktop. Desktop scope inherently foregrounds and works on pixels, so it cannot honor the background contract or reach the ax rung — which is exactly why background, per-window automation is the default.

Platform Support#

Axis / valueWindowsmacOSLinux
action rung: ax (element_index) / px (x, y)
dispatch: background (the contract)✅ (X11 and Wayland via AT-SPI; raw keyboard into native-Wayland apps is the residual gap)
dispatch: foreground + bring_to_front✅ explicit activation✅ explicit activation (NSRunningApplication.activate)✅ X11 EWMH activation (_NET_ACTIVE_WINDOW + input focus); Wayland raise is compositor-constrained
capture_scope: desktop (full screen-absolute loop)rolling outrolling out

Both action rungs work on all three platforms, so window-scope automation — the four window-scope rows of the matrix — works everywhere. The desktop-scope loop (get_desktop_state plus window-less screen-absolute input via hit-testing) is complete on Windows and rolling out to macOS and Linux; on those platforms a window-less action under window scope is still rejected. See the MCP tool reference for per-tool parameters and the no-foreground contract for how background dispatch is implemented on each OS.