Cua Docs

Choose an Action Rung and Dispatch Mode

Pick how each action addresses its target — the accessibility (ax) rung or the pixel (px) rung — plus dispatch and scope. Start on the accessibility path, fall back to a pixel action, and escalate to foreground only when you must.

Choose an Action Rung and Dispatch Mode

This guide shows you how to pick how the agent acts on a target, so each action lands the first time and stays in the background when it can.

When to use this guide#

Use this when you are wiring an agent to Cua Driver and deciding what to pass to click, type_text, get_window_state, and friends — element_index versus x, y, delivery_mode: background versus foreground, and window versus desktop scope.

Before you start#

You should already be connected to the driver and able to launch an app and read a window. Perception is no longer a mode you pick: get_window_state returns both the accessibility tree and a screenshot in one call by default, and you choose the rung at action time. For the concepts behind the axes, see Capture and dispatch modalities; this guide only covers what to call.

Start on the accessibility path (the default)#

Default to the element ax action with background dispatch — act by element_index. It is the only rung the driver can verify, it never steals focus, and it works on Windows, macOS, and Linux (X11 and Wayland).

Read the window once, then act on an element from that snapshot. The snapshot already carries the screenshot too, so you never re-capture to switch how you address the target:

// 1. snapshot — returns the accessibility tree AND a screenshot by default
get_window_state({ pid, window_id })
// → elements[], each with an element_index and a frame, plus a grounding screenshot
 
// 2. act by element_index — the element ax action, background, no focus steal
click({ pid, window_id, element_index: 12 })
type_text({ pid, text: "hello" })

When you are only re-indexing before an element ax action and don't need fresh pixels, pass include_screenshot: false to skip the screen grab and get the tree alone — a cheap perf opt-out, not a modality choice. The ax versus px decision still happens at action time, by how you address the target. To pin the rendered frame to disk instead of inlining it, set screenshot_out_file.

Follow the escalation ladder#

You don't have to guess when the AX path failed — every action response carries the signals that tell you the next rung. Walk the ladder in order, and only step down when the response says to:

  1. Element ax action, background (the default). Act by element_index. If the response shows effect: "confirmed", you're done — the driver read the result back. If get_window_state came back degraded (empty AX tree), an action returns effect: "suspected_noop" (the AX action dispatched but likely no-op'd), an action returns effect: "unverifiable" on an echo-prone surface, or the tree disagrees with the screenshot (an h:1 or off-viewport row), follow escalation.recommended.

  2. Element px action, background. When escalation.recommended is "px", pick the target pixel from the screenshot already in the get_window_state response and click it — no re-capture to switch rung, because the screenshot was always there. Coordinates are window-relative for a windowed target.

    // the screenshot is already in the snapshot above — just read a pixel off it
    click({ pid, window_id, x: 320, y: 210 }) // → { path: "cgevent", effect: "unverifiable" }

    Use the same px form for keyboard fallback. If AX type_text, press_key, or hotkey returns effect: "unverifiable" on Electron/Chromium, retry with x, y: the tool pixel-clicks to focus, then sends the keys.

    type_text({ pid, window_id, element_index: 18, text: "hello" })
    // → { effect: "unverifiable", escalation: { recommended: "px", reason: "..." } }
     
    type_text({ pid, window_id, x: 320, y: 210, text: "hello" })
    press_key({ pid, window_id, x: 320, y: 210, key: "return" })
    hotkey({ pid, window_id, x: 320, y: 210, keys: ["cmd", "a"] })

    On Linux this still avoids synthetic input where it can: the driver resolves the pixel to the element under it and fires that element's action via AT-SPI doAction at that point. See Linux and Wayland for why.

  3. Browser-tab DOM. When escalation.recommended is "page", switch to the page tool for browser-tab DOM work instead of retrying the AX write.

    type_text({ pid, window_id, element_index: 18, text: "hello" })
    // → { effect: "unverifiable", escalation: { recommended: "page", reason: "..." } }
     
    page({
      pid,
      window_id,
      action: "execute_javascript",
      javascript: "document.querySelector('#search').value = 'hello'"
    })
  4. Foreground. If the response recommends "foreground" or the pixel click still doesn't land — DirectInput games, raw-input canvases (Blender, Unity), focus-polling apps — retry with delivery_mode: "foreground", which activates the window first.

    click({ pid, window_id, x: 320, y: 210, delivery_mode: "foreground" })
    // → { path: "cgevent_fg", effect: "unverifiable" }

    Foreground only for the action that needs it, and only when the user isn't actively working on the machine — it raises the window. See Known limits for the specific apps.

The escalation signal on the response#

Two additive fields make the ladder explicit, so you escalate on data rather than a hunch:

  • effect"confirmed" (the driver verified the result through AX read-back), "unverifiable" (the rung fired but only the caller can confirm), or "suspected_noop" (an AX action dispatched but almost certainly did nothing).
  • escalation — present when there's a next rung: { recommended: "px" | "foreground" | "page", reason }. A degraded get_window_state carries the same hint (recommending px).
click({ pid, window_id, element_index: 12 })
// → { effect: "suspected_noop", escalation: { recommended: "px", reason: "..." } }

Wayland exception. On a native Wayland session an unfocused window can't be pixel-targeted in the background — there is no global coordinate space and the compositor drops synthetic pointer events. So when an AX action no-ops there, the escalation skips the pixel rung and recommends foreground directly. See Linux and Wayland.

Switch to desktop scope only for screen-absolute work#

Reach for desktop scope only when the action has no single window — dragging between windows, or clicking absolute screen coordinates. It inherently foregrounds and works on pixels, so it cannot honor the background contract.

set_config({ capture_scope: "desktop" })
get_desktop_state()                       // full-screen screenshot
click({ x: 1280, y: 40 })                 // screen-absolute, no window_id

A window-less click while scope is still window is rejected with desktop_scope_disabled — that error is the prompt to switch scope.

Confirm the action landed#

Only AX read-back can produce verified: true (the driver read the result back). Echo-prone AX surfaces, pixel actions, and foreground actions return verified: false or omit it; use effect and escalation to decide the next call. After an unverifiable action, re-read and check (the re-read returns both the tree and the screenshot):

click({ pid, window_id, x: 320, y: 210 })   // → { verified: false, path: "cgevent", effect: "unverifiable" }
get_window_state({ pid, window_id })         // confirm the change against tree + screenshot

Troubleshooting#

Problem: the call returned success but nothing changed (false success). Don't trust the status code on a verified: false action. Re-read the window — the snapshot carries both the tree and the screenshot — and confirm the effect; if it didn't land, switch rung (element ax action → element px action off the same screenshot) or escalate dispatch (background → foreground).

Problem: desktop_scope_disabled on a window-less click. You're in window scope. Either pass a window_id (+ element_index or x, y), or set_config({ capture_scope: "desktop" }) for genuinely screen-absolute work.

Problem: keystrokes don't land on a Linux app. On native Wayland, raw keys have no background path. Type into accessible fields with type_text, drive controls by element_index, or run the app under XWayland. See Known limits.