Cua DriverGuideGetting Started

Quickstart

Drive your first macOS app with Cua Driver

In under 5 minutes, you'll launch a macOS app in the background, snapshot its AX tree, click a button by its element index, and verify the action landed.

Start the daemon

Element-indexed workflows require a persistent daemon. The per-pid element cache lives in-process, so one-shot CLI invocations lose it between calls.

open -n -g -a CuaDriver --args serve
cua-driver status
# cua-driver daemon is running
#   socket: /Users/you/Library/Caches/cua-driver/cua-driver.sock
#   pid: 12345

cua-driver serve & also works. The CLI auto-relaunches itself via LaunchServices when your shell's TCC context is wrong (any IDE terminal).

Switch to som mode so snapshots carry both the AX tree and a screenshot:

cua-driver config set capture_mode som

Launch an app (hidden)

cua-driver launch_app '{"bundle_id":"com.apple.calculator"}'

Output:

✅ Launched Calculator (pid 844) in background.

Windows:
- "Calculator" [window_id: 10725]
→ Call get_window_state(pid: 844, window_id) to inspect.

The app's pid and a windows array come back in one call. launch_app is idempotent: relaunching a running app returns the existing pid with no side effects. The window's AX tree is fully populated (clickable via element_index) but not drawn on screen.

Snapshot the window

cua-driver get_window_state '{"pid":844,"window_id":10725}'

Output (trimmed):

✅ Calculator — 34 elements, turn 1 + screenshot

- AXApplication "Calculator"
  - [0] AXWindow "Calculator" actions=[AXRaise]
    - [1] AXButton "All Clear"
    - [2] AXButton "Plus/Minus"
    - [3] AXButton "Percent"
    - [4] AXButton "Divide"
    - [5] AXButton "Seven"
    ...
    - [14] AXButton "Three"
    ...

Every actionable element is tagged with [N] — that's the element_index you pass to click, type_text, and friends. The index map is replaced on every snapshot, keyed on (pid, window_id), so always snapshot before acting.

Large trees (Finder is ~1600 elements) exceed most LLM context limits. Pass a query field to filter the Markdown to matching lines plus their ancestors: get_window_state '{"pid":844,"window_id":10725,"query":"Three"}'.

Click by element_index

cua-driver click '{"pid":844,"window_id":10725,"element_index":14}'
# ✅ Performed AXPress on [14] AXButton "Three".

No cursor moved. Calculator never came to the foreground. The click went through AXUIElementPerformAction directly.

Verify the action landed

Re-snapshot and check the AX tree diff:

cua-driver get_window_state '{"pid":844,"window_id":10725}'

Look for the display AXStaticText reading 3. If the tree didn't change, the action failed silently, and you should say so rather than assume success.

The pixel-click variant

Pixel clicks are for surfaces the AX tree doesn't reach: canvases, video players, WebGL, custom controls. Coordinates are window-local screenshot pixels (same space as the PNG get_window_state returns).

# Write the screenshot to disk. Works in every capture mode.
cua-driver get_window_state '{"pid":844,"window_id":10725}' --image-out /tmp/shot.png

# Look at /tmp/shot.png, pick a target pixel, then:
cua-driver click '{"pid":844,"window_id":10725,"x":120,"y":240}'
# ✅ Posted click to pid 844.

The PNG is capped at 1568 px long-side by default (matching Anthropic's multimodal-vision downsampling limit), so the image you reason over and the coordinate space the click tool expects are the same resolution. No scaling math.

The window_id field is optional on pixel clicks but recommended. It pins the coordinate conversion to the window whose screenshot produced the pixel, rather than letting the driver pick heuristically.

Cleanup

# Quit the target app via a hotkey to its pid.
cua-driver hotkey '{"pid":844,"keys":["cmd","q"]}'

# Stop the daemon.
cua-driver stop

The canonical loop

Every multi-step workflow follows the same shape:

open -n -g -a CuaDriver --args serve

# 1. Launch the target (idempotent; returns pid + windows).
cua-driver launch_app '{"bundle_id":"..."}'

# 2. Snapshot the window you care about (populates the element cache).
cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}'

# 3. Dispatch an action by element_index or pixel coordinates.
cua-driver click '{"pid":<pid>,"window_id":<wid>,"element_index":<n>}'

# 4. Re-snapshot to verify the action landed.
cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}'

cua-driver stop

The snapshot-before-AND-after invariant is not optional. Indices are stale across turns. Actions that silently no-op (disabled buttons, minimized windows that reject keyboard commits, Chromium right-clicks) are indistinguishable from successes without the post-action diff.

What's next

  • Full CLI reference: every subcommand and its flags. See CLI reference.
  • MCP tool schemas: the authoritative input shape for every tool. See MCP tools.
  • Known limits: Chromium right-click coercion, canvas apps, off-Space SwiftUI. See Limits.
  • Common gotchas: "No cached AX state", minimized windows + keyboard commit, disabling the agent cursor. See the FAQ.

Was this page helpful?