Cua Docs

MCP Tools

Specification for every MCP tool exposed by Cua Driver

Cua Driver exposes 36 MCP tools through a single stdio server (cua-driver mcp). Every tool is also callable from the shell as cua-driver call <tool-name> '<JSON-args>'.

Tool names are snake_case. Responses are MCP CallTool.Result envelopes: a text content block prefixed with a summary (or the error reason on failure), plus optional image or structured-content blocks on tools that produce them.

Tool names match the CLI dispatch form exactly. cua-driver call list_apps '{}' and the MCP list_apps tool run the same code path.

TCC auto-delegation. When an MCP client spawns cua-driver mcp from an IDE terminal (Claude Code, Cursor, VS Code, Warp), macOS attributes the subprocess to the parent terminal, not CuaDriver.app, so AX probes fail against the wrong bundle id. mcp detects this, auto-launches a cua-driver serve daemon via open -n -g -a CuaDriver --args serve, and proxies every tool call through the daemon's Unix socket. Tool semantics are identical to the in-process path. Pass --no-daemon-relaunch (or set CUA_DRIVER_MCP_NO_RELAUNCH=1) to force in-process execution.


Inspection tools

list_apps

List macOS apps, both currently running and installed-but-not-running, with per-app state flags.

Per-record fields:

FieldDescription
runningWhether a process for this app is live. pid is 0 when false.
activeWhether it is the system-frontmost app (implies running).
launch_pathFilesystem path to the .app bundle, when known.
kind"desktop" for .app bundles on macOS.
last_usedRFC 3339 timestamp from the bundle's filesystem mtime, or null.

Only apps with NSApplicationActivationPolicyRegular are included. Installed apps are scanned from /Applications, /Applications/Utilities, ~/Applications, /System/Applications, and /System/Applications/Utilities.

Arguments: none.


list_windows

List all layer-0 top-level windows currently known to WindowServer, including off-screen windows (minimized, on another Space, hidden-launched).

Per-record fields: window_id, pid, app_name, title, bounds (x/y/width/height, top-left origin), z_index (higher = frontmost), is_on_screen, on_current_space.

ArgumentTypeRequiredDescription
on_screen_onlybooleanNoWhen true, drop windows not on the current Space. Default false.
pidintegerNoOptional pid filter. When set, only this pid's windows are returned.

get_window_state

Walk a running app's AX tree and return a Markdown rendering of its UI, tagging every actionable element with [element_index N]. Also captures a PNG screenshot of the specified window.

Invariant: call get_window_state once per turn per (pid, window_id) before any element-indexed action. The index map is replaced by the next snapshot.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
window_idintegerYesTarget window ID from list_windows.
capture_modestringNosom = AX + screenshot (default); vision = screenshot only; ax = AX only.
querystringNoCase-insensitive filter for tree_markdown. Element indices are unchanged.
screenshot_out_filestringNoWhen set, write the PNG to this file path (~ expanded) instead of embedding base64 in the response.
{"pid": 844, "window_id": 10725}

get_accessibility_tree

Return a lightweight snapshot of the desktop: running regular apps and on-screen visible windows with their bounds, z-order, and owner pid. No TCC grants required.

Arguments: none.


get_screen_size

Return the logical size of the main display in points plus its backing scale factor. Agents click in points; Retina displays have scale_factor 2.0. Requires no TCC permissions.

Arguments: none.


get_cursor_position

Return the current mouse cursor position in screen points (origin top-left).

Arguments: none.


get_config

Return the current cua-driver-rs configuration.

Arguments: none.


get_recording_state

Report the current trajectory recorder state: whether recording is enabled, the output directory (when enabled), and the 1-based counter for the next turn folder. Counter increments on every recorded action tool call and resets to 1 each time recording is (re-)enabled. Pure read-only.

Arguments: none.


get_agent_cursor_state

Return the current state of this session's agent cursor: position, config (color, icon, label, size, opacity), enabled flag. Scoped to the cursor the call resolves to (precedence: explicit cursor_id > session identity > "default").

ArgumentTypeRequiredDescription
cursor_idstringNoCursor instance to inspect. Omit to target the calling session's own cursor (macOS); the anonymous path targets "default".

Action tools

launch_app

Launch a macOS app in the background. The target does NOT come to the foreground.

ArgumentTypeRequiredDescription
bundle_idstringNoApp bundle identifier, e.g. com.apple.calculator. Preferred over name.
namestringNoApp display name. Used only when bundle_id is absent.
urlsarray of stringNoFile paths or URLs to open with the app.
electron_debugging_portintegerNoOpen a Chrome DevTools Protocol server on this port (appends --remote-debugging-port=N).
webkit_inspector_portintegerNoOpen a WebKit inspector server on this port (sets WEBKIT_INSPECTOR_SERVER env var).
creates_new_application_instancebooleanNoWhen true, force a new app instance even if already running (open -n). Use for concurrent multi-agent/multi-session work.
additional_argumentsarray of stringNoExtra arguments appended after --args when launching.

Returns the launched app's pid, bundle_id, name, and a windows array (same shape as list_windows). When the focus-steal demotion check ran, the response also includes self_activation_suppressed: bool.


kill_app

Force-terminate a process by pid (kill -9 equivalent on macOS / Linux; taskkill /F equivalent on Windows). Unsaved state is lost.

ArgumentTypeRequiredDescription
pidintegerYesPID of the process to terminate.
{"pid": 844}

bring_to_front

Activate a window so subsequent input tools with dispatch:"foreground" land on it without a per-call SetForegroundWindow flash. Windows-only. On macOS this tool returns an error. On Linux this tool stubs out.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
window_idintegerNoOptional target window ID.
{"pid": 844}

click

Left-click against a target pid.

Two addressing modes:

  • Element index (element_index + window_id): AX action path. Works on backgrounded/hidden windows. No cursor move, no focus steal. Cache is scoped per (pid, window_id) and is replaced by the next snapshot.
  • Pixel (x, y): CGEvent path. Synthesizes mouse events posted to pid. Needs a visible on-screen window.
ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
element_indexintegerNoElement index from last get_window_state.
window_idintegerNoTarget window ID. Required for element_index.
xnumberNoWindow-local screenshot X coordinate.
ynumberNoWindow-local screenshot Y coordinate.
actionstringNoAX action: press (default), show_menu, pick, confirm, cancel, open.
countintegerNoClick count (pixel path only). Default 1.
modifierarray of stringNoModifier keys: cmd, shift, option/alt, ctrl.
from_zoombooleanNoWhen true, x and y are in the last zoom image for this pid; driver translates back to full-window coordinates.
debug_image_outstringNoFile path for a diagnostic screenshot with a red crosshair at (x, y).
{"pid": 844}

double_click

Double-click at (x, y) or on an AX element identified by element_index + window_id.

AX path: performs AXOpen when the element advertises it; otherwise resolves the element's on-screen center and falls back to a pixel double-click.

Pixel path: two down/up pairs ~80 ms apart.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
element_indexintegerNoElement index from last get_window_state. Uses AX path.
window_idintegerNoCGWindowID. Required when element_index is used.
xnumberNoScreen X coordinate (pixel path).
ynumberNoScreen Y coordinate (pixel path).
{"pid": 844}

right_click

Right-click against a target pid.

Two addressing modes:

  • Element index (element_index + window_id): performs AXShowMenu. Pure AX RPC, works on backgrounded/hidden windows.
  • Pixel (x, y): synthesizes rightMouseDown/rightMouseUp CGEvent pair posted to the pid.

Exactly one of element_index or (x AND y) must be provided. pid always required.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
element_indexintegerNoElement index from last get_window_state. Routes through AXShowMenu. Requires window_id.
window_idintegerNoCGWindowID. Required when element_index is used.
xnumberNoX in window-local screenshot pixels.
ynumberNoY in window-local screenshot pixels.
modifierarray of stringNoModifier keys held during right-click: cmd/shift/option/ctrl. Pixel path only.
{"pid": 844}

drag

Press-drag-release gesture from (from_x, from_y) to (to_x, to_y) in window-local screenshot pixels.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
from_xnumberYesDrag-start X in window-local screenshot pixels.
from_ynumberYesDrag-start Y in window-local screenshot pixels.
to_xnumberYesDrag-end X in window-local screenshot pixels.
to_ynumberYesDrag-end Y in window-local screenshot pixels.
window_idintegerNoCGWindowID for the window the pixel coordinates were measured against.
duration_msintegerNoWall-clock duration of the drag path between mouseDown and mouseUp. Default 500.
stepsintegerNoNumber of intermediate mouseDragged events. Default 20.
modifierarray of stringNoModifier keys held across the entire gesture: cmd/shift/option/ctrl.
buttonstringNoMouse button used for the drag. Default: left.
from_zoombooleanNoWhen true, coordinates are in the last zoom image for this pid.
{"from_x": 100, "from_y": 200, "pid": 844, "to_x": 300, "to_y": 400}

type_text

Insert text into the target pid via AXSetAttribute(kAXSelectedText). Works for standard Cocoa text fields and text views. No keystrokes are synthesized. For Chromium/Electron inputs that don't implement kAXSelectedText, the tool falls back to CGEvent character synthesis automatically.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
textstringYesText to insert at the target's cursor.
element_indexintegerNoElement index from last get_window_state. Directs the write to a specific field. Requires window_id.
window_idintegerNoCGWindowID. Required when element_index is used.
delay_msintegerNoMilliseconds between characters in the CGEvent fallback path. Default 30. Ignored when the AX path succeeds.
{"pid": 844, "text": "hello"}

press_key

Press and release a single key, delivered to the target pid via CGEventPostToPid. No focus steal.

Three delivery paths:

  • window_id + element_index: focuses the AX element first, then posts via the auth-message path (Chromium-safe).
  • window_id only (no element_index): NSMenu path. Briefly activates the window, posts without the auth envelope so IOHIDPostEvent fires and NSApplication.sendEvent: dispatches NSMenu key equivalents. Restores prior frontmost immediately.
  • No window_id: standard auth-message path.

Key names: return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1f12, plus any letter or digit.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
keystringYesKey name: return, tab, escape, up, down, etc.
modifiersarray of stringNoModifier keys: cmd, shift, option/alt, ctrl, fn.
element_indexintegerNoElement index.
window_idintegerNoTarget window ID.
{"key": "return", "pid": 844}

hotkey

Press a combination of keys simultaneously. The combo is posted directly to the target pid's event queue; the target does NOT need to be frontmost.

Two delivery paths:

  • Default (no window_id): auth-message envelope. Chromium/Electron apps accept the keystrokes as trusted live input on macOS 14+.
  • With window_id: NSMenu path. Briefly activates the target WindowServer-frontmost, posts without the auth envelope so IOHIDPostEvent fires and NSMenu key equivalents dispatch (e.g. Cmd+Z undo, Cmd+W close). Restores prior frontmost immediately.

Recognized modifiers: cmd/command, shift, option/alt, ctrl/control, fn. Order: modifiers first, one non-modifier last.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
keysarray of stringYesModifier(s) and one non-modifier key, e.g. ["cmd", "c"].
window_idintegerNoWhen set, uses NSMenu path.
{"keys": ["cmd", "c"], "pid": 844}

set_value

Set a value on a UI element.

Two modes:

  • AXPopUpButton / select dropdown: finds the child option whose title or value matches value (case-insensitive) and AXPresses it directly. No native popup menu is opened.
  • All other elements: writes AXValue directly (sliders, steppers, date pickers, native text fields that expose a settable AXValue).
ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
element_indexintegerYesElement index from last get_window_state.
window_idintegerYesCGWindowID for the window whose get_window_state produced the element index.
valuestringYesNew value. AX will coerce to the element's native type.
{"element_index": 14, "pid": 844, "value": "42", "window_id": 10725}

scroll

Scroll the target pid's focused region by synthesized keystrokes.

Mapping: by='page' → PageDown/PageUp × amount; by='line' → DownArrow/UpArrow × amount. Horizontal variants use Left/Right arrow keys.

ArgumentTypeRequiredDescription
pidintegerYesTarget process ID.
directionstringYesScroll direction: up, down, left, right.
bystringNoScroll granularity: line (default) or page.
amountintegerNoNumber of keystroke repetitions. Default 3.
element_indexintegerNoPre-focuses this element before scrolling.
window_idintegerNoRequired when element_index is set.
{"direction": "down", "pid": 844}

move_cursor

Move the agent cursor overlay to (x, y). Does NOT move the real mouse cursor.

ArgumentTypeRequiredDescription
xnumberYesTarget X coordinate in screen points.
ynumberYesTarget Y coordinate in screen points.
cursor_idstringNoExplicit cursor-instance override. Omit to target the calling session's own cursor.
{"x": 100, "y": 200}

zoom

Capture a cropped JPEG of a window region (x1, y1)(x2, y2) in screenshot pixel coordinates, with 20% padding added on each side. The output image is at most 500 px wide.

After a zoom, pass from_zoom=true to click/type_text to auto-translate coordinates back to full-window space.

ArgumentTypeRequiredDescription
window_idintegerYesCGWindowID from list_windows.
x1numberYesLeft edge of region in screenshot pixels.
y1numberYesTop edge of region in screenshot pixels.
x2numberYesRight edge of region in screenshot pixels.
y2numberYesBottom edge of region in screenshot pixels.
pidintegerNoTarget pid, required for from_zoom click/type translation.
{"window_id": 10725, "x1": 100, "y1": 200, "x2": 400, "y2": 500}

Browser tools

page

Interact with the browser page loaded in a running app. Supports Chrome, Brave, Edge, Safari (via AppleScript on macOS), Electron apps (via CDP), Chromium/Firefox on Windows (via UIA for read; CDP for execute_javascript when --remote-debugging-port is set), and WKWebView/Tauri/AT-SPI fallbacks.

Actions:

ActionDescription
execute_javascriptRun JS and return the result.
get_textExtract visible text from the page.
query_domFind elements matching a CSS selector.
click_elementClick a CSS-selected element and animate the agent cursor to its on-screen center.
enable_javascript_apple_eventsmacOS-only: patch the browser's Preferences to allow JS from Apple Events. Requires user_has_confirmed_enabling: true and a browser restart.
ArgumentTypeRequiredDescription
actionstringYesAction to perform (see table above).
pidintegerNoTarget process ID.
window_idintegerNoTarget window ID from list_windows.
javascriptstringNoJavaScript to execute. Required for execute_javascript.
css_selectorstringNoCSS selector for query_dom.
attributesarray of stringNoElement attributes to include in query_dom results.
selectorstringNoCSS selector for click_element.
bundle_idstringNoBundle ID of the browser. Required for enable_javascript_apple_events (macOS only).
user_has_confirmed_enablingbooleanNoMust be true to proceed with enable_javascript_apple_events.
{"action": "get_text"}

Recording tools

start_recording

Start trajectory recording. Every subsequent action-tool invocation writes a turn folder under output_dir.

Turn folder contents:

FileDescription
app_state.jsonPost-action AX/UIA snapshot for the target pid.
screenshot.pngPost-action per-window screenshot of the target's frontmost on-screen window.
action.jsonTool name, full input arguments, result summary, pid, click point (when applicable), ISO-8601 timestamp.
click.pngFor click-family actions only: screenshot.png with a red dot drawn at the click point.

Turn folders are named turn-00001/, turn-00002/, etc. Numbering restarts at 1 each time recording is (re-)started.

Video recording:

  • Default: off. Pass record_video: true to also capture the main display to <output_dir>/recording.mp4 (H.264, 30 fps).
  • On macOS: uses native ScreenCaptureKit (requires macOS 15.0+, no extra TCC prompt).
  • On Windows + Linux: uses an ffmpeg subprocess (gdigrab / x11grab + libx264). Requires ffmpeg on PATH.
  • Recording stops automatically when the session ends on the daemon-proxy path.
ArgumentTypeRequiredDescription
output_dirstringYesAbsolute or ~-rooted directory where turn folders and the video file are written.
record_videobooleanNoCapture the main display to <output_dir>/recording.mp4. Default false.
{"output_dir": "~/cua-trajectories/demo1"}

stop_recording

Stop trajectory recording. When video was enabled, gracefully terminates the ffmpeg subprocess so the mp4's moov atom is finalized (file is playable). Calling stop on an already-stopped session is a no-op. The response carries last_video_path pointing at the finalized mp4 when video was on.

A manual stop_recording is unconditional. It stops whatever recording is active regardless of which session started it.

Arguments: none.


replay_trajectory

Replay a recorded trajectory by re-invoking every turn's tool call in lexical order. dir must point at a directory previously written by start_recording.

Caveats:

  • Element-indexed actions fail because element indices are per-snapshot. Pixel clicks and keyboard tools replay cleanly.
  • get_window_state and other read-only tools are not recorded, so replays do not re-populate the element cache.
  • When recording is enabled while replay runs, the replay itself is recorded into the currently configured output directory.
ArgumentTypeRequiredDescription
dirstringYesTrajectory directory previously written by start_recording. Absolute or ~-rooted.
delay_msintegerNoMilliseconds to sleep between turns. Default 500.
stop_on_errorbooleanNoStop replay on the first tool-call error. Default true.
{"dir": "~/cua-trajectories/demo1"}

Configuration tools

set_config

Update cua-driver-rs configuration. Changes to capture_mode and max_image_dimension take effect immediately. The experimental_pip keys are persisted to ~/.cua-driver/config.json and take effect on the next daemon restart.

Per-session isolation (daemon-proxy path). On the daemon-proxy path, set_config writes an in-memory, session-scoped override that does not touch the global DriverConfig or persist to disk. get_config and capture tools resolve effective values as: call-arg > session override > global default. The override is dropped automatically when the client disconnects. Only the anonymous path (cua-driver config set CLI, one-shot cua-driver call) writes the persisted global default.

ArgumentTypeRequiredDescription
capture_modestringNoDefault capture mode for get_window_state.
max_image_dimensionintegerNoMax dimension for screenshot resizing. 0 = no limit.
experimental_pipbooleanNoEnable the experimental picture-in-picture preview window. Applies on next daemon restart.
experimental_pip_geometrystringNoPiP window size + optional position in WxH or WxH+X+Y form (e.g. 320x200+24+24). Applies on next daemon restart.

start_session

Declare a session: a named, color-coded identity for the current agent run. The agent cursor, per-session config, and recording all key on the session id, and it follows the run across apps/windows. Idempotent (re-calling refreshes the idle-TTL). End it with end_session or let the idle-TTL reclaim it.

ArgumentTypeRequiredDescription
sessionstringYesStable session id for this run (e.g. "research-run-1").

end_session

End a session declared with start_session: removes its agent cursor, stops any recording it owns, and clears its per-session config. Idempotent.

ArgumentTypeRequiredDescription
sessionstringYesThe session id to end.

set_agent_cursor_enabled

Show or hide the agent cursor overlay for a cursor instance. The overlay is on by default for each MCP session.

ArgumentTypeRequiredDescription
enabledbooleanYestrue = show, false = hide.
cursor_idstringNoCursor instance. Default: "default".
{"enabled": false}

set_agent_cursor_motion

Configure the visual appearance and motion curve of an agent cursor instance.

Appearance parameters:

ArgumentTypeRequiredDescription
cursor_idstringNoInstance name. Default: "default".
cursor_iconstringNoBuilt-in (arrow, crosshair, hand, dot) or PNG/SVG file path.
cursor_colorstringNoHex color (e.g. "#00FFFF") or CSS name.
cursor_labelstringNoShort text shown near the cursor.
cursor_sizenumberNoDot radius in points. Default 16.
cursor_opacitynumberNoOpacity 0.0–1.0. Default 0.85.

Motion curve parameters:

ArgumentTypeRangeDefaultDescription
start_handlenumber[0, 1]0.3Departure control-point fraction.
end_handlenumber[0, 1]0.3Arrival control-point fraction.
arc_sizenumber[0, 1]0.25Perpendicular deflection as fraction of path length.
arc_flownumber[-1, 1]0.0Asymmetry bias; positive bulges toward destination.
springnumber[0.3, 1.0]0.72Settle damping; 1.0 = no overshoot.
turn_radiusnumber[1, 1000]80Corner rounding radius at direction changes.
glide_duration_msnumber[50, 5000]Flight duration per move in ms. Omit for speed-based timing.
dwell_after_click_msnumber[0, 5000]80Pause after click ripple in ms.
idle_hide_msnumber[0, 60000]20000Auto-hide delay in ms. 0 = never hide.

set_agent_cursor_style

Update the visual style of the agent cursor overlay.

ArgumentTypeRequiredDescription
cursor_idstringNoCursor instance. Default: "default".
gradient_colorsarray of stringNoCSS hex gradient stops tip→tail. Empty array reverts to default palette colours.
bloom_colorstringNoHex bloom/halo colour (e.g. "#00FFFF"). Empty string reverts to default.
image_pathstringNoPath to PNG/JPEG/SVG/ICO cursor image. Empty string reverts to procedural arrow.

Maintenance tools

check_permissions

Report TCC permission status for Accessibility and Screen Recording. By default also raises the system permission dialogs for any missing grants (Apple's request APIs are no-ops when the grant is already active). Pass {"prompt": false} for a read-only status check.

ArgumentTypeRequiredDescription
promptbooleanNoRaise the system permission prompts for missing grants. Default true.

check_for_update

Check whether a newer cua-driver-rs release is available on GitHub. Returns the current and latest versions, an update_available boolean, the install one-liner, and the release notes URL. Read-only. Mirror of cua-driver check-update --json.

Arguments: none.

install_ffmpeg

Install ffmpeg, which the recording video backend shells out to. Two-step and confirmed: called without confirm it only reports the command it would run (a read-only preview); pass confirm: true to actually run it. No-op if ffmpeg is already on PATH.

Arguments:

ArgumentTypeDescription
confirmboolean (optional)When true, runs the install. Omit or set false for a read-only preview of the command.