MCP Tools
Specification for every MCP tool exposed by Cua Driver
Cua Driver exposes 36 MCP tools through a single stdio server (cua-driver mcp). Every tool is also callable from the shell as cua-driver call <tool-name> '<JSON-args>'.
Tool names are snake_case. Responses are MCP CallTool.Result envelopes: a text content block prefixed with a ✅ summary (or the error reason on failure), plus optional image or structured-content blocks on tools that produce them.
Tool names match the CLI dispatch form exactly. cua-driver call list_apps '{}' and the MCP list_apps tool run the same code path.
TCC auto-delegation. When an MCP client spawns cua-driver mcp from an IDE terminal (Claude Code, Cursor, VS Code, Warp), macOS attributes the subprocess to the parent terminal, not CuaDriver.app, so AX probes fail against the wrong bundle id. mcp detects this, auto-launches a cua-driver serve daemon via open -n -g -a CuaDriver --args serve, and proxies every tool call through the daemon's Unix socket. Tool semantics are identical to the in-process path. Pass --no-daemon-relaunch (or set CUA_DRIVER_MCP_NO_RELAUNCH=1) to force in-process execution.
Inspection tools
list_apps
List macOS apps, both currently running and installed-but-not-running, with per-app state flags.
Per-record fields:
| Field | Description |
|---|---|
running | Whether a process for this app is live. pid is 0 when false. |
active | Whether it is the system-frontmost app (implies running). |
launch_path | Filesystem path to the .app bundle, when known. |
kind | "desktop" for .app bundles on macOS. |
last_used | RFC 3339 timestamp from the bundle's filesystem mtime, or null. |
Only apps with NSApplicationActivationPolicyRegular are included. Installed apps are scanned from /Applications, /Applications/Utilities, ~/Applications, /System/Applications, and /System/Applications/Utilities.
Arguments: none.
list_windows
List all layer-0 top-level windows currently known to WindowServer, including off-screen windows (minimized, on another Space, hidden-launched).
Per-record fields: window_id, pid, app_name, title, bounds (x/y/width/height, top-left origin), z_index (higher = frontmost), is_on_screen, on_current_space.
| Argument | Type | Required | Description |
|---|---|---|---|
on_screen_only | boolean | No | When true, drop windows not on the current Space. Default false. |
pid | integer | No | Optional pid filter. When set, only this pid's windows are returned. |
get_window_state
Walk a running app's AX tree and return a Markdown rendering of its UI, tagging every actionable element with [element_index N]. Also captures a PNG screenshot of the specified window.
Invariant: call get_window_state once per turn per (pid, window_id) before any element-indexed action. The index map is replaced by the next snapshot.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
window_id | integer | Yes | Target window ID from list_windows. |
capture_mode | string | No | som = AX + screenshot (default); vision = screenshot only; ax = AX only. |
query | string | No | Case-insensitive filter for tree_markdown. Element indices are unchanged. |
screenshot_out_file | string | No | When set, write the PNG to this file path (~ expanded) instead of embedding base64 in the response. |
{"pid": 844, "window_id": 10725}get_accessibility_tree
Return a lightweight snapshot of the desktop: running regular apps and on-screen visible windows with their bounds, z-order, and owner pid. No TCC grants required.
Arguments: none.
get_screen_size
Return the logical size of the main display in points plus its backing scale factor. Agents click in points; Retina displays have scale_factor 2.0. Requires no TCC permissions.
Arguments: none.
get_cursor_position
Return the current mouse cursor position in screen points (origin top-left).
Arguments: none.
get_config
Return the current cua-driver-rs configuration.
Arguments: none.
get_recording_state
Report the current trajectory recorder state: whether recording is enabled, the output directory (when enabled), and the 1-based counter for the next turn folder. Counter increments on every recorded action tool call and resets to 1 each time recording is (re-)enabled. Pure read-only.
Arguments: none.
get_agent_cursor_state
Return the current state of this session's agent cursor: position, config (color, icon, label, size, opacity), enabled flag. Scoped to the cursor the call resolves to (precedence: explicit cursor_id > session identity > "default").
| Argument | Type | Required | Description |
|---|---|---|---|
cursor_id | string | No | Cursor instance to inspect. Omit to target the calling session's own cursor (macOS); the anonymous path targets "default". |
Action tools
launch_app
Launch a macOS app in the background. The target does NOT come to the foreground.
| Argument | Type | Required | Description |
|---|---|---|---|
bundle_id | string | No | App bundle identifier, e.g. com.apple.calculator. Preferred over name. |
name | string | No | App display name. Used only when bundle_id is absent. |
urls | array of string | No | File paths or URLs to open with the app. |
electron_debugging_port | integer | No | Open a Chrome DevTools Protocol server on this port (appends --remote-debugging-port=N). |
webkit_inspector_port | integer | No | Open a WebKit inspector server on this port (sets WEBKIT_INSPECTOR_SERVER env var). |
creates_new_application_instance | boolean | No | When true, force a new app instance even if already running (open -n). Use for concurrent multi-agent/multi-session work. |
additional_arguments | array of string | No | Extra arguments appended after --args when launching. |
Returns the launched app's pid, bundle_id, name, and a windows array (same shape as list_windows). When the focus-steal demotion check ran, the response also includes self_activation_suppressed: bool.
kill_app
Force-terminate a process by pid (kill -9 equivalent on macOS / Linux; taskkill /F equivalent on Windows). Unsaved state is lost.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | PID of the process to terminate. |
{"pid": 844}bring_to_front
Activate a window so subsequent input tools with dispatch:"foreground" land on it without a per-call SetForegroundWindow flash. Windows-only. On macOS this tool returns an error. On Linux this tool stubs out.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
window_id | integer | No | Optional target window ID. |
{"pid": 844}click
Left-click against a target pid.
Two addressing modes:
- Element index (
element_index+window_id): AX action path. Works on backgrounded/hidden windows. No cursor move, no focus steal. Cache is scoped per(pid, window_id)and is replaced by the next snapshot. - Pixel (
x,y): CGEvent path. Synthesizes mouse events posted to pid. Needs a visible on-screen window.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
element_index | integer | No | Element index from last get_window_state. |
window_id | integer | No | Target window ID. Required for element_index. |
x | number | No | Window-local screenshot X coordinate. |
y | number | No | Window-local screenshot Y coordinate. |
action | string | No | AX action: press (default), show_menu, pick, confirm, cancel, open. |
count | integer | No | Click count (pixel path only). Default 1. |
modifier | array of string | No | Modifier keys: cmd, shift, option/alt, ctrl. |
from_zoom | boolean | No | When true, x and y are in the last zoom image for this pid; driver translates back to full-window coordinates. |
debug_image_out | string | No | File path for a diagnostic screenshot with a red crosshair at (x, y). |
{"pid": 844}double_click
Double-click at (x, y) or on an AX element identified by element_index + window_id.
AX path: performs AXOpen when the element advertises it; otherwise resolves the element's on-screen center and falls back to a pixel double-click.
Pixel path: two down/up pairs ~80 ms apart.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
element_index | integer | No | Element index from last get_window_state. Uses AX path. |
window_id | integer | No | CGWindowID. Required when element_index is used. |
x | number | No | Screen X coordinate (pixel path). |
y | number | No | Screen Y coordinate (pixel path). |
{"pid": 844}right_click
Right-click against a target pid.
Two addressing modes:
- Element index (
element_index+window_id): performsAXShowMenu. Pure AX RPC, works on backgrounded/hidden windows. - Pixel (
x,y): synthesizesrightMouseDown/rightMouseUpCGEvent pair posted to the pid.
Exactly one of element_index or (x AND y) must be provided. pid always required.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
element_index | integer | No | Element index from last get_window_state. Routes through AXShowMenu. Requires window_id. |
window_id | integer | No | CGWindowID. Required when element_index is used. |
x | number | No | X in window-local screenshot pixels. |
y | number | No | Y in window-local screenshot pixels. |
modifier | array of string | No | Modifier keys held during right-click: cmd/shift/option/ctrl. Pixel path only. |
{"pid": 844}drag
Press-drag-release gesture from (from_x, from_y) to (to_x, to_y) in window-local screenshot pixels.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
from_x | number | Yes | Drag-start X in window-local screenshot pixels. |
from_y | number | Yes | Drag-start Y in window-local screenshot pixels. |
to_x | number | Yes | Drag-end X in window-local screenshot pixels. |
to_y | number | Yes | Drag-end Y in window-local screenshot pixels. |
window_id | integer | No | CGWindowID for the window the pixel coordinates were measured against. |
duration_ms | integer | No | Wall-clock duration of the drag path between mouseDown and mouseUp. Default 500. |
steps | integer | No | Number of intermediate mouseDragged events. Default 20. |
modifier | array of string | No | Modifier keys held across the entire gesture: cmd/shift/option/ctrl. |
button | string | No | Mouse button used for the drag. Default: left. |
from_zoom | boolean | No | When true, coordinates are in the last zoom image for this pid. |
{"from_x": 100, "from_y": 200, "pid": 844, "to_x": 300, "to_y": 400}type_text
Insert text into the target pid via AXSetAttribute(kAXSelectedText). Works for standard Cocoa text fields and text views. No keystrokes are synthesized. For Chromium/Electron inputs that don't implement kAXSelectedText, the tool falls back to CGEvent character synthesis automatically.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
text | string | Yes | Text to insert at the target's cursor. |
element_index | integer | No | Element index from last get_window_state. Directs the write to a specific field. Requires window_id. |
window_id | integer | No | CGWindowID. Required when element_index is used. |
delay_ms | integer | No | Milliseconds between characters in the CGEvent fallback path. Default 30. Ignored when the AX path succeeds. |
{"pid": 844, "text": "hello"}press_key
Press and release a single key, delivered to the target pid via CGEventPostToPid. No focus steal.
Three delivery paths:
window_id+element_index: focuses the AX element first, then posts via the auth-message path (Chromium-safe).window_idonly (noelement_index): NSMenu path. Briefly activates the window, posts without the auth envelope soIOHIDPostEventfires andNSApplication.sendEvent:dispatches NSMenu key equivalents. Restores prior frontmost immediately.- No
window_id: standard auth-message path.
Key names: return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1–f12, plus any letter or digit.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
key | string | Yes | Key name: return, tab, escape, up, down, etc. |
modifiers | array of string | No | Modifier keys: cmd, shift, option/alt, ctrl, fn. |
element_index | integer | No | Element index. |
window_id | integer | No | Target window ID. |
{"key": "return", "pid": 844}hotkey
Press a combination of keys simultaneously. The combo is posted directly to the target pid's event queue; the target does NOT need to be frontmost.
Two delivery paths:
- Default (no
window_id): auth-message envelope. Chromium/Electron apps accept the keystrokes as trusted live input on macOS 14+. - With
window_id: NSMenu path. Briefly activates the target WindowServer-frontmost, posts without the auth envelope soIOHIDPostEventfires and NSMenu key equivalents dispatch (e.g. Cmd+Z undo, Cmd+W close). Restores prior frontmost immediately.
Recognized modifiers: cmd/command, shift, option/alt, ctrl/control, fn. Order: modifiers first, one non-modifier last.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
keys | array of string | Yes | Modifier(s) and one non-modifier key, e.g. ["cmd", "c"]. |
window_id | integer | No | When set, uses NSMenu path. |
{"keys": ["cmd", "c"], "pid": 844}set_value
Set a value on a UI element.
Two modes:
AXPopUpButton/ select dropdown: finds the child option whose title or value matchesvalue(case-insensitive) andAXPresses it directly. No native popup menu is opened.- All other elements: writes
AXValuedirectly (sliders, steppers, date pickers, native text fields that expose a settableAXValue).
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
element_index | integer | Yes | Element index from last get_window_state. |
window_id | integer | Yes | CGWindowID for the window whose get_window_state produced the element index. |
value | string | Yes | New value. AX will coerce to the element's native type. |
{"element_index": 14, "pid": 844, "value": "42", "window_id": 10725}scroll
Scroll the target pid's focused region by synthesized keystrokes.
Mapping: by='page' → PageDown/PageUp × amount; by='line' → DownArrow/UpArrow × amount. Horizontal variants use Left/Right arrow keys.
| Argument | Type | Required | Description |
|---|---|---|---|
pid | integer | Yes | Target process ID. |
direction | string | Yes | Scroll direction: up, down, left, right. |
by | string | No | Scroll granularity: line (default) or page. |
amount | integer | No | Number of keystroke repetitions. Default 3. |
element_index | integer | No | Pre-focuses this element before scrolling. |
window_id | integer | No | Required when element_index is set. |
{"direction": "down", "pid": 844}move_cursor
Move the agent cursor overlay to (x, y). Does NOT move the real mouse cursor.
| Argument | Type | Required | Description |
|---|---|---|---|
x | number | Yes | Target X coordinate in screen points. |
y | number | Yes | Target Y coordinate in screen points. |
cursor_id | string | No | Explicit cursor-instance override. Omit to target the calling session's own cursor. |
{"x": 100, "y": 200}zoom
Capture a cropped JPEG of a window region (x1, y1)–(x2, y2) in screenshot pixel coordinates, with 20% padding added on each side. The output image is at most 500 px wide.
After a zoom, pass from_zoom=true to click/type_text to auto-translate coordinates back to full-window space.
| Argument | Type | Required | Description |
|---|---|---|---|
window_id | integer | Yes | CGWindowID from list_windows. |
x1 | number | Yes | Left edge of region in screenshot pixels. |
y1 | number | Yes | Top edge of region in screenshot pixels. |
x2 | number | Yes | Right edge of region in screenshot pixels. |
y2 | number | Yes | Bottom edge of region in screenshot pixels. |
pid | integer | No | Target pid, required for from_zoom click/type translation. |
{"window_id": 10725, "x1": 100, "y1": 200, "x2": 400, "y2": 500}Browser tools
page
Interact with the browser page loaded in a running app. Supports Chrome, Brave, Edge, Safari (via AppleScript on macOS), Electron apps (via CDP), Chromium/Firefox on Windows (via UIA for read; CDP for execute_javascript when --remote-debugging-port is set), and WKWebView/Tauri/AT-SPI fallbacks.
Actions:
| Action | Description |
|---|---|
execute_javascript | Run JS and return the result. |
get_text | Extract visible text from the page. |
query_dom | Find elements matching a CSS selector. |
click_element | Click a CSS-selected element and animate the agent cursor to its on-screen center. |
enable_javascript_apple_events | macOS-only: patch the browser's Preferences to allow JS from Apple Events. Requires user_has_confirmed_enabling: true and a browser restart. |
| Argument | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Action to perform (see table above). |
pid | integer | No | Target process ID. |
window_id | integer | No | Target window ID from list_windows. |
javascript | string | No | JavaScript to execute. Required for execute_javascript. |
css_selector | string | No | CSS selector for query_dom. |
attributes | array of string | No | Element attributes to include in query_dom results. |
selector | string | No | CSS selector for click_element. |
bundle_id | string | No | Bundle ID of the browser. Required for enable_javascript_apple_events (macOS only). |
user_has_confirmed_enabling | boolean | No | Must be true to proceed with enable_javascript_apple_events. |
{"action": "get_text"}Recording tools
start_recording
Start trajectory recording. Every subsequent action-tool invocation writes a turn folder under output_dir.
Turn folder contents:
| File | Description |
|---|---|
app_state.json | Post-action AX/UIA snapshot for the target pid. |
screenshot.png | Post-action per-window screenshot of the target's frontmost on-screen window. |
action.json | Tool name, full input arguments, result summary, pid, click point (when applicable), ISO-8601 timestamp. |
click.png | For click-family actions only: screenshot.png with a red dot drawn at the click point. |
Turn folders are named turn-00001/, turn-00002/, etc. Numbering restarts at 1 each time recording is (re-)started.
Video recording:
- Default: off. Pass
record_video: trueto also capture the main display to<output_dir>/recording.mp4(H.264, 30 fps). - On macOS: uses native ScreenCaptureKit (requires macOS 15.0+, no extra TCC prompt).
- On Windows + Linux: uses an ffmpeg subprocess (
gdigrab/x11grab+ libx264). Requires ffmpeg on PATH. - Recording stops automatically when the session ends on the daemon-proxy path.
| Argument | Type | Required | Description |
|---|---|---|---|
output_dir | string | Yes | Absolute or ~-rooted directory where turn folders and the video file are written. |
record_video | boolean | No | Capture the main display to <output_dir>/recording.mp4. Default false. |
{"output_dir": "~/cua-trajectories/demo1"}stop_recording
Stop trajectory recording. When video was enabled, gracefully terminates the ffmpeg subprocess so the mp4's moov atom is finalized (file is playable). Calling stop on an already-stopped session is a no-op. The response carries last_video_path pointing at the finalized mp4 when video was on.
A manual stop_recording is unconditional. It stops whatever recording is active regardless of which session started it.
Arguments: none.
replay_trajectory
Replay a recorded trajectory by re-invoking every turn's tool call in lexical order. dir must point at a directory previously written by start_recording.
Caveats:
- Element-indexed actions fail because element indices are per-snapshot. Pixel clicks and keyboard tools replay cleanly.
get_window_stateand other read-only tools are not recorded, so replays do not re-populate the element cache.- When recording is enabled while replay runs, the replay itself is recorded into the currently configured output directory.
| Argument | Type | Required | Description |
|---|---|---|---|
dir | string | Yes | Trajectory directory previously written by start_recording. Absolute or ~-rooted. |
delay_ms | integer | No | Milliseconds to sleep between turns. Default 500. |
stop_on_error | boolean | No | Stop replay on the first tool-call error. Default true. |
{"dir": "~/cua-trajectories/demo1"}Configuration tools
set_config
Update cua-driver-rs configuration. Changes to capture_mode and max_image_dimension take effect immediately. The experimental_pip keys are persisted to ~/.cua-driver/config.json and take effect on the next daemon restart.
Per-session isolation (daemon-proxy path). On the daemon-proxy path, set_config writes an in-memory, session-scoped override that does not touch the global DriverConfig or persist to disk. get_config and capture tools resolve effective values as: call-arg > session override > global default. The override is dropped automatically when the client disconnects. Only the anonymous path (cua-driver config set CLI, one-shot cua-driver call) writes the persisted global default.
| Argument | Type | Required | Description |
|---|---|---|---|
capture_mode | string | No | Default capture mode for get_window_state. |
max_image_dimension | integer | No | Max dimension for screenshot resizing. 0 = no limit. |
experimental_pip | boolean | No | Enable the experimental picture-in-picture preview window. Applies on next daemon restart. |
experimental_pip_geometry | string | No | PiP window size + optional position in WxH or WxH+X+Y form (e.g. 320x200+24+24). Applies on next daemon restart. |
start_session
Declare a session: a named, color-coded identity for the current agent run. The agent cursor, per-session config, and recording all key on the session id, and it follows the run across apps/windows. Idempotent (re-calling refreshes the idle-TTL). End it with end_session or let the idle-TTL reclaim it.
| Argument | Type | Required | Description |
|---|---|---|---|
session | string | Yes | Stable session id for this run (e.g. "research-run-1"). |
end_session
End a session declared with start_session: removes its agent cursor, stops any recording it owns, and clears its per-session config. Idempotent.
| Argument | Type | Required | Description |
|---|---|---|---|
session | string | Yes | The session id to end. |
set_agent_cursor_enabled
Show or hide the agent cursor overlay for a cursor instance. The overlay is on by default for each MCP session.
| Argument | Type | Required | Description |
|---|---|---|---|
enabled | boolean | Yes | true = show, false = hide. |
cursor_id | string | No | Cursor instance. Default: "default". |
{"enabled": false}set_agent_cursor_motion
Configure the visual appearance and motion curve of an agent cursor instance.
Appearance parameters:
| Argument | Type | Required | Description |
|---|---|---|---|
cursor_id | string | No | Instance name. Default: "default". |
cursor_icon | string | No | Built-in (arrow, crosshair, hand, dot) or PNG/SVG file path. |
cursor_color | string | No | Hex color (e.g. "#00FFFF") or CSS name. |
cursor_label | string | No | Short text shown near the cursor. |
cursor_size | number | No | Dot radius in points. Default 16. |
cursor_opacity | number | No | Opacity 0.0–1.0. Default 0.85. |
Motion curve parameters:
| Argument | Type | Range | Default | Description |
|---|---|---|---|---|
start_handle | number | [0, 1] | 0.3 | Departure control-point fraction. |
end_handle | number | [0, 1] | 0.3 | Arrival control-point fraction. |
arc_size | number | [0, 1] | 0.25 | Perpendicular deflection as fraction of path length. |
arc_flow | number | [-1, 1] | 0.0 | Asymmetry bias; positive bulges toward destination. |
spring | number | [0.3, 1.0] | 0.72 | Settle damping; 1.0 = no overshoot. |
turn_radius | number | [1, 1000] | 80 | Corner rounding radius at direction changes. |
glide_duration_ms | number | [50, 5000] | — | Flight duration per move in ms. Omit for speed-based timing. |
dwell_after_click_ms | number | [0, 5000] | 80 | Pause after click ripple in ms. |
idle_hide_ms | number | [0, 60000] | 20000 | Auto-hide delay in ms. 0 = never hide. |
set_agent_cursor_style
Update the visual style of the agent cursor overlay.
| Argument | Type | Required | Description |
|---|---|---|---|
cursor_id | string | No | Cursor instance. Default: "default". |
gradient_colors | array of string | No | CSS hex gradient stops tip→tail. Empty array reverts to default palette colours. |
bloom_color | string | No | Hex bloom/halo colour (e.g. "#00FFFF"). Empty string reverts to default. |
image_path | string | No | Path to PNG/JPEG/SVG/ICO cursor image. Empty string reverts to procedural arrow. |
Maintenance tools
check_permissions
Report TCC permission status for Accessibility and Screen Recording. By default also raises the system permission dialogs for any missing grants (Apple's request APIs are no-ops when the grant is already active). Pass {"prompt": false} for a read-only status check.
| Argument | Type | Required | Description |
|---|---|---|---|
prompt | boolean | No | Raise the system permission prompts for missing grants. Default true. |
check_for_update
Check whether a newer cua-driver-rs release is available on GitHub. Returns the current and latest versions, an update_available boolean, the install one-liner, and the release notes URL. Read-only. Mirror of cua-driver check-update --json.
Arguments: none.
install_ffmpeg
Install ffmpeg, which the recording video backend shells out to. Two-step and confirmed: called without confirm it only reports the command it would run (a read-only preview); pass confirm: true to actually run it. No-op if ffmpeg is already on PATH.
Arguments:
| Argument | Type | Description |
|---|---|---|
confirm | boolean (optional) | When true, runs the install. Omit or set false for a read-only preview of the command. |