Cua DriverReference

MCP Tools

Reference for every MCP tool cua-driver exposes

cua-driver exposes 29 MCP tools through a single stdio server (cua-driver mcp). Every tool is also callable from the shell as cua-driver <name> '<JSON-args>'.

Tool names are snake_case. Responses are MCP CallTool.Result envelopes: a text content block prefixed with a summary (or the error reason on failure), plus optional image or structured-content blocks on tools that produce them. See the CLI reference for CLI-specific flags like --raw and --screenshot-out-file.

Tool names here match the CLI form exactly. cua-driver list_apps and the MCP list_apps tool run the same code path.

TCC auto-delegation. When an MCP client spawns cua-driver mcp from an IDE terminal (Claude Code, Cursor, VS Code, Warp), macOS attributes the subprocess to the parent terminal — not CuaDriver.app — so AX probes fail against the wrong bundle id. mcp detects this and auto-launches a cua-driver serve daemon via open -n -g -a CuaDriver --args serve, then proxies every tool call through the daemon's Unix socket. Tool semantics are identical to the in-process path; no Python bridge is needed. Pass --no-daemon-relaunch (or set CUA_DRIVER_MCP_NO_RELAUNCH=1) to force in-process execution. See the process model guide for the full lifecycle, failure modes, and wrapper-author guidance.

check_permissions

Report TCC permission status for Accessibility and Screen Recording. By default also raises the system permission dialogs for any missing grants — Apple's request APIs are no-ops when the grant is already active, so this is safe to call repeatedly. Pass {"prompt": false} for a purely read-only status check.

Arguments:

  • prompt (boolean, optional): Raise the system permission prompts for missing grants. Default true.

click

Left-click against a target pid. Two addressing modes:

  • element_index + window_id (from the last get_window_state snapshot of that window) — performs an AX action on the cached element. Default action is press (left-click). Other values: show_menu (right-click equivalent), pick (open a menu-bar item's submenu), confirm (return on a default button), cancel (escape on a dismiss button), open (open a file/folder in a browser). Pure AX RPC, works on backgrounded / hidden windows, no cursor move or focus steal. Requires a prior get_window_state(pid, window_id) in this turn; the element_index cache is scoped per (pid, window_id) so indices from one window do not resolve against another.

  • x, y (window-local screenshot pixels, top-left origin of the PNG returned by get_window_state) — synthesizes mouse events via CGEvent / SkyLight and delivers them to the pid. AX is never consulted on this path. The driver converts image-pixel → screen-point internally using the target's window origin and backing scale. count: 2 posts two down/up pairs ~80ms apart for a double-click. modifier holds cmd/shift/option/ctrl during the click (e.g. ["cmd"] for cmd-click).

Exactly one of element_index or (x AND y) must be provided. pid is required in both modes. window_id is required when element_index is used (scopes the cache lookup). action is only valid with element_index; count and modifier are ignored in the AX path.

Arguments:

  • action (string, optional): AX action name (element_index path only). Default: press.
  • count (integer, optional): Click count — 1 (single), 2 (double), 3 (triple). Pixel path only. Default: 1.
  • debug_image_out (string, optional): Optional absolute path. When set on a pixel-addressed click (x, y present), the tool captures a fresh screenshot of the target window at the current max_image_dimension, draws a red crosshair at the received (x, y), and writes the PNG to this path. Use this to verify the tool received the coordinate you intended — if you drew your own blue crosshair on the PNG from get_window_state and saved it separately, the two crosshairs must land on the same pixel. Mismatch surfaces a coord-space bug. Pixel path only; requires window_id; incompatible with from_zoom.
  • element_index (integer, optional): Element index from the last get_window_state for the same (pid, window_id). Routes through the AX action path. Requires window_id.
  • from_zoom (boolean, optional): When true, x and y are pixel coordinates in the last zoom image for this pid. The driver automatically maps them back to the correct window coordinates. Use this after calling zoom to click on something you saw in the zoomed image.
  • modifier (array of string, optional): Modifier keys held during the click: cmd/shift/option/ctrl. Pixel path only.
  • pid (integer, required): Target process ID.
  • window_id (integer, optional): CGWindowID for the window whose get_window_state produced the element_index. Required when element_index is used; ignored in the pixel path.
  • x (number, optional): X in window-local screenshot pixels — same space as the PNG get_window_state returns. Top-left origin of the target's window. Must be provided together with y.
  • y (number, optional): Y in window-local screenshot pixels — same space as the PNG get_window_state returns. Top-left origin of the target's window. Must be provided together with x.
{"pid":844}

double_click

Double-click against a target pid. Two addressing modes:

  • element_index + window_id (from the last get_window_state snapshot of that window) — performs AXOpen when the element advertises it (Finder items, list rows, openable cells); otherwise falls back to a stamped pixel double-click at the element's on-screen center. Pure AX RPC in the AXOpen branch; pixel recipe when no AX open action exists. Requires a prior get_window_state(pid, window_id) in this turn.

  • x, y (window-local screenshot pixels, top-left origin of the PNG returned by get_window_state) — synthesizes a stamped double-click (two down/up pairs with clickState 1→2) and delivers to the pid. The primer-gated auth-signed recipe lands on backgrounded Chromium web content (dblclick events that trigger YouTube fullscreen toggle, Finder open-on-double-click, etc.). modifier holds cmd/shift/option/ctrl during the gesture.

Exactly one of element_index or (x AND y) must be provided. pid is required in both modes. window_id is required when element_index is used. modifier only takes effect in the pixel path (AXOpen doesn't propagate modifier keys).

Arguments:

  • element_index (integer, optional): Element index from the last get_window_state for the same (pid, window_id). Routes through AXOpen when advertised, else pixel double-click at the element's center. Requires window_id.
  • modifier (array of string, optional): Modifier keys held during the double-click: cmd/shift/option/ctrl. Pixel path only.
  • pid (integer, required): Target process ID.
  • window_id (integer, optional): CGWindowID for the window whose get_window_state produced the element_index. Required when element_index is used; ignored in the pixel path.
  • x (number, optional): X in window-local screenshot pixels — same space as the PNG get_window_state returns. Top-left origin of the target's window. Must be provided together with y.
  • y (number, optional): Y in window-local screenshot pixels — same space as the PNG get_window_state returns. Top-left origin of the target's window. Must be provided together with x.
{"pid":844}

drag

Press-drag-release gesture from (from_x, from_y) to (to_x, to_y) in window-local screenshot pixels — the same space the PNG get_window_state returns. Top-left origin of the target's window.

Use this for: marquee/lasso selection, drag-and-drop between source and destination, resizing via a handle, scrubbing a slider, repositioning a panel. macOS AX has no semantic drag action, so drag is pixel-only — there is no element-indexed mode. Address an element with get_window_state, read its bounds, and pass the pixel coordinates you want.

duration_ms (default 500) is the wall-clock budget for the path between mouse-down and mouse-up; steps (default 20) is the number of intermediate mouseDragged events linearly interpolated along the path. Increase both for slower, more "human" drags; decrease for snap gestures. modifier keys (cmd/shift/option/ctrl) are held across the entire gesture — option-drag duplicates, shift-drag constrains the axis on most surfaces.

Frontmost target: posts via .cghidEventTap. The real cursor visibly traces the drag path — unavoidable for AppKit-style drag sources, which accept only HID-origin events.

Backgrounded target: posts via the auth-signed pid-routed path. Cursor-neutral, but some surfaces (OpenGL canvases, Blender, Unity) filter pid-routed dragged events at the event-source level — those targets must be frontmost.

When from_zoom is true, both endpoints are pixel coordinates in the last zoom image for this pid; the driver maps them back to window coordinates before dispatching.

Arguments:

  • button (string, optional): Mouse button used for the drag. Default: left.
  • duration_ms (integer, optional): Wall-clock duration of the drag path between mouseDown and mouseUp. Default: 500.
  • from_x (number, required): Drag-start X in window-local screenshot pixels. Top-left origin.
  • from_y (number, required): Drag-start Y in window-local screenshot pixels. Top-left origin.
  • from_zoom (boolean, optional): When true, from_x/from_y/to_x/to_y are pixel coordinates in the last zoom image for this pid. The driver maps them back to window coordinates before dispatching.
  • modifier (array of string, optional): Modifier keys held across the entire gesture: cmd/shift/option/ctrl.
  • pid (integer, required): Target process ID.
  • steps (integer, optional): Number of intermediate mouseDragged events linearly interpolated along the path. Default: 20.
  • to_x (number, required): Drag-end X in window-local screenshot pixels.
  • to_y (number, required): Drag-end Y in window-local screenshot pixels.
  • window_id (integer, optional): CGWindowID for the window the pixel coordinates were measured against. Optional — when omitted the driver picks the frontmost window of pid. Pass when the target has multiple windows and you measured against a specific one.
{"from_x":100,"from_y":200,"pid":844,"to_x":100,"to_y":200}

get_agent_cursor_state

Report the current agent-cursor configuration: enabled flag, motion knobs (startHandle, endHandle, arcSize, arcFlow, spring), glide duration, post-click dwell, and idle-hide delay. Durations come back in milliseconds to match the setter's units. Pure read-only — no side effects.

Arguments: none.

get_config

Report the current persistent driver config. Config lives at ~/Library/Application Support/<app-name>/config.json and survives daemon restarts, unlike session-scoped state (recording).

Pure read-only. Returns defaults when the file doesn't exist or fails to decode — same fallback the daemon uses at startup. Sibling to set_config / cua-driver config.

Current schema:

{ "schema_version": 1, "capture_mode": "vision" | "ax" | "som", "agent_cursor": { "enabled": true, "motion": { "start_handle": 0.3, "end_handle": 0.3, "arc_size": 0.25, "arc_flow": 0.0, "spring": 0.72 } } }

Arguments: none.

get_cursor_position

Return the current mouse cursor position in screen points (origin top-left).

Arguments: none.

get_recording_state

Report the current trajectory recorder state: whether recording is enabled, the output directory (when enabled), and the 1-based counter for the next turn folder that will be written. Counter increments on every recorded action tool call and resets to 1 each time recording is (re-)enabled.

Pure read-only. Typical use: cua-driver recording status surfaces these fields for human consumption; programmatic callers use it to confirm recording is armed before kicking off a session.

Arguments: none.

get_screen_size

Return the logical size of the main display in points plus its backing scale factor. Agents click in points; Retina displays have scale_factor 2.0. Requires no TCC permissions.

Arguments: none.

get_window_state

Walk a running app's AX tree and return a Markdown rendering of its UI, tagging every actionable element with [element_index N]. Pass those indices to click, type_text_in, scroll_in, etc. — those tools resolve the index to the cached AXUIElement on the server.

INVARIANT: call get_window_state once per turn per (pid, window_id) before any element-indexed action against that window. The index map is replaced by the next snapshot of the same (pid, window_id).

The AX tree walked is the pid's tree, but the screenshot and window bounds reported come from the specified window_id. This is the source of truth for which window the caller intends to reason about — the driver never picks a window implicitly. Use list_windows to enumerate candidates, or read launch_app's windows field for freshly-launched apps.

window_id MUST belong to pid and MUST be on the user's current Space; the call returns isError: true otherwise. The driver does not auto-fall-back to a different window — window selection is the caller's responsibility.

The screenshot of the specified window is delivered as a native MCP image content block (not a JSON text field), so clients receive it as a proper image rather than raw base64 text.

Set query to a case-insensitive substring to filter tree_markdown down to matching lines plus their ancestor chain (so the structural context above each match stays visible). The element_index values and element_count are unchanged — the filter only trims the rendered Markdown, so click({pid, window_id, element_index: N}) still resolves against the full cached tree. When nothing matches, tree_markdown is an empty string; the caller can retry without the filter.

Response shape is controlled by the persistent capture_mode config setting (default som). Each mode skips the half of the work it doesn't need — not just the half of the response. Note that vision names the capture MODE; the separate screenshot tool (which captures a raw PNG without walking AX) is unrelated and keeps its name:

  • som — walks AX tree AND captures screenshot (default). Element-indexed clicks work out of the box; screenshot is there for disambiguation.
  • vision — captures the window PNG; AX tree walk is skipped entirely (no Accessibility hit, no element_index cache update). tree_markdown, element_count, and turn_id omitted. Element-indexed clicks will fail until a non-vision snapshot runs — pair with pixel-addressed click({pid, x, y}). Previously named screenshot; the raw string "screenshot" still decodes to vision as a deprecated alias.
  • ax — walks AX tree; screen-capture call is skipped entirely (no Screen Recording hit). screenshot_* fields omitted. Change with cua-driver config set capture_mode <mode> or the set_config tool.

Screenshot capture failures in som / vision modes are non-fatal for som: the AX tree still ships in the response and the summary line carries a hint. The macOS 26.4.x SCK regression (SCStreamError -3801, "Could not start streaming") is handled this way — agents can keep doing element-indexed clicks against the same window even when the screenshot is unavailable. Switching to capture_mode: ax skips the capture attempt entirely on subsequent turns.

Requires Accessibility and Screen Recording permissions.

Arguments:

Sticky (pid, window_id) context. Each get_window_state(pid, window_id) call refreshes the element-index cache for that specific window. All element-indexed actions (click, type_text, scroll, etc.) in the same turn resolve against the most recent snapshot for that (pid, window_id) pair — indices from a different window or an older snapshot are stale. Always re-call get_window_state after any action that might shift focus to a different window or app, and always pass the same (pid, window_id) you intend to act against.

  • javascript (string, optional): Optional JavaScript to execute in the browser tab and return alongside the AX snapshot — one round-trip instead of two. Only works for Chromium-family browsers (Chrome, Brave, Edge) and Safari; requires 'Allow JavaScript from Apple Events' to be enabled first (see WEB_APPS.md). The result is appended to the response as a ## JavaScript result section. Use for read-only queries (document.title, innerText, querySelectorAll, etc.). For mutations or side effects use the page tool instead.
  • pid (integer, required): Process ID from list_apps.
  • query (string, optional): Optional case-insensitive substring. When set, tree_markdown only contains lines that match plus their ancestor chain; element indices and element_count are unchanged.
  • screenshot_out_file (string, optional): Optional absolute path to write the screenshot to (e.g. "/tmp/shot.jpg"). When set, the screenshot bytes are written to this file and the MCP image content block is omitted from the response — screenshot_file_path is returned instead of screenshot_png_b64. Useful for CLI callers and agents that cannot consume inline base64 without saturating their context window (e.g. OpenCode with a local Ollama model). The directory must already exist; the file is created or overwritten.
  • window_id (integer, required): CGWindowID of the target window. Must belong to pid and be on the user's current Space. Enumerate via list_windows or read from launch_app's windows array.
{"pid":844,"window_id":10725}

hotkey

Press a combination of keys simultaneously — e.g. ["cmd", "c"] for Copy, ["cmd", "shift", "4"] for screenshot selection. The combo is posted directly to the target pid's event queue via CGEvent.postToPid; the target does NOT need to be frontmost.

window_id (optional): when supplied, the driver calls FocusWithoutRaise.activateWithoutRaise before posting — making the target AppKit-active without raising its window. This is required for shortcuts that are dispatched via NSMenu key equivalents (Cmd+S, Cmd+N, Cmd+W, Cmd+Shift+N, …) because AppKit only routes menu key equivalents to the active app. Omit window_id for shortcuts handled by the renderer (e.g. Cmd+C in a Chromium text field). Trade-off: the sentinel foreground app will lose focus once.

Recognized modifiers: cmd/command, shift, option/alt, ctrl/control, fn. Non-modifier keys use the same vocabulary as press_key (return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1-f12, letters, digits). Order: modifiers first, one non-modifier last.

Arguments:

  • keys (array of string, required): Modifier(s) and one non-modifier key, e.g. ["cmd", "c"].
  • pid (integer, required): Target process ID.
  • window_id (integer, optional): CGWindowID of the target window. When provided, the driver calls FocusWithoutRaise before posting so NSMenu key equivalents (Cmd+S, Cmd+N, …) reach the backgrounded app.
{"keys":["cmd","c"],"pid":844}

launch_app

Launch a macOS app hidden — the driver never brings the target to the foreground, and the target's window is not drawn on screen. This tool is strictly a background-launch primitive; if a human operator wants to switch the foreground app, they should do so through normal macOS UI (Dock, Cmd-Tab, Spotlight), not through this driver.

Provide either bundle_id (preferred — unambiguous, e.g. com.apple.calculator) or name (e.g. "Calculator", resolved against /Applications and /System/Applications). If both are given, bundle_id wins.

Launches in the background — the app's window appears on screen but does not steal focus from whatever is currently frontmost. The AX tree is fully populated and automation can start immediately via click, hotkey, get_window_state, etc.

Some apps (Calculator and many Electron apps) call NSApp.activate(ignoringOtherApps:) in their own applicationDidFinishLaunching, overriding NSWorkspace.OpenConfiguration.activates = false. A layer-3 focus-steal preventer arms a short NSWorkspace.didActivateApplicationNotification watcher around the launch and re-activates the previously-frontmost app if the target self-activates. If suppression fails the returned summary includes a "self-activation not suppressed" warning.

Optional urls are handed to the app through its application(_:open:) AppKit delegate. For Finder, passing a folder URL opens a Finder window rooted at that folder in the background. Works with any app that implements application(_:open:); apps that ignore the delegate simply launch without side effects.

Optional electron_debugging_port launches an Electron app with --remote-debugging-port=<N>, activating its Chrome DevTools Protocol (CDP) on that port. This gives the page tool full renderer/DOM access (document, window, fetch, etc.) via page targets rather than the restricted Node main-process context exposed by SIGUSR1. Use port 9222 unless you're running multiple Electron apps simultaneously. Has no effect on non-Electron apps.

Optional webkit_inspector_port launches a Tauri/WKWebView app with WEBKIT_INSPECTOR_SERVER=127.0.0.1:<N>, activating WebKit's remote inspector. The page tool will use this to execute JavaScript in the WKWebView renderer. Use port 9226 (reserved WebKit range: 9226–9228). Requires developerExtrasEnabled = true in the app's WKWebView config (default in Tauri debug builds; production builds may disable it). Has no effect on non-WKWebView apps.

Use this instead of shelling out to open -a — the shell form always activates the target, requires an extra permission prompt for Bash, and doesn't even try to respect background-launch intent.

Returns the launched app's pid, bundle_id, name, active flag, AND a windows array — same per-window shape as list_windows returns (window_id, title, bounds, z_index, is_on_screen, on_current_space, space_ids). That lets callers skip an extra list_windows round-trip before get_window_state(pid, window_id) in the common case. When the launch settles but no window has materialized yet (transient; rare), windows comes back empty — call list_windows(pid) explicitly a moment later to resolve a target.

Arguments:

Browsers require at least one URL. Launching Safari, Chrome, Firefox, Arc, Brave, or Edge without a urls argument starts the process but no window is ever created — subsequent get_window_state, click, and screenshot calls will fail with "no window found." Always pass urls=["about:blank"] for a blank window, or a real URL to open.

{"bundle_id": "com.apple.Safari", "urls": ["about:blank"]}
  • additional_arguments (array of string, optional): Extra command-line arguments passed to the launched process. Passed directly as argv entries — no shell expansion. Example: ["--user-data-dir=/tmp/cua-session", "--no-first-run"] for an isolated Chrome session.
  • bundle_id (string, optional): App bundle identifier, e.g. com.apple.calculator.
  • creates_new_application_instance (boolean, optional): Force a brand-new process even if the app is already running. Useful for isolated browser sessions: pass creates_new_application_instance=true together with additional_arguments=["--user-data-dir=/tmp/session-a", "--no-first-run", "--no-default-browser-check"] to launch a sandboxed Chrome that cannot see the user's real profile, cookies, or extensions. Each session gets its own pid and window identity and can be controlled independently.
  • electron_debugging_port (integer, optional): Launch an Electron app with --remote-debugging-port=<N> so the page tool gets full renderer/DOM access. Use 9222 unless running multiple Electron apps. Ignored for non-Electron apps.
  • name (string, optional): App display name. Used only when bundle_id is absent.
  • urls (array of string, optional): Optional file:// or http(s):// URLs (or plain paths with ~ expansion) to hand to the launched app via application(:open:). For Finder, pass a folder URL or path to open a backgrounded Finder window rooted at that folder — no activation. Apps that don't implement application(:open:) launch normally and ignore these.
  • webkit_inspector_port (integer, optional): Launch a Tauri/WKWebView app with WEBKIT_INSPECTOR_SERVER=127.0.0.1:<N> so the page tool can reach its WebKit inspector. Use 9226 (reserved WebKit range: 9226–9228, distinct from Electron's 9222–9225). Requires developerExtrasEnabled=true in the WKWebView config (default in Tauri debug builds).

list_apps

List macOS apps — both currently running and installed-but-not-running — with per-app state flags:

  • running: is a process for this app live? (pid is 0 when false)
  • active: is it the system-frontmost app? (implies running)

Only apps with NSApplicationActivationPolicyRegular are included — background helpers and system UI agents are filtered out. Installed apps come from scanning /Applications, /Applications/Utilities, ~/Applications, /System/Applications, and /System/Applications/Utilities.

Use this for "is X installed?" as well as "is X running?". For per-window state — on-screen, on-current-Space, minimized, window titles — call list_windows instead. For just opening an app — running or not — call launch_app({bundle_id: ...}) directly; list_apps is not a prerequisite.

Arguments: none.

list_windows

List every layer-0 top-level window currently known to WindowServer — including off-screen ones (hidden-launched, minimized into the Dock, on another Space). Each record self-contains its owning app identity so the caller never has to join back against list_apps.

Use this — not list_apps — for any window-level reasoning: "does this app have a visible window right now?", "which Space is this window on?", "which of this pid's windows is the main one?". list_apps is purely an app / launch surface.

Per-record fields:

  • window_id: CGWindowID, addressable for screenshot / activation / Space-move APIs.
  • pid + app_name: owning app identity.
  • title: current window title (empty string for helpers and chromeless surfaces).
  • bounds: x / y / width / height, top-left origin in global screen points.
  • layer: CGWindow stratum. Always 0 in the default filter; reserved for future higher-layer opt-in.
  • z_index: stacking order on the current Space (higher = closer to front). Cross-Space ordering is undefined.
  • is_on_screen: WindowServer's "visible on the user's current Space right now" bit. False for hidden / minimized / off-Space windows.
  • on_current_space: true when the window is bound to the user's current Space. Omitted when the SkyLight Space SPIs didn't resolve (rare; sandboxed or very old OS).
  • space_ids: every managed Space id the window is a member of. Compare against current_space_id in the top-level response. Omitted in the same failure modes as on_current_space.

Top-level fields:

  • windows: array of the per-window records above.
  • current_space_id: user's active Space id, or null when SPI unavailable.

Inputs: pid (optional — restrict to one pid's windows), on_screen_only (bool, default false — surface off-Space / minimized windows by default). Layer 0 filtering is always applied; menubar strips and dock shields are noise for every current caller.

Arguments:

  • on_screen_only (boolean, optional): When true, drop windows that aren't currently on the user's Space (minimized, hidden, off-Space). Default false.
  • pid (integer, optional): Optional pid filter. When set, only this pid's windows are returned.

move_cursor

Instantly move the mouse cursor to (x, y) in screen points. Uses CGWarpMouseCursorPosition — no drag, no click, no CGEvent.

Arguments:

  • x (integer, required): X in screen points.
  • y (integer, required): Y in screen points.
{"x":100,"y":200}

page

Browser page primitives — execute JavaScript, extract page text, or query DOM elements via CSS selector.

Three actions:

execute_javascript — run arbitrary JS in the active tab and return the result. Wrap in an IIFE with try-catch for safety. Don't use this for elements already indexed by get_window_state — prefer click / set_value there.

get_text — returns document.body.innerText. Faster than walking the AX tree for plain-text content (prices, article body, table values) that the AX tree drops or truncates.

query_dom — runs querySelectorAll(css_selector) and returns each matching element's tag, innerText, and any requested attributes as a JSON array. Useful for structured data (table rows, link hrefs, data-* attributes) that the AX tree flattens.

Supported browsers: Chrome (com.google.Chrome), Brave (com.brave.Browser), Edge (com.microsoft.edgemac), Safari (com.apple.Safari). Requires "Allow JavaScript from Apple Events" enabled — see WEB_APPS.md.

pid and window_id identify the target window. The browser does NOT need to be frontmost for read actions. For mutations that trigger UI changes (form submits, navigation) the browser should be frontmost so the results are visible.

Arguments:

  • action (string, required): Which page primitive to run.

• execute_javascript / get_text / query_dom — see above. • enable_javascript_apple_events — enable 'Allow JavaScript from Apple Events' in Chrome/Brave/Edge. Quits the browser, patches each profile's Preferences JSON, then relaunches. Requires user_has_confirmed_enabling=true — you MUST ask the user for explicit permission before calling this action.

  • attributes (array of string, optional): Element attributes to include per match (action=query_dom only). E.g. ["href", "src", "data-id"]. tag and innerText are always included.
  • bundle_id (string, optional): Browser bundle ID (action=enable_javascript_apple_events only). E.g. com.google.Chrome.
  • css_selector (string, optional): CSS selector (action=query_dom only). Passed to querySelectorAll — standard CSS syntax.
  • javascript (string, optional): JS to execute (action=execute_javascript only). Should return a serialisable value. Wrap in IIFE: (() => { … })().
  • pid (integer, required): Browser process ID.
  • user_has_confirmed_enabling (boolean, optional): Must be true (action=enable_javascript_apple_events only). Set this only after the user has explicitly said yes to enabling JavaScript from Apple Events.
  • window_id (integer, required): CGWindowID of the target browser window.
{"action":"get_text","pid":844,"window_id":10725}

press_key

Press and release a single key, delivered directly to the target pid's event queue via CGEvent.postToPid. The target does NOT need to be frontmost — no focus steal.

Optional element_index + window_id (from the last get_window_state snapshot of that window) pre-focuses that element (via AXSetAttribute(kAXFocused, true)) before the key fires; useful for "type into field → press Return on field." Without element_index, the key is posted to the pid with whatever element is already focused receiving it — useful for scroll keys on a backgrounded web view.

The key is delivered via SkyLight's SLEventPostToPid with an attached SLSEventAuthenticationMessage, which is the path the renderer's keyboard pipeline accepts as authentic input. That's what lets Return commit a Chromium omnibox against a backgrounded window. Falls back to the public CGEventPostToPid when the SkyLight SPI isn't available.

Key vocabulary: return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1-f12, plus any letter or digit. Optional modifiers array takes cmd/shift/option/ctrl/fn. For true combinations (cmd+c), hotkey is a cleaner surface.

Arguments:

  • element_index (integer, optional): Optional element_index from the last get_window_state for the same (pid, window_id). When present, the element is focused before the key fires. Requires window_id.
  • key (string, required): Key name (return, tab, escape, up, down, left, right, space, delete, home, end, pageup, pagedown, f1-f12, letter, digit).
  • modifiers (array of string, optional): Optional modifier names held while the key is pressed (cmd/shift/option/ctrl/fn).
  • pid (integer, required): Target process ID.
  • window_id (integer, optional): CGWindowID for the window. Required when element_index is used. When supplied without element_index, triggers FocusWithoutRaise before the key fires — use this for NSMenu key equivalents (Cmd+S, Cmd+N, …) on backgrounded apps.
{"key":"return","pid":844}

replay_trajectory

Replay a recorded trajectory by re-invoking every turn's tool call in lexical order. dir must point at a directory previously written by set_recording (or the cua-driver recording start CLI). Each turn-NNNNN/ is parsed for action.json, and the recorded tool is called with its recorded arguments via the same dispatch path an MCP / CLI call uses.

Caveats:

  • Element-indexed actions (click({pid, element_index}) etc.) will fail because element indices are per-snapshot and don't survive across sessions. Pixel clicks (click({pid, x, y})) and all keyboard tools replay cleanly. Failures are reported but don't stop replay unless stop_on_error is true.
  • get_window_state and other read-only tools are NOT currently recorded, so replays do not re-populate the per-(pid, window_id) element cache. A recorded session consisting solely of element-indexed clicks will therefore replay as a stream of "No cached AX state" errors — fine for regression diffs, useless for re-driving an app.
  • If recording is ENABLED while replay runs, the replay itself is recorded into the currently configured output directory. That's deliberate: recording a replay against a new build and diffing the two trajectories is the regression-test workflow.

Arguments:

  • delay_ms (integer, optional): Milliseconds to sleep between turns, for human-observable pacing. Default 500.
  • dir (string, required): Trajectory directory previously written by set_recording. Absolute or ~-rooted.
  • stop_on_error (boolean, optional): Stop replay on the first tool-call error. Default true — set false to best-effort through the full trajectory.
{"dir":"~/cua-trajectories/demo1"}

right_click

Right-click against a target pid. Two addressing modes:

  • element_index + window_id (from the last get_window_state snapshot of that window) — performs AXShowMenu on the cached element. This is the reliable path: pure AX RPC, works on backgrounded / hidden windows, no cursor move or focus steal. Requires a prior get_window_state(pid, window_id) in this turn.

  • x, y (window-local screenshot pixels, top-left origin of the PNG returned by get_window_state) — internally routed through AX (AXShowMenu) when the pixel resolves to an actionable element (same reliability as element_index); falls back to a synthesized right-mouse-down / right-mouse-up CGEvent pair posted via auth-signed SLEventPostToPid for non-AX surfaces (canvas / WebView / games). Driver converts image-pixel → screen-point internally. Experimental on Chromium web content in the CGEvent fallback: the event is observed to land as a left-click there (a known limitation matching third-party Computer Use implementations). modifier forces the CGEvent path (AX doesn't propagate modifier keys).

Exactly one of element_index or (x AND y) must be provided. pid is required in both modes. window_id is required when element_index is used. modifier only takes effect in the pixel path (AX actions don't propagate modifier keys).

Arguments:

  • element_index (integer, optional): Element index from the last get_window_state for the same (pid, window_id). Routes through AXShowMenu. Requires window_id.
  • modifier (array of string, optional): Modifier keys held during the right-click: cmd/shift/option/ctrl. Pixel path only.
  • pid (integer, required): Target process ID.
  • window_id (integer, optional): CGWindowID for the window whose get_window_state produced the element_index. Required when element_index is used; ignored in the pixel path.
  • x (number, optional): X in window-local screenshot pixels — same space as the PNG get_window_state returns. Top-left origin of the target's window. Must be provided together with y.
  • y (number, optional): Y in window-local screenshot pixels — same space as the PNG get_window_state returns. Top-left origin of the target's window. Must be provided together with x.
{"pid":844}

screenshot

Capture a screenshot using ScreenCaptureKit. Returns base64-encoded image data for a single window in the requested format (default png).

window_id is required. Get window ids from list_windows.

Requires the Screen Recording TCC grant — call check_permissions first if unsure.

On macOS 26.4.x, ScreenCaptureKit can refuse specific windows on physical Macs (SCStreamError -3801, "Could not start streaming"). The driver retries once and falls back to the legacy CGWindowList path before failing; if both refuse, the error response includes a hint to try a different window_id or switch to capture_mode: ax for get_window_state (the element-indexed flow doesn't need pixels).

Arguments:

  • format (string, optional): Image format. Default: png.
  • quality (integer, optional): JPEG quality 1-95; ignored for png.
  • window_id (integer, required): Required CGWindowID / kCGWindowNumber to capture.
{"window_id":10725}

scroll

Scroll the target pid's focused region by synthesized keystrokes.

Mapping: by: "page" → PageDown/PageUp × amount; by: "line" → DownArrow/UpArrow × amount. Horizontal variants use Left/Right arrow keys. Keys are posted via the same auth-signed SLEventPostToPid path as press_key, so backgrounded Chromium / WebKit windows scroll correctly.

Why keyboard instead of wheel events: probe tests showed CGEventCreateScrollWheelEvent2 posted via SkyLight's per-pid path is silently dropped by Chromium (no Scroll-specific auth subclass exists; the factory falls back to the bare parent class and renderers reject it). Keystrokes don't have that problem — that's why PageDown works when the wheel event didn't.

Optional element_index + window_id (from the last get_window_state snapshot of that window) focuses the element (AXFocused=true) before the keystroke, so the keys land in the scrollable element rather than whatever previously had focus. Skip it if focus was already established by a prior click.

Arguments:

  • amount (integer, optional): Number of keystroke repetitions. Default: 3.
  • by (string, optional): Scroll granularity. Default: line.
  • direction (string, required):
  • element_index (integer, optional): Optional element_index from the last get_window_state for the same (pid, window_id) — focuses the element before the keystrokes fire. Requires window_id.
  • pid (integer, required): Target process ID.
  • window_id (integer, optional): CGWindowID for the window whose get_window_state produced the element_index. Required when element_index is used.
{"direction":"down","pid":844}

set_agent_cursor_enabled

Toggle the visual agent-cursor overlay. When enabled, future pointer actions animate a floating arrow to the target's on-screen position before firing the AX action — purely visual, the AX dispatch itself is unchanged. Disabling removes the overlay immediately.

Default: enabled. Stays on for the life of the MCP session / daemon; disable with {"enabled": false} for headless / CI runs where the visual isn't wanted. The overlay only renders when the driver has an AppKit run loop, which is bootstrapped by mcp and serve but not by one-shot CLI invocations — so this flag is a no-op in fresh-CLI mode.

Arguments:

  • enabled (boolean, required): True to show the overlay cursor; false to hide.
{"enabled":false}

set_agent_cursor_motion

Tune the agent cursor's motion curve. All fields optional; only the knobs you pass change, the rest keep their current value. The knobs map to the cubic Bezier that the cursor traces between successive positions, plus the spring animation that settles the cursor's rotation on arrival.

  • start_handle: how far along the straight line the first control point sits, as a fraction in [0, 1]. Lower = tight departure, higher = floppy departure.
  • end_handle: same, measured from the end point.
  • arc_size: perpendicular deflection as a fraction of path length. 0 = straight line. 0.25 = modest arc. 0.5 = dramatic swoop. Typical: 0.25.
  • arc_flow: asymmetry bias in [-1, 1]. Positive bulges the arc toward the destination; negative toward the origin; 0 is symmetric.
  • spring: settle damping in [0.3, 1.0]. 1.0 = critically damped (no overshoot); 0.4 = bouncy. Typical: 0.72.
  • glide_duration_ms: wall-clock duration of each cursor flight in milliseconds. Higher = slower, more visible for demos / screen recordings. Typical: 750.
  • dwell_after_click_ms: pause the tool holds after the click ripple before returning, letting the cursor visibly rest on the target so consecutive clicks read as human pacing. Only applied while the cursor is enabled. Typical: 400.
  • idle_hide_ms: how long the overlay lingers after the last pointer action before auto-hiding. Re-armed by every click, so a burst of consecutive actions keeps the cursor visible throughout. Higher = more tolerant of follow-up actions without the overlay popping in and out. Typical: 3000.

Defaults: start_handle=0.3, end_handle=0.3, arc_size=0.25, arc_flow=0.0, spring=0.72, glide_duration_ms=750, dwell_after_click_ms=400, idle_hide_ms=3000.

Arguments:

  • arc_flow (number, optional): Asymmetry bias in [-1, 1]. Default 0.
  • arc_size (number, optional): Arc deflection as fraction of path length. Default 0.25.
  • dwell_after_click_ms (number, optional): Pause after the click ripple in ms, for human-like pacing. Default 400.
  • end_handle (number, optional): End-handle fraction in [0, 1]. Default 0.3.
  • glide_duration_ms (number, optional): Flight duration per click in ms. Higher = slower. Default 750.
  • idle_hide_ms (number, optional): How long the overlay lingers after the last click before auto-hiding, in ms. Default 3000.
  • spring (number, optional): Settle damping in [0.3, 1]. Default 0.72.
  • start_handle (number, optional): Start-handle fraction in [0, 1]. Default 0.3.

set_agent_cursor_style

Customize the agent-cursor's visual appearance. All fields optional; omitted fields keep their current value.

  • gradient_colors: Array of CSS hex strings (#RRGGBB or #RGB) defining the arrow fill gradient from tip to tail. E.g. ["#FF6B6B", "#FF8E53"] for a red-orange arrow.
  • bloom_color: CSS hex string for the glow halo and the focus-rect highlight drawn around clicked elements.
  • image_path: Absolute or ~-rooted path to a PNG, JPEG, PDF, or SVG file. When set, replaces the default arrow with this image (drawn at shapeSize × shapeSize points, rotated to match the motion heading). Set to "" (empty string) to revert to the procedural arrow.

Example — brand-colored arrow: {"gradient_colors": ["#A855F7", "#6366F1"], "bloom_color": "#A855F7"}

Example — custom PNG cursor: {"image_path": "~/cursors/my-cursor.png"}

Example — revert to default: {"gradient_colors": [], "bloom_color": "", "image_path": ""}

Arguments:

  • bloom_color (string, optional): CSS hex color for the bloom halo and focus rect. Empty string reverts to default.
  • gradient_colors (array of string, optional): CSS hex color strings for gradient stops (tip to tail). Empty array reverts to default.
  • image_path (string, optional): Path to PNG/JPEG/PDF/SVG cursor image. Empty string reverts to arrow.

set_config

Write a single setting into the persistent driver config at ~/Library/Application Support/<app-name>/config.json. Values survive daemon restarts AND are propagated to the live session state.

Keys are dotted snake_case paths:

  • agent_cursor.enabled (bool)
  • agent_cursor.motion.start_handle (number)
  • agent_cursor.motion.end_handle (number)
  • agent_cursor.motion.arc_size (number)
  • agent_cursor.motion.arc_flow (number)
  • agent_cursor.motion.spring (number)
  • capture_mode (string: vision | ax | som)

Arguments: {"key": "<dotted.path>", "value": <json>}.

Arguments:

  • key (string, required): Dotted snake_case path to a leaf config field, e.g. agent_cursor.enabled.
  • value (undefined, required): New value for key. JSON type depends on the key.
{"key":"return","value":"value"}

set_recording

Toggle trajectory recording. When enabled, every subsequent action-tool invocation (click, right_click, scroll, type_text, type_text_chars, press_key, hotkey, set_value) writes a turn folder under output_dir:

  • app_state.json — post-action AX snapshot for the target pid (same shape as get_window_state).
  • screenshot.png — post-action per-window screenshot of the target's frontmost on-screen window.
  • action.json — tool name, full input arguments, result summary, pid, click point (when applicable), ISO-8601 timestamp.
  • click.png — for click-family actions only, screenshot.png with a red dot drawn at the click point.

Turn folders are named turn-00001/, turn-00002/, etc. Turn numbering restarts at 1 each time recording is (re-)enabled.

Required when enabled=true: output_dir. Expands ~ and creates the directory (and intermediates) if missing. Ignored when enabled=false.

State persists for the life of the daemon / MCP session; a restart resets to disabled with no on-disk state.

Arguments:

  • enabled (boolean, required): True to start recording subsequent action tool calls; false to stop.
  • output_dir (string, optional): Absolute or ~-rooted directory where turn folders are written. Required when enabled=true.
  • video_experimental (boolean, optional): Experimental: also capture the main display to <output_dir>/recording.mp4 via SCStream (H.264, 30fps, no audio, no cursor). Capture-only — no zoom / post-process. Off by default. Ignored when enabled=false.
{"enabled":false}

set_value

Set a value on a UI element. Two modes depending on element role:

  • AXPopUpButton / select dropdown: finds the child option whose title or value matches value (case-insensitive) and AXPresses it directly — the native macOS popup menu is never opened, so focus is never stolen. Use this for HTML <select> elements in Safari or any native NSPopUpButton. Pass the option's display label as value (e.g. "Blue", not "blue").

  • All other elements: writes AXValue directly (sliders, steppers, date pickers, native text fields that expose settable AXValue).

For free-form text entry into web inputs, prefer type_text_chars which synthesises key events — AXValue writes are ignored by WebKit.

Arguments:

  • element_index (integer, required):
  • pid (integer, required):
  • value (string, required): New value. AX will coerce to the element's native type.
  • window_id (integer, required): CGWindowID for the window whose get_window_state produced the element_index. The element_index cache is scoped per (pid, window_id).
{"element_index":14,"pid":844,"value":"42","window_id":10725}

type_text

Insert text into the target pid. Tries AXSetAttribute(kAXSelectedText) first (fast bulk insert — works for standard Cocoa text fields). If the target element rejects the AX write, automatically falls back to character-by-character CGEvent.postToPid synthesis — reaches Chromium / Electron inputs and any surface that doesn't implement kAXSelectedText.

Special keys (Return, Escape, arrows, Tab) go through press_key / hotkey — they are not text.

Optional element_index + window_id (from the last get_window_state snapshot of that window) pre-focuses that element before the write; useful for "fill this specific field." Without element_index, the write targets the pid's currently-focused element — useful after a prior click already set focus.

delay_ms (0–200, default 30) spaces successive characters in the CGEvent fallback path so autocomplete and IME can keep up; ignored when the AX path succeeds.

Requires Accessibility.

Arguments:

  • delay_ms (integer, optional): Milliseconds between characters in the CGEvent fallback path. Default 30. Ignored when the AX path succeeds.
  • element_index (integer, optional): Optional element_index from the last get_window_state for the same (pid, window_id). When present, the element is focused before the write. Requires window_id.
  • pid (integer, required): Target process ID.
  • text (string, required): Text to insert at the target's cursor.
  • window_id (integer, optional): CGWindowID for the window whose get_window_state produced the element_index. Required when element_index is used.
{"pid":844,"text":"hello"}

zoom

Zoom into a rectangular region of a window screenshot at full (native) resolution. Use this when get_window_state returned a resized image and you need to read small text, identify icons, or verify UI details.

Coordinates x1, y1, x2, y2 are in the same pixel space as the screenshot returned by get_window_state (i.e. the resized image if max_image_dimension is active). The maximum zoom region width is 500 px in scaled-image coordinates.

The tool maps coordinates back to original (native) resolution by multiplying by the compression ratio z (original_size / scaled_size). A 500 px-wide region in the scaled image becomes 500 × z native pixels, so the returned image will be larger than 500 px whenever the unzoomed screenshot was downscaled. This is the intended behaviour — the zoom undoes the compression and shows the region at full pixel density.

A 20% padding is automatically added on every side of the requested region so the target remains visible even if the caller's coordinates are slightly off. The padding is clamped to the image bounds.

Requires get_window_state(pid, window_id) earlier in this session so the resize ratio is known.

Arguments:

  • pid (integer, required): Target process ID.
  • x1 (number, required): Left edge of the region (resized-image pixels).
  • x2 (number, required): Right edge of the region (resized-image pixels).
  • y1 (number, required): Top edge of the region (resized-image pixels).
  • y2 (number, required): Bottom edge of the region (resized-image pixels).
{"pid":844,"x1":100,"x2":100,"y1":200,"y2":200}

Was this page helpful?