MCP Tools
Reference for every MCP tool cua-driver exposes
cua-driver exposes 28 MCP tools through a single stdio server (cua-driver mcp). Every tool is also callable from the shell as cua-driver <name> '<JSON-args>'.
Tool names are snake_case. Responses are MCP CallTool.Result envelopes: a text content block prefixed with a ✅ summary (or the error reason on failure), plus optional image or structured-content blocks on tools that produce them. See the CLI reference for CLI-specific flags like --raw and --image-out.
Tool names here match the CLI form exactly. cua-driver list_apps and the MCP list_apps tool run the same code path.
Discovery
list_apps
List macOS apps, both running and installed-but-not-running, with pid / bundle id / active state.
Arguments: none.
{}list_windows
List every layer-0 top-level window currently known to WindowServer, including off-screen ones (hidden-launched, minimized, on another Space).
Arguments:
pid(integer, optional): Pid filter. When set, only this pid's windows are returned.on_screen_only(boolean, optional): When true, drop windows that aren't currently on the user's Space. Default false.
{"pid": 844, "on_screen_only": true}get_screen_size
Return the main display's logical size in points plus the backing scale factor. Retina displays report 2.0.
Arguments: none.
{}get_cursor_position
Return the current mouse cursor position in screen points (top-left origin).
Arguments: none.
{}get_accessibility_tree
Lightweight desktop snapshot: running regular apps and on-screen visible windows with bounds, z-order, and owner pid. For a single window's internal UI, use get_window_state.
Arguments: none.
{}screenshot
Raw ScreenCaptureKit capture. Full main display, or a single window when window_id is set. Returns an image content block plus a text summary listing on-screen windows.
Arguments:
format(string, optional):"png"or"jpeg". Default"png".quality(integer, optional): JPEG quality 1-95; ignored for png.window_id(integer, optional): CGWindowID /kCGWindowNumberto capture just that window.
{"format": "jpeg", "quality": 80, "window_id": 10725}get_window_state
Snapshot a single window: AX element tree plus a screenshot. Populates the per-pid, per-window element_index cache that mouse and keyboard tools consume.
Arguments:
pid(integer, required): Process ID fromlist_apps.window_id(integer, required): CGWindowID of the target window. Must belong topid. Enumerate vialist_windowsor read fromlaunch_app'swindowsarray.query(string, optional): Case-insensitive substring. When set,tree_markdownonly contains matching lines plus their ancestor chain; element indices andelement_countare unchanged.
{"pid": 844, "window_id": 10725, "query": "save"}Response shape varies by capture_mode (som / ax / vision). Default is som. See the CLI reference for the full matrix.
App lifecycle
launch_app
Launch an app hidden (no focus steal) and return its pid plus the initial windows array. Either bundle_id or name must be provided.
Arguments:
bundle_id(string, optional): App bundle identifier, e.g.com.apple.calculator.name(string, optional): App display name. Used only whenbundle_idis absent.urls(array of string, optional):file:///http(s)://URLs (or plain paths with~expansion) handed to the launched app viaapplication(_:open:). For Finder, a folder URL opens a backgrounded window rooted there. Apps that don't implementapplication(_:open:)launch normally and ignore these.creates_new_application_instance(boolean, optional): When true, force-launches a separate process even if the app is already running. Useful for isolated browser sessions. Default false.additional_arguments(array of string, optional): Extra command-line arguments passed to the launched process, e.g.["--user-data-dir=/tmp/session-a"]for an isolated Chrome profile.
{"bundle_id": "com.google.Chrome", "urls": ["https://example.com"], "creates_new_application_instance": true, "additional_arguments": ["--user-data-dir=/tmp/cua-session"]}check_permissions
Report TCC permission status for Accessibility and Screen Recording. By default also raises the system permission dialogs for any missing grants — Apple's request APIs no-op when the grant is already active, so the call is safe to repeat.
Arguments:
prompt(boolean, optional): Raise the system permission prompts for missing grants. Defaulttrue. Passfalsefor a purely read-only status check.
{"prompt": false}Mouse
All mouse tools accept either the element_index path (preferred; works on backgrounded windows) or a pixel (x, y) path. Pixel coordinates are in window-local screenshot pixels, the same space as the PNG get_window_state returns.
click
Left-click by element or pixel.
Arguments:
pid(integer, required): Target process ID.element_index(integer, optional): Element index from the lastget_window_statefor the same(pid, window_id). Routes through the AX action path. Requireswindow_id.window_id(integer, optional): CGWindowID for the window whoseget_window_stateproducedelement_index. Required in the element path; ignored in the pixel path.x,y(number, optional): Pixel coordinates (top-left origin). Must be provided together.action(string, optional): AX action:press|show_menu|pick|confirm|cancel|open. Defaultpress. Element path only.modifier(array of string, optional): Modifiers held during the click:cmd/shift/option/ctrl. Pixel path only.count(integer, optional): Click count 1-3 (single, double, triple). Default 1. Pixel path only.from_zoom(boolean, optional): When true,x,yare pixel coordinates in the lastzoomimage for this pid; the driver maps them back to window coordinates automatically.debug_image_out(string, optional): Absolute path. On a pixel-addressed click, the tool captures the window at currentmax_image_dimension, draws a red crosshair at the received(x, y), and writes the PNG here. Use to verify coordinate-space correctness. Incompatible withfrom_zoom.
{"pid": 844, "window_id": 10725, "element_index": 14}double_click
Double-click by element (routes through AXOpen when advertised, else pixel double-click at the element's center) or by pixel.
Arguments:
pid(integer, required): Target process ID.element_index(integer, optional): Requireswindow_id.window_id(integer, optional): Required whenelement_indexis used.x,y(number, optional): Pixel coordinates (top-left origin). Provided together.modifier(array of string, optional):cmd/shift/option/ctrl. Pixel path only.
{"pid": 844, "x": 320, "y": 180}right_click
Right-click by element (routes through AXShowMenu) or by pixel.
Arguments:
pid(integer, required): Target process ID.element_index(integer, optional): Requireswindow_id.window_id(integer, optional): Required whenelement_indexis used.x,y(number, optional): Pixel coordinates (top-left origin). Provided together.modifier(array of string, optional):cmd/shift/option/ctrl. Pixel path only.
{"pid": 844, "window_id": 10725, "element_index": 7}move_cursor
Warp the real mouse cursor to a screen-point coordinate. Does not click.
Arguments:
x(integer, required): X in screen points.y(integer, required): Y in screen points.
{"x": 640, "y": 400}scroll
Synthesize PageUp/PageDown/arrow keystrokes against the target pid. When element_index is provided, the element is focused before the keystrokes fire.
Arguments:
pid(integer, required): Target process ID.direction(string, required):up|down|left|right.amount(integer, optional): Keystroke repetitions, 1-50. Default 3.by(string, optional):line|page. Defaultline.element_index(integer, optional): Requireswindow_id.window_id(integer, optional): Required whenelement_indexis used.
{"pid": 844, "direction": "down", "amount": 5, "by": "page"}Keyboard and text
All keyboard tools are pid-scoped: the event is delivered to the target process regardless of current frontmost app.
press_key
Single key press, optionally with modifiers. Delivered via CGEvent.postToPid.
Arguments:
pid(integer, required): Target process ID.key(string, required): Key name:return,tab,escape,up,down,left,right,space,delete,home,end,pageup,pagedown,f1-f12, letter, digit.modifiers(array of string, optional):cmd/shift/option/ctrl/fnheld while the key is pressed.element_index(integer, optional): When present, the element is focused before the key fires. Requireswindow_id.window_id(integer, optional): Required whenelement_indexis used.
{"pid": 844, "key": "return"}hotkey
Modifier combo as a single array, e.g. ["cmd", "c"]. Requires at least two entries (one or more modifiers plus one non-modifier key).
Arguments:
pid(integer, required): Target process ID.keys(array of string, required): Modifier(s) and one non-modifier key.
{"pid": 844, "keys": ["cmd", "shift", "s"]}type_text
Insert text at the target's current cursor via AXSelectedText. Fast (single AX write) but skipped by apps with custom text layers; for Chromium / Electron inputs use type_text_chars.
Arguments:
pid(integer, required): Target process ID.text(string, required): Text to insert at the target's cursor.element_index(integer, optional): When present, the element is focused before the write. Requireswindow_id.window_id(integer, optional): Required whenelement_indexis used.
{"pid": 844, "window_id": 10725, "element_index": 12, "text": "hello"}type_text_chars
Character-by-character input via CGEvent.postToPid. Slower than type_text but reaches Chromium and Electron inputs that ignore AX writes.
Arguments:
pid(integer, required): Target process ID.text(string, required): Text to type into the target's focused element.delay_ms(integer, optional): Milliseconds between characters, 0-200. Default 30.
{"pid": 844, "text": "hello world", "delay_ms": 40}Element values
set_value
Write an element's AXValue directly. For sliders, steppers, text fields, and similar controls where AX coerces the string to the native type.
Arguments:
pid(integer, required): Target process ID.window_id(integer, required): CGWindowID for the window whoseget_window_stateproducedelement_index.element_index(integer, required): Element index from the lastget_window_statefor the same(pid, window_id).value(string, required): New value. AX coerces it to the element's native type.
{"pid": 844, "window_id": 10725, "element_index": 9, "value": "42"}Browser
page
Browser page primitives — execute JavaScript, extract page text, or query DOM elements. Supports Chrome, Brave, Edge, and Safari (requires "Allow JavaScript from Apple Events"). For WKWebView/Tauri apps where the remote inspector is blocked, get_text and query_dom automatically fall back to the accessibility tree.
Arguments:
pid(integer, required): Browser process ID.window_id(integer, required): CGWindowID of the target browser window.action(string, required): One ofexecute_javascript,get_text,query_dom,enable_javascript_apple_events.javascript(string): JS to evaluate — actionexecute_javascriptonly. Wrap in an IIFE with try-catch for safety.css_selector(string): CSS selector — actionquery_domonly.attributes(array of string, optional): Attributes to include per element — actionquery_domonly.tagandtextare always included.bundle_id(string): Browser bundle ID — actionenable_javascript_apple_eventsonly.user_has_confirmed_enabling(boolean): Must betrue— actionenable_javascript_apple_eventsonly. You must ask the user for explicit permission before passing this.
{"pid": 1234, "window_id": 5678, "action": "get_text"}
{"pid": 1234, "window_id": 5678, "action": "execute_javascript", "javascript": "document.title"}
{"pid": 1234, "window_id": 5678, "action": "query_dom", "css_selector": "a[href]", "attributes": ["href"]}Zoom
zoom
Native-resolution crop of a previously captured window region. Pass the region in resized-image pixel coordinates (same space get_window_state reports); the tool scales back to the source resolution, pads by 20% on each side, captures the frontmost window of pid, and returns the crop.
Arguments:
pid(integer, required): Target process ID.x1(number, required): Left edge of the region (resized-image pixels).y1(number, required): Top edge of the region (resized-image pixels).x2(number, required): Right edge of the region (resized-image pixels).y2(number, required): Bottom edge of the region (resized-image pixels).
{"pid": 844, "x1": 200, "y1": 100, "x2": 400, "y2": 250}After zoom, pass from_zoom: true to click with pixel coordinates in the zoomed image; the driver maps them back automatically.
Agent cursor overlay
The agent cursor is an optional visual overlay (Bezier-arc glide + click ripple + dwell) drawn on top of every synthetic click. Motion parameters persist across restarts via the config file.
get_agent_cursor_state
Read the current overlay configuration: enabled flag, motion options, glide / dwell / idle-hide timings.
Arguments: none.
{}set_agent_cursor_enabled
Toggle the overlay. Persists to config.
Arguments:
enabled(boolean, required): True to show; false to hide.
{"enabled": true}set_agent_cursor_motion
Tune the Bezier-arc + spring motion knobs. All fields are optional; omitted fields keep their current value.
Arguments:
start_handle(number, optional): Start-handle fraction in [0, 1]. Default 0.3.end_handle(number, optional): End-handle fraction in [0, 1]. Default 0.3.arc_size(number, optional): Arc deflection as fraction of path length. Default 0.25.arc_flow(number, optional): Asymmetry bias in [-1, 1]. Default 0.spring(number, optional): Settle damping in [0.3, 1]. Default 0.72.glide_duration_ms(number, optional): Flight duration per click, 50-5000. Default 750.dwell_after_click_ms(number, optional): Pause after the click ripple, 0-5000. Default 400.idle_hide_ms(number, optional): Overlay linger after last click before auto-hide, 100-60000. Default 3000.
{"arc_size": 0.3, "glide_duration_ms": 900}Config
Persistent settings live at ~/Library/Application Support/Cua Driver/config.json. Writes route through the running daemon when reachable so live state (e.g. AgentCursor.shared) picks up changes without a restart.
get_config
Return the current config as pretty-printed JSON, identical to the on-disk shape.
Arguments: none.
{}set_config
Write a single leaf field identified by a dotted snake_case path.
Arguments:
key(string, required): Dottedsnake_casepath to a leaf config field, e.g.agent_cursor.enabled.value(required): New value. JSON type depends on the key.
{"key": "capture_mode", "value": "ax"}Supported keys and ranges: see the CLI reference.
Recording and replay
The trajectory recorder captures every action-tool call (click, right_click, scroll, type_text, type_text_chars, press_key, hotkey, set_value) into numbered turn folders. Recordings can be replayed turn-by-turn.
get_recording_state
Report whether the recorder is currently enabled, the output directory, and the next turn number.
Arguments: none.
{}Structured-content response: {"enabled": bool, "next_turn": int, "output_dir"?: string}.
set_recording
Toggle the recorder. When enabling, output_dir is required.
Arguments:
enabled(boolean, required): True to start; false to stop.output_dir(string, optional): Absolute or~-rooted directory for turn folders. Required whenenabled=true.video_experimental(boolean, optional): Experimental: also capture the main display to<output_dir>/recording.mp4via SCStream (H.264, 30fps, no audio, no cursor). Off by default. Ignored whenenabled=false.
{"enabled": true, "output_dir": "~/cua-trajectories/demo1"}replay_trajectory
Re-invoke every turn's tool call in lexical order against the live system.
Arguments:
dir(string, required): Trajectory directory previously written byset_recording. Absolute or~-rooted.delay_ms(integer, optional): Milliseconds to sleep between turns, 0-10000. Default 500.stop_on_error(boolean, optional): Stop replay on the first tool-call error. Default true; set false to best-effort through the full trajectory.
{"dir": "~/cua-trajectories/demo1", "delay_ms": 800, "stop_on_error": false}Was this page helpful?