Cua Docs

MCP Tool Notes

Cross-cutting contracts behind the MCP tools — shared parameters, required-parameter rules, platform-specific parameters, and the action response shape.

These notes are hand-maintained companions to the auto-generated MCP Tools reference. They document the cross-cutting parameter contract and response shape that span multiple tools and are not derivable from any single tool's schema.

Common parameters#

Several parameters are a shared cross-platform contract: the same JSON shape on Windows, macOS, and Linux, composed from canonical schema fragments and enforced by a CI consistency gate so the three platforms cannot drift. Tools accept them uniformly.

ParameterWhereNotes
sessionevery action and cursor toolOptional run identity for the agent cursor and per-session state; the same id works over MCP, the CLI, or the raw socket and follows the run across apps and windows. Accepted on all three platforms — earlier Windows and Linux builds rejected it via additionalProperties:false.
delivery_modethe input family (click, double_click, right_click, drag, scroll, type_text, press_key, hotkey)"background" (default) injects without fronting or raising the target — the no-foreground contract. "foreground" briefly fronts the target, acts, then restores the prior frontmost — the explicit last resort when a background attempt did not land. Legacy "auto" is removed; omitted or unknown values fall back to "background" for safety.
capture_modeget_window_stateDeprecated and ignored. Still accepted for back-compat so old callers don't error, but it has no effect — get_window_state always returns both the accessibility tree and a screenshot by default. There is no ax/vision/som capture choice; the modality (ax vs px) is chosen at action time by how you address the target.
include_screenshotget_window_stateBoolean, default true (returns the tree and a screenshot). Set false to skip the screenshot grab and return the tree only — a perf opt-out for the cheap re-index-before-an-element-ax-action path, not a modality choice.
modifier, button, element_index, element_tokenpointer and element toolsHeld modifier keys, mouse button, and the two element-addressing handles.

Required parameters#

The required set is uniform across platforms: click requires nothing, scroll requires direction, and zoom requires window_id plus x1/y1/x2/y2. pid is conditionally required — needed for every per-window action, omitted for a windowless desktop-scope call — so it is validated in code with a clear error rather than pinned in the schema. A client that omits pid for a desktop-scope action is therefore not schema-rejected.

Platform-specific parameters#

A few parameters are platform-specific by design and are intentionally NOT part of the shared contract:

Parameter or toolPlatformWhy
launch_app identifiersmacOS: bundle_id, urls. Windows: aumid, launch_path, path, start_minimized.App launch is OS-native. name is the portable fallback that works on both.
debug_window_infoWindows onlyA window-handle / class / rect / z-order debug tool with no macOS or Linux counterpart.
check_permissions.promptmacOS onlyRaises the TCC permission dialogs; there is no Windows or Linux equivalent of TCC.
Windowless screen-absolute (desktop-scope) clicks/scrollsall three platforms, exposed differentlyThe capability exists everywhere, but the knob differs: macOS takes a per-call scope:"desktop" parameter on the pointer tools, while Windows and Linux gate it on the persisted capture_scope:"desktop" config (set_config). So a per-call scope param appears only on macOS; on Windows/Linux you opt in once via config. Unifying this knob is a tracked follow-up.
modifier on a background clickhonored on macOS and Linux; dropped on a Windows background clickA Windows background click goes through UIA Invoke / PostMessage, which carry no live keyboard state, so modifier takes effect only on the delivery_mode:"foreground" (SendInput) rung there.

Action response shape#

Action tools (click, double_click, right_click, drag, scroll, type_text, press_key, hotkey, set_value) return these structured fields:

FieldTypeMeaningValue set / presence
pathstringDelivery rung that ran."ax", "cgevent", "cgevent_fg", "key_events", "key_events_fg", "pixel", "x11_atspi", "x11_pixel", "x11_pixel_fg", "msaa".
verifiedboolean or absentAX read-back verification result. true means the driver read the effect back through AX; false means the action dispatched but is unconfirmed; absent means the tool does not carry this field.true, false, or absent.
effectstringAction confidence signal."confirmed", "unverifiable", "suspected_noop".
escalationobject or absentMachine-readable next-rung recommendation. Present only when the driver recommends climbing the ladder.{ recommended: "px" | "foreground" | "page", reason: string }, or absent.

Per-tool notes#

get_window_state degraded results#

On Linux (and macOS/Windows), the structured result may include degraded: true alongside a degraded_reason string when the accessibility walk completed but found no actionable elements. This distinguishes "a11y bridge not up, daemon not on the session D-Bus, or non-AX surface" from a window that genuinely has no controls — do not treat elements: [] as authoritative when degraded: true is set. When degraded: true, act by px off the screenshot returned in the same response.

page platform support#

get_text, query_dom, click_element, and execute_javascript work cross-platform (macOS, Windows, Linux). insert_text and type_keystrokes are implemented on macOS only for now; Windows and Linux return a clear "not implemented" error rather than a silent no-op. Tracked in trycua/cua#2084.