Cua Docs

Interface Contracts

The cross-cutting contracts behind the CLI and MCP surfaces — transport state, config persistence, capture scope, and how an action is routed.

The CLI (cua-driver call …) and the MCP server (cua-driver serve / mcp) run the same tool code, but they differ in what state survives between calls, where configuration lands, and which parameters a call must carry. This page is the contract map. For the why behind the process shapes, see Process model; for the modality axes, see Capture and dispatch modalities.


CLI versus MCP at a glance#

DimensionCLI (call)MCP (serve / mcp)
Lifetimeone process per actionone process, many actions
element_index cacheonly when proxied to a running daemonper connection, lives across calls
Identityanonymous, unless proxied to a daemona minted _session_id per connection, or an explicit session
Where set_config landsthe persisted global default on diskan in-memory, session-scoped override
Agent cursornoneshown when a session is declared

When a daemon is already listening, cua-driver call proxies to it. A freshly built binary's behavior will not appear through the CLI until that daemon restarts, because the call runs in the daemon, not in the new process. Integration tests avoid this by spawning their own MCP server.


Where settings live#

set_config resolves where a setting is written from whether a session is declared. The daemon mirrors the public session argument into the reserved _session_id.

Caller_session_idEffect
cua-driver config set …, one-shot cua-driver callabsent (anonymous)writes the global DriverConfig and persists to ~/.cua-driver/config.json
MCP call with a sessionpresentin-memory override for that session only — no disk write, no clobber of the default

Every tool then reads the effective value with this precedence:

effective = call-argument  >  session override  >  global default (disk)

Keys that flow through this: capture_scope, max_image_dimension. (capture_mode is deprecated and ignored — it is still accepted for back-compat but has no effect; get_window_state always returns both the tree and a screenshot.) See the set_config reference for the per-session isolation details.


The capture-scope contract#

capture_scope decides which parameters a call must carry and what coordinate space it speaks.

 capture_scope = "window"  (default)          capture_scope = "desktop"
 ───────────────────────────────────          ──────────────────────────────────
 per window, background-capable                whole screen, foreground, vision

 observe: get_window_state(pid, window_id)     observe: get_desktop_state()
 act:     click(pid, …)                        act:     click(x, y)
          element_index + window_id  (AX)                true screen pixels
          or x, y + pid  (window-local px)              no pid, no window_id
 needs:   pid (+ window_id for elements)        skips:   window_id, list_windows
window (default)desktop
Coordinate spacewindow-local (the PNG get_window_state returns)true screen pixels
Required paramspid (+ window_id for element_index)only x, y
Capture surfaceget_window_state (tree + screenshot)get_desktop_state (screenshot)
Dispatchbackground (default) or foregroundforeground only
Action rungelement ax action (element_index) or element px action (x,y)element px action (x,y) only

Desktop scope is the screen-absolute "Computer-Use 1.0" loop: read the whole screen, click an absolute coordinate, the way a screenshot-only model expects. Window scope is the default because it is what makes background, concurrent automation possible.


How an action is routed#

Every input tool — click, scroll, and the keyboard family (type_text, press_key, hotkey) — picks its path from the arguments present:

ArgumentsPathBehavior
element_index + window_idaccessibility actionUIA Invoke / AXPerformAction / AT-SPI — background, no cursor move, no focus steal
x, y + pidwindow-local pixelcoordinates are relative to that window's screenshot. For the keyboard family this px form pixel-clicks (x, y) to establish real renderer focus, then delivers the keystroke(s) to the now-focused element — the one-call path for Chromium/Electron inputs the AX layer can't focus
x, y, no pid/window_id, scope desktopscreen-absolutetrue screen pixels, lands on whatever is frontmost there
x, y, no pid/window_id, scope windowrejectedstructured desktop_scope_disabled error

The keyboard family's x, y (px) form is mutually exclusive with element_index (ax) — pass one or the other, not both.

A window-less action is never silently reinterpreted. Under window scope it is rejected with a structured desktop_scope_disabled error that points the caller at set_config capture_scope=desktop, rather than treating screen pixels as window-local pixels.


Platform support#

CapabilityWindowsmacOSLinux
get_window_state returns both tree + screenshot (element ax / px actions)
dispatch: background (the no-foreground contract)✅ (X11/AT-SPI; native Wayland input is a gap)
dispatch: foreground + bring_to_front✅ explicit activation (input is already background-safe; activation is for focus-proxy surfaces like RDP)stubbed
get_desktop_state (desktop capture)
Window-less desktop click (click{x,y}, no pid)rolling outrolling out

Mental model#

WINDOW scope  = "talk to a window"   → needs pid (+window_id for elements); background by default
DESKTOP scope = "talk to the screen" → needs only x,y; screen-absolute; foreground; vision

CLI  = stateless; writes config to DISK (or proxies to a running daemon)
MCP  = stateful session; config override in MEMORY; owns the agent cursor