Interface Contracts

The cross-cutting contracts behind the CLI and MCP surfaces — transport state, config persistence, capture scope, and how an action is routed.

The CLI (cua-driver call …) and the MCP server (cua-driver serve / mcp) run the same tool code, but they differ in what state survives between calls, where configuration lands, and which parameters a call must carry. This page is the contract map. For the why behind the process shapes, see Process model; for the modality axes, see Capture and dispatch modalities.

CLI versus MCP at a glance#

Dimension	CLI (`call`)	MCP (`serve` / `mcp`)
Lifetime	one process per action	one process, many actions
`element_index` cache	only when proxied to a running daemon	per connection, lives across calls
Identity	anonymous, unless proxied to a daemon	a minted `_session_id` per connection, or an explicit `session`
Where `set_config` lands	the persisted global default on disk	an in-memory, session-scoped override
Agent cursor	none	shown when a `session` is declared

When a daemon is already listening, cua-driver call proxies to it. A freshly built binary's behavior will not appear through the CLI until that daemon restarts, because the call runs in the daemon, not in the new process. Integration tests avoid this by spawning their own MCP server.

Where settings live#

set_config resolves where a setting is written from whether a session is declared. The daemon mirrors the public session argument into the reserved _session_id.

Caller	`_session_id`	Effect
`cua-driver config set …`, one-shot `cua-driver call`	absent (anonymous)	writes the global `DriverConfig` and persists to `~/.cua-driver/config.json`
MCP call with a `session`	present	in-memory override for that session only — no disk write, no clobber of the default

Every tool then reads the effective value with this precedence:

effective = call-argument  >  session override  >  global default (disk)

Keys that flow through this: capture_scope, max_image_dimension. (capture_mode is deprecated and ignored — it is still accepted for back-compat but has no effect; get_window_state always returns both the tree and a screenshot.) See the set_config reference for the per-session isolation details.

The capture-scope contract#

capture_scope decides which parameters a call must carry and what coordinate space it speaks.

 capture_scope = "window"  (default)          capture_scope = "desktop"
 ───────────────────────────────────          ──────────────────────────────────
 per window, background-capable                whole screen, foreground, vision

 observe: get_window_state(pid, window_id)     observe: get_desktop_state()
 act:     click(pid, …)                        act:     click(x, y)
          element_index + window_id  (AX)                true screen pixels
          or x, y + pid  (window-local px)              no pid, no window_id
 needs:   pid (+ window_id for elements)        skips:   window_id, list_windows

	`window` (default)	`desktop`
Coordinate space	window-local (the PNG `get_window_state` returns)	true screen pixels
Required params	`pid` (+ `window_id` for `element_index`)	only `x`, `y`
Capture surface	`get_window_state` (tree + screenshot)	`get_desktop_state` (screenshot)
Dispatch	background (default) or foreground	foreground only
Action rung	element ax action (`element_index`) or element px action (`x,y`)	element px action (`x,y`) only

Desktop scope is the screen-absolute "Computer-Use 1.0" loop: read the whole screen, click an absolute coordinate, the way a screenshot-only model expects. Window scope is the default because it is what makes background, concurrent automation possible.

How an action is routed#

Every input tool — click, scroll, and the keyboard family (type_text, press_key, hotkey) — picks its path from the arguments present:

Arguments	Path	Behavior
`element_index` + `window_id`	accessibility action	UIA Invoke / `AXPerformAction` / AT-SPI — background, no cursor move, no focus steal
`x`, `y` + `pid`	window-local pixel	coordinates are relative to that window's screenshot. For the keyboard family this px form pixel-clicks `(x, y)` to establish real renderer focus, then delivers the keystroke(s) to the now-focused element — the one-call path for Chromium/Electron inputs the AX layer can't focus
`x`, `y`, no `pid`/`window_id`, scope `desktop`	screen-absolute	true screen pixels, lands on whatever is frontmost there
`x`, `y`, no `pid`/`window_id`, scope `window`	rejected	structured `desktop_scope_disabled` error

The keyboard family's x, y (px) form is mutually exclusive with element_index (ax) — pass one or the other, not both.

A window-less action is never silently reinterpreted. Under window scope it is rejected with a structured desktop_scope_disabled error that points the caller at set_config capture_scope=desktop, rather than treating screen pixels as window-local pixels.

Platform support#

Capability	Windows	macOS	Linux
`get_window_state` returns both tree + screenshot (element ax / px actions)	✅	✅	✅
`dispatch: background` (the no-foreground contract)	✅	✅	✅ (X11/AT-SPI; native Wayland input is a gap)
`dispatch: foreground` + `bring_to_front`	✅	✅ explicit activation (input is already background-safe; activation is for focus-proxy surfaces like RDP)	stubbed
`get_desktop_state` (desktop capture)	✅	✅	✅
Window-less desktop click (`click{x,y}`, no `pid`)	✅	rolling out	rolling out

Mental model#

WINDOW scope  = "talk to a window"   → needs pid (+window_id for elements); background by default
DESKTOP scope = "talk to the screen" → needs only x,y; screen-absolute; foreground; vision

CLI  = stateless; writes config to DISK (or proxies to a running daemon)
MCP  = stateful session; config override in MEMORY; owns the agent cursor