Cua Docs

Modality Test Suite & Harnesses

How the Cua Driver modality test suite and per-OS app harnesses work — action rungs, scopes, the action matrix, and per-platform results.

This document describes the cua-driver test-harness: a cross-OS, cross-toolkit rig that exercises every driver action against a controlled application and scores each action twice — did it land an effect, and did it keep the app in the background. It is the authoritative map of which driver mechanisms work where, and why each limitation manifests in the mode/scope it does.

1. Overview — what the suite validates#

The suite validates two things, per driver action, per surface:

  1. The no-foreground / background contract — cua-driver's differentiator. The driver is supposed to drive an app without bringing it to the foreground and without moving the visible cursor (it dispatches via the accessibility tree or via background coordinate injection, and paints a per-session overlay cursor instead of warping the real pointer). The suite measures, for every action, whether the target app stayed in the background (held) or jumped frontmost (STOLE FOCUS).

  2. Cross-toolkit dispatch coverage — whether an action actually lands on each UI toolkit. The same 8-action × 6-control matrix runs against WPF, WinUI3, WebView2, Electron (Windows), GTK3 + Electron (Linux), and AppKit, SwiftUI, WKWebView, Electron (macOS). Toolkits route input differently (UIA Invoke vs AT-SPI doAction vs AX actions vs synthetic pixels vs composition input-sites), so "click works" is a per-toolkit, per-dispatch-path fact, not a global one.

Scope caveat — this tests driver mechanisms, not agent task success. A cell that reads ✓ means the driver delivered the action and the target's own instrumented state changed (a checkbox toggled, a slider moved, last_action= updated). It does not mean an LLM agent located the control correctly or chose the right action. Agent locator quality and agent self-judgment are explicitly kept on separate axes (see §7) so a bad locator can never mask — or be masked by — a driver regression.


2. How the suite works — the modality recorder#

The core instrument is the modality recorder (one per OS: windows/wpf-recorder.ps1, linux/lin-rec.py + lin-rec-electron.py, macos/mac-rec.py). Each run produces one annotated screen recording:

  • Test harness on the LEFT, cua-driver-panel dashboard on the RIGHT (always-on-top). The dashboard is a tiny HTML page served over loopback and shown via a chrome --app (Windows/macOS) or WebKit (Linux) window; it polls a status.json the recorder writes (~5 Hz) and renders the live verdict.
  • The dashboard shows the run's modality, a live foreground/background indicator (current frontmost window title), and a per-action dual verdict.
  • The pink/cyan overlay is the driver's per-session agent cursor — proof the driver is acting without a real pointer move.

The 5 modality modes (per surface)#

Action rung × foreground/background × scope = 5 single-modality runs:

ModeAction rungDispatchForeground?Scope
ax-fgelement ax action (accessibility)element_indexapp kept FRONTwindow
ax-bgelement ax actionelement_indexapp stays BACKGROUND (contract measured)window
px-fgelement px action (pixel)pixel coordsapp kept FRONTwindow
px-bgelement px actionpixel coordsapp stays BACKGROUND (contract measured)window
px-desktopelement px actionscreen pixels (window-less)FRONTdesktop

The run names use ax/px as labels for the element-dispatch vs pixel-dispatch runs; capture_mode itself is now deprecated and ignoredget_window_state returns both the tree and a screenshot regardless of what you pass. The distinction the recorder exercises is the action rung (the element ax action addressing by element_index vs the element px action addressing by pixel x,y), not a capture toggle.

The fg/bg split only matters for the contract (measured in the *-bg runs). px-desktop sets capture_scope=desktop and issues window-less, screen-absolute actions.

Dual scoring per action#

Every action row carries two independent measurements (rendered as two badges on the dashboard):

  1. ✓ worked / ✗ no-op (effect) — did the action change the harness's own state? Read back after the action via the harness's instrumented status labels (agreed=, slider_value=, last_action=, mirror=, menu_action=, scroll_offset=).
  2. held / STOLE FOCUS (contract) — sampled for ~1.5 s after the action: did the target window's title become the frontmost window? In fg modes this badge is replaced by foreground (steal is expected/intentional).

The dashboard tally at the bottom is the live count: effects: N/M actions changed the app and, in bg modes, focus contract: S/A stole focus.

How the verifier reads each toolkit's instrumented labels#

The harnesses all expose the same set of status labels (single source of truth: shared/scenarios.json + shared/web/index.html), but the verifier reads them differently per platform:

Platform / toolkitHow the verifier reads state
Windows WPF / WinUI3UIAutomation — finds the harness window by title substring, scans ControlType.Text descendants, regex-extracts agreed=, slider_value=, last_action=, mirror=, menu_action=, scroll_offset=
Windows WebView2 / ElectronSame UIA scan, but the window is matched by substring because Chromium titles carry a [cdp=NNNN] suffix; the web-AX tree only surfaces with --force-renderer-accessibility
Linux GTK3The harness writes a state file (/tmp/cua-lin-state.json); the recorder's hstate() reads it (AT-SPI text isn't reliably enumerable for all labels)
Linux ElectronChromium web-AX over AT-SPI (--force-renderer-accessibility), read from the get_window_state tree
macOS AppKit / SwiftUIAX — status labels render as AX text nodes in tree_markdown; do_native() drives the action, then re-reads the labels
macOS WKWebView / ElectronChromium/WebKit web-AX text (WEBISH path), same label contract

The per-action verify(action, before, after) logic is consistent across recorders, e.g.: clickagreed flipped; doublelast_action == double_click; rightlast_action == right_click or menu_action changed; dragslider increased; scrollscroll_offset increased; setvalmirror == set-by-cua; typemirror contains typed-by-cua; press-key (Tab) → no asserted effect (na).

Honesty principle (FINDINGS.md): the harness/recorder stays honest — real driver gaps are documented as driver work, fixed in the driver first, then re-recorded. The recorder does not paper over a no-op.

Session lifecycle & disposal#

Each recorder declares a session (the agent-cursor identity, e.g. d1 / macd-…) via start_session, drives every action under it, and must end_session it on teardown so the cursor + per-session state are disposed — not leaked.

  • Disposal is double-guarded. end_session and the idle-TTL sweep (evict_idle, default 300 s, overridable via CUA_DRIVER_RS_SESSION_IDLE_TTL_SECS, swept every 30 s) both dispose through the same fire_session_end hook fan-out: agent cursor removed, recording stopped, per-session config cleared. A reused id starts fresh, not poisoned.
  • Test: cua-driver-core/tests/session_lifecycle.rs asserts start → use → end_session disposes (and a reused id starts fresh), and that idle-TTL eviction disposes the same way — verified via the disposal-hook contract both paths fan out to (the concrete cursor/config registries are a daemon-layer assertion). It passes a short TTL straight to evict_idle; the 300 s production default is never lowered for tests.
  • The recorders dispose explicitlyend_session in a try/finally before they kill the daemon (mac-rec.py via an atexit net, since it leaves the daemon running) — so a session can't leak even if a run errors.

3. Test-app harnesses#

Each harness is a self-contained host app the driver drives. They share fixtures via shared/scenarios.json (AutomationIds / AX identifiers / DOM ids / window titles) and, for the web hosts, shared/web/index.html, so identifiers never drift between apps and the Rust integration tests under rust/crates/cua-driver/tests/harness_*.rs. Built outputs stage into rust/test-apps/harness-{name}/ via build/{macos.sh,windows.ps1,linux.sh}.

The 6 parity controls#

The "control-parity pass" brought every surface up to the WPF baseline of 6 controls, so the same 8-action matrix is exercisable everywhere:

  1. checkbox (chk-agree*) → agreed=
  2. click-target (left / right / double) → last_action=, clicks=
  3. slider (sld-value) → slider_value=
  4. scroll-target (tall scroll-tall region) → scroll_offset=
  5. text-input (txt-input, mirrored) → mirror=
  6. context-menu (btn-context) → menu_action=

(Plus a counter= increment button used as a generic AX-press probe.) WPF carries an extended scenario set on top (modal MessageBox, owned/layered popups, native Win32 child HWND, menu bar, combo/list boxes) — it is the "fullest harness".

Per OS / toolkit#

OSHarnessWhat it isBuildSource
WindowsWPFWPF / UIA host — popups, layered windows, modal MessageBox, HwndHost native childbuild/windows.ps1harness-wpfapps/windows/wpf/ (C#/XAML)
WindowsWinUI3WinUI3 unpackaged — DirectComposition popups, XAML Popup primitive, CommandBarFlyout; HWND subclassed to translate WM_VSCROLLChangeViewbuild/windows.ps1harness-winui3apps/windows/winui3/
WindowsWebView2WPF host + Microsoft.Edge.WebView2 loading the shared DOM; CDP on :9222build/windows.ps1harness-webviewapps/windows/webview2/
WindowsElectronElectron host loading the same shared DOM; CDP on :9223apps/cross-platform/electron/build.ps1apps/cross-platform/electron/
LinuxGTK3PyGObject GTK3 app; AT-SPI exposes accessible NAME (not an AutomationId); writes /tmp/cua-lin-state.json for the verifierbuild/linux.shharness-gtk3apps/linux/gtk3/main.py
LinuxElectronSame Electron bundle run from node_modules under Xvfb + dbus with --force-renderer-accessibilityrun via lin-run-electron.shapps/cross-platform/electron/
macOSAppKitCocoa — NSButton, NSScrollView, NSMenu, NSSlider, menubar itembuild/macos.shharness-appkitapps/macos/appkit/main.swift
macOSSwiftUISwiftUI — .popover(), .contextMenu, declarative Slider/Toggle/ScrollViewbuild/macos.shharness-swiftuiapps/macos/swiftui/main.swift
macOSWKWebViewApple WebKit host loading the shared DOM — the native analogue of WebView2build/macos.shharness-wkwebviewapps/macos/wkwebview/main.swift
macOSElectronSame shared DOM under Chromium-on-macOSapps/cross-platform/electron/build.shapps/cross-platform/electron/

Instrumented labels (the verifier contract)#

Every harness updates these labels in response to a successful action; the verifier asserts on them:

LabelSet byVerifies
last_action=click-target (left/right/double)click / double / right landed
clicks=click-targetclick count
slider_value=sliderdrag / set_value on a range control
scroll_offset=scroll-targetscroll landed
agreed=checkboxclick toggled a toggle
menu_action=context-menu itemright-click → context menu → item invoked
mirror=text-inputtype / set_value into a text field (typed-by-cua / set-by-cua)
counter=increment buttongeneric AX-press probe

4. Action rungs & scopes#

Action rung (how the action addresses the target)#

get_window_state is perception-mode-agnostic: it returns both the accessibility tree and a screenshot by default (include_screenshot:false skips the screenshot grab — a perf opt-out, not a modality choice). capture_mode is deprecated and ignored — there is no ax/vision capture toggle anymore. The modality is chosen at action time by how you address the target, and that one choice selects the rung:

RungHow the agent addresses the targetHow actions dispatch
element ax actionby element_index (read off the tree)UIA Invoke (Windows) / AT-SPI doAction (Linux) / AX actions (macOS) — backgroundable, driver-verifiable
element px actionby pixel x,y (read off the screenshot in the same response)synthetic/injected pointer events at (x,y) — best-effort, caller-confirmed

Deprecated aliases. som (the old set-of-marks mode) and screenshot are no longer advertised capture modes — like ax/vision/capture_mode itself, they still decode but are ignored (both the tree and the screenshot come back regardless). The modality recorder's runs exercise the element-dispatch path and the pixel-dispatch path, so the recorder runs and the results matrices below are unchanged — they are framed here as the element ax action and the element px action rungs.

Capture scope (capture_scope)#

ScopeMeaning
windowsingle target window (default)
desktopfull display

Why mode/scope determines where a limitation shows#

This is the central organizing idea (from CUA_DRIVER_LIMITATIONS_AND_TEST_MATRIX.md):

  • The element ax action (element_index) path. Limits in the element path (a toolkit with no doAction for right-click, a WinUI3 composition input-site the WPF path can't reach, an AXValue a SwiftUI slider rejects) surface in this rung — the ax- runs.
  • The element px action (pixel) path. Limits in the pixel path (GTK dropping synthetic X events, NSButton ignoring a synthetic mouseDown, a 2× Retina coordinate conversion) surface in this rung — the px- runs.
  • fg/bg isolates the contract (only *-bg measures focus steal).
  • desktop scope is where window-less / coordinate-space-conversion bugs surface (e.g. the macOS 2× Retina desktop path, the macOS no-windowless-click gap).

5. Action × control matrix#

8 actions × 6 controls. Each modality run filters the plan: set_value is dropped in the px modes (the pixel-dispatch runs — it's AX-only); px-desktop keeps only {click, scroll, type, press-key} (window-less screen actions).

Action ↓ / Control →checkboxclick-targetsliderscroll-targettext-inputcontext-menu
click● (toggle)● (left)
double-click
right-click● (web/WinUI3 records last_action)● (opens menu)
drag● (thumb→X)
scroll
set_value (AX)● (range value)● (set-by-cua)
type● (typed-by-cua)
press-key (Tab)● (focus move; no asserted effect)

Notes:

  • WPF has a dedicated context-menu control; WinUI3 / web harnesses have none, so right-click is aimed at the click-target and verified via last_action=right_click.
  • SwiftUI's AX has no distinct right-press action, so right-click on its button coerces to a normal AXPress — right-click semantics there are covered by the dedicated context_menu scenario.

6. RESULTS MATRIX (the core)#

Per cell: ✓ landed · ✗ no-op · ⚠ guarded/fail-loud for the effect, with the reason. The Contract column is the focus verdict measured in the *-bg runs. "land" = the harness's own instrumented label changed. Verification status is drawn from FINDINGS.md, the limitations doc, and the fix-commit bodies; runtime status is flagged where a commit deferred it.

Legend: ✓ lands · ✗ no-op (honest) · ⚠ guarded clean-fail / fail-loud · — not exercised · n/a not in plan / no asserted effect.

macOS#

Contract: HOLDS — 0/8 stole on AppKit in both ax-bg and px-bg (pixel left-click/double/drag landed while Chrome stayed frontmost). WKWebView also holds — confirming the Windows Chromium steal is Windows-specific, not WebKit-on-macOS.

Harnessmodeclickdoublerightdragscrollset_valuetypepress-keyContractNotes
AppKitax✓ (fixed bg, click_at_xy_with_window_local, 80b4e3d7)✓ (AXShowMenu→delivering-pixel fallback, 80b4e3d7)⚠ element scroll no-ops on content-height containers (use x,y)✓ CFNumber NSSlider (92d8aebf, verified 0→50)n/a0/8 stoleax-bg 5/7 effects
AppKitpxno-op on standard NSButton even frontmost (modal mouseDown reads window-server queue, not postToPid → use element_index)✓ now fires rightMouseDown (window-number + button + primer, e7145c18)✓ via per-pid pixel-wheel scroll_wheel_at_xy (e7145c18)n/an/a0/8 stoleneeds target frontmost; px-bg 4/6
SwiftUIax✓ (via dedicated context-menu)✗ slider drag is a composition no-op⚠ nested scroll no-opAXValue unsettable (-25200); needs AXIncrement/AXDecrement stepping (d2b29230)n/a0/8 stolefull 6-control parity confirmed live
WKWebViewax+px✓ slider (pixel drag at live frame)page scroll works, nested overflow:auto div no-op (div has no tabindex → never keyboard-focusable; needs pixel-wheel)AXValue set, no DOM input event (web mirror unchanged)n/aHOLDSnative analogue of WebView2. Recorder fix (mac-rec.py): web dispatch now uses the live element frame, not a stale hardcoded coord map that was ~185px off → every WKWebView pixel action used to miss. ax-fg re-measured 5/6 (click/double/right/drag/type land; only set_value the honest web no-op).
Electron (mac)ax/px✗ (web scroll, same as WKWebView)n/aholds (Chromium-on-mac does not steal like Windows)local-only recordings

macOS gotchas filed as driver bugs: end_session poisons a reused session id (subsequent actions silently no-op); AppKit window height drifts between launches (store targets as window-local points, convert to live screenshot px).

Linux#

Harnessmodeclickdoublerightdragscrollset_valuetypepress-keyContractNotes
GTK3ax (AT-SPI)✓ left-click via hit-test→doAction (7d358283)✗ no doAction equiv✗ no doAction equiv⚠ value-only (no Action) — driven via set_value⚠ surfaced now (is_indexable = actions || has_value); driven via set_value✓ slider/scroll value widgets now surfacen/aax-bg 1/8 stole (only set_value; corrected via genuine-anchor baseline 5c0a1d3c — was 3/8)left/set_value/type land
GTK3px (Xvfb)n/an/a0/7 stoleGTK drops synthetic XSendEvent; XTEST core events don't reach its XInput2 path
GTK3px (real Xorg)n/an/aholds focusright/double/middle-click + scroll land via uinput/XInput2-MPX + shield-grab (79e546ca); capability auto-detected via real_pointer_input_available()runtime-verified on a real Xorg server (dummy-driver): the probe flips TRUE, all four actions LAND and HOLD focus (_NET_ACTIVE_WINDOW unchanged), confirmed by the harness oracle + a middle-click PRIMARY paste. Xvfb can't bind uinput as an X slave, so the path auto-skips there.
Electron (Linux)ax✓ click✓ drag✗ scroll resolves but no-op (AT-SPI synthetic)✓ typen/aax-bg 2/8 stole (set_value + drag; was 3/8 — recorder baseline artifact, 5c0a1d3c)drag lands a value-change but its synthetic window-coord activates Chromium (never reaches the slider thumb); still holds far better than Windows Electron's reported 7/8 (also an artifact)
Electron (Linux)px✓ pixel double-click fires on click-targetn/an/apx-bg 6/7 stole (pixel dispatch foregrounds Chromium)
GTK4 (gnome-calculator)ax (AT-SPI)✓ via doAction (coordinate-free; lands without a valid frame)✗ no doAction equiv✗ no doAction equiv⚠ value-only (no Action) — driven via set_value⚠ driven via set_valuen/aholdsGTK4 coord fix required for frame/agent-cursor. GTK4's AT-SPI bridge returns Component.GetExtents(SCREEN)=(0,0) for every widget (GNOME/gtk #1564/#1739 a11y rework). Fix: queries CoordType::Window (GTK4 reports correctly per-widget) and reconstructs screen as x11_window_origin + _GTK_FRAME_EXTENTS.(left,top) + WINDOW_xy, gated on presence of _GTK_FRAME_EXTENTS so non-GTK toolkits (Qt) keep their correct SCREEN path. Verified live: button "7" = window(55,27) + inset(61,55) + WINDOW(16,293) = screen(132,375). doAction element clicks land regardless (coordinate-free). GNOME VM lane only.
GTK4 (gnome-calculator)px (real Xorg)✓ (after GTK4 coord fix)n/an/aholdsPixel coords reliable only after WINDOW+_GTK_FRAME_EXTENTS reconstruction; without the fix, cursor/frame collapses to the window corner. GNOME Shell requires real console Xorg (software GLX) — not Xvnc.

Windows#

Contract: HOLDS on WPF / WinUI3 / WebView2. Electron — see the caveat below (the prior "7/8 stole" is a recorder measurement artifact).

Harnessmodeclickdoublerightdragscrollset_valuetypepress-keyContractNotes
WPFax✓ (UIA Toggle)⚠ off-screen at 556px reflow → clean no-op (guarded; UIA Invoke works off-screen)✓ slider 0→48⚠ off-screen reflow → guarded no-op✓ (mirror=set-by-cua)✓ (recorder focuses the box first)n/a2/8 stole"fullest harness"; ax-bg ~5/7 effects
WPFpx⚠ guarded (point_in_window_bounds refuses off-screen point; off-screen controls now ScrollIntoView + actuate via ancestor ScrollPattern, c3efb587, runtime-verified counter 0→1)⚠ off-screen reflown/an/a1–2/7 stoleoff-screen guard prevents taskbar misfire
WinUI3ax✓ left-click (UIA Invoke)now LANDS — double UIA-Invoke under WS_EX_NOACTIVATE guard (531aa6de, runtime-verified live)fail-loud background_unavailable — no contract-safe path (pen injection steals; WM_*BUTTON doesn't land)⚠ fail-loud (same composition input-site gap)✓ (HWND subclass: WM_VSCROLLChangeView)n/a0/7 stolegap needs a WinUI3-specific composition InputSite path
WebView2ax✗ checkbox below the fold (web won't scroll in ax)✓ records last_action=right_click✗ host HWND doesn't route scroll to renderern/a0/7 stoleneeds --force-renderer-accessibility; 4/7 effects
Electronax✗ web scrolln/asee caveat5/7 effects

Other Windows: UIA element-cache use-after-free (concurrent ax sessions on the same window) — fixed with a RetainedElement retain-under-lock guard (d95b89a1); proven by a deterministic cache_uaf_repro test (pre-fix path takes a real access violation, fixed path survives a 6-thread stress loop, 531aa6de). Recording the VM needs Session 2 attached (tscon … /dest:console) and ffmpeg on PATH before serve.

Windows Electron contract caveat — RESOLVED. FINDINGS.md records Electron as stealing 7/8 in ax-bg. That was a recorder MEASUREMENT ARTIFACT: the recorder re-asserted its dashboard panel with SetWindowPos (z-order only, no activation), so it never held a real foreground baseline — after the first inject action click-activated Chromium, the harness stayed frontmost and every later step false-positived as a steal. The recorder was fixed (49bdb41b) to genuinely SetForegroundWindow a real non-harness anchor (mspaint) before each step. Re-measured: Electron ax-bg = 0/8 (held) — corroborated by baseline.log (anchor held ×8), an independent metric.log MEASURE=0/8, and frame f03 (actions landed while the harness stayed non-foreground). So Electron HOLDS the contract, consistent with Linux/macOS Electron. (The committed matrix-electron-ax-bg.mp4 still shows the old 7/8 until re-recorded with the fixed recorder.)


7. Vision-agent coordinate test#

vision-agent-test/ (vision_agent_test.py, 010fdd78) is a newer test that hits the driver the way a vision agent actually does — and removes the modality recorder's overfit (hand-tuned window-local points run through a private ratio, which never exercises the driver's image→screen mapping).

Invariant under test: the pixel an agent reads off the returned screenshot is the pixel that gets clicked — verified by the target's own instrumented state changing.

The loop (no cheating in locate or click):

  1. captureget_window_state (window — reads the screenshot it returns by default) / get_desktop_state (desktop, true pixels): the exact image an agent receives.
  2. locate — a deterministic pixel in the returned-image coordinate space (PixelRegistryLocator: a pre-measured pixel read off the real PNG, with a dims-guard that fails loud if the pinned geometry drifts). No element_index, no hand-converted window-local points. locate(image, target, dims) → (x,y) is pluggable (an LLM locator can drop in later).
  3. actclick/right_click/scroll at that pixel (scope matches capture).
  4. verify — the harness oracle (last_action=, clicks=, …) → objective pass/fail. A coordinate mis-map leaves the oracle unchanged → FAIL.

Run: python3 vision_agent_test.py {wkwebview-click-window|wkwebview-click-desktop|appkit-click-window|safari-learnmore-desktop|all}.

Three axes kept separate so none can mask the others: (a) the driver coordinate invariant (this test, deterministic → the regression guard); (b) agent locator quality (future, LLM, same locate() signature, scored separately, never gates the regression); (c) agent self-judged success with no oracle (future, isolated track).

What the deterministic version caught that the modality suite structurally couldn't:

  • 2× Retina desktop path now correct + guarded — pixel (340,1358) in the 3024×1964 desktop PNG converts to screen-point (170,679) and lands (oracle + a real Safari navigation confirm it). This turns the Retina escape into a permanently-guarded one-liner.
  • Pixel-click is a no-op on AppKit NSButton even frontmost (modal mouseDown reads the window-server queue, not the per-pid postToPid queue) — the modality suite drives this via AXPress, so it never saw it.
  • The pixel path requires the target app frontmost (the AX path doesn't).
  • The AppKit harness window AX returns only the menu bar — its clicks= oracle is unreachable that way; WKWebView exposes it fine.

8. Known overfitting caveats#

The suite's own authors flag two ways the modality recorder could have lied, and the de-risking work for each:

(a) Recorder hand-coords masked the desktop 2× Retina bug. The modality recorder stores targets as hand-tuned window-local points and converts them through a private ratio. That conversion is self-consistent with the driver's own assumption, so it never exercised the driver's real image→screen mapping — and the desktop-scope 2× Retina off-by-backing-scale bug (a center-pixel pick warping to the corner) slipped through every px-desktop run. De-risk: the vision-agent coordinate test (§7) reads a pixel straight off the returned PNG with a dims-guard, caught the bug, and now guards it (80b4e3d7 fixes the desktop branch to divide x,y by the native/logical ratio; 010fdd78 adds the regression test).

(b) Recorder contract measurement had no real foreground baseline → the false "Electron 7/8 stole". The recorder judges the contract by checking whether the target's title is the frontmost window, but it never established a genuine, distinct foreground window to begin with — so Windows Electron read as stealing 7/8 when the corrected understanding is that Chromium-on-Windows holds the contract (matching macOS/Linux Electron). De-risk: the recorder-contract fix (49bdb41b) now genuinely SetForegroundWindows a real anchor before each step; re-measured Electron ax-bg = 0/8 (held), so the claim is now verified and the legacy FINDINGS "7/8" is a confirmed measurement artifact. The Linux recorders had the same flaw — also fixed (5c0a1d3c): GTK3 + Linux-Electron ax-bg corrected 3/8 → 1/8 (only set_value genuinely steals); macOS was already fine.

Both caveats reflect the same lesson: a recorder that hand-feeds the driver its own assumptions can hide bugs in exactly the path a real agent uses. The deterministic, oracle-checked vision-agent test and the re-baselined contract measurement are the two structural fixes.


9. Edge cases — real closed-source apps#

The synthetic harnesses pin down the heuristics; real apps surface behaviours a single-process test window never can. These were found driving live closed-source apps (Finder, System Settings, Calculator, Safari on macOS; Calculator/Notepad UWP, Edge, Explorer on Windows) and are why the suite is a floor, not the ceiling.

macOS (Finder, System Settings, Calculator, Safari)#

  1. Pixel click hits the right pixel but the wrong window. With ~8 overlapping same-pid Finder windows, a crosshair dead-centre on the target file still posts to screen coordinates, so an occluding sibling intercepts it → no-op. A per-window screenshot masks it. Strong argument for the ax/element path, which is z-order independent.
  2. set_value and type_text both falsely succeed on a background search field (System Settings, SwiftUI). set_value writes AXValue (text appears, the search action never fires); type_text posts keys that a non-first-responder window drops. Neither drives the field; both report success.
  3. Finder column filenames don't advertise AXPress → a default click is an honest no-op; the driver surfaces AXOpen/AXShowMenu/AXConfirm, so an agent must pick action:"open"/"pick".
  4. The Calculator result is AX-invisible (no AX node for the display) → an AX-only agent can't read the answer; the keypad itself is labelled. A vision readout is required.
  5. AppKit AX-tree duplication (System Settings ~half duplicated, Safari a duplicated toolbar) plus whole-menu-bar walking inflates real-app context versus synthetic.

Windows (Calculator/Notepad UWP, Edge, Explorer)#

A. UWP window-identity split. launch_app("Calculator") returns the package backing PID and a window_id that's stale by the next call; the real top-level HWND belongs to ApplicationFrameHost.exe. Driving by the returned pid+window_id errors No window with window_id … exists. The driver should resolve UWP windows to their ApplicationFrameHost host / relink the churned HWND. B. Real UWP is drivable via the element path without the uiAccess worker. Calculator's num5Button drove the display 0→5 via UIA Invoke from the Medium-IL daemon with no cua-driver-uia.exe running. This refines the #1602/serve.rs assumption: only the pixel/SendInput path needs the uiAccess worker for AppContainer apps — the element_index UIA Invoke/ValuePattern path works on real UWP as-is. C. Real Edge (Chromium) holds the foreground contract. With Notepad pinned top, 3× click + double-click + right-click on background Edge left the z-order unchanged ((background, no foreground swap)). The shield validated on synthetic Electron (0/8) generalises to a real closed-source Chromium — and the double/right-click that needed the WinUI3 composition fix synthetically just work on real Chromium in the background. D. element_index requires pid (+window_id). Element actions with element_index alone fail-fast with Missing required integer field: pid. Correct, but the MCP tool descriptions under-emphasise it — an agent that omits pid and filters stderr perceives a silent no-op. The element_index tool schemas should state pid's necessity explicitly. E. get_screen_size under-reports desktop width (1024 reported vs 1824 actual span, likely an RDP dynamic-resolution artifact). The element path is unaffected (window-local frames stay consistent); a pixel/vision agent computing against 1024 width misplaces clicks in the right ~800 px band.

F. ⚠ Unhandled-protocol launch_app deadlocks the whole daemon (DoS-class). Launching an app whose protocol has no registered handler (e.g. bingmaps: with Maps uninstalled) spawns Windows' "you'll need a new app" modal on the daemon's session desktop and blocks the worker thread inside the shell-launch call. The client wedges indefinitely and a subsequent stop reports "daemon is not running" because the wedged daemon can't service the pipe; recovery needs the modal dismissed from inside that session (Session-0 SSH can't see it — per-session window stations). Any bad app name/protocol is a full-daemon DoS. The driver should launch via a non-blocking path with a timeout and/or validate handler registration before ShellExecute. (Needs a fix + a tracking issue.) G. One ApplicationFrameHost pid multiplexes N unrelated UWP apps (Settings, Calculator, and Store were all hosted under the same pid). Anything keying state/cursor/element-cache by pid alone is ambiguous across apps — only window_id is a real identity. (Generalises A.) H. launch_app's return contract is effectively per-app-architecture — three topologies seen across five apps: brokered/pid:0 (Settings, Store), real-pid-but-empty-windows race (Photos exposes its pid a beat before its HWND), and clean pid+window (Snipping Tool, the well-behaved baseline). Store is a three-way split: launch-pid 0 ≠ window-pid (AFH) ≠ content-pid (WinStore.App), yet UIA still walks fully into the content provider. I. The AFH UIA root is a caption-only ~188×32 strip for Settings and Store while the content frames are full and correct — the AppFrame chrome and the XAML content are different UIA providers stitched at the window. A single-provider synthetic WinUI harness won't reproduce this dual-provider root.

Methodology note: on a disconnected-RDP console session, GetForegroundWindow returns 0 (no foreground window), so focus-steal can't be measured by that probe there — verify "landed" by content-state change instead. A single-process harness never hits this.


Linux dispatch ladder — container and VM lanes#

Linux (X11/XFCE and real-desktop VMs) now validates the same delivery_mode background/foreground dispatch ladder that macOS and Windows exercise, closing the parity gap. The matrix below shows each modality's path and whether the driver can verify the action landed without a screenshot.

Validated modality matrix#

Modalitydelivery_modePath reportedDriver-verifiable?
Element click (element_index)backgroundx11_atspi (AT-SPI doAction)yes — a11y action confirmed
Element px action (pixel)backgroundx11_atspi (AT-SPI doAction-at-point) for AX apps; else MPX x11_pixelyes when AT-SPI-at-point lands; best-effort otherwise
Element px action (escalated)foregroundx11_pixel_fg (EWMH activate → inject → restore)no — confirm via screenshot
type_text into editablebackgroundax (AT-SPI insertText)yes (verified: true)
type_text, non-editable focusbackground / foregroundkey_events / key_events_fgno — confirm via screenshot
press_key / hotkeyforegroundx11_xtest_fg (XSetInputFocus → XTEST key)no — confirm via screenshot / effect (keytest)

Background element px actions (pixel clicks) do land on X11: apps that expose AT-SPI take the focus-free doAction-at-point path (x11_atspi), matching the macOS/Windows background-click behavior. The fallback to MPX x11_pixel (which requires a real Xorg server with /dev/uinput — unavailable under Xvnc or minimal containers) fires only for non-AX surfaces; escalate to delivery_mode: foreground there.

Foreground keyboard (press_key / hotkey, and type_text foreground) lands under Xvnc. The XTEST keyboard path injects on a short-lived X connection; it now round-trips after injection so the server delivers the key before the connection closes — under Xvnc a connection that closed right after flush() had its queued KeyPress/KeyRelease dropped (pointer events survived the same close, which masked the bug as "foreground clicks work but keys don't"). Two companion fixes: shifted-level symbols (*, +, () auto-hold Shift via slot-0/slot-1 keysym resolution (a bare keycode press emitted the unshifted glyph, e.g. *8); and the foreground rung now XSetInputFocuses the target (KWin honours EWMH _NET_ACTIVE_WINDOW as raise-only — fine for clicks that land by stacking, but keys route to the X input focus). derec.sh keytest effect-confirms the whole path across toolkits (AC, 3*4= → display reads 12).

The XFCE container lane#

The trycua/cua-xfce Docker image is a reproducible, WM-equipped X11 target — xfwm4 + EWMH + full AT-SPI stack — that exercises both the foreground/EWMH and background AT-SPI modalities. It is richer than the existing Xvfb harness (no window manager, no EWMH) for those paths.

The harness script lives at libs/cua-driver/test-harness/linux-container/calc.sh. It drives galculator through four modalities: background element ax click, background element px (pixel) click, foreground EWMH type, and background type with focus fallback.

Gotchas:

  • AT-SPI session-bus auto-discovery. AT-SPI lives on the desktop session's D-Bus (DBUS_SESSION_BUS_ADDRESS). When the daemon starts outside the session (container entrypoint, headless, runuser/su, systemd system unit, VNC ad-hoc bus), that variable is unset, yielding an empty AT-SPI tree. The driver now auto-discovers the bus at startup — preferring /run/user/<uid>/bus, else reading it from a running desktop-session process's /proc/<pid>/environ (xfce4-session, gnome-session, etc.). An a11y bus must be running with toolkit-accessibility enabled, and the daemon must run as the desktop user — running the daemon as root against a user session is the Linux analogue of the Windows Session 0 isolation problem.
  • install-local refuses root. Use runuser -u <desktop-user> -- cua-driver serve to start the daemon as the session owner.
  • Single-instance apps + zombie children. Unreaped zombie children can pollute pgrep; a full daemon restart reaps them.
  • Stale daemon, stale tool schema. A daemon that was not restarted after an update can serve an outdated tool schema. Verify with cua-driver describe <tool> from a fresh shell.

Toolkit coverage#

ToolkitApp under testLaneAT-SPI coord path
GTK3galculatorContainer (XFCE / Kasm, az exec)SCREEN — AT-SPI returns correct screen coords
GTK4gnome-calculatorVM (GNOME Shell, SSH)WINDOW + _GTK_FRAME_EXTENTS reconstruction (see below)
QtkcalcVM (KDE Plasma, SSH)SCREEN — Qt AT-SPI reports correct screen coords
Electron / Chromiumshared harnessContainer (XFCE, az exec)web-AX + pixel (--force-renderer-accessibility)

Lane matrix#

LaneDesktopAccess methodTest appHarness driver
XFCE containertrycua/cua-xfce (xfwm4)az execgalculator (GTK3), Electronlibs/cua-driver/test-harness/linux-container/calc.sh
Kasm containertrycua/cua-ubuntu (Kasm/XFCE)az execgalculator (GTK3)libs/cua-driver/test-harness/linux-container/calc.sh
KDE Plasma VMAzure VM, KDE Plasma X11SSH → derec.shkcalc (Qt)libs/cua-driver/test-harness/linux-container/derec.sh
GNOME Shell VMAzure VM, GNOME Shell X11 (console Xorg)SSH → derec.shgnome-calculator (GTK4)libs/cua-driver/test-harness/linux-container/derec.sh
GNOME Wayland VMAzure VM, GNOME Mutter WaylandSSH → derec.shgnome-calculator (GTK4)libs/cua-driver/test-harness/linux-container/derec.sh

The VM lanes are driven over SSH by derec.sh, in contrast to the container lanes which use az exec. GNOME Shell requires the real console Xorg session (software GLX) and is not drivable over VNC (no GLX). KDE Plasma works over both console Xorg and VNC.

GTK4 coordinate reconstruction — invariant#

GTK4's AT-SPI bridge returns Component.GetExtents(SCREEN) = (0, 0) for every widget (GNOME/gtk #1564/#1739, a11y rework). The driver fix:

  1. Detects _GTK_FRAME_EXTENTS on the X11 window (present for GTK4; absent for Qt and GTK3).
  2. Queries CoordType::Window (which GTK4 reports correctly, per-widget).
  3. Reconstructs: screen_xy = x11_window_origin + _GTK_FRAME_EXTENTS.(left, top) + WINDOW_xy.

Verified live: gnome-calculator button "7" = window-origin (55, 27) + inset (61, 55) + WINDOW (16, 293) = screen (132, 375).

Non-GTK toolkits (Qt) keep the existing SCREEN path — the gate is the presence of _GTK_FRAME_EXTENTS.

Invariant worth guarding: frame_center ≈ x11_window_origin + _GTK_FRAME_EXTENTS + atspi_WINDOW_xy. Any divergence means the reconstruction is broken. Before the fix, pixel clicks silently collapsed to the window corner; doAction element clicks landed regardless (coordinate-free).

The Wayland lane (GNOME Mutter / KDE Plasma)#

On a native Wayland session the background AX rung is the same coordinate-free AT-SPI doAction as on X11, and it is the one rung that works unchanged: Action.DoAction is dispatched toolkit-side (GTK/Qt invoke the widget's own action), so it needs no coordinates, no portal, no cursor move, and no focus steal. Element-index clicks therefore land on Mutter/KWin exactly as on X11.

Element px (pixel) coordinate clicks now land too — without pointer injection. There is no global coordinate space on Wayland and Mutter drops synthetic virtual-pointer events, so a "click at (x,y)" can't be delivered as input. Instead the driver resolves the screen pixel back to the accessible element under it and fires that element's action — the proven element_index rung, reached by coordinates. The screen frame of every element is reconstructed the same way get_window_state exposes it (the org.cua.WinRects GNOME Shell helper supplies the window origin; GTK4's correct CoordType::Window extents supply the per-widget offset). The covering element is chosen role-aware: GTK4 nests a label inside every button with a near-identical, slightly smaller frame, so an area-only hit-test lands on the inert label and do_action silently no-ops (a "false success"); the selector prefers the smallest covering real actuator and only falls back to a passive label if nothing else covers the point. Validated live on Mutter Wayland: an element px action targeting buttons by {x,y} (no element_index) types 789 into gnome-calculator (rung wayland_atspi). The same role-aware fix hardened the X11 perform_action_at_point rung, which had the identical GTK4 false-success.

Two Wayland-specific facts the lane exercises:

  • Window enumeration via AT-SPI. Native Wayland apps have no X11 XID, and GNOME Mutter / KDE KWin do not implement zwlr_foreign_toplevel_management (that is wlroots-only), so wayland::list_windows is empty there. list_windows_dispatch now falls back to atspi::list_windows, which enumerates top-level frames from the AT-SPI registry and hands back a synthetic, stable window_id. Downstream get_window_state / click walk the AT-SPI tree by pid (the xid is unused for the walk), so the full list_windows → get_window_state → click flow works. Validated live on Mutter Wayland: 7+8= via element_index background clicks computes 15.
  • Engaging the Wayland backend. is_wayland() requires the opt-in CUA_DRIVER_RS_ENABLE_WAYLAND=1, WAYLAND_DISPLAY set, and DISPLAY unset — otherwise the daemon stays on the X11 path. Start the daemon with all three for a pure-Wayland session.

Bringing up a headless GNOME Wayland VM (the lane's host) has two gotchas worth recording, neither cua-driver's concern but both block the lane:

  • GDM disables Wayland for the Azure virtual GPU. The hyperv_drm DRM driver exposes no parameters/modeset, so GDM's 61-gdm.rules udev gate (ATTR{parameters/modeset}!="Y") force-selects Xorg. Override by shadowing the rule in /etc/udev/rules.d/61-gdm.rules (drop the modeset + bare GOTO="gdm_disable_wayland" lines) and set Session=ubuntu-wayland in /var/lib/AccountsService/users/<user>, then restart gdm.
  • GTK4 apps need software GL on a GPU-less box. Launch with LIBGL_ALWAYS_SOFTWARE=1 GSK_RENDERER=cairo, else the GL renderer fails (EGL/ZINK) and the app exits.

Appendix — file map#

PathRole
shared/scenarios.jsonsingle source of truth: control ids, window titles, scenarios
shared/web/index.htmlshared DOM for WebView2 / Electron / WKWebView harnesses
apps/per-OS/toolkit harness sources (see §3 table)
build/{macos.sh,windows.ps1,linux.sh}stage harnesses into rust/test-apps/harness-*
modality-recordings/FINDINGS.mdauthoritative per-action findings
modality-recordings/windows/wpf-recorder.ps1Windows recorder (-Mode, -Toolkit)
modality-recordings/windows/run-one.ps1single-run launcher
modality-recordings/linux/lin-rec.py, lin-rec-electron.pyLinux recorders (+ lin-run*.sh, lin-dash*, lin-harness.py)
modality-recordings/macos/mac-rec.pymacOS recorder (MODE SURFACE)
vision-agent-test/vision_agent_test.py + README.mdcoordinate-invariant regression test
smoke/macos.shmacOS smoke check
rust/crates/cua-driver/tests/harness_*.rsRust integration tests consuming staged harnesses
TEST_SUITE.md (this dir)sibling doc: the Rust integration-test taxonomy / cua-driver-testkit
CUA_DRIVER_LIMITATIONS_AND_TEST_MATRIX.md (repo root)limitations + mode/scope mapping