Modality Test Suite & Harnesses
How the Cua Driver modality test suite and per-OS app harnesses work — action rungs, scopes, the action matrix, and per-platform results.
This document describes the cua-driver test-harness: a cross-OS, cross-toolkit rig that exercises every driver action against a controlled application and scores each action twice — did it land an effect, and did it keep the app in the background. It is the authoritative map of which driver mechanisms work where, and why each limitation manifests in the mode/scope it does.
1. Overview — what the suite validates#
The suite validates two things, per driver action, per surface:
-
The no-foreground / background contract — cua-driver's differentiator. The driver is supposed to drive an app without bringing it to the foreground and without moving the visible cursor (it dispatches via the accessibility tree or via background coordinate injection, and paints a per-session overlay cursor instead of warping the real pointer). The suite measures, for every action, whether the target app stayed in the background (
held) or jumped frontmost (STOLE FOCUS). -
Cross-toolkit dispatch coverage — whether an action actually lands on each UI toolkit. The same 8-action × 6-control matrix runs against WPF, WinUI3, WebView2, Electron (Windows), GTK3 + Electron (Linux), and AppKit, SwiftUI, WKWebView, Electron (macOS). Toolkits route input differently (UIA Invoke vs AT-SPI
doActionvs AX actions vs synthetic pixels vs composition input-sites), so "click works" is a per-toolkit, per-dispatch-path fact, not a global one.
Scope caveat — this tests driver mechanisms, not agent task success. A cell
that reads ✓ means the driver delivered the action and the target's own
instrumented state changed (a checkbox toggled, a slider moved, last_action=
updated). It does not mean an LLM agent located the control correctly or
chose the right action. Agent locator quality and agent self-judgment are
explicitly kept on separate axes (see §7) so a bad locator can never mask — or be
masked by — a driver regression.
2. How the suite works — the modality recorder#
The core instrument is the modality recorder (one per OS:
windows/wpf-recorder.ps1, linux/lin-rec.py + lin-rec-electron.py,
macos/mac-rec.py). Each run produces one annotated screen recording:
- Test harness on the LEFT,
cua-driver-paneldashboard on the RIGHT (always-on-top). The dashboard is a tiny HTML page served over loopback and shown via achrome --app(Windows/macOS) or WebKit (Linux) window; it polls astatus.jsonthe recorder writes (~5 Hz) and renders the live verdict. - The dashboard shows the run's modality, a live foreground/background indicator (current frontmost window title), and a per-action dual verdict.
- The pink/cyan overlay is the driver's per-session agent cursor — proof the driver is acting without a real pointer move.
The 5 modality modes (per surface)#
Action rung × foreground/background × scope = 5 single-modality runs:
| Mode | Action rung | Dispatch | Foreground? | Scope |
|---|---|---|---|---|
ax-fg | element ax action (accessibility) | element_index | app kept FRONT | window |
ax-bg | element ax action | element_index | app stays BACKGROUND (contract measured) | window |
px-fg | element px action (pixel) | pixel coords | app kept FRONT | window |
px-bg | element px action | pixel coords | app stays BACKGROUND (contract measured) | window |
px-desktop | element px action | screen pixels (window-less) | FRONT | desktop |
The run names use ax/px as labels for the
element-dispatch vs pixel-dispatch runs; capture_mode itself is now
deprecated and ignored — get_window_state returns both the tree and a
screenshot regardless of what you pass. The distinction the recorder exercises
is the action rung (the element ax action addressing by element_index
vs the element px action addressing by pixel x,y), not a capture toggle.
The fg/bg split only matters for the contract (measured in the *-bg runs).
px-desktop sets capture_scope=desktop and issues window-less,
screen-absolute actions.
Dual scoring per action#
Every action row carries two independent measurements (rendered as two badges on the dashboard):
- ✓ worked / ✗ no-op (effect) — did the action change the harness's own
state? Read back after the action via the harness's instrumented status
labels (
agreed=,slider_value=,last_action=,mirror=,menu_action=,scroll_offset=). - held / STOLE FOCUS (contract) — sampled for ~1.5 s after the action: did
the target window's title become the frontmost window? In fg modes this badge
is replaced by
foreground(steal is expected/intentional).
The dashboard tally at the bottom is the live count: effects: N/M actions changed the app and, in bg modes, focus contract: S/A stole focus.
How the verifier reads each toolkit's instrumented labels#
The harnesses all expose the same set of status labels (single source of
truth: shared/scenarios.json + shared/web/index.html), but the verifier
reads them differently per platform:
| Platform / toolkit | How the verifier reads state |
|---|---|
| Windows WPF / WinUI3 | UIAutomation — finds the harness window by title substring, scans ControlType.Text descendants, regex-extracts agreed=, slider_value=, last_action=, mirror=, menu_action=, scroll_offset= |
| Windows WebView2 / Electron | Same UIA scan, but the window is matched by substring because Chromium titles carry a [cdp=NNNN] suffix; the web-AX tree only surfaces with --force-renderer-accessibility |
| Linux GTK3 | The harness writes a state file (/tmp/cua-lin-state.json); the recorder's hstate() reads it (AT-SPI text isn't reliably enumerable for all labels) |
| Linux Electron | Chromium web-AX over AT-SPI (--force-renderer-accessibility), read from the get_window_state tree |
| macOS AppKit / SwiftUI | AX — status labels render as AX text nodes in tree_markdown; do_native() drives the action, then re-reads the labels |
| macOS WKWebView / Electron | Chromium/WebKit web-AX text (WEBISH path), same label contract |
The per-action verify(action, before, after) logic is consistent across
recorders, e.g.: click → agreed flipped; double → last_action == double_click; right → last_action == right_click or menu_action
changed; drag → slider increased; scroll → scroll_offset increased;
setval → mirror == set-by-cua; type → mirror contains typed-by-cua;
press-key (Tab) → no asserted effect (na).
Honesty principle (FINDINGS.md): the harness/recorder stays honest — real driver gaps are documented as driver work, fixed in the driver first, then re-recorded. The recorder does not paper over a no-op.
Session lifecycle & disposal#
Each recorder declares a session (the agent-cursor identity, e.g. d1 /
macd-…) via start_session, drives every action under it, and must
end_session it on teardown so the cursor + per-session state are disposed —
not leaked.
- Disposal is double-guarded.
end_sessionand the idle-TTL sweep (evict_idle, default 300 s, overridable viaCUA_DRIVER_RS_SESSION_IDLE_TTL_SECS, swept every 30 s) both dispose through the samefire_session_endhook fan-out: agent cursor removed, recording stopped, per-session config cleared. A reused id starts fresh, not poisoned. - Test:
cua-driver-core/tests/session_lifecycle.rsasserts start → use →end_sessiondisposes (and a reused id starts fresh), and that idle-TTL eviction disposes the same way — verified via the disposal-hook contract both paths fan out to (the concrete cursor/config registries are a daemon-layer assertion). It passes a short TTL straight toevict_idle; the 300 s production default is never lowered for tests. - The recorders dispose explicitly —
end_sessionin atry/finallybefore they kill the daemon (mac-rec.pyvia anatexitnet, since it leaves the daemon running) — so a session can't leak even if a run errors.
3. Test-app harnesses#
Each harness is a self-contained host app the driver drives. They share fixtures
via shared/scenarios.json (AutomationIds / AX identifiers / DOM ids /
window titles) and, for the web hosts, shared/web/index.html, so identifiers
never drift between apps and the Rust integration tests under
rust/crates/cua-driver/tests/harness_*.rs. Built outputs stage into
rust/test-apps/harness-{name}/ via build/{macos.sh,windows.ps1,linux.sh}.
The 6 parity controls#
The "control-parity pass" brought every surface up to the WPF baseline of 6 controls, so the same 8-action matrix is exercisable everywhere:
- checkbox (
chk-agree*) →agreed= - click-target (left / right / double) →
last_action=,clicks= - slider (
sld-value) →slider_value= - scroll-target (tall
scroll-tallregion) →scroll_offset= - text-input (
txt-input, mirrored) →mirror= - context-menu (
btn-context) →menu_action=
(Plus a counter= increment button used as a generic AX-press probe.) WPF
carries an extended scenario set on top (modal MessageBox, owned/layered popups,
native Win32 child HWND, menu bar, combo/list boxes) — it is the "fullest
harness".
Per OS / toolkit#
| OS | Harness | What it is | Build | Source |
|---|---|---|---|---|
| Windows | WPF | WPF / UIA host — popups, layered windows, modal MessageBox, HwndHost native child | build/windows.ps1 → harness-wpf | apps/windows/wpf/ (C#/XAML) |
| Windows | WinUI3 | WinUI3 unpackaged — DirectComposition popups, XAML Popup primitive, CommandBarFlyout; HWND subclassed to translate WM_VSCROLL → ChangeView | build/windows.ps1 → harness-winui3 | apps/windows/winui3/ |
| Windows | WebView2 | WPF host + Microsoft.Edge.WebView2 loading the shared DOM; CDP on :9222 | build/windows.ps1 → harness-webview | apps/windows/webview2/ |
| Windows | Electron | Electron host loading the same shared DOM; CDP on :9223 | apps/cross-platform/electron/build.ps1 | apps/cross-platform/electron/ |
| Linux | GTK3 | PyGObject GTK3 app; AT-SPI exposes accessible NAME (not an AutomationId); writes /tmp/cua-lin-state.json for the verifier | build/linux.sh → harness-gtk3 | apps/linux/gtk3/main.py |
| Linux | Electron | Same Electron bundle run from node_modules under Xvfb + dbus with --force-renderer-accessibility | run via lin-run-electron.sh | apps/cross-platform/electron/ |
| macOS | AppKit | Cocoa — NSButton, NSScrollView, NSMenu, NSSlider, menubar item | build/macos.sh → harness-appkit | apps/macos/appkit/main.swift |
| macOS | SwiftUI | SwiftUI — .popover(), .contextMenu, declarative Slider/Toggle/ScrollView | build/macos.sh → harness-swiftui | apps/macos/swiftui/main.swift |
| macOS | WKWebView | Apple WebKit host loading the shared DOM — the native analogue of WebView2 | build/macos.sh → harness-wkwebview | apps/macos/wkwebview/main.swift |
| macOS | Electron | Same shared DOM under Chromium-on-macOS | apps/cross-platform/electron/build.sh | apps/cross-platform/electron/ |
Instrumented labels (the verifier contract)#
Every harness updates these labels in response to a successful action; the verifier asserts on them:
| Label | Set by | Verifies |
|---|---|---|
last_action= | click-target (left/right/double) | click / double / right landed |
clicks= | click-target | click count |
slider_value= | slider | drag / set_value on a range control |
scroll_offset= | scroll-target | scroll landed |
agreed= | checkbox | click toggled a toggle |
menu_action= | context-menu item | right-click → context menu → item invoked |
mirror= | text-input | type / set_value into a text field (typed-by-cua / set-by-cua) |
counter= | increment button | generic AX-press probe |
4. Action rungs & scopes#
Action rung (how the action addresses the target)#
get_window_state is perception-mode-agnostic: it returns both the
accessibility tree and a screenshot by default (include_screenshot:false skips
the screenshot grab — a perf opt-out, not a modality choice). capture_mode is
deprecated and ignored — there is no ax/vision capture toggle anymore.
The modality is chosen at action time by how you address the target, and
that one choice selects the rung:
| Rung | How the agent addresses the target | How actions dispatch |
|---|---|---|
element ax action | by element_index (read off the tree) | UIA Invoke (Windows) / AT-SPI doAction (Linux) / AX actions (macOS) — backgroundable, driver-verifiable |
element px action | by pixel x,y (read off the screenshot in the same response) | synthetic/injected pointer events at (x,y) — best-effort, caller-confirmed |
Deprecated aliases.
som(the old set-of-marks mode) andscreenshotare no longer advertised capture modes — likeax/vision/capture_modeitself, they still decode but are ignored (both the tree and the screenshot come back regardless). The modality recorder's runs exercise the element-dispatch path and the pixel-dispatch path, so the recorder runs and the results matrices below are unchanged — they are framed here as the elementaxaction and the elementpxaction rungs.
Capture scope (capture_scope)#
| Scope | Meaning |
|---|---|
window | single target window (default) |
desktop | full display |
Why mode/scope determines where a limitation shows#
This is the central organizing idea (from CUA_DRIVER_LIMITATIONS_AND_TEST_MATRIX.md):
- The element
axaction (element_index) path. Limits in the element path (a toolkit with nodoActionfor right-click, a WinUI3 composition input-site the WPF path can't reach, an AXValue a SwiftUI slider rejects) surface in this rung — theax-runs. - The element
pxaction (pixel) path. Limits in the pixel path (GTK dropping synthetic X events, NSButton ignoring a synthetic mouseDown, a 2× Retina coordinate conversion) surface in this rung — thepx-runs. - fg/bg isolates the contract (only
*-bgmeasures focus steal). desktopscope is where window-less / coordinate-space-conversion bugs surface (e.g. the macOS 2× Retina desktop path, the macOS no-windowless-click gap).
5. Action × control matrix#
8 actions × 6 controls. Each modality run filters the plan:
set_value is dropped in the px modes (the pixel-dispatch runs — it's
AX-only); px-desktop keeps only {click, scroll, type, press-key}
(window-less screen actions).
| Action ↓ / Control → | checkbox | click-target | slider | scroll-target | text-input | context-menu |
|---|---|---|---|---|---|---|
| click | ● (toggle) | ● (left) | ||||
| double-click | ● | |||||
| right-click | ● (web/WinUI3 records last_action) | ● (opens menu) | ||||
| drag | ● (thumb→X) | |||||
| scroll | ● | |||||
| set_value (AX) | ● (range value) | ● (set-by-cua) | ||||
| type | ● (typed-by-cua) | |||||
| press-key (Tab) | ● (focus move; no asserted effect) |
Notes:
- WPF has a dedicated context-menu control; WinUI3 / web harnesses have none,
so right-click is aimed at the click-target and verified via
last_action=right_click. - SwiftUI's AX has no distinct right-press action, so right-click on its button
coerces to a normal AXPress — right-click semantics there are covered by the
dedicated
context_menuscenario.
6. RESULTS MATRIX (the core)#
Per cell: ✓ landed · ✗ no-op · ⚠ guarded/fail-loud for the effect, with the
reason. The Contract column is the focus verdict measured in the *-bg runs.
"land" = the harness's own instrumented label changed. Verification status is
drawn from FINDINGS.md, the limitations doc, and the fix-commit bodies; runtime
status is flagged where a commit deferred it.
Legend: ✓ lands · ✗ no-op (honest) · ⚠ guarded clean-fail / fail-loud · — not exercised · n/a not in plan / no asserted effect.
macOS#
Contract: HOLDS — 0/8 stole on AppKit in both ax-bg and px-bg
(pixel left-click/double/drag landed while Chrome stayed frontmost). WKWebView
also holds — confirming the Windows Chromium steal is Windows-specific, not
WebKit-on-macOS.
| Harness | mode | click | double | right | drag | scroll | set_value | type | press-key | Contract | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AppKit | ax | ✓ | ✓ (fixed bg, click_at_xy_with_window_local, 80b4e3d7) | ✓ (AXShowMenu→delivering-pixel fallback, 80b4e3d7) | ✓ | ⚠ element scroll no-ops on content-height containers (use x,y) | ✓ CFNumber NSSlider (92d8aebf, verified 0→50) | ✓ | n/a | 0/8 stole | ax-bg 5/7 effects |
| AppKit | px | ⚠ no-op on standard NSButton even frontmost (modal mouseDown reads window-server queue, not postToPid → use element_index) | ✓ | ✓ now fires rightMouseDown (window-number + button + primer, e7145c18) | ✓ | ✓ via per-pid pixel-wheel scroll_wheel_at_xy (e7145c18) | n/a | ✓ | n/a | 0/8 stole | needs target frontmost; px-bg 4/6 |
| SwiftUI | ax | ✓ | ✓ | ✓ (via dedicated context-menu) | ✗ slider drag is a composition no-op | ⚠ nested scroll no-op | ⚠ AXValue unsettable (-25200); needs AXIncrement/AXDecrement stepping (d2b29230) | ✓ | n/a | 0/8 stole | full 6-control parity confirmed live |
| WKWebView | ax+px | ✓ | ✓ | ✓ | ✓ slider (pixel drag at live frame) | ⚠ page scroll works, nested overflow:auto div no-op (div has no tabindex → never keyboard-focusable; needs pixel-wheel) | ✗ AXValue set, no DOM input event (web mirror unchanged) | ✓ | n/a | HOLDS | native analogue of WebView2. Recorder fix (mac-rec.py): web dispatch now uses the live element frame, not a stale hardcoded coord map that was ~185px off → every WKWebView pixel action used to miss. ax-fg re-measured 5/6 (click/double/right/drag/type land; only set_value the honest web no-op). |
| Electron (mac) | ax/px | ✓ | ✓ | — | — | ✗ (web scroll, same as WKWebView) | — | ✓ | n/a | holds (Chromium-on-mac does not steal like Windows) | local-only recordings |
macOS gotchas filed as driver bugs: end_session poisons a reused session id
(subsequent actions silently no-op); AppKit window height drifts between
launches (store targets as window-local points, convert to live screenshot px).
Linux#
| Harness | mode | click | double | right | drag | scroll | set_value | type | press-key | Contract | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GTK3 | ax (AT-SPI) | ✓ left-click via hit-test→doAction (7d358283) | ✗ no doAction equiv | ✗ no doAction equiv | ⚠ value-only (no Action) — driven via set_value | ⚠ surfaced now (is_indexable = actions || has_value); driven via set_value | ✓ slider/scroll value widgets now surface | ✓ | n/a | ax-bg 1/8 stole (only set_value; corrected via genuine-anchor baseline 5c0a1d3c — was 3/8) | left/set_value/type land |
| GTK3 | px (Xvfb) | ✗ | ✗ | ✗ | ✗ | ✗ | n/a | ✗ | n/a | 0/7 stole | GTK drops synthetic XSendEvent; XTEST core events don't reach its XInput2 path |
| GTK3 | px (real Xorg) | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | n/a | holds focus | right/double/middle-click + scroll land via uinput/XInput2-MPX + shield-grab (79e546ca); capability auto-detected via real_pointer_input_available() — runtime-verified on a real Xorg server (dummy-driver): the probe flips TRUE, all four actions LAND and HOLD focus (_NET_ACTIVE_WINDOW unchanged), confirmed by the harness oracle + a middle-click PRIMARY paste. Xvfb can't bind uinput as an X slave, so the path auto-skips there. |
| Electron (Linux) | ax | ✓ click | ✗ | ✗ | ✓ drag | ✗ scroll resolves but no-op (AT-SPI synthetic) | ✗ | ✓ type | n/a | ax-bg 2/8 stole (set_value + drag; was 3/8 — recorder baseline artifact, 5c0a1d3c) | drag lands a value-change but its synthetic window-coord activates Chromium (never reaches the slider thumb); still holds far better than Windows Electron's reported 7/8 (also an artifact) |
| Electron (Linux) | px | ✗ | ✓ pixel double-click fires on click-target | ✗ | ✗ | ✗ | n/a | ✗ | n/a | px-bg 6/7 stole (pixel dispatch foregrounds Chromium) | |
| GTK4 (gnome-calculator) | ax (AT-SPI) | ✓ via doAction (coordinate-free; lands without a valid frame) | ✗ no doAction equiv | ✗ no doAction equiv | ⚠ value-only (no Action) — driven via set_value | ⚠ driven via set_value | ✓ | ✓ | n/a | holds | GTK4 coord fix required for frame/agent-cursor. GTK4's AT-SPI bridge returns Component.GetExtents(SCREEN)=(0,0) for every widget (GNOME/gtk #1564/#1739 a11y rework). Fix: queries CoordType::Window (GTK4 reports correctly per-widget) and reconstructs screen as x11_window_origin + _GTK_FRAME_EXTENTS.(left,top) + WINDOW_xy, gated on presence of _GTK_FRAME_EXTENTS so non-GTK toolkits (Qt) keep their correct SCREEN path. Verified live: button "7" = window(55,27) + inset(61,55) + WINDOW(16,293) = screen(132,375). doAction element clicks land regardless (coordinate-free). GNOME VM lane only. |
| GTK4 (gnome-calculator) | px (real Xorg) | ✓ (after GTK4 coord fix) | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | n/a | holds | Pixel coords reliable only after WINDOW+_GTK_FRAME_EXTENTS reconstruction; without the fix, cursor/frame collapses to the window corner. GNOME Shell requires real console Xorg (software GLX) — not Xvnc. |
Windows#
Contract: HOLDS on WPF / WinUI3 / WebView2. Electron — see the caveat below (the prior "7/8 stole" is a recorder measurement artifact).
| Harness | mode | click | double | right | drag | scroll | set_value | type | press-key | Contract | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| WPF | ax | ✓ (UIA Toggle) | ✓ | ⚠ off-screen at 556px reflow → clean no-op (guarded; UIA Invoke works off-screen) | ✓ slider 0→48 | ⚠ off-screen reflow → guarded no-op | ✓ (mirror=set-by-cua) | ✓ (recorder focuses the box first) | n/a | 2/8 stole | "fullest harness"; ax-bg ~5/7 effects |
| WPF | px | ✓ | ✓ | ⚠ guarded (point_in_window_bounds refuses off-screen point; off-screen controls now ScrollIntoView + actuate via ancestor ScrollPattern, c3efb587, runtime-verified counter 0→1) | ✓ | ⚠ off-screen reflow | n/a | ✓ | n/a | 1–2/7 stole | off-screen guard prevents taskbar misfire |
| WinUI3 | ax | ✓ left-click (UIA Invoke) | ✓ now LANDS — double UIA-Invoke under WS_EX_NOACTIVATE guard (531aa6de, runtime-verified live) | ⚠ fail-loud background_unavailable — no contract-safe path (pen injection steals; WM_*BUTTON doesn't land) | ⚠ fail-loud (same composition input-site gap) | ✓ (HWND subclass: WM_VSCROLL→ChangeView) | ✓ | ✓ | n/a | 0/7 stole | gap needs a WinUI3-specific composition InputSite path |
| WebView2 | ax | ✗ checkbox below the fold (web won't scroll in ax) | ✓ | ✓ records last_action=right_click | ✓ | ✗ host HWND doesn't route scroll to renderer | ✓ | ✓ | n/a | 0/7 stole | needs --force-renderer-accessibility; 4/7 effects |
| Electron | ax | ✓ | ✓ | — | ✓ | ✗ web scroll | ✓ | ✓ | n/a | see caveat | 5/7 effects |
Other Windows: UIA element-cache use-after-free (concurrent ax sessions on
the same window) — fixed with a RetainedElement retain-under-lock guard
(d95b89a1); proven by a deterministic cache_uaf_repro test (pre-fix path
takes a real access violation, fixed path survives a 6-thread stress loop,
531aa6de). Recording the VM needs Session 2 attached (tscon … /dest:console)
and ffmpeg on PATH before serve.
Windows Electron contract caveat — RESOLVED. FINDINGS.md records Electron as stealing 7/8 in
ax-bg. That was a recorder MEASUREMENT ARTIFACT: the recorder re-asserted its dashboard panel withSetWindowPos(z-order only, no activation), so it never held a real foreground baseline — after the first inject action click-activated Chromium, the harness stayed frontmost and every later step false-positived as a steal. The recorder was fixed (49bdb41b) to genuinelySetForegroundWindowa real non-harness anchor (mspaint) before each step. Re-measured: Electronax-bg= 0/8 (held) — corroborated bybaseline.log(anchor held ×8), an independentmetric.logMEASURE=0/8, and frame f03 (actions landed while the harness stayed non-foreground). So Electron HOLDS the contract, consistent with Linux/macOS Electron. (The committedmatrix-electron-ax-bg.mp4still shows the old 7/8 until re-recorded with the fixed recorder.)
7. Vision-agent coordinate test#
vision-agent-test/ (vision_agent_test.py, 010fdd78) is a newer test that
hits the driver the way a vision agent actually does — and removes the
modality recorder's overfit (hand-tuned window-local points run through a private
ratio, which never exercises the driver's image→screen mapping).
Invariant under test: the pixel an agent reads off the returned screenshot is the pixel that gets clicked — verified by the target's own instrumented state changing.
The loop (no cheating in locate or click):
- capture —
get_window_state(window — reads the screenshot it returns by default) /get_desktop_state(desktop, true pixels): the exact image an agent receives. - locate — a deterministic pixel in the returned-image coordinate space
(
PixelRegistryLocator: a pre-measured pixel read off the real PNG, with a dims-guard that fails loud if the pinned geometry drifts). Noelement_index, no hand-converted window-local points.locate(image, target, dims) → (x,y)is pluggable (an LLM locator can drop in later). - act —
click/right_click/scrollat that pixel (scope matches capture). - verify — the harness oracle (
last_action=,clicks=, …) → objective pass/fail. A coordinate mis-map leaves the oracle unchanged → FAIL.
Run: python3 vision_agent_test.py {wkwebview-click-window|wkwebview-click-desktop|appkit-click-window|safari-learnmore-desktop|all}.
Three axes kept separate so none can mask the others: (a) the driver
coordinate invariant (this test, deterministic → the regression guard); (b)
agent locator quality (future, LLM, same locate() signature, scored
separately, never gates the regression); (c) agent self-judged success with no
oracle (future, isolated track).
What the deterministic version caught that the modality suite structurally couldn't:
- 2× Retina desktop path now correct + guarded — pixel (340,1358) in the 3024×1964 desktop PNG converts to screen-point (170,679) and lands (oracle + a real Safari navigation confirm it). This turns the Retina escape into a permanently-guarded one-liner.
- Pixel-click is a no-op on AppKit
NSButtoneven frontmost (modal mouseDown reads the window-server queue, not the per-pidpostToPidqueue) — the modality suite drives this via AXPress, so it never saw it. - The pixel path requires the target app frontmost (the AX path doesn't).
- The AppKit harness window AX returns only the menu bar — its
clicks=oracle is unreachable that way; WKWebView exposes it fine.
8. Known overfitting caveats#
The suite's own authors flag two ways the modality recorder could have lied, and the de-risking work for each:
(a) Recorder hand-coords masked the desktop 2× Retina bug. The modality
recorder stores targets as hand-tuned window-local points and converts them
through a private ratio. That conversion is self-consistent with the driver's
own assumption, so it never exercised the driver's real image→screen mapping —
and the desktop-scope 2× Retina off-by-backing-scale bug (a center-pixel pick
warping to the corner) slipped through every px-desktop run. De-risk: the
vision-agent coordinate test (§7) reads a pixel straight off the returned PNG
with a dims-guard, caught the bug, and now guards it (80b4e3d7 fixes the desktop
branch to divide x,y by the native/logical ratio; 010fdd78 adds the regression
test).
(b) Recorder contract measurement had no real foreground baseline → the false
"Electron 7/8 stole". The recorder judges the contract by checking whether the
target's title is the frontmost window, but it never established a genuine,
distinct foreground window to begin with — so Windows Electron read as stealing
7/8 when the corrected understanding is that Chromium-on-Windows holds the
contract (matching macOS/Linux Electron). De-risk: the recorder-contract fix
(49bdb41b) now genuinely SetForegroundWindows a real anchor before each step;
re-measured Electron ax-bg = 0/8 (held), so the claim is now verified and the
legacy FINDINGS "7/8" is a confirmed measurement artifact. The Linux recorders had
the same flaw — also fixed (5c0a1d3c): GTK3 + Linux-Electron ax-bg corrected
3/8 → 1/8 (only set_value genuinely steals); macOS was already fine.
Both caveats reflect the same lesson: a recorder that hand-feeds the driver its own assumptions can hide bugs in exactly the path a real agent uses. The deterministic, oracle-checked vision-agent test and the re-baselined contract measurement are the two structural fixes.
9. Edge cases — real closed-source apps#
The synthetic harnesses pin down the heuristics; real apps surface behaviours a single-process test window never can. These were found driving live closed-source apps (Finder, System Settings, Calculator, Safari on macOS; Calculator/Notepad UWP, Edge, Explorer on Windows) and are why the suite is a floor, not the ceiling.
macOS (Finder, System Settings, Calculator, Safari)#
- Pixel click hits the right pixel but the wrong window. With ~8 overlapping
same-pid Finder windows, a crosshair dead-centre on the target file still posts to
screen coordinates, so an occluding sibling intercepts it → no-op. A per-window
screenshot masks it. Strong argument for the
ax/element path, which is z-order independent. set_valueandtype_textboth falsely succeed on a background search field (System Settings, SwiftUI).set_valuewritesAXValue(text appears, the search action never fires);type_textposts keys that a non-first-responder window drops. Neither drives the field; both report success.- Finder column filenames don't advertise
AXPress→ a default click is an honest no-op; the driver surfacesAXOpen/AXShowMenu/AXConfirm, so an agent must pickaction:"open"/"pick". - The Calculator result is AX-invisible (no AX node for the display) → an AX-only
agent can't read the answer; the keypad itself is labelled. A
visionreadout is required. - AppKit AX-tree duplication (System Settings ~half duplicated, Safari a duplicated toolbar) plus whole-menu-bar walking inflates real-app context versus synthetic.
Windows (Calculator/Notepad UWP, Edge, Explorer)#
A. UWP window-identity split. launch_app("Calculator") returns the package backing
PID and a window_id that's stale by the next call; the real top-level HWND belongs to
ApplicationFrameHost.exe. Driving by the returned pid+window_id errors No window with window_id … exists. The driver should resolve UWP windows to their ApplicationFrameHost
host / relink the churned HWND.
B. Real UWP is drivable via the element path without the uiAccess worker. Calculator's
num5Button drove the display 0→5 via UIA Invoke from the Medium-IL daemon with no
cua-driver-uia.exe running. This refines the #1602/serve.rs assumption: only the
pixel/SendInput path needs the uiAccess worker for AppContainer apps — the
element_index UIA Invoke/ValuePattern path works on real UWP as-is.
C. Real Edge (Chromium) holds the foreground contract. With Notepad pinned top, 3×
click + double-click + right-click on background Edge left the z-order unchanged
((background, no foreground swap)). The shield validated on synthetic Electron (0/8)
generalises to a real closed-source Chromium — and the double/right-click that needed
the WinUI3 composition fix synthetically just work on real Chromium in the background.
D. element_index requires pid (+window_id). Element actions with element_index
alone fail-fast with Missing required integer field: pid. Correct, but the MCP tool
descriptions under-emphasise it — an agent that omits pid and filters stderr perceives a
silent no-op. The element_index tool schemas should state pid's necessity explicitly.
E. get_screen_size under-reports desktop width (1024 reported vs 1824 actual span,
likely an RDP dynamic-resolution artifact). The element path is unaffected (window-local
frames stay consistent); a pixel/vision agent computing against 1024 width misplaces
clicks in the right ~800 px band.
F. ⚠ Unhandled-protocol launch_app deadlocks the whole daemon (DoS-class). Launching an
app whose protocol has no registered handler (e.g. bingmaps: with Maps uninstalled) spawns
Windows' "you'll need a new app" modal on the daemon's session desktop and blocks the
worker thread inside the shell-launch call. The client wedges indefinitely and a subsequent
stop reports "daemon is not running" because the wedged daemon can't service the pipe;
recovery needs the modal dismissed from inside that session (Session-0 SSH can't see it —
per-session window stations). Any bad app name/protocol is a full-daemon DoS. The driver
should launch via a non-blocking path with a timeout and/or validate handler registration
before ShellExecute. (Needs a fix + a tracking issue.)
G. One ApplicationFrameHost pid multiplexes N unrelated UWP apps (Settings, Calculator,
and Store were all hosted under the same pid). Anything keying state/cursor/element-cache by
pid alone is ambiguous across apps — only window_id is a real identity. (Generalises A.)
H. launch_app's return contract is effectively per-app-architecture — three topologies
seen across five apps: brokered/pid:0 (Settings, Store), real-pid-but-empty-windows race
(Photos exposes its pid a beat before its HWND), and clean pid+window (Snipping Tool, the
well-behaved baseline). Store is a three-way split: launch-pid 0 ≠ window-pid (AFH) ≠
content-pid (WinStore.App), yet UIA still walks fully into the content provider.
I. The AFH UIA root is a caption-only ~188×32 strip for Settings and Store while the content
frames are full and correct — the AppFrame chrome and the XAML content are different UIA
providers stitched at the window. A single-provider synthetic WinUI harness won't reproduce
this dual-provider root.
Methodology note: on a disconnected-RDP console session,
GetForegroundWindowreturns0(no foreground window), so focus-steal can't be measured by that probe there — verify "landed" by content-state change instead. A single-process harness never hits this.
Linux dispatch ladder — container and VM lanes#
Linux (X11/XFCE and real-desktop VMs) now validates the same delivery_mode background/foreground dispatch ladder that macOS and Windows exercise, closing the parity gap. The matrix below shows each modality's path and whether the driver can verify the action landed without a screenshot.
Validated modality matrix#
| Modality | delivery_mode | Path reported | Driver-verifiable? |
|---|---|---|---|
Element click (element_index) | background | x11_atspi (AT-SPI doAction) | yes — a11y action confirmed |
Element px action (pixel) | background | x11_atspi (AT-SPI doAction-at-point) for AX apps; else MPX x11_pixel | yes when AT-SPI-at-point lands; best-effort otherwise |
Element px action (escalated) | foreground | x11_pixel_fg (EWMH activate → inject → restore) | no — confirm via screenshot |
type_text into editable | background | ax (AT-SPI insertText) | yes (verified: true) |
type_text, non-editable focus | background / foreground | key_events / key_events_fg | no — confirm via screenshot |
press_key / hotkey | foreground | x11_xtest_fg (XSetInputFocus → XTEST key) | no — confirm via screenshot / effect (keytest) |
Background element px actions (pixel clicks) do land on X11: apps that expose AT-SPI take the focus-free doAction-at-point path (x11_atspi), matching the macOS/Windows background-click behavior. The fallback to MPX x11_pixel (which requires a real Xorg server with /dev/uinput — unavailable under Xvnc or minimal containers) fires only for non-AX surfaces; escalate to delivery_mode: foreground there.
Foreground keyboard (press_key / hotkey, and type_text foreground) lands under Xvnc. The XTEST keyboard path injects on a short-lived X connection; it now round-trips after injection so the server delivers the key before the connection closes — under Xvnc a connection that closed right after flush() had its queued KeyPress/KeyRelease dropped (pointer events survived the same close, which masked the bug as "foreground clicks work but keys don't"). Two companion fixes: shifted-level symbols (*, +, () auto-hold Shift via slot-0/slot-1 keysym resolution (a bare keycode press emitted the unshifted glyph, e.g. *→8); and the foreground rung now XSetInputFocuses the target (KWin honours EWMH _NET_ACTIVE_WINDOW as raise-only — fine for clicks that land by stacking, but keys route to the X input focus). derec.sh keytest effect-confirms the whole path across toolkits (AC, 3*4= → display reads 12).
The XFCE container lane#
The trycua/cua-xfce Docker image is a reproducible, WM-equipped X11 target — xfwm4 + EWMH + full AT-SPI stack — that exercises both the foreground/EWMH and background AT-SPI modalities. It is richer than the existing Xvfb harness (no window manager, no EWMH) for those paths.
The harness script lives at libs/cua-driver/test-harness/linux-container/calc.sh. It drives galculator through four modalities: background element ax click, background element px (pixel) click, foreground EWMH type, and background type with focus fallback.
Gotchas:
- AT-SPI session-bus auto-discovery. AT-SPI lives on the desktop session's D-Bus (
DBUS_SESSION_BUS_ADDRESS). When the daemon starts outside the session (container entrypoint, headless,runuser/su, systemd system unit, VNC ad-hoc bus), that variable is unset, yielding an empty AT-SPI tree. The driver now auto-discovers the bus at startup — preferring/run/user/<uid>/bus, else reading it from a running desktop-session process's/proc/<pid>/environ(xfce4-session,gnome-session, etc.). An a11y bus must be running with toolkit-accessibility enabled, and the daemon must run as the desktop user — running the daemon as root against a user session is the Linux analogue of the Windows Session 0 isolation problem. install-localrefuses root. Userunuser -u <desktop-user> -- cua-driver serveto start the daemon as the session owner.- Single-instance apps + zombie children. Unreaped zombie children can pollute
pgrep; a full daemon restart reaps them. - Stale daemon, stale tool schema. A daemon that was not restarted after an update can serve an outdated tool schema. Verify with
cua-driver describe <tool>from a fresh shell.
Toolkit coverage#
| Toolkit | App under test | Lane | AT-SPI coord path |
|---|---|---|---|
| GTK3 | galculator | Container (XFCE / Kasm, az exec) | SCREEN — AT-SPI returns correct screen coords |
| GTK4 | gnome-calculator | VM (GNOME Shell, SSH) | WINDOW + _GTK_FRAME_EXTENTS reconstruction (see below) |
| Qt | kcalc | VM (KDE Plasma, SSH) | SCREEN — Qt AT-SPI reports correct screen coords |
| Electron / Chromium | shared harness | Container (XFCE, az exec) | web-AX + pixel (--force-renderer-accessibility) |
Lane matrix#
| Lane | Desktop | Access method | Test app | Harness driver |
|---|---|---|---|---|
| XFCE container | trycua/cua-xfce (xfwm4) | az exec | galculator (GTK3), Electron | libs/cua-driver/test-harness/linux-container/calc.sh |
| Kasm container | trycua/cua-ubuntu (Kasm/XFCE) | az exec | galculator (GTK3) | libs/cua-driver/test-harness/linux-container/calc.sh |
| KDE Plasma VM | Azure VM, KDE Plasma X11 | SSH → derec.sh | kcalc (Qt) | libs/cua-driver/test-harness/linux-container/derec.sh |
| GNOME Shell VM | Azure VM, GNOME Shell X11 (console Xorg) | SSH → derec.sh | gnome-calculator (GTK4) | libs/cua-driver/test-harness/linux-container/derec.sh |
| GNOME Wayland VM | Azure VM, GNOME Mutter Wayland | SSH → derec.sh | gnome-calculator (GTK4) | libs/cua-driver/test-harness/linux-container/derec.sh |
The VM lanes are driven over SSH by derec.sh, in contrast to the container lanes which use az exec. GNOME Shell requires the real console Xorg session (software GLX) and is not drivable over VNC (no GLX). KDE Plasma works over both console Xorg and VNC.
GTK4 coordinate reconstruction — invariant#
GTK4's AT-SPI bridge returns Component.GetExtents(SCREEN) = (0, 0) for every widget (GNOME/gtk #1564/#1739, a11y rework). The driver fix:
- Detects
_GTK_FRAME_EXTENTSon the X11 window (present for GTK4; absent for Qt and GTK3). - Queries
CoordType::Window(which GTK4 reports correctly, per-widget). - Reconstructs:
screen_xy = x11_window_origin + _GTK_FRAME_EXTENTS.(left, top) + WINDOW_xy.
Verified live: gnome-calculator button "7" = window-origin (55, 27) + inset (61, 55) + WINDOW (16, 293) = screen (132, 375).
Non-GTK toolkits (Qt) keep the existing SCREEN path — the gate is the presence of _GTK_FRAME_EXTENTS.
Invariant worth guarding: frame_center ≈ x11_window_origin + _GTK_FRAME_EXTENTS + atspi_WINDOW_xy. Any divergence means the reconstruction is broken. Before the fix, pixel clicks silently collapsed to the window corner; doAction element clicks landed regardless (coordinate-free).
The Wayland lane (GNOME Mutter / KDE Plasma)#
On a native Wayland session the background AX rung is the same coordinate-free AT-SPI doAction as on X11, and it is the one rung that works unchanged: Action.DoAction is dispatched toolkit-side (GTK/Qt invoke the widget's own action), so it needs no coordinates, no portal, no cursor move, and no focus steal. Element-index clicks therefore land on Mutter/KWin exactly as on X11.
Element px (pixel) coordinate clicks now land too — without pointer injection. There is no global coordinate space on Wayland and Mutter drops synthetic virtual-pointer events, so a "click at (x,y)" can't be delivered as input. Instead the driver resolves the screen pixel back to the accessible element under it and fires that element's action — the proven element_index rung, reached by coordinates. The screen frame of every element is reconstructed the same way get_window_state exposes it (the org.cua.WinRects GNOME Shell helper supplies the window origin; GTK4's correct CoordType::Window extents supply the per-widget offset). The covering element is chosen role-aware: GTK4 nests a label inside every button with a near-identical, slightly smaller frame, so an area-only hit-test lands on the inert label and do_action silently no-ops (a "false success"); the selector prefers the smallest covering real actuator and only falls back to a passive label if nothing else covers the point. Validated live on Mutter Wayland: an element px action targeting buttons by {x,y} (no element_index) types 789 into gnome-calculator (rung wayland_atspi). The same role-aware fix hardened the X11 perform_action_at_point rung, which had the identical GTK4 false-success.
Two Wayland-specific facts the lane exercises:
- Window enumeration via AT-SPI. Native Wayland apps have no X11 XID, and GNOME Mutter / KDE KWin do not implement
zwlr_foreign_toplevel_management(that is wlroots-only), sowayland::list_windowsis empty there.list_windows_dispatchnow falls back toatspi::list_windows, which enumerates top-level frames from the AT-SPI registry and hands back a synthetic, stablewindow_id. Downstreamget_window_state/clickwalk the AT-SPI tree by pid (the xid is unused for the walk), so the fulllist_windows → get_window_state → clickflow works. Validated live on Mutter Wayland:7+8=viaelement_indexbackground clicks computes15. - Engaging the Wayland backend.
is_wayland()requires the opt-inCUA_DRIVER_RS_ENABLE_WAYLAND=1,WAYLAND_DISPLAYset, andDISPLAYunset — otherwise the daemon stays on the X11 path. Start the daemon with all three for a pure-Wayland session.
Bringing up a headless GNOME Wayland VM (the lane's host) has two gotchas worth recording, neither cua-driver's concern but both block the lane:
- GDM disables Wayland for the Azure virtual GPU. The
hyperv_drmDRM driver exposes noparameters/modeset, so GDM's61-gdm.rulesudev gate (ATTR{parameters/modeset}!="Y") force-selects Xorg. Override by shadowing the rule in/etc/udev/rules.d/61-gdm.rules(drop the modeset + bareGOTO="gdm_disable_wayland"lines) and setSession=ubuntu-waylandin/var/lib/AccountsService/users/<user>, then restart gdm. - GTK4 apps need software GL on a GPU-less box. Launch with
LIBGL_ALWAYS_SOFTWARE=1 GSK_RENDERER=cairo, else the GL renderer fails (EGL/ZINK) and the app exits.
Appendix — file map#
| Path | Role |
|---|---|
shared/scenarios.json | single source of truth: control ids, window titles, scenarios |
shared/web/index.html | shared DOM for WebView2 / Electron / WKWebView harnesses |
apps/ | per-OS/toolkit harness sources (see §3 table) |
build/{macos.sh,windows.ps1,linux.sh} | stage harnesses into rust/test-apps/harness-* |
modality-recordings/FINDINGS.md | authoritative per-action findings |
modality-recordings/windows/wpf-recorder.ps1 | Windows recorder (-Mode, -Toolkit) |
modality-recordings/windows/run-one.ps1 | single-run launcher |
modality-recordings/linux/lin-rec.py, lin-rec-electron.py | Linux recorders (+ lin-run*.sh, lin-dash*, lin-harness.py) |
modality-recordings/macos/mac-rec.py | macOS recorder (MODE SURFACE) |
vision-agent-test/vision_agent_test.py + README.md | coordinate-invariant regression test |
smoke/macos.sh | macOS smoke check |
rust/crates/cua-driver/tests/harness_*.rs | Rust integration tests consuming staged harnesses |
TEST_SUITE.md (this dir) | sibling doc: the Rust integration-test taxonomy / cua-driver-testkit |
CUA_DRIVER_LIMITATIONS_AND_TEST_MATRIX.md (repo root) | limitations + mode/scope mapping |