Cua Docs

Linux and Wayland

Why Cua Driver leans on the accessibility tree on Linux, how it recovers real screen coordinates GTK4 hides, and how a coordinate click lands on Wayland where there is no global coordinate space.

Linux and Wayland

Linux is not one target. The same desktop app runs under two different session stacks — X11 and Wayland — and they expose input and geometry on opposite terms. This page explains how Cua Driver reaches apps on both, and why the accessibility tree, not synthetic input, is the through-line.

Background#

X11 was built when one program reading or driving another's window was normal; window IDs are global, synthetic events can be addressed to any window, and pixels can be read from any mapped surface. Wayland was built to end exactly that: a client cannot see or inject into another client's surface, and there is no global coordinate space to address. Both models are deliberate. The driver does not fight either — it routes around the parts that are closed.

The semantic path is the through-line#

The one mechanism that behaves the same on X11 and Wayland is AT-SPI (the Linux accessibility bus). When the driver clicks a control by element_index, it calls that control's own accessibility action — Action.DoAction — which the toolkit (GTK, Qt) executes internally. No coordinates, no synthetic pointer event, no window focus, no raised window. Because the action is dispatched toolkit-side rather than as input, it crosses the Wayland client boundary that blocks synthetic input. This is why background, by-element automation is the default on Linux: it is the rung that works everywhere and the one the driver can verify without a screenshot (it reads the result back from the same tree). Pixel actions still get an honest verdict — the driver cross-checks the screenshot and reports effect/escalation — they just aren't confirmable from the tree alone.

Why coordinates need reconstructing#

Agents that work from a screenshot think in pixels, so the driver also has to answer "where is this element on screen?" — and Linux makes that harder than it sounds.

On GTK4, the accessibility bridge reports Component.GetExtents in screen coordinates as (0, 0) for every widget (GNOME/gtk issues #1564 / #1739). Taken at face value, every element collapses onto the window's corner — the agent cursor, element frames, and any coordinate click all pile up in one place.

The fix is to ask a different question. GTK4 does report each widget's position correctly relative to its window (CoordType::Window); what's missing is only the window's origin on screen. So the driver reconstructs the real coordinate as window-relative position + window origin, and sources the origin per session stack:

  • X11 — the window's _GTK_FRAME_EXTENTS (the client-side-decoration inset) plus the X11 window origin.
  • Wayland — a small bundled GNOME Shell helper (org.cua.WinRects) that reports each window's frame from inside the compositor, which is the only privileged vantage point that can see it. A normal client cannot.

The tradeoff is honest: this is a workaround for a toolkit regression, gated on the GTK markers being present, so non-GTK toolkits (Qt reports screen coordinates correctly) keep their native path untouched.

How a coordinate click lands on Wayland#

On Wayland there is no global coordinate space and the compositor discards synthetic pointer events, so "click at (x, y)" cannot be delivered as input at all. Rather than give up the vision modality, the driver inverts the problem: it takes the reconstructed frames above, finds the accessible element whose frame contains the target pixel, and fires that element's action — back on the semantic path that already works.

The subtlety is which element. GTK4 nests a label inside every button with a near-identical, slightly smaller frame; picking the smallest covering element lands on the inert label, whose action does nothing — a click that "succeeds" and changes nothing. So the selection is role-aware: it prefers the smallest covering real actuator and only falls back to a passive label when nothing else covers the point. The net effect is that "click pixel (x, y)" becomes "activate the control the agent sees there," with no pointer injection.

The same shell helper that supplies window origins also draws the agent cursor on Wayland — for the same reason a client can't position a surface at an arbitrary screen coordinate, but the compositor can.

What still needs X11#

The residual gap is raw keyboard injection. Typing into an accessible text field works (AT-SPI writes the field directly), but synthetic key events — and keys that some apps only read as raw device input — have no free path on native Wayland: the cooperative routes (libei, virtual-keyboard protocols) need a portal grant or a compositor that Mutter/KWin aren't. Running the app under XWayland restores the X11 XTEST keyboard path, which is why a GTK4 app launched with GDK_BACKEND=x11 types fine.

One more place X11 assumptions leak: native Wayland apps have no X11 window ID, and neither GNOME Mutter nor KDE KWin implement the wlr-foreign-toplevel protocol (that is wlroots-only), so the usual window list comes back empty. The driver falls back to enumerating top-level frames from the AT-SPI registry and hands back a synthetic but stable window_id. Everything downstream walks the tree by process, not by that ID, so the full list_windows → get_window_state → click flow works unchanged.

Further reading#