SDK Reference

The cua-bench Session interface provides methods for interacting with desktop environments, whether they're simulated (Playwright-based) or native (Docker/QEMU-based).

Quick Example

from cua_bench import make

# Load an environment
env = make("./my_task")

# Reset to start a task
screenshot, task_config = await env.reset(task_id=0)

# Get the session from the environment
session = env.session

# Take actions
await session.click(100, 200)
await session.type("Hello, world!")
await session.key("Enter")

# Take a screenshot
screenshot_bytes = await session.screenshot()

# Close when done
await env.close()

Session Interface

The DesktopSession protocol defines the interface for interacting with desktop environments.

Lifecycle Methods

# Async context manager (preferred)
async with get_session("native")(os_type="linux") as session:
    await session.screenshot()

# Manual lifecycle
session = get_session("native")(os_type="linux")
await session.start()
try:
    await session.screenshot()
finally:
    await session.close()

Method	Description
`start(config?, headless?)`	Start the session and connect to the environment
`close()`	Close the session and cleanup resources

Screenshot & State

# Capture screenshot
screenshot_bytes = await session.screenshot()  # PNG bytes

# Get desktop state snapshot
snapshot = await session.get_snapshot()
# Returns: Snapshot(windows: List[WindowSnapshot])

Method	Return Type	Description
`screenshot()`	`bytes`	Capture screen as PNG bytes
`get_snapshot()`	`Snapshot`	Get lightweight snapshot of desktop state (windows, geometry, metadata)

Mouse Actions

# Click actions
await session.click(x, y)           # Left click
await session.right_click(x, y)     # Right click
await session.double_click(x, y)    # Double click
await session.middle_click(x, y)    # Middle click

# Movement and drag
await session.move_to(x, y)         # Move cursor
await session.drag(from_x, from_y, to_x, to_y)  # Drag gesture

# Scroll
await session.scroll(direction="down", amount=100)  # Scroll up/down

Method	Parameters	Description
`click(x, y)`	x: int, y: int	Left click at coordinates
`right_click(x, y)`	x: int, y: int	Right click at coordinates
`double_click(x, y)`	x: int, y: int	Double click at coordinates
`middle_click(x, y)`	x: int, y: int	Middle click at coordinates
`move_to(x, y)`	x: int, y: int	Move cursor to coordinates
`drag(from_x, from_y, to_x, to_y)`	from_x: int, from_y: int, to_x: int, to_y: int	Drag from one position to another
`scroll(direction, amount)`	direction: "up" \| "down", amount: int	Scroll in given direction

Keyboard Actions

# Type text
await session.type("Hello, world!")

# Press single key
await session.key("Enter")
await session.key("Escape")
await session.key("Tab")

# Key combinations (hotkeys)
await session.hotkey(["ctrl", "c"])     # Copy
await session.hotkey(["ctrl", "v"])     # Paste
await session.hotkey(["ctrl", "shift", "t"])  # Multiple modifiers

Method	Parameters	Description
`type(text)`	text: str	Type a string of text
`key(key)`	key: str	Press a single key (e.g., "Enter", "Escape", "Tab")
`hotkey(keys)`	keys: List[str]	Press a key combination (e.g., ["ctrl", "c"])

Window Management

# Launch a window
pid = await session.launch_window(
    url="https://example.com",
    title="My Window",
    width=800,
    height=600
)

# Execute JavaScript in a window
result = await session.execute_javascript(pid, "document.title")

# Get element position within a window
rect = await session.get_element_rect(pid, ".button-class")
# Returns: {"x": 10, "y": 20, "width": 100, "height": 30}

# Close all windows
await session.close_all_windows()

Method	Parameters	Return Type	Description
`launch_window(url?, html?, folder?, ...)`	Various options for window creation	`int \| str`	Launch a window and return its process ID
`execute_javascript(pid, js)`	pid: int \| str, js: str	`Any`	Execute JavaScript in a specific window
`get_element_rect(pid, selector)`	pid: int \| str, selector: str	`dict \| None`	Get element rectangle in window or screen space
`close_all_windows()`	-	`None`	Close or clear all open windows

Action System

For more advanced use cases, you can use the action dataclasses directly:

from cua_bench.types import (
    ClickAction,
    RightClickAction,
    DoubleClickAction,
    MiddleClickAction,
    DragAction,
    MoveToAction,
    ScrollAction,
    TypeAction,
    KeyAction,
    HotkeyAction,
    WaitAction,
    DoneAction,
)

# Execute an action directly
await session.execute_action(ClickAction(x=100, y=200))
await session.execute_action(TypeAction(text="Hello"))
await session.execute_action(HotkeyAction(keys=["ctrl", "c"]))

Available Action Types:

Action Class	Fields	Description
`ClickAction`	x: int, y: int	Left click at position
`RightClickAction`	x: int, y: int	Right click at position
`DoubleClickAction`	x: int, y: int	Double click at position
`MiddleClickAction`	x: int, y: int	Middle click at position
`DragAction`	from_x: int, from_y: int, to_x: int, to_y: int, duration: float	Drag gesture
`MoveToAction`	x: int, y: int, duration: float	Move cursor
`ScrollAction`	direction: "up" \| "down", amount: int	Scroll action
`TypeAction`	text: str	Type text
`KeyAction`	key: str	Press single key
`HotkeyAction`	keys: List[str]	Key combination
`WaitAction`	seconds: float	Wait/pause
`DoneAction`	-	Signal completion

Provider Types

cua-bench supports two provider types:

Simulated Desktop (Playwright-based)

Name: "simulated" or "webtop"
Description: HTML/CSS desktop simulation in headless browser
Pros: Fast, no Docker required
Use cases: Web app testing, UI benchmarks

from cua_bench.computers.base import get_session

SessionClass = get_session("simulated")
session = SessionClass(env=env)

Native Desktop (Docker/QEMU-based)

Name: "native" or "computer"
Description: Real OS in Docker/QEMU container
Pros: Actual desktop environment with real applications
Requires: Docker
Use cases: Real app testing, OS-level tasks

from cua_bench.computers.base import get_session

SessionClass = get_session("native")
session = SessionClass(env=env)

Data Types

Snapshot

@dataclass
class WindowSnapshot:
    window_type: Literal["webview", "process", "desktop"]
    pid: Optional[str] = None
    url: Optional[str] = None
    html: Optional[str] = None
    title: str = ""
    x: int = 0
    y: int = 0
    width: int = 0
    height: int = 0
    active: bool = False
    minimized: bool = False

@dataclass
class Snapshot:
    windows: List[WindowSnapshot]

The get_snapshot() method returns a Snapshot object containing information about all open windows, their positions, sizes, and states.

Was this page helpful?

SDK Reference

On this page