Cua-BenchReference

SDK Reference

Session interface for interacting with desktop environments

The cua-bench Session interface provides methods for interacting with desktop environments, whether they're simulated (Playwright-based) or native (Docker/QEMU-based).

Quick Example

from cua_bench import make

# Load an environment
env = make("./my_task")

# Reset to start a task
screenshot, task_config = await env.reset(task_id=0)

# Get the session from the environment
session = env.session

# Take actions
await session.click(100, 200)
await session.type("Hello, world!")
await session.key("Enter")

# Take a screenshot
screenshot_bytes = await session.screenshot()

# Close when done
await env.close()

Session Interface

The DesktopSession protocol defines the interface for interacting with desktop environments.

Lifecycle Methods

# Async context manager (preferred)
async with get_session("native")(os_type="linux") as session:
    await session.screenshot()

# Manual lifecycle
session = get_session("native")(os_type="linux")
await session.start()
try:
    await session.screenshot()
finally:
    await session.close()
MethodDescription
start(config?, headless?)Start the session and connect to the environment
close()Close the session and cleanup resources

Screenshot & State

# Capture screenshot
screenshot_bytes = await session.screenshot()  # PNG bytes

# Get desktop state snapshot
snapshot = await session.get_snapshot()
# Returns: Snapshot(windows: List[WindowSnapshot])
MethodReturn TypeDescription
screenshot()bytesCapture screen as PNG bytes
get_snapshot()SnapshotGet lightweight snapshot of desktop state (windows, geometry, metadata)

Mouse Actions

# Click actions
await session.click(x, y)           # Left click
await session.right_click(x, y)     # Right click
await session.double_click(x, y)    # Double click
await session.middle_click(x, y)    # Middle click

# Movement and drag
await session.move_to(x, y)         # Move cursor
await session.drag(from_x, from_y, to_x, to_y)  # Drag gesture

# Scroll
await session.scroll(direction="down", amount=100)  # Scroll up/down
MethodParametersDescription
click(x, y)x: int, y: intLeft click at coordinates
right_click(x, y)x: int, y: intRight click at coordinates
double_click(x, y)x: int, y: intDouble click at coordinates
middle_click(x, y)x: int, y: intMiddle click at coordinates
move_to(x, y)x: int, y: intMove cursor to coordinates
drag(from_x, from_y, to_x, to_y)from_x: int, from_y: int, to_x: int, to_y: intDrag from one position to another
scroll(direction, amount)direction: "up" | "down", amount: intScroll in given direction

Keyboard Actions

# Type text
await session.type("Hello, world!")

# Press single key
await session.key("Enter")
await session.key("Escape")
await session.key("Tab")

# Key combinations (hotkeys)
await session.hotkey(["ctrl", "c"])     # Copy
await session.hotkey(["ctrl", "v"])     # Paste
await session.hotkey(["ctrl", "shift", "t"])  # Multiple modifiers
MethodParametersDescription
type(text)text: strType a string of text
key(key)key: strPress a single key (e.g., "Enter", "Escape", "Tab")
hotkey(keys)keys: List[str]Press a key combination (e.g., ["ctrl", "c"])

Window Management

# Launch a window
pid = await session.launch_window(
    url="https://example.com",
    title="My Window",
    width=800,
    height=600
)

# Execute JavaScript in a window
result = await session.execute_javascript(pid, "document.title")

# Get element position within a window
rect = await session.get_element_rect(pid, ".button-class")
# Returns: {"x": 10, "y": 20, "width": 100, "height": 30}

# Close all windows
await session.close_all_windows()
MethodParametersReturn TypeDescription
launch_window(url?, html?, folder?, ...)Various options for window creationint | strLaunch a window and return its process ID
execute_javascript(pid, js)pid: int | str, js: strAnyExecute JavaScript in a specific window
get_element_rect(pid, selector)pid: int | str, selector: strdict | NoneGet element rectangle in window or screen space
close_all_windows()-NoneClose or clear all open windows

Action System

For more advanced use cases, you can use the action dataclasses directly:

from cua_bench.types import (
    ClickAction,
    RightClickAction,
    DoubleClickAction,
    MiddleClickAction,
    DragAction,
    MoveToAction,
    ScrollAction,
    TypeAction,
    KeyAction,
    HotkeyAction,
    WaitAction,
    DoneAction,
)

# Execute an action directly
await session.execute_action(ClickAction(x=100, y=200))
await session.execute_action(TypeAction(text="Hello"))
await session.execute_action(HotkeyAction(keys=["ctrl", "c"]))

Available Action Types:

Action ClassFieldsDescription
ClickActionx: int, y: intLeft click at position
RightClickActionx: int, y: intRight click at position
DoubleClickActionx: int, y: intDouble click at position
MiddleClickActionx: int, y: intMiddle click at position
DragActionfrom_x: int, from_y: int, to_x: int, to_y: int, duration: floatDrag gesture
MoveToActionx: int, y: int, duration: floatMove cursor
ScrollActiondirection: "up" | "down", amount: intScroll action
TypeActiontext: strType text
KeyActionkey: strPress single key
HotkeyActionkeys: List[str]Key combination
WaitActionseconds: floatWait/pause
DoneAction-Signal completion

Provider Types

cua-bench supports two provider types:

Simulated Desktop (Playwright-based)

  • Name: "simulated" or "webtop"
  • Description: HTML/CSS desktop simulation in headless browser
  • Pros: Fast, no Docker required
  • Use cases: Web app testing, UI benchmarks
from cua_bench.computers.base import get_session

SessionClass = get_session("simulated")
session = SessionClass(env=env)

Native Desktop (Docker/QEMU-based)

  • Name: "native" or "computer"
  • Description: Real OS in Docker/QEMU container
  • Pros: Actual desktop environment with real applications
  • Requires: Docker
  • Use cases: Real app testing, OS-level tasks
from cua_bench.computers.base import get_session

SessionClass = get_session("native")
session = SessionClass(env=env)

Data Types

Snapshot

@dataclass
class WindowSnapshot:
    window_type: Literal["webview", "process", "desktop"]
    pid: Optional[str] = None
    url: Optional[str] = None
    html: Optional[str] = None
    title: str = ""
    x: int = 0
    y: int = 0
    width: int = 0
    height: int = 0
    active: bool = False
    minimized: bool = False

@dataclass
class Snapshot:
    windows: List[WindowSnapshot]

The get_snapshot() method returns a Snapshot object containing information about all open windows, their positions, sizes, and states.

Was this page helpful?