Cua-BenchReference
SDK Reference
Session interface for interacting with desktop environments
The cua-bench Session interface provides methods for interacting with desktop environments, whether they're simulated (Playwright-based) or native (Docker/QEMU-based).
Quick Example
from cua_bench import make
# Load an environment
env = make("./my_task")
# Reset to start a task
screenshot, task_config = await env.reset(task_id=0)
# Get the session from the environment
session = env.session
# Take actions
await session.click(100, 200)
await session.type("Hello, world!")
await session.key("Enter")
# Take a screenshot
screenshot_bytes = await session.screenshot()
# Close when done
await env.close()Session Interface
The DesktopSession protocol defines the interface for interacting with desktop environments.
Lifecycle Methods
# Async context manager (preferred)
async with get_session("native")(os_type="linux") as session:
await session.screenshot()
# Manual lifecycle
session = get_session("native")(os_type="linux")
await session.start()
try:
await session.screenshot()
finally:
await session.close()| Method | Description |
|---|---|
start(config?, headless?) | Start the session and connect to the environment |
close() | Close the session and cleanup resources |
Screenshot & State
# Capture screenshot
screenshot_bytes = await session.screenshot() # PNG bytes
# Get desktop state snapshot
snapshot = await session.get_snapshot()
# Returns: Snapshot(windows: List[WindowSnapshot])| Method | Return Type | Description |
|---|---|---|
screenshot() | bytes | Capture screen as PNG bytes |
get_snapshot() | Snapshot | Get lightweight snapshot of desktop state (windows, geometry, metadata) |
Mouse Actions
# Click actions
await session.click(x, y) # Left click
await session.right_click(x, y) # Right click
await session.double_click(x, y) # Double click
await session.middle_click(x, y) # Middle click
# Movement and drag
await session.move_to(x, y) # Move cursor
await session.drag(from_x, from_y, to_x, to_y) # Drag gesture
# Scroll
await session.scroll(direction="down", amount=100) # Scroll up/down| Method | Parameters | Description |
|---|---|---|
click(x, y) | x: int, y: int | Left click at coordinates |
right_click(x, y) | x: int, y: int | Right click at coordinates |
double_click(x, y) | x: int, y: int | Double click at coordinates |
middle_click(x, y) | x: int, y: int | Middle click at coordinates |
move_to(x, y) | x: int, y: int | Move cursor to coordinates |
drag(from_x, from_y, to_x, to_y) | from_x: int, from_y: int, to_x: int, to_y: int | Drag from one position to another |
scroll(direction, amount) | direction: "up" | "down", amount: int | Scroll in given direction |
Keyboard Actions
# Type text
await session.type("Hello, world!")
# Press single key
await session.key("Enter")
await session.key("Escape")
await session.key("Tab")
# Key combinations (hotkeys)
await session.hotkey(["ctrl", "c"]) # Copy
await session.hotkey(["ctrl", "v"]) # Paste
await session.hotkey(["ctrl", "shift", "t"]) # Multiple modifiers| Method | Parameters | Description |
|---|---|---|
type(text) | text: str | Type a string of text |
key(key) | key: str | Press a single key (e.g., "Enter", "Escape", "Tab") |
hotkey(keys) | keys: List[str] | Press a key combination (e.g., ["ctrl", "c"]) |
Window Management
# Launch a window
pid = await session.launch_window(
url="https://example.com",
title="My Window",
width=800,
height=600
)
# Execute JavaScript in a window
result = await session.execute_javascript(pid, "document.title")
# Get element position within a window
rect = await session.get_element_rect(pid, ".button-class")
# Returns: {"x": 10, "y": 20, "width": 100, "height": 30}
# Close all windows
await session.close_all_windows()| Method | Parameters | Return Type | Description |
|---|---|---|---|
launch_window(url?, html?, folder?, ...) | Various options for window creation | int | str | Launch a window and return its process ID |
execute_javascript(pid, js) | pid: int | str, js: str | Any | Execute JavaScript in a specific window |
get_element_rect(pid, selector) | pid: int | str, selector: str | dict | None | Get element rectangle in window or screen space |
close_all_windows() | - | None | Close or clear all open windows |
Action System
For more advanced use cases, you can use the action dataclasses directly:
from cua_bench.types import (
ClickAction,
RightClickAction,
DoubleClickAction,
MiddleClickAction,
DragAction,
MoveToAction,
ScrollAction,
TypeAction,
KeyAction,
HotkeyAction,
WaitAction,
DoneAction,
)
# Execute an action directly
await session.execute_action(ClickAction(x=100, y=200))
await session.execute_action(TypeAction(text="Hello"))
await session.execute_action(HotkeyAction(keys=["ctrl", "c"]))Available Action Types:
| Action Class | Fields | Description |
|---|---|---|
ClickAction | x: int, y: int | Left click at position |
RightClickAction | x: int, y: int | Right click at position |
DoubleClickAction | x: int, y: int | Double click at position |
MiddleClickAction | x: int, y: int | Middle click at position |
DragAction | from_x: int, from_y: int, to_x: int, to_y: int, duration: float | Drag gesture |
MoveToAction | x: int, y: int, duration: float | Move cursor |
ScrollAction | direction: "up" | "down", amount: int | Scroll action |
TypeAction | text: str | Type text |
KeyAction | key: str | Press single key |
HotkeyAction | keys: List[str] | Key combination |
WaitAction | seconds: float | Wait/pause |
DoneAction | - | Signal completion |
Provider Types
cua-bench supports two provider types:
Simulated Desktop (Playwright-based)
- Name:
"simulated"or"webtop" - Description: HTML/CSS desktop simulation in headless browser
- Pros: Fast, no Docker required
- Use cases: Web app testing, UI benchmarks
from cua_bench.computers.base import get_session
SessionClass = get_session("simulated")
session = SessionClass(env=env)Native Desktop (Docker/QEMU-based)
- Name:
"native"or"computer" - Description: Real OS in Docker/QEMU container
- Pros: Actual desktop environment with real applications
- Requires: Docker
- Use cases: Real app testing, OS-level tasks
from cua_bench.computers.base import get_session
SessionClass = get_session("native")
session = SessionClass(env=env)Data Types
Snapshot
@dataclass
class WindowSnapshot:
window_type: Literal["webview", "process", "desktop"]
pid: Optional[str] = None
url: Optional[str] = None
html: Optional[str] = None
title: str = ""
x: int = 0
y: int = 0
width: int = 0
height: int = 0
active: bool = False
minimized: bool = False
@dataclass
class Snapshot:
windows: List[WindowSnapshot]The get_snapshot() method returns a Snapshot object containing information about all open windows, their positions, sizes, and states.
Was this page helpful?