Tasks

A Cua-Bench task is a Python module that defines a verifiable, cross-platform GUI task. Tasks use decorators to define configuration, setup, evaluation, and oracle solutions.

Task Structure

Every task consists of a main.py file with four key components:

1. Task Configuration (`@cb.tasks_config`)

Defines the list of task variants to generate:

import cua_bench as cb

@cb.tasks_config(split="train")
def load():
    return [
        cb.Task(
            description='Click the "Submit" button on the page.',
            metadata={
                "button_text": "Submit",
            },
            # Configure the computer platform
            computer={
                "provider": provider,
                "setup_config": {
                    "os_type": os_type,
                    "width": 1024,
                    "height": 768
                }
            }
        )
        for provider, os_type in [("native", "linux"), ("native", "windows"), ("simulated", "win11"), ("simulated", "macos")]
        # More task variants...
    ]

Each task variant can have different metadata, enabling thousands of unique scenarios from a single template.

The native provider emulates windows and linux operating systems using QEMU/Docker, while the simulated provider lightweight Playwright-based desktop environments with themes such as win11 and macos. See Simulated Desktop for all available themes.

2. Setup (`@cb.setup_task`)

Initializes the environment for a specific task, installing any necessary apps or web UI onto the desktop:

# Each (setup_task, evaluate_task, and solve_task) function runs in 1 thread/context per task_cfg
# All task functions must be async
pid = None

@cb.setup_task(split="train")
async def start(task_cfg: cb.Task, session: cb.DesktopSession | cb.MobileSession):
    global pid
    # Launch a window with HTML content
    pid = await session.launch_window(
        html=Path("gui/index.html").read_text(),
        title="My App",
        width=800,
        height=600
    )

Async Required

All task functions (setup, evaluate, solve) must be async functions, and session methods must be called with await.

3. Evaluation (`@cb.evaluate_task`)

Checks if the task was completed successfully by observing the internal state of the apps:

@cb.evaluate_task(split="train")
async def evaluate(task_cfg: cb.Task, session: cb.DesktopSession | cb.MobileSession) -> list[float]:
    global pid
    if pid is None:
        return [0.0]

    # Check task completion
    submitted = await session.execute_javascript(pid, "window.__submitted")

    # Return reward: 1.0 for success, 0.0 for failure
    return [1.0] if submitted is True else [0.0]

4. Oracle Solution (`@cb.solve_task`)

Provides a reference solution for trajectory generation:

@cb.solve_task(split="train")
async def solve(task_cfg: cb.Task, session: cb.DesktopSession | cb.MobileSession):
    global pid
    if pid is None:
        return

    # Execute the solution
    await session.click_element(pid, ".submit-button")

Environment API

Window Management

Cua-Bench provides a Python API for launching webviews and installing web apps, which makes it simple to add custom GUI applications to your tasks.

# Launch a webview window (async)
pid = await session.launch_window(
    html="<html>...</html>",
    title="Window Title",
    width=800,
    height=600
)

# Execute JavaScript (async)
result = await session.execute_javascript(pid, "document.title")

# Install a webapp shortcut
await session.install_app("Slack Clone", html="<html>...</html>")

Playwright-style Actions

Cua-Bench also provides a Playwright-like API, making it easy to script oracle solutions:

# Click an element by CSS selector (async)
await session.click_element(pid, ".button")
await session.click_element(pid, "#submit-btn")

# Type text
await session.type_text("Hello world")

# Press keys
await session.key("Enter")
await session.key("Control+c")

Low-level Actions

All actions from the agent or the oracle solution are performed via low-level keyboard and mouse actions. You can use execute_action to execute these actions directly:

# Perform low-level actions (async)
await session.execute_action(cb.ClickAction(x=64, y=128))
await session.execute_action(cb.MoveToAction(x=16, y=100, duration=0.5))
await session.execute_action(cb.KeyAction(key="Enter"))

# Take screenshot
screenshot = await session.screenshot()

Shell Actions

If your task setup requires installing binaries or packages, then you can use session.run_command to execute shell commands in the computer environment.

# Install firefox (async)
await session.run_command("sudo apt update && sudo apt install firefox")

Compatibility Warning

This feature is only supported by VMDesktopSession and not WebDesktopSession. Avoid using this API for writing tasks unless you cannot create your task with the other provided environment APIs.

Structuring your task folders

All tasks must have a top-level main.py file. We recommend also including a gui/ directory for any HTML/CSS/JS files:

tasks/my_task/
├── main.py
└── gui/
    ├── index.html
    ├── style.css
    └── script.js

To reference these files in your setup function:

folder_path = Path(__file__).parent / "gui"
pid = env.launch_window(folder=str(folder_path), title="My GUI", ...)

If you use a web framework such as React or Svelte, you can serve the dist/ folder, or you can start a development server and launch a window pointing to it:

@cb.setup_task(split="train")
def start(task_cfg: cb.Task, env: cb.DesktopSession | cb.MobileSession):
    folder_path = Path(__file__).parent / "gui"

    # Start framework app in the background
    _server_process = subprocess.Popen(["npm", "run", "dev"], cwd=folder_path)

    # Launch window pointing to the local server
    pid = env.launch_window(
        url="http://localhost:3000",
        title="My React GUI",
        width=800,
        height=600
    )

Running Tasks

# Run interactively
cb interact tasks/my_task --variant-id 0

# Run evaluations
cb run task tasks/my_task --model anthropic/claude-opus-4-5 --agent cua-agent

# Generate oracle trajectories
cb run task tasks/my_task --oracle

Was this page helpful?

On this page