Creating Private Tasks

Cua-Bench is a broad set of tasks meant to test the general ability of an agent to perform complex tasks in a desktop environment.

However, many agents are designed for more narrow use cases. The harness can still be leveraged by making it easy for agent developers to create a custom set of tasks for their agent.

Creating a Custom Task

Using the Wizard

cb task create tasks/my_custom_task

This starts an interactive wizard that guides you through the task creation process.

Using AI Generation

Use Claude Code to scaffold your starter environment:

cb task generate "discord clone with channels and messages" --output tasks/discord_clone

Manual Creation

Create the task structure manually:

tasks/my_custom_task/
├── main.py           # Task implementation
├── gui/
│   └── index.html    # UI assets
└── pyproject.toml    # Optional metadata

Task Implementation

Here's a complete example task:

"""Custom task example - Click a button."""
import cua_bench as cb
from pathlib import Path

@cb.tasks_config(split="train")
def load():
    """Define task variants."""
    return [
        cb.Task(
            description='Click the "Submit" button to complete the task.',
            metadata={"button_id": "submit"},
            computer={
                "provider": "computer",
                "setup_config": {
                    "os_type": "linux",
                    "width": 1024,
                    "height": 768,
                }
            }
        )
    ]

pid = None

@cb.setup_task(split="train")
async def start(task_cfg, session: cb.DesktopSession):
    """Initialize the task environment."""
    global pid
    pid = await session.launch_window(
        html=(Path(__file__).parent / "gui/index.html").read_text('utf-8'),
        title="My Custom Task",
        width=400,
        height=300,
    )

@cb.evaluate_task(split="train")
async def evaluate(task_cfg, session: cb.DesktopSession) -> list[float]:
    """Check if the task was completed successfully."""
    global pid
    if pid is None:
        return [0.0]

    submitted = await session.execute_javascript(pid, "window.__submitted")
    return [1.0] if submitted is True else [0.0]

@cb.solve_task(split="train")
async def solve(task_cfg, session: cb.DesktopSession):
    """Oracle solution - click the button."""
    global pid
    if pid is None:
        return

    await session.click_element(pid, "#submit")

if __name__ == "__main__":
    cb.interact(__file__)

Async Required

All task functions (setup, evaluate, solve) must be async functions, and session methods must be called with await.

Running Custom Tasks

Test Locally

# Interactive mode - opens browser to view task
cb interact tasks/my_custom_task

# Run specific task variant
cb interact tasks/my_custom_task --variant-id 0

# Run oracle solution
cb interact tasks/my_custom_task --oracle

# Save screenshot
cb interact tasks/my_custom_task --oracle --screenshot output.png

Evaluate with Agent

# Run with agent
cb run task tasks/my_custom_task --agent cua-agent --model anthropic/claude-sonnet-4-20250514

# Validate oracle solutions
cb run task tasks/my_custom_task --oracle

Running a Custom Dataset

To run multiple tasks as a dataset:

# Run all tasks in a directory
cb run dataset tasks --agent cua-agent --model anthropic/claude-sonnet-4-20250514

# Run specific tasks
cb run dataset tasks/my_custom_task,tasks/another_task --agent cua-agent

Task Development Workflow

Create task structure with cb task create or cb task generate
Implement task logic in main.py with async functions
Test interactively with cb interact tasks/my_task
Validate oracle with cb interact tasks/my_task --oracle
Run evaluation with cb run task tasks/my_task --oracle to verify evaluation returns [1.0]
Test with agent with cb run task tasks/my_task --agent cua-agent

Tips

Use global pid to share window references between setup, evaluate, and solve functions
Always check if pid is None in evaluate and solve functions
Use await session.execute_javascript(pid, "expression") to read app state
Use await session.click_element(pid, "selector") for reliable element clicks
Test with --oracle first to ensure your evaluation logic is correct

Was this page helpful?

Creating Private Tasks

On this page