Cua-BenchGuideAdvanced

Creating Private Tasks

How to create a private set of tasks.

Cua-Bench is a broad set of tasks meant to test the general ability of an agent to perform complex tasks in a desktop environment.

However, many agents are designed for more narrow use cases. The harness can still be leveraged by making it easy for agent developers to create a custom set of tasks for their agent.

Creating a Custom Task

Using the Wizard

cb task create tasks/my_custom_task

This starts an interactive wizard that guides you through the task creation process.

Using AI Generation

Use Claude Code to scaffold your starter environment:

cb task generate "discord clone with channels and messages" --output tasks/discord_clone

Manual Creation

Create the task structure manually:

tasks/my_custom_task/
├── main.py           # Task implementation
├── gui/
│   └── index.html    # UI assets
└── pyproject.toml    # Optional metadata

Task Implementation

Here's a complete example task:

"""Custom task example - Click a button."""
import cua_bench as cb
from pathlib import Path

@cb.tasks_config(split="train")
def load():
    """Define task variants."""
    return [
        cb.Task(
            description='Click the "Submit" button to complete the task.',
            metadata={"button_id": "submit"},
            computer={
                "provider": "computer",
                "setup_config": {
                    "os_type": "linux",
                    "width": 1024,
                    "height": 768,
                }
            }
        )
    ]

pid = None

@cb.setup_task(split="train")
async def start(task_cfg, session: cb.DesktopSession):
    """Initialize the task environment."""
    global pid
    pid = await session.launch_window(
        html=(Path(__file__).parent / "gui/index.html").read_text('utf-8'),
        title="My Custom Task",
        width=400,
        height=300,
    )

@cb.evaluate_task(split="train")
async def evaluate(task_cfg, session: cb.DesktopSession) -> list[float]:
    """Check if the task was completed successfully."""
    global pid
    if pid is None:
        return [0.0]

    submitted = await session.execute_javascript(pid, "window.__submitted")
    return [1.0] if submitted is True else [0.0]

@cb.solve_task(split="train")
async def solve(task_cfg, session: cb.DesktopSession):
    """Oracle solution - click the button."""
    global pid
    if pid is None:
        return

    await session.click_element(pid, "#submit")

if __name__ == "__main__":
    cb.interact(__file__)

Async Required

All task functions (setup, evaluate, solve) must be async functions, and session methods must be called with await.

Running Custom Tasks

Test Locally

# Interactive mode - opens browser to view task
cb interact tasks/my_custom_task

# Run specific task variant
cb interact tasks/my_custom_task --variant-id 0

# Run oracle solution
cb interact tasks/my_custom_task --oracle

# Save screenshot
cb interact tasks/my_custom_task --oracle --screenshot output.png

Evaluate with Agent

# Run with agent
cb run task tasks/my_custom_task --agent cua-agent --model anthropic/claude-sonnet-4-20250514

# Validate oracle solutions
cb run task tasks/my_custom_task --oracle

Running a Custom Dataset

To run multiple tasks as a dataset:

# Run all tasks in a directory
cb run dataset tasks --agent cua-agent --model anthropic/claude-sonnet-4-20250514

# Run specific tasks
cb run dataset tasks/my_custom_task,tasks/another_task --agent cua-agent

Task Development Workflow

  1. Create task structure with cb task create or cb task generate
  2. Implement task logic in main.py with async functions
  3. Test interactively with cb interact tasks/my_task
  4. Validate oracle with cb interact tasks/my_task --oracle
  5. Run evaluation with cb run task tasks/my_task --oracle to verify evaluation returns [1.0]
  6. Test with agent with cb run task tasks/my_task --agent cua-agent

Tips

  • Use global pid to share window references between setup, evaluate, and solve functions
  • Always check if pid is None in evaluate and solve functions
  • Use await session.execute_javascript(pid, "expression") to read app state
  • Use await session.click_element(pid, "selector") for reliable element clicks
  • Test with --oracle first to ensure your evaluation logic is correct

Was this page helpful?