Creating Private Tasks
How to create a private set of tasks.
Cua-Bench is a broad set of tasks meant to test the general ability of an agent to perform complex tasks in a desktop environment.
However, many agents are designed for more narrow use cases. The harness can still be leveraged by making it easy for agent developers to create a custom set of tasks for their agent.
Creating a Custom Task
Using the Wizard
cb task create tasks/my_custom_taskThis starts an interactive wizard that guides you through the task creation process.
Using AI Generation
Use Claude Code to scaffold your starter environment:
cb task generate "discord clone with channels and messages" --output tasks/discord_cloneManual Creation
Create the task structure manually:
tasks/my_custom_task/
├── main.py # Task implementation
├── gui/
│ └── index.html # UI assets
└── pyproject.toml # Optional metadataTask Implementation
Here's a complete example task:
"""Custom task example - Click a button."""
import cua_bench as cb
from pathlib import Path
@cb.tasks_config(split="train")
def load():
"""Define task variants."""
return [
cb.Task(
description='Click the "Submit" button to complete the task.',
metadata={"button_id": "submit"},
computer={
"provider": "computer",
"setup_config": {
"os_type": "linux",
"width": 1024,
"height": 768,
}
}
)
]
pid = None
@cb.setup_task(split="train")
async def start(task_cfg, session: cb.DesktopSession):
"""Initialize the task environment."""
global pid
pid = await session.launch_window(
html=(Path(__file__).parent / "gui/index.html").read_text('utf-8'),
title="My Custom Task",
width=400,
height=300,
)
@cb.evaluate_task(split="train")
async def evaluate(task_cfg, session: cb.DesktopSession) -> list[float]:
"""Check if the task was completed successfully."""
global pid
if pid is None:
return [0.0]
submitted = await session.execute_javascript(pid, "window.__submitted")
return [1.0] if submitted is True else [0.0]
@cb.solve_task(split="train")
async def solve(task_cfg, session: cb.DesktopSession):
"""Oracle solution - click the button."""
global pid
if pid is None:
return
await session.click_element(pid, "#submit")
if __name__ == "__main__":
cb.interact(__file__)Async Required
All task functions (setup, evaluate, solve) must be async functions, and session methods must be
called with await.
Running Custom Tasks
Test Locally
# Interactive mode - opens browser to view task
cb interact tasks/my_custom_task
# Run specific task variant
cb interact tasks/my_custom_task --variant-id 0
# Run oracle solution
cb interact tasks/my_custom_task --oracle
# Save screenshot
cb interact tasks/my_custom_task --oracle --screenshot output.pngEvaluate with Agent
# Run with agent
cb run task tasks/my_custom_task --agent cua-agent --model anthropic/claude-sonnet-4-20250514
# Validate oracle solutions
cb run task tasks/my_custom_task --oracleRunning a Custom Dataset
To run multiple tasks as a dataset:
# Run all tasks in a directory
cb run dataset tasks --agent cua-agent --model anthropic/claude-sonnet-4-20250514
# Run specific tasks
cb run dataset tasks/my_custom_task,tasks/another_task --agent cua-agentTask Development Workflow
- Create task structure with
cb task createorcb task generate - Implement task logic in
main.pywith async functions - Test interactively with
cb interact tasks/my_task - Validate oracle with
cb interact tasks/my_task --oracle - Run evaluation with
cb run task tasks/my_task --oracleto verify evaluation returns[1.0] - Test with agent with
cb run task tasks/my_task --agent cua-agent
Tips
- Use
global pidto share window references between setup, evaluate, and solve functions - Always check
if pid is Nonein evaluate and solve functions - Use
await session.execute_javascript(pid, "expression")to read app state - Use
await session.click_element(pid, "selector")for reliable element clicks - Test with
--oraclefirst to ensure your evaluation logic is correct
Was this page helpful?