First Steps
Running the benchmark and testing your own agent.
Prerequisites
Before running benchmarks, ensure you have:
- Installed the CLI
- Created at least one base image:
# Quick setup (Linux container, no KVM required)
cb image create linux-docker
# Check available images
cb image listExample 1: Run a Single Task
Learn the basic workflow by running a single task end-to-end.
Step 1: Explore Interactively
Start by exploring the task interactively in a browser window to see how it works:
cb interact tasks/slack_env --variant-id 0This opens a visible browser where you can manually interact with the task to understand what needs to be done.
Step 2: Run the Task with an Agent
Now run the same task with an AI agent. Tasks run asynchronously by default and return immediately with a run ID:
export ANTHROPIC_API_KEY=sk-....
# Run a single task (async - returns immediately)
cb run task tasks/slack_env --agent cua-agent --model anthropic/claude-sonnet-4-20250514You'll see output like:
✓ Run started
Run ID: 96d41b51
Task: click_button (variant 0)
Env: simulated (Playwright)
Output: ~/.local/share/cua-bench/runs/96d41b51
Session: ~/.local/share/cua-bench/runs/96d41b51/click_button_v0
Commands:
cb run watch 96d41b51 # Watch progress in real-time
cb run info 96d41b51 # Show run details
cb run logs 96d41b51 # View logs
cb run stop 96d41b51 # Stop the runStep 3: Watch Progress
Monitor the agent's progress in real-time:
cb run watch 96d41b51You'll see live updates:
████████████████████ 1/1
Model: anthropic/claude-sonnet-4-20250514
Agent: cua-agent
Run ID: 96d41b51
SESSION ID ENVIRONMENT VARIANT STATUS REWARD
---------------- ------------ ------- ----------- ------
task-96d41b51 slack_env 0 ✓ completed 1.0
✓ All sessions completed!
Average Reward: 1.000Step 4: View the Trace
Finally, explore the agent's trajectory using the trace viewer:
cb trace view 96d41b51This launches an interactive viewer where you can see screenshots, actions, and debug the agent's behavior:
Serving trace viewer at: http://127.0.0.1:55115/
Press Enter to stop...Open the URL in your browser to explore the full execution trace.
Example 2: Run a Basic Benchmark
Scale up to run an entire benchmark dataset with multiple tasks.
The Cua-Bench CLI can run many different computer-use benchmarks and versions of benchmarks. To see available datasets, run cb dataset list or visit the registry.
Step 1: Explore a Single Task
Start by exploring one task from the dataset interactively:
cb interact click-icon --dataset cua-bench-basicStep 2: Validate with Oracle Solutions
Verify your environment setup by running the oracle solutions (reference implementations) for all tasks:
cb run dataset datasets/cua-bench-basic --oracleOracle solutions should achieve 100% success rate. If they fail, check your environment setup.
Step 3: Run the Entire Dataset in Parallel
Now run the full benchmark with an agent. Use --max-parallel to run multiple tasks simultaneously:
export ANTHROPIC_API_KEY=sk-....
# Run entire dataset in parallel
cb run dataset datasets/cua-bench-basic \
--agent cua-agent \
--model anthropic/claude-haiku-4-5 \
--max-parallel 8 \
--max-steps 10 \
--max-variants 1Monitor progress with:
cb run watch <run_id>You should see output similar to:
⠀⣀⣀⡀⠀⠀⠀⠀⢀⣀⣀⣀⡀⠘⠋⢉⠙⣷⠀⠀ ⠀
⠀⠀⢀⣴⣿⡿⠋⣉⠁⣠⣾⣿⣿⣿⣿⡿⠿⣦⡈⠀⣿⡇⠃⠀
⠀⠀⠀⣽⣿⣧⠀⠃⢰⣿⣿⡏⠙⣿⠿⢧⣀⣼⣷⠀⡿⠃⠀⠀
⠀⠀⠀⠉⣿⣿⣦⠀⢿⣿⣿⣷⣾⡏⠀⠀⢹⣿⣿⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠉⠛⠁⠈⠿⣿⣿⣿⣷⣄⣠⡼⠟⠁⠀cua-bench==v0.1.0
toolkit for computer-use RL environments and benchmarks
████████████████████ 13/13
Model: anthropic/claude-haiku-4-5
Agent: cua-agent
Run ID: run-2768db7f
SESSION ID ENVIRONMENT VARIANT STATUS REWARD
---------------------------------------- --------------- ------- --------------- ----------
task-baecb22e typing-input 0 ✗ failed 0.0
task-05a6e068 video-player 0 ✓ completed 1.0
task-a8faeb43 spreadsheet-cell 0 ✗ failed 0.0
task-5a5f30c5 toggle-switch 0 ✓ completed 1.0
task-3555ee20 select-dropdown 0 ✗ failed 0.0
task-e9560151 click-button 0 ✓ completed 1.0
task-7ec326f2 fill-form 0 ✗ failed 0.0
task-57dae23d date-picker 0 ✗ failed 0.0
task-01276975 drag-slider 0 ✗ failed 0.0
task-4b875cd2 color-picker 0 ✓ completed 1.0
task-c933d5b0 drag-drop 0 ✗ failed 0.0
task-ec44d643 right-click-menu 0 ✗ failed 0.0
task-7c4acb4c click-icon 0 ✓ completed 1.0
✓ All sessions completed!
Average Reward: 0.385
REWARD COUNT
-------------
1.0 5
0.0 8Results are saved to ~/.local/share/cua-bench/runs/<run_id>/ by default.
Managing Runs
View and manage your past runs:
# List all runs with statistics
cb run list
# Watch a run in real-time
cb run watch <run_id>
# Check status of a specific run
cb run info <run_id>
# View logs
cb run logs <run_id>
# Stop a running task
cb run stop <run_id>
# View traces from a specific run
cb trace view <run_id>Was this page helpful?