Cua-BenchGuideGetting Started

First Steps

Running the benchmark and testing your own agent.

Prerequisites

Before running benchmarks, ensure you have:

  1. Installed the CLI
  2. Created at least one base image:
# Quick setup (Linux container, no KVM required)
cb image create linux-docker

# Check available images
cb image list

Example 1: Run a Single Task

Learn the basic workflow by running a single task end-to-end.

Step 1: Explore Interactively

Start by exploring the task interactively in a browser window to see how it works:

cb interact tasks/slack_env --variant-id 0

This opens a visible browser where you can manually interact with the task to understand what needs to be done.

Step 2: Run the Task with an Agent

Now run the same task with an AI agent. Tasks run asynchronously by default and return immediately with a run ID:

export ANTHROPIC_API_KEY=sk-....

# Run a single task (async - returns immediately)
cb run task tasks/slack_env --agent cua-agent --model anthropic/claude-sonnet-4-20250514

You'll see output like:

✓ Run started

  Run ID:  96d41b51
  Task:    click_button (variant 0)
  Env:     simulated (Playwright)
  Output:  ~/.local/share/cua-bench/runs/96d41b51
  Session: ~/.local/share/cua-bench/runs/96d41b51/click_button_v0

Commands:
  cb run watch 96d41b51   # Watch progress in real-time
  cb run info 96d41b51    # Show run details
  cb run logs 96d41b51    # View logs
  cb run stop 96d41b51    # Stop the run

Step 3: Watch Progress

Monitor the agent's progress in real-time:

cb run watch 96d41b51

You'll see live updates:

████████████████████ 1/1
Model: anthropic/claude-sonnet-4-20250514
Agent: cua-agent
Run ID: 96d41b51

SESSION ID            ENVIRONMENT      VARIANT    STATUS         REWARD
----------------      ------------     -------    -----------    ------
task-96d41b51         slack_env        0          ✓ completed    1.0

✓ All sessions completed!
Average Reward: 1.000

Step 4: View the Trace

Finally, explore the agent's trajectory using the trace viewer:

cb trace view 96d41b51

This launches an interactive viewer where you can see screenshots, actions, and debug the agent's behavior:

Serving trace viewer at: http://127.0.0.1:55115/
Press Enter to stop...

Open the URL in your browser to explore the full execution trace.


Example 2: Run a Basic Benchmark

Scale up to run an entire benchmark dataset with multiple tasks.

The Cua-Bench CLI can run many different computer-use benchmarks and versions of benchmarks. To see available datasets, run cb dataset list or visit the registry.

Step 1: Explore a Single Task

Start by exploring one task from the dataset interactively:

cb interact click-icon --dataset cua-bench-basic

Step 2: Validate with Oracle Solutions

Verify your environment setup by running the oracle solutions (reference implementations) for all tasks:

cb run dataset datasets/cua-bench-basic --oracle

Oracle solutions should achieve 100% success rate. If they fail, check your environment setup.

Step 3: Run the Entire Dataset in Parallel

Now run the full benchmark with an agent. Use --max-parallel to run multiple tasks simultaneously:

export ANTHROPIC_API_KEY=sk-....

# Run entire dataset in parallel
cb run dataset datasets/cua-bench-basic \
    --agent cua-agent \
    --model anthropic/claude-haiku-4-5 \
    --max-parallel 8 \
    --max-steps 10 \
    --max-variants 1

Monitor progress with:

cb run watch <run_id>

You should see output similar to:


    ⠀⣀⣀⡀⠀⠀⠀⠀⢀⣀⣀⣀⡀⠘⠋⢉⠙⣷⠀⠀ ⠀
 ⠀⠀⢀⣴⣿⡿⠋⣉⠁⣠⣾⣿⣿⣿⣿⡿⠿⣦⡈⠀⣿⡇⠃⠀
 ⠀⠀⠀⣽⣿⣧⠀⠃⢰⣿⣿⡏⠙⣿⠿⢧⣀⣼⣷⠀⡿⠃⠀⠀
 ⠀⠀⠀⠉⣿⣿⣦⠀⢿⣿⣿⣷⣾⡏⠀⠀⢹⣿⣿⠀⠀⠀⠀⠀⠀
 ⠀⠀⠀⠀⠀⠉⠛⠁⠈⠿⣿⣿⣿⣷⣄⣠⡼⠟⠁⠀cua-bench==v0.1.0
           toolkit for computer-use RL environments and benchmarks

████████████████████ 13/13
Model: anthropic/claude-haiku-4-5
Agent: cua-agent
Run ID: run-2768db7f



SESSION ID                                   ENVIRONMENT       VARIANT  STATUS           REWARD
----------------------------------------     ---------------   -------  ---------------  ----------
task-baecb22e                                typing-input      0        ✗ failed         0.0
task-05a6e068                                video-player      0        ✓ completed      1.0
task-a8faeb43                                spreadsheet-cell  0        ✗ failed         0.0
task-5a5f30c5                                toggle-switch     0        ✓ completed      1.0
task-3555ee20                                select-dropdown   0        ✗ failed         0.0
task-e9560151                                click-button      0        ✓ completed      1.0
task-7ec326f2                                fill-form         0        ✗ failed         0.0
task-57dae23d                                date-picker       0        ✗ failed         0.0
task-01276975                                drag-slider       0        ✗ failed         0.0
task-4b875cd2                                color-picker      0        ✓ completed      1.0
task-c933d5b0                                drag-drop         0        ✗ failed         0.0
task-ec44d643                                right-click-menu  0        ✗ failed         0.0
task-7c4acb4c                                click-icon        0        ✓ completed      1.0



✓ All sessions completed!
Average Reward: 0.385

REWARD  COUNT
-------------
1.0     5
0.0     8

Results are saved to ~/.local/share/cua-bench/runs/<run_id>/ by default.


Managing Runs

View and manage your past runs:

# List all runs with statistics
cb run list

# Watch a run in real-time
cb run watch <run_id>

# Check status of a specific run
cb run info <run_id>

# View logs
cb run logs <run_id>

# Stop a running task
cb run stop <run_id>

# View traces from a specific run
cb trace view <run_id>

Was this page helpful?