Cua-BenchGuideFundamentals

Adapters

Running benchmarks across different operating systems and environments.

The Cua-Bench framework makes it easy to create adapters for existing computer-use benchmarks by implementing their setup/evaluators with the Python task specification.

Supported Benchmarks

BenchmarkImageTasksStatusDescription
Windows Arenawindows-qemu173AvailableWindows 11 desktop automation across 12 apps
OSWorldlinux-qemu369Coming soonLinux desktop automation
WebArenalinux-docker812Coming soonWeb browsing tasks
AndroidWorldandroid-qemu200+Coming soonAndroid mobile automation
macOSWorldmacos-lume100+Coming soonmacOS desktop automation

Example: Run Windows Arena

Windows Arena is a benchmark with 173 tasks across 12 Windows application domains.

Requirements

Windows Arena requires x86_64 Linux with KVM support (nested virtualization). Setup takes ~1-2 hours.

Step 1: Create Windows Base Image

# Download Windows ISO and create base image (~1-2 hours)
cb image create windows-qemu --download-iso

# With WinArena benchmark apps pre-installed (recommended)
cb image create windows-qemu --download-iso --winarena-apps

# Monitor progress (VNC at http://localhost:8006)
docker logs -f cua-setup-windows

Step 2: Verify Image (Optional)

# Start interactive shell to verify the image works
cb image shell windows-qemu

# Access VNC at http://localhost:8006
# Press Ctrl+C to stop when done verifying

Step 3: Run Tasks

# Run specific task with oracle
cb interact tasks/winarena_adapter --variant-id 0 --oracle

# Run with agent (2-container architecture)
cb run task tasks/winarena_adapter \
    --agent cua-agent \
    --model anthropic/claude-sonnet-4-20250514 \
    --image windows-qemu

# Run entire dataset in parallel
cb run dataset datasets/winarena \
    --agent cua-agent \
    --model anthropic/claude-sonnet-4-20250514 \
    --image windows-qemu \
    --max-parallel 4

Was this page helpful?