The Cua-Bench framework makes it easy to create adapters for existing computer-use benchmarks by implementing their setup/evaluators with the Python task specification.

Supported Benchmarks

Benchmark	Image	Tasks	Status	Description
Windows Arena	`windows-qemu`	173	Available	Windows 11 desktop automation across 12 apps
OSWorld	`linux-qemu`	369	Coming soon	Linux desktop automation
WebArena	`linux-docker`	812	Coming soon	Web browsing tasks
AndroidWorld	`android-qemu`	200+	Coming soon	Android mobile automation
macOSWorld	`macos-lume`	100+	Coming soon	macOS desktop automation

Example: Run Windows Arena

Windows Arena is a benchmark with 173 tasks across 12 Windows application domains.

Requirements

Windows Arena requires x86_64 Linux with KVM support (nested virtualization). Setup takes ~1-2 hours.

Step 1: Create Windows Base Image

# Download Windows ISO and create base image (~1-2 hours)
cb image create windows-qemu --download-iso

# With WinArena benchmark apps pre-installed (recommended)
cb image create windows-qemu --download-iso --winarena-apps

# Monitor progress (VNC at http://localhost:8006)
docker logs -f cua-setup-windows

Step 2: Verify Image (Optional)

# Start interactive shell to verify the image works
cb image shell windows-qemu

# Access VNC at http://localhost:8006
# Press Ctrl+C to stop when done verifying

Step 3: Run Tasks

# Run specific task with oracle
cb interact tasks/winarena_adapter --variant-id 0 --oracle

# Run with agent (2-container architecture)
cb run task tasks/winarena_adapter \
    --agent cua-agent \
    --model anthropic/claude-sonnet-4-20250514 \
    --image windows-qemu

# Run entire dataset in parallel
cb run dataset datasets/winarena \
    --agent cua-agent \
    --model anthropic/claude-sonnet-4-20250514 \
    --image windows-qemu \
    --max-parallel 4

Was this page helpful?

Adapters

Supported Benchmarks

Example: Run Windows Arena

Step 1: Create Windows Base Image

Step 2: Verify Image (Optional)

Step 3: Run Tasks

On this page