Adapters
Running benchmarks across different operating systems and environments.
The Cua-Bench framework makes it easy to create adapters for existing computer-use benchmarks by implementing their setup/evaluators with the Python task specification.
Supported Benchmarks
| Benchmark | Image | Tasks | Status | Description |
|---|---|---|---|---|
| Windows Arena | windows-qemu | 173 | Available | Windows 11 desktop automation across 12 apps |
| OSWorld | linux-qemu | 369 | Coming soon | Linux desktop automation |
| WebArena | linux-docker | 812 | Coming soon | Web browsing tasks |
| AndroidWorld | android-qemu | 200+ | Coming soon | Android mobile automation |
| macOSWorld | macos-lume | 100+ | Coming soon | macOS desktop automation |
Example: Run Windows Arena
Windows Arena is a benchmark with 173 tasks across 12 Windows application domains.
Requirements
Windows Arena requires x86_64 Linux with KVM support (nested virtualization). Setup takes ~1-2 hours.
Step 1: Create Windows Base Image
# Download Windows ISO and create base image (~1-2 hours)
cb image create windows-qemu --download-iso
# With WinArena benchmark apps pre-installed (recommended)
cb image create windows-qemu --download-iso --winarena-apps
# Monitor progress (VNC at http://localhost:8006)
docker logs -f cua-setup-windowsStep 2: Verify Image (Optional)
# Start interactive shell to verify the image works
cb image shell windows-qemu
# Access VNC at http://localhost:8006
# Press Ctrl+C to stop when done verifyingStep 3: Run Tasks
# Run specific task with oracle
cb interact tasks/winarena_adapter --variant-id 0 --oracle
# Run with agent (2-container architecture)
cb run task tasks/winarena_adapter \
--agent cua-agent \
--model anthropic/claude-sonnet-4-20250514 \
--image windows-qemu
# Run entire dataset in parallel
cb run dataset datasets/winarena \
--agent cua-agent \
--model anthropic/claude-sonnet-4-20250514 \
--image windows-qemu \
--max-parallel 4Was this page helpful?