Cua-BenchReference

CLI Reference

Complete reference for all cua-bench CLI commands

Image Commands

Create and manage base images for environments.

cb image <action> [options]
CommandDescription
cb image create <platform>Create a base image
cb image listList all images
cb image info <name>Show image details
cb image clone <src> <dst>Clone/fork an image
cb image delete <name>Delete an image
cb image shell <name>Interactive shell into image (protected)

Create Options

cb image create <platform> [options]
OptionDescription
--download-isoDownload Windows 11 ISO (~6GB)
--iso <path>Use existing Windows ISO
--name <name>Image name (default: same as platform)
--winarena-appsInstall WinArena benchmark apps (Chrome, LibreOffice, VLC, etc.)
--memory <size>Memory (default: 8G)
--cpus <count>CPU cores (default: 8)
--disk <size>Disk size (default: 64G)
--detach, -dRun in background
--forceForce recreation
--skip-pullDon't pull Docker image (use local)
--vnc-port <port>VNC port (default: auto-allocate from 8006)
--api-port <port>API port (default: auto-allocate from 5000)

List Options

cb image list [options]
OptionDescription
--format <fmt>Output format: table, json
--platform <type>Filter by platform type

Shell Options

Interactive shell into an image. By default uses an overlay to protect the golden image.

cb image shell <name> [options]
OptionDescription
--writableModify golden image directly (dangerous!)
--vnc-port <port>VNC port (default: auto-allocate from 8006)
--api-port <port>API port (default: auto-allocate from 5000)
--memory <size>Memory (default: 8G)
--cpus <count>CPU cores (default: 8)
--no-kvmDisable KVM acceleration
--detach, -dRun in background

Image Protection

By default, cb image shell mounts the golden image read-only and uses an overlay for any changes. This protects your base images from accidental modification. Use --writable only when you intentionally want to modify the golden image.

Platform Commands

View available platforms.

cb platform list [options]
OptionDescription
--format <fmt>Output format: table, json

Status Commands

Show system status.

cb status

Shows Docker status, KVM availability, running shells, and active runs.

Run Commands

Run tasks with the 2-container architecture (agent container + environment container).

cb run <subcommand> [options]
CommandDescription
cb run task <path>Run a single task
cb run dataset <path>Run all tasks in a dataset (parallel)
cb run listList all runs with status
cb run status <id>Show detailed run status
cb run watch <id>Watch a run in real-time
cb run stop <id>Stop/cancel a run
cb run logs <id>View combined logs from a run or session

Task Options

Tasks run asynchronously by default and return immediately with a run ID. Use --wait for synchronous execution.

cb run task <path> [options]
OptionDescription
--wait, -wWait for task completion (default: run async)
--debugAuto-allocate debug ports (VNC + API)
--variant-id <n>Task variant index (default: 0)
--agent <name>Agent to use (from .cua/agents.yaml)
--agent-import-path <path>Import path for custom agent
--model <model>Model to use (e.g., anthropic/claude-sonnet-4-20250514)
--oracleRun with oracle solution (no agent)
--max-steps <n>Maximum steps per task (default: 100)
--image <name>Use specific base image
--platform <type>Platform type: linux-docker, windows-qemu, android-qemu
--vnc-port <port>Expose VNC on host port (auto-allocated if not specified)
--api-port <port>Expose API on host port (auto-allocated if not specified)
--output-dir <dir>Output directory (default: ~/.local/share/cua-bench/runs/<run_id>/)

Dataset Options

cb run dataset <path> [options]
OptionDescription
--max-parallel <n>Maximum parallel task runners (default: 4)
--max-variants <n>Maximum variants per task (default: all)
--task-filter <pattern>Filter tasks by name pattern (glob)
--agent <name>Agent to use (from .cua/agents.yaml)
--model <model>Model to use
--oracleRun with oracle solution (no agent)
--max-steps <n>Maximum steps per task (default: 100)
--image <name>Use specific base image
--platform <type>Platform type
--output-dir <dir>Output directory for results

Examples

# Run single task (async by default - returns immediately)
cb run task tasks/click_button --agent cua-agent --model anthropic/claude-sonnet-4-20250514

# Wait for task completion (synchronous)
cb run task tasks/click_button --agent cua-agent --wait

# Run with oracle solution
cb run task tasks/click_button --variant-id 0 --oracle --wait

# Auto-allocate debug ports for live viewing
cb run task tasks/click_button --agent cua-agent --debug
# Output shows: Auto-allocated debug ports: VNC=8006, API=5000

# Run Windows task with specific image
cb run task tasks/winarena --image windows-qemu --agent cua-agent

# Run all tasks in a dataset (4 parallel)
cb run dataset datasets/cua-bench-basic --agent cua-agent --max-parallel 4

# Run dataset with filtering
cb run dataset datasets/cua-bench-basic --task-filter "click*" --max-variants 1

# Monitor runs (async tasks)
cb run list                    # List all runs
cb run watch <run_id>          # Watch progress in real-time
cb run status <run_id>         # Check status
cb run logs <run_id>           # View logs
cb run logs <run_id> --tail 100  # View last 100 lines
cb run stop <run_id>           # Stop a run

Interact Commands

Interactive task mode with visible browser.

cb interact <task> [options]
OptionDescription
--variant-id <id>Task variant index (default: 0)
--dataset <name>Dataset name to resolve task from registry
--dataset-path <path>Path to dataset directory containing multiple tasks
--oracleRun the solution after setup
--max-steps <n>Maximum number of env.step() calls before stopping
--screenshot <path>Save screenshot to file
--trace-out <path>If set, start tracing and save dataset to this path on exit
--viewOpen a trace viewer when done
--no-waitSkip the interactive prompt (useful for SSH/CI testing)

Task Commands

Manage task environments.

cb task <action> [options]
CommandDescription
cb task info <path>Show task details (provider, variants)
cb task list [path]List tasks in a directory
cb task create [path]Scaffold a new task environment
cb task generate "<prompt>"Generate a task using Claude

Generate Options

cb task generate "<description>" [options]
OptionDescription
--output <path>, -oOutput directory path. If not provided, auto-generates from prompt.
--no-interactionSkip prompts and run Claude non-interactively

Trace Commands

View and manage trace datasets.

cb trace <action> <id>
CommandDescription
cb trace view <id>View a single trace in browser (accepts run_id or session_id)
cb trace grid <id>View all traces in a run as a grid (accepts run_id)

Dataset Commands

Manage datasets and build from outputs.

cb dataset <action> [options]
CommandDescription
cb dataset listList available datasets from registry
cb dataset build <outputs>Build a dataset from batch outputs

Build Options

cb dataset build <outputs> [options]
OptionDescription
--save-dir <dir>Output directory for datasets
--mode <mode>Processor mode: aguvis-stage-1, gui-r1
--push-to-hubPush to Hugging Face Hub
--repo-id <id>Hugging Face repo ID
--privateCreate private repo on Hub

Prune Commands

Clean up cua-bench data, images, and Docker resources.

cb prune [options]

When called without options, shows an interactive overview of storage usage. Use flags to clean specific resources.

OptionDescription
--all, -aRemove everything (images, overlays, runs, docker)
--imagesRemove only stored images
--overlaysRemove only task overlays
--runsRemove only run logs and registry
--dockerRemove only docker containers/images
--dry-runShow what would be deleted without deleting
--force, -fSkip confirmation prompts

Examples

# Show storage overview
cb prune

# Remove all data with confirmation
cb prune --all

# Remove all data without confirmation
cb prune --all --force

# Preview what would be deleted
cb prune --all --dry-run

# Remove only run logs
cb prune --runs

# Remove docker resources
cb prune --docker

Was this page helpful?