Cua-BenchGuideGetting Started

Introduction

A benchmark to measure the capabilities of computer-use agents in desktop environments.

Cua-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks on a desktop computer using primarily the keyboard and mouse, or on a mobile device using primarily the touchscreen.

Cua

Examples of Cua-Bench tasks are:

  • Navigating a web browser to book travel reservations
  • Installing design software and editing images to remove backgrounds
  • Using office applications to format and merge multiple documents

Cua-Bench consists of two parts:

  1. A dataset of tasks - Verifiable, dynamic computer-use environments across multiple operating systems
  2. An evaluation/training harness - Tools for running benchmarks, generating data, and testing agents

Dataset of tasks

Each task in Cua-Bench includes:

  • a description in English
  • a config for the desktop/mobile environment
  • a setup script to install any HTML webapps
  • a test script to verify if the agent completed the task successfully
  • a reference ("oracle") solution that solves the task

Check out our existing tasks here

An environment harness

The harness provides familiar APIs for creating verifiable tasks. Use an Electron-like Python API to install web apps onto desktop/mobile environments, then use a Playwright-like API to write automated solutions or reward test scripts.

After installing the package, you can run the harness using cb run or cb interact.

Next Steps

  1. Install the CLI
  2. Run your first benchmark
  3. Create custom tasks

Was this page helpful?


On this page