Introduction
A benchmark to measure the capabilities of computer-use agents in desktop environments.
Cua-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks on a desktop computer using primarily the keyboard and mouse, or on a mobile device using primarily the touchscreen.

Examples of Cua-Bench tasks are:
- Navigating a web browser to book travel reservations
- Installing design software and editing images to remove backgrounds
- Using office applications to format and merge multiple documents
Cua-Bench consists of two parts:
- A dataset of tasks - Verifiable, dynamic computer-use environments across multiple operating systems
- An evaluation/training harness - Tools for running benchmarks, generating data, and testing agents
Dataset of tasks
Each task in Cua-Bench includes:
- a description in English
- a config for the desktop/mobile environment
- a setup script to install any HTML webapps
- a test script to verify if the agent completed the task successfully
- a reference ("oracle") solution that solves the task
Check out our existing tasks here
An environment harness
The harness provides familiar APIs for creating verifiable tasks. Use an Electron-like Python API to install web apps onto desktop/mobile environments, then use a Playwright-like API to write automated solutions or reward test scripts.
After installing the package, you can run the harness using cb run or cb interact.
Next Steps
Was this page helpful?