Building or researching computer-use agents? Fill out the Interest Form to chat with us.

Cua-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks on a desktop computer using primarily the keyboard and mouse, or on a mobile device using primarily the touchscreen.

Examples of Cua-Bench tasks are:

Navigating a web browser to book travel reservations
Installing design software and editing images to remove backgrounds
Using office applications to format and merge multiple documents

Cua-Bench consists of three parts:

Base images - Pre-packaged Windows, Linux, macOS, and Android environments with apps and dependencies needed for benchmark runs
A dataset of tasks - Verifiable, dynamic computer-use environments across multiple operating systems
An evaluation/training harness - Tools for running benchmarks, generating data, and testing agents

Dataset of tasks

Each task in Cua-Bench includes:

a description in English
a config for the desktop/mobile environment
a setup script to install any HTML webapps
a test script to verify if the agent completed the task successfully
a reference ("oracle") solution that solves the task

Check out our existing tasks here

An environment harness

The harness provides familiar APIs for creating verifiable tasks. Use an Electron-like Python API to install web apps onto desktop/mobile environments, then use a Playwright-like API to write automated solutions or reward test scripts.

After installing the package, you can run the harness using cb run or cb interact.

Next Steps

Was this page helpful?

Introduction

Dataset of tasks

An environment harness

Next Steps

On this page