Data · Benchmarks · Training

100K examples for $5 — then benchmark and train.

HTML simulators across 10+ OS themes, gym-style RL with shaped rewards, deterministic verifiers, and exportable trajectory data.

Start for free View leaderboard

from cua_bench import GymEnvironment

env = GymEnvironment("messaging-app", provider="simulated")
obs, info = env.reset()

# Generate data across OS themes
for theme in ["macos", "windows", "ubuntu", "android"]:
    env.set_theme(theme)
    obs, reward, done, info = env.step(action)

Leaderboard

CuaBench 2.0 results

Rank	Model	Score
1	Claude Haiku 4.5	68.4%
2	Claude Sonnet 4.5	59.1%
3	UI-TARS-2	58.0%
4	GPT-5.2	57.8%
5	Gemini CUA	54.2%

Capabilities

Generate data, benchmark, and train

10+ OS themes

HTML simulators render pixel-perfect UIs across macOS, Windows, Ubuntu, Android, and more.

$5 per 100K examples

Record a trajectory once, replot it to every theme. 10,000x cheaper than manual annotation.

Gym-style RL

OpenAI Gym interface with shaped rewards — 10x faster than sparse-reward setups.

Deterministic verifiers

Programmatic reward functions with oracle solutions for reproducible evals.

Trajectory export

Export runs as Parquet with screenshots, actions, rewards, and traces.

4 benchmark datasets

cua-bench-basic, cua-bench-real, OSWorld, and WindowsAgentArena.