Data · Benchmarks · Training
100K examples for $5 — then benchmark and train.
HTML simulators across 10+ OS themes, gym-style RL with shaped rewards, deterministic verifiers, and exportable trajectory data.
from cua_bench import GymEnvironment
env = GymEnvironment("messaging-app", provider="simulated")
obs, info = env.reset()
# Generate data across OS themes
for theme in ["macos", "windows", "ubuntu", "android"]:
env.set_theme(theme)
obs, reward, done, info = env.step(action)Leaderboard
CuaBench 2.0 results
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Haiku 4.5 | 68.4% |
| 2 | Claude Sonnet 4.5 | 59.1% |
| 3 | UI-TARS-2 | 58.0% |
| 4 | GPT-5.2 | 57.8% |
| 5 | Gemini CUA | 54.2% |
Capabilities
Generate data, benchmark, and train
10+ OS themes
HTML simulators render pixel-perfect UIs across macOS, Windows, Ubuntu, Android, and more.
$5 per 100K examples
Record a trajectory once, replot it to every theme. 10,000x cheaper than manual annotation.
Gym-style RL
OpenAI Gym interface with shaped rewards — 10x faster than sparse-reward setups.
Deterministic verifiers
Programmatic reward functions with oracle solutions for reproducible evals.
Trajectory export
Export runs as Parquet with screenshots, actions, rewards, and traces.
4 benchmark datasets
cua-bench-basic, cua-bench-real, OSWorld, and WindowsAgentArena.