GuideIntegrations

Benchmarks

Computer Agent SDK benchmarks for agentic GUI tasks

The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:

Computer Agent SDK providers (using model strings like "huggingface-local/HelloKKMe/GTA1-7B")
Reference agent implementations (custom model classes implementing the ModelProtocol)

Available Benchmarks

ScreenSpot-v2 - Standard resolution GUI grounding
ScreenSpot-Pro - High-resolution GUI grounding
Interactive Testing - Real-time testing and visualization

Quick Start

# Clone the benchmark repository
git clone https://github.com/trycua/cua
cd libs/python/agent/benchmarks

# Install dependencies
pip install cua

# Run a benchmark
python ss-v2.py

Was this page helpful?

On this page

Available Benchmarks Quick Start