GuideIntegrations
Benchmarks
Computer Agent SDK benchmarks for agentic GUI tasks
The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
- Computer Agent SDK providers (using model strings like
"huggingface-local/HelloKKMe/GTA1-7B") - Reference agent implementations (custom model classes implementing the
ModelProtocol)
Available Benchmarks
- ScreenSpot-v2 - Standard resolution GUI grounding
- ScreenSpot-Pro - High-resolution GUI grounding
- Interactive Testing - Real-time testing and visualization
Quick Start
# Clone the benchmark repository
git clone https://github.com/trycua/cua
cd libs/python/agent/benchmarks
# Install dependencies
pip install cua
# Run a benchmark
python ss-v2.pyWas this page helpful?