Cua-BenchGuideFundamentals

Agent Traces

Recording and viewing agent execution traces

When running tasks with agents, cua-bench automatically records detailed traces of agent execution. These traces capture screenshots, actions, reasoning, and metadata—essential for debugging, evaluation, and dataset creation.

Trace Output Locations

After running a task with an agent, traces are saved to ~/.local/share/cua-bench/runs/<run_id>/<task>_v<variant>/:

~/.local/share/cua-bench/runs/96d41b51/slack_env_v0/
├── run.log                     # Execution logs
├── task_0_agent_logs/          # CUA Agent SDK trajectories (if using cua-agent)
│   └── trajectories/
│       └── 2026-01-07_claudesonnet4_210819_2a23/
│           ├── metadata.json   # Agent turns, model responses, usage
│           ├── turn_000/       # Screenshots and data per turn
│           ├── turn_001/
│           └── ...
└── task_0_trace/               # cua-bench trace dataset
    ├── data-00000-of-00001.arrow  # HuggingFace Dataset format
    ├── dataset_info.json
    └── state.json

cua-bench Trace Format

The cua-bench trace format uses HuggingFace Datasets with Apache Arrow for efficient storage and loading. Each trace contains:

Dataset Schema

{
  "event_name": str,          # Event type: "reset", "agent_step", "agent_action", etc.
  "data_json": str,           # JSON-encoded event metadata
  "data_images": [Image],     # List of screenshots (PIL Images)
  "trajectory_id": str,       # Unique trajectory identifier
  "timestamp": str            # ISO 8601 timestamp
}

Event Types

EventDescriptionImagesMetadata
resetTask setup completeInitial screenshotTask config, index
agent_stepAgent reasoning stepScreenshotStep number, model usage, output
agent_actionAgent performed actionScreenshot after actionAction type, arguments
agent_thinkingModel thinking/responseScreenshotThinking text (truncated)
solveOracle solution completeFinal screenshotCompletion status
evaluateTask evaluation-Evaluation result

Loading Traces Programmatically

from datasets import load_from_disk

# Load a trace dataset
trace = load_from_disk("~/.local/share/cua-bench/runs/96d41b51/slack_env_v0/task_0_trace")

# Iterate through events
for event in trace:
    print(f"Event: {event['event_name']}")
    print(f"Time: {event['timestamp']}")

    # Parse metadata
    import json
    data = json.loads(event['data_json'])
    print(f"Data: {data}")

    # Access screenshots
    for img in event['data_images']:
        img.show()  # PIL Image

Viewing Traces

Use the cua-bench trace viewer to visualize traces in your browser:

View Single Trace

# View a specific task trace
cb trace view <run_id>/<task>_v<variant>

# Example
cb trace view 96d41b51/slack_env_v0

Opens an interactive viewer showing:

  • Timeline of events
  • Screenshots at each step
  • Action details and metadata
  • Agent reasoning (if available)

View All Traces in a Run

# View all task traces in a grid
cb trace grid <run_id>

# Example
cb trace grid 96d41b51

Shows a grid view of all task traces in the run, useful for:

  • Comparing agent performance across variants
  • Quickly identifying failures
  • Reviewing oracle solutions

Recording Custom Events

When building custom agents, you can record your own events to the trace:

from cua_bench.agents.base import BaseAgent, AgentResult

class MyAgent(BaseAgent):
    async def perform_task(
        self,
        task_description: str,
        session,
        logging_dir=None,
        tracer=None,  # Tracer is passed automatically
    ) -> AgentResult:
        # Record custom event
        if tracer:
            screenshot = await session.screenshot()
            tracer.record(
                "my_custom_event",
                {
                    "step": 1,
                    "reasoning": "Analyzing the screen...",
                    "confidence": 0.95,
                },
                [screenshot]
            )

        # Your agent logic...

Was this page helpful?