Cua-BenchGuideExamples

Create a Native Task

Build a file-based question-answering task using native OS commands and file manipulation

In this tutorial, you'll create a task that uses native OS capabilities to set up files and evaluate completion. This task uses native provider - it runs on real Docker/QEMU environments with full shell access.

Time: ~10 minutes Prerequisites: cua-bench installed, Docker running

What You'll Build

A question-answering task where agents must:

  1. Read a question from a text file on the Desktop
  2. Write their answer to another text file
  3. Get evaluated based on file contents

You'll learn:

  • Using run_command() for OS-level operations
  • Setting up files and directories
  • Evaluating tasks by reading files
  • Working with native Linux/Windows environments

When to Use Native Tasks

Use native tasks when:

  • You need to test real OS interactions (file system, registry, processes)
  • The task requires installing or using real applications
  • You need shell access for complex setup
  • You're building WinArena-style tasks

Use Universal GUI instead when:

  • You're building custom interfaces or games
  • You want cross-platform compatibility
  • You don't need real OS features
  • You want faster, lightweight environments

Step 1: Create Task Directory

mkdir -p tasks/question_answer
cd tasks/question_answer

Step 2: Create the Task Configuration

Create main.py:

import cua_bench as cb
from pathlib import Path


@cb.tasks_config(split="train")
def load():
    """Define task variants with different questions."""
    qa_pairs = [
        {
            "question": "What is the capital of France?",
            "answer": "Paris"
        },
        {
            "question": "What is 2 + 2?",
            "answer": "4"
        },
        {
            "question": "What color is the sky on a clear day?",
            "answer": "blue"
        },
        {
            "question": "What is the largest planet in our solar system?",
            "answer": "Jupiter"
        }
    ]

    return [
        cb.Task(
            description=f'Open the question.txt file on the Desktop, read the question, and write your answer in answer.txt on the Desktop. Question: {qa["question"]}',
            metadata={
                "question": qa["question"],
                "answer": qa["answer"]
            },
            computer={
                "provider": "native",  # Requires Docker/QEMU
                "setup_config": {
                    "os_type": "linux",  # or "windows"
                    "width": 1024,
                    "height": 768
                }
            }
        )
        for qa in qa_pairs
    ]


@cb.setup_task(split="train")
async def start(task_cfg: cb.Task, session: cb.DesktopSession):
    """Initialize the task environment with files."""
    # Get the question from metadata
    question = task_cfg.metadata["question"]

    # Create Desktop directory if it doesn't exist
    await session.run_command("mkdir -p ~/Desktop")

    # Write the question to a text file on the Desktop
    await session.run_command(f'echo "{question}" > ~/Desktop/question.txt')

    # Ensure answer.txt doesn't exist yet
    await session.run_command("rm -f ~/Desktop/answer.txt")


@cb.evaluate_task(split="train")
async def evaluate(task_cfg: cb.Task, session: cb.DesktopSession) -> list[float]:
    """Return reward based on task completion."""
    expected_answer = task_cfg.metadata["answer"].strip().lower()

    # Check if answer.txt exists and read it
    result = await session.run_command("cat ~/Desktop/answer.txt 2>/dev/null || echo ''")

    # Extract the actual answer from the command result
    user_answer = result["stdout"].strip().lower() if result else ""

    # Check if the answer matches (case-insensitive, stripped)
    if user_answer == expected_answer:
        return [1.0]
    else:
        return [0.0]


@cb.solve_task(split="train")
async def solve(task_cfg: cb.Task, session: cb.DesktopSession):
    """Demonstrate the solution."""
    answer = task_cfg.metadata["answer"]

    # Open the question file to read it (demonstration)
    await session.run_command("cat ~/Desktop/question.txt")

    # Write the answer to answer.txt
    await session.run_command(f'echo "{answer}" > ~/Desktop/answer.txt')

Step 3: Understanding run_command()

The run_command() method executes shell commands and returns a dict:

result = await session.run_command("cat ~/Desktop/answer.txt")

# Result structure:
{
    "stdout": str,        # Command output
    "stderr": str,        # Error output
    "return_code": int,   # Exit code (0 = success)
    "success": bool       # True if return_code == 0
}

# Access the output:
output = result["stdout"]
errors = result["stderr"]
exit_code = result["return_code"]
succeeded = result["success"]

Common Patterns

Create directories:

await session.run_command("mkdir -p ~/Desktop/MyFolder")

Write files:

await session.run_command('echo "content" > ~/Desktop/file.txt')

Read files:

result = await session.run_command("cat ~/Desktop/file.txt")
content = result["stdout"]

Check if file exists:

result = await session.run_command("test -f ~/Desktop/file.txt && echo 'yes' || echo 'no'")
exists = result["stdout"].strip() == "yes"

Install applications (Linux):

await session.run_command("sudo apt update && sudo apt install -y firefox")

Launch applications:

await session.run_command("firefox &")  # Background process

Step 4: Run the Task

Preview Interactively

cb interact tasks/question_answer --variant-id 0

A VNC viewer will open showing the native Linux desktop environment.

Run with Oracle

cb interact tasks/question_answer --oracle --variant-id 0

Run with Agent

export ANTHROPIC_API_KEY=sk-...
cb run task tasks/question_answer --agent cua-agent --model anthropic/claude-sonnet-4-20250514

Advanced: Windows Tasks

For Windows-specific tasks, use os_type: "windows":

@cb.tasks_config(split="train")
def load():
    return [
        cb.Task(
            description='Open Notepad and write "Hello World" to a file',
            computer={
                "provider": "native",
                "setup_config": {
                    "os_type": "windows",  # Windows VM
                    "width": 1920,
                    "height": 1080
                }
            }
        )
    ]

@cb.setup_task(split="train")
async def start(task_cfg: cb.Task, session: cb.DesktopSession):
    # Windows commands use PowerShell/CMD syntax
    await session.run_command('powershell -Command "New-Item -Path C:\\Users\\Public\\Desktop -ItemType Directory -Force"')

    # Launch Notepad
    await session.run_command("notepad.exe")

Windows Command Patterns

PowerShell commands:

result = await session.run_command('powershell -Command "Get-ChildItem C:\\Users"')

Registry operations:

await session.run_command('reg add "HKCU\\Software\\MyApp" /v Setting /t REG_SZ /d "value"')

File operations:

await session.run_command('powershell -Command "Get-Content C:\\path\\file.txt"')

File API Alternative

For simple file operations, use the dedicated file methods instead of run_command():

# Read file
content = await session.read_file("/path/to/file.txt")

# Write file
await session.write_file("/path/to/file.txt", "content")

# Check if file exists
exists = await session.file_exists("/path/to/file.txt")

# List directory
files = await session.list_dir("/path/to/dir")

These methods are cleaner than shell commands for basic file operations.

Key Concepts

Native Provider: Runs on real Docker (Linux) or QEMU (Windows) environments

Shell Commands: Use run_command() for OS-level operations

Return Type: Always returns a dict with stdout, stderr, return_code, success

File Setup: Use setup to prepare the environment before agents interact

Evaluation: Read files/check state to determine task completion

Performance Considerations

Native tasks are slower than Universal GUI tasks because they:

  • Require Docker/QEMU startup time
  • Run full OS environments
  • Execute real shell commands

Optimization tips:

  • Use Universal GUI when possible
  • Batch multiple commands with &&
  • Avoid unnecessary file operations
  • Use file API methods for simple reads/writes

Troubleshooting

Docker not running:

Error: Cannot connect to Docker daemon

Solution: Start Docker Desktop

Permission denied:

# Add sudo for operations requiring permissions
await session.run_command("sudo apt install -y package")

Command not found:

# Install the package first
await session.run_command("sudo apt update && sudo apt install -y package-name")

Windows path escaping:

# Use raw strings or double backslashes
await session.run_command('powershell -Command "Get-Item C:\\\\Users"')

Next Steps

Was this page helpful?