Cua-BenchGuideExamples

Generate Tasks with AI

Use Claude to create tasks from natural language descriptions

In this tutorial, you'll use AI to generate complete task environments from natural language descriptions. This is the fastest way to create new tasks.

Time: ~10 minutes Prerequisites: Claude Code OAuth token

What You'll Build

An AI-generated color picker task where the agent must select a specific color. You'll learn:

  • Using cb task generate to create tasks
  • Reviewing and understanding generated code
  • Iterating on AI-generated tasks
  • Manual refinement techniques

How It Works

The cb task generate command:

  1. Creates a starter task template
  2. Launches Claude Code in interactive mode
  3. Claude modifies the template based on your description
  4. You review, test, and refine the result
┌──────────────────────────────────────────────────────────────┐
│  cb task generate "Create a color picker..."                 │
│                           │                                  │
│                           ▼                                  │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  Starter Template                                       │ │
│  │  ├── main.py (decorators)                              │ │
│  │  ├── gui/index.html                                    │ │
│  │  └── pyproject.toml                                    │ │
│  └────────────────────────────────────────────────────────┘ │
│                           │                                  │
│                           ▼                                  │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  Claude Code                                            │ │
│  │  Modifies template based on your prompt                │ │
│  └────────────────────────────────────────────────────────┘ │
│                           │                                  │
│                           ▼                                  │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  Generated Task                                         │ │
│  │  ├── main.py (customized)                              │ │
│  │  ├── gui/index.html (customized)                       │ │
│  │  └── pyproject.toml (metadata)                         │ │
│  └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Step 1: Set Up Authentication

Get your Claude Code OAuth token and set it as an environment variable:

export CLAUDE_CODE_OAUTH_TOKEN=your-token-here

Getting a Token

Contact your administrator or visit the Claude Code documentation to obtain an OAuth token.

Step 2: Generate a Task

Run the generate command with a description:

cb task generate "Create a color picker task where the user must select the color red from a palette of 6 colors"

Claude Code will start an interactive session. You'll see it:

  1. Read the starter template
  2. Modify main.py with task logic
  3. Create gui/index.html with the color picker UI
  4. Update pyproject.toml with metadata

Step 3: Review Generated Files

After generation, examine the created files:

ls tasks/color-picker-env/
# main.py  gui/  pyproject.toml

Generated main.py

The generated main.py should look something like:

import cua_bench as cb
from pathlib import Path

pid = None

@cb.tasks_config(split="train")
def load():
    return [
        cb.Task(
            description="Select the red color from the color palette.",
            metadata={"target_color": "red"},
            computer={
                "provider": "simulated",
                "setup_config": {
                    "os_type": "macos",
                    "width": 512,
                    "height": 512,
                }
            }
        )
    ]


@cb.setup_task(split="train")
async def start(task_cfg: cb.Task, session: cb.DesktopSession):
    global pid
    html_content = (Path(__file__).parent / "gui/index.html").read_text()
    pid = await session.launch_window(
        html=html_content,
        title="Color Picker",
        width=400,
        height=300,
    )


@cb.evaluate_task(split="train")
async def evaluate(task_cfg: cb.Task, session: cb.DesktopSession) -> list[float]:
    global pid
    if pid is None:
        return [0.0]

    selected = await session.execute_javascript(pid, "window.__selectedColor")
    target = task_cfg.metadata.get("target_color", "red")

    return [1.0] if selected == target else [0.0]


@cb.solve_task(split="train")
async def solve(task_cfg: cb.Task, session: cb.DesktopSession):
    global pid
    if pid is None:
        return

    await session.click_element(pid, '[data-color="red"]')

Generated gui/index.html

<!DOCTYPE html>
<html>
<head>
    <style>
        body {
            font-family: system-ui;
            display: flex;
            flex-direction: column;
            align-items: center;
            padding: 20px;
        }

        h2 { margin-bottom: 20px; }

        .palette {
            display: grid;
            grid-template-columns: repeat(3, 1fr);
            gap: 10px;
        }

        .color-box {
            width: 80px;
            height: 80px;
            border: 3px solid #ccc;
            border-radius: 8px;
            cursor: pointer;
        }

        .color-box:hover { transform: scale(1.05); }
        .color-box.selected { border-color: #000; }
    </style>
</head>
<body>
    <h2>Select the red color</h2>
    <div class="palette">
        <div class="color-box" data-color="red" style="background: #e74c3c;"></div>
        <div class="color-box" data-color="blue" style="background: #3498db;"></div>
        <div class="color-box" data-color="green" style="background: #2ecc71;"></div>
        <div class="color-box" data-color="yellow" style="background: #f1c40f;"></div>
        <div class="color-box" data-color="purple" style="background: #9b59b6;"></div>
        <div class="color-box" data-color="orange" style="background: #e67e22;"></div>
    </div>

    <script>
        window.__selectedColor = null;

        document.querySelectorAll('.color-box').forEach(box => {
            box.addEventListener('click', function() {
                document.querySelectorAll('.color-box').forEach(b =>
                    b.classList.remove('selected')
                );
                this.classList.add('selected');
                window.__selectedColor = this.dataset.color;
            });
        });
    </script>
</body>
</html>

Step 4: Test the Generated Task

Run Interactively

cb interact tasks/color-picker-env --variant-id 0

Try clicking different colors and see how the UI responds.

Run with Oracle Solution

cb interact tasks/color-picker-env --oracle --no-wait

Note: For headless servers, use xvfb-run:

xvfb-run -a cb interact tasks/color-picker-env --oracle --no-wait

Run with Agent

cb interact tasks/color-picker-env --variant-id 0
# Agent will be prompted interactively

Step 5: Iterate on the Task

If the generated task isn't quite right, you can:

Option A: Regenerate with Better Prompt

cb task generate "Create a color picker with 9 colors in a 3x3 grid. The target color should be randomly selected from the palette. Show the target color name prominently above the grid."

Option B: Continue in Claude Code

Run the generate command again and ask Claude to modify:

cb task generate "Modify the existing color picker to add a timer that gives bonus points for fast selections"

Option C: Manual Refinement

Edit the generated files directly:

# Add more task variants
@cb.tasks_config(split="train")
def load():
    colors = ["red", "blue", "green", "yellow", "purple", "orange"]

    return [
        cb.Task(
            description=f"Select the {color} color from the palette.",
            metadata={"target_color": color},
            computer={...}
        )
        for color in colors
    ]

Step 6: Generate More Complex Tasks

Multi-Step Task

cb task generate "Create a todo list app where the user must:
1. Add a task called 'Buy groceries'
2. Mark it as complete
3. Delete the task
Track which steps are completed for partial rewards."

Form-Based Task

cb task generate "Create a registration form with fields for name, email, and password.
The agent must fill in the form with valid data and submit it.
Validate that email contains @ and password is at least 8 characters."

Game Task

cb task generate "Create a simple memory card matching game with 8 cards (4 pairs).
The agent must match all pairs. Give partial reward based on number of pairs found."

Tips for Good Prompts

Be Specific

# Good
cb task generate "Create a dropdown menu with 5 options (Apple, Banana, Cherry, Date, Elderberry). The agent must select 'Cherry'."

# Too vague
cb task generate "Create a dropdown task"

Include Success Criteria

# Good
cb task generate "Create a slider that goes from 0 to 100. The agent must set it to exactly 75. Give reward of 1.0 for exact match, 0.5 for within 5 points."

# Missing criteria
cb task generate "Create a slider task"

Specify Visual Requirements

# Good
cb task generate "Create a dark-themed calculator with large buttons (at least 50x50 pixels) and clear labels."

# No visual guidance
cb task generate "Create a calculator"

What Claude Can and Can't Do

Claude CAN:

  • Create HTML/CSS/JavaScript UIs
  • Implement game logic
  • Write evaluation functions
  • Create multiple task variants
  • Add partial reward calculations

Claude CANNOT:

  • Access external APIs at generation time
  • Create native application tasks (use templates for those)
  • Generate images (use placeholder colors/text)
  • Test the task (you need to do this)

Complete Workflow Example

# 1. Generate initial task
cb task generate "Create a simple quiz with 3 multiple choice questions about Python"

# 2. Test interactively
cb interact tasks/python-quiz-env --variant-id 0

# 3. Verify oracle works
cb interact tasks/python-quiz-env --oracle --no-wait
# On headless servers: xvfb-run -a cb interact tasks/python-quiz-env --oracle --no-wait

# 4. Test with agent
cb interact tasks/python-quiz-env --variant-id 0

# 5. Manually refine if needed
# Edit tasks/python-quiz-env/main.py

# 6. Verify changes still work
cb interact tasks/python-quiz-env --oracle --no-wait

Next Steps

Was this page helpful?