CuaGuideAdvanced

Demonstration-Guided Skills

Use human demonstrations to guide agent behavior and improve task completion

Human demonstrations provide powerful context for guiding agent behavior. By showing an agent how a task is performed once, it can learn the workflow pattern and generalize to similar tasks with different inputs.

Overview

The demonstration-guided workflow consists of three stages:

  1. Record - Capture a human performing the task via VNC recording
  2. Process - Convert the recording into semantic action traces with VLM captions
  3. Prompt - Use the processed trajectory to guide agent execution

This approach enables:

  • One-shot learning: Teach a workflow once, execute it many times
  • Reduced errors: Agents follow demonstrated patterns instead of discovering them
  • Complex workflows: Handle multi-step tasks that require specific sequences

Example: Using a Demonstration to Guide an Agent

1. Record a demonstration

cua skills record --sandbox my-sandbox

When prompted, name it flight-booking and describe it as "Book a flight from NYC to LA".

2. View the saved skill

The skill is saved to ~/.cua/skills/flight-booking/SKILL.md:

cua skills read flight-booking

3. Use the skill to guide an agent

Read the SKILL.md file and prefix it to your prompt:

from pathlib import Path
from agent import ComputerAgent

# Load the skill prompt from SKILL.md
skill_path = Path.home() / ".cua/skills/flight-booking/SKILL.md"
skill_prompt = skill_path.read_text()

agent = ComputerAgent(loop="anthropic")

# Prefix the skill as context, then give the actual task
messages = [
    {"role": "user", "content": skill_prompt},
    {"role": "user", "content": "Book a flight from Boston to Chicago for next Friday"}
]

async for result in agent.run(messages):
    print(result)

The skill contains a natural language description of the demonstrated workflow, allowing the agent to follow the same pattern while adapting to the new parameters (Boston→Chicago instead of NYC→LA).

Skill Format

Each skill is saved to ~/.cua/skills/<name>/ with the following structure:

~/.cua/skills/flight-booking/
├── SKILL.md              # Skill description and agent prompt
└── trajectory/
    ├── trajectory.json   # Raw events and captions
    ├── events.json       # Input events
    ├── flight-booking.mp4 # Video recording
    ├── step_1_full.jpg   # Full screenshot for step 1
    ├── step_1_crop.jpg   # Cropped screenshot with marker
    └── ...

The SKILL.md file contains the skill metadata, step descriptions, and agent prompt:

---
name: flight-booking
description: Book a flight from NYC to LA
---

# flight-booking

Book a flight from NYC to LA

## Steps

### Step 1: Click on the departure city field

**Context:** A flight booking form with empty departure and arrival fields...
**Intent:** User wants to enter the departure city
**Expected Result:** Field becomes focused for text input

### Step 2: Type "NYC"

...

## Agent Prompt

You have been shown a demonstration of how to perform this task:
Book a flight from NYC to LA

The demonstration consisted of the following steps:
Step 1: Click on the departure city field
Step 2: Type "NYC"
...

Was this page helpful?