GuideGet Started

Using the Agent SDK

Add AI automation with vision-language models

Now that you've verified your sandbox works with the Sandbox SDK, use an Agent to automate complex tasks. The agent interacts with the computer environment using a vision-language model to understand the UI and execute actions.

Cua Agent Framework provides:

  • 100+ VLM options through Cua VLM Router and direct provider access
  • Built-in optimizations for computer-use tasks
  • Structured agent loops for consistent behavior

Installation

Using uv (recommended):

uv pip install cua

Or with pip:

pip install cua
npm install @trycua/computer ai

Choose Your Model Provider

Use Cua's inference API to access multiple model providers with a single API key (same key used for sandbox access). Cua VLM Router provides intelligent routing and cost optimization.

import os
import asyncio
from cua import Sandbox, Image, ComputerAgent

os.environ["CUA_API_KEY"] = "sk_cua-api01_..."

async def main():
    async with Sandbox.ephemeral(Image.linux()) as sb:
        agent = ComputerAgent(
            model="cua/anthropic/claude-sonnet-4.5",  # Cua-routed model
            tools=[sb],
            max_trajectory_budget=5.0
        )

        messages = [{"role": "user", "content": "Take a screenshot and tell me what you see"}]

        async for result in agent.run(messages):
            for item in result["output"]:
                if item["type"] == "message":
                    print(item["content"][0]["text"])

asyncio.run(main())

Available Cua models:

  • cua/anthropic/claude-sonnet-4.5 - Claude Sonnet 4.5 (recommended)
  • cua/anthropic/claude-opus-4.5 - Claude Opus 4.5 (enhanced agentic capabilities)
  • cua/anthropic/claude-haiku-4.5 - Claude Haiku 4.5 (faster, cost-effective)
  • cua/google/gemini-3-pro-preview - Gemini 3 Pro Preview (most powerful multimodal)
  • cua/google/gemini-3-flash-preview - Gemini 3 Flash Preview (fastest and cheapest)

Available composed models:

  • huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-sonnet-4-5-20250929 - GTA1 grounding + Claude Sonnet 4.5 planning
  • huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5 - GTA1 grounding + GPT-5 planning
  • huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o - UI-TARS grounding + GPT-4o planning
  • moondream3+openai/gpt-4o - Moondream3 grounding + GPT-4o planning

Benefits:

  • Single API key for multiple providers
  • Cost tracking and optimization
  • No need to manage multiple provider keys

Use your own API keys from model providers like Anthropic, OpenAI, or others.

import asyncio
from cua import Sandbox, Image, ComputerAgent

# Set your provider API key
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."  # For Anthropic
# OR
os.environ["OPENAI_API_KEY"] = "sk-..."  # For OpenAI

async def main():
    async with Sandbox.ephemeral(Image.linux()) as sb:
        agent = ComputerAgent(
            model="anthropic/claude-sonnet-4-5-20250929",  # Direct provider model
            tools=[sb],
            max_trajectory_budget=5.0
        )

        messages = [{"role": "user", "content": "Take a screenshot and tell me what you see"}]

        async for result in agent.run(messages):
            for item in result["output"]:
                if item["type"] == "message":
                    print(item["content"][0]["text"])

asyncio.run(main())

Supported providers:

  • anthropic/claude-* - Anthropic Claude models
  • openai/gpt-* - OpenAI GPT models
  • openai/o1-* - OpenAI o1 models
  • huggingface-local/* - Local HuggingFace models
  • And many more via LiteLLM

See Supported Models for the complete list.

For TypeScript, you can build agent loops using the Vercel AI SDK with the Cua Computer Framework TypeScript library.

import Anthropic from '@anthropic-ai/sdk';
import { Computer, OSType } from '@trycua/computer';

const client = new Anthropic();
let computer: Computer;

const computerTool = {
  type: 'tool' as const,
  name: 'computer',
  description: 'Control the computer with actions like screenshot, click, type, etc.',
  inputSchema: {
    type: 'object' as const,
    properties: {
      action: {
        type: 'string' as const,
        description: 'Action to perform (screenshot, click, type, key_press, etc.)',
      },
      coordinate: {
        type: 'array' as const,
        items: { type: 'number' as const },
        description: 'x, y coordinates for click actions',
      },
      text: {
        type: 'string' as const,
        description: 'Text to type',
      },
    },
    required: ['action'],
  },
};

async function runAgentLoop(goal: string) {
  // Initialize computer
  computer = new Computer({
    osType: OSType.LINUX,
    provider_type: 'cloud',
    name: 'your-sandbox-name',
    apiKey: process.env.CUA_API_KEY!,
  });

  await computer.run();

  const messages: any[] = [];
  messages.push({ role: 'user', content: goal });

  // Agent loop
  for (let i = 0; i < 10; i++) {
    const response = await client.messages.create({
      model: 'claude-opus-4-1-20250805',
      max_tokens: 4096,
      tools: [computerTool],
      messages: messages,
    });

    if (response.stop_reason === 'end_turn') {
      console.log('Task completed!');
      break;
    }

    messages.push({ role: 'assistant', content: response.content });

    // Process tool calls
    const toolResults = [];
    for (const block of response.content) {
      if (block.type === 'tool_use') {
        let result: any;
        switch (block.input.action) {
          case 'screenshot':
            result = await computer.interface.screenshot();
            break;
          case 'click':
            result = await computer.interface.click(
              block.input.coordinate[0],
              block.input.coordinate[1]
            );
            break;
          case 'type':
            result = await computer.interface.type(block.input.text);
            break;
        }
        toolResults.push({
          type: 'tool_result',
          tool_use_id: block.id,
          content: JSON.stringify(result),
        });
      }
    }

    if (toolResults.length > 0) {
      messages.push({ role: 'user', content: toolResults });
    }
  }

  await computer.disconnect();
}

runAgentLoop('Take a screenshot and tell me what you see');

For more details, see the Vercel AI SDK Computer Use Cookbook.

Next Steps

Was this page helpful?


On this page