GUI Grounding with Gemini 3
Using Google's Gemini 3 with OmniParser for Advanced GUI Grounding Tasks
Overview
This example demonstrates how to use Google's Gemini 3 models with OmniParser for complex GUI grounding tasks. Gemini 3 Pro achieves exceptional performance on the ScreenSpot-Pro benchmark with a 72.7% accuracy (compared to Claude Sonnet 4.5's 36.2%), making it ideal for precise UI element location and complex navigation tasks.
Why Gemini 3 for UI Navigation?
According to Google's Gemini 3 announcement, Gemini 3 Pro achieves: - 72.7% on ScreenSpot-Pro (vs. Gemini 2.5 Pro's 11.4%) - Industry-leading performance on complex UI navigation tasks - Advanced multimodal understanding for high-resolution screens
What You'll Build
This guide shows how to:
- Set up Vertex AI with proper authentication
- Use OmniParser with Gemini 3 for GUI element detection
- Leverage Gemini 3-specific features like
thinking_levelandmedia_resolution - Create agents that can perform complex multi-step UI interactions
Set Up Google Cloud and Vertex AI
Before using Gemini 3 models, you need to enable Vertex AI in Google Cloud Console.
1. Create a Google Cloud Project
- Go to Google Cloud Console
- Click Select a project → New Project
- Enter a project name and click Create
- Note your Project ID (you'll need this later)
2. Enable Vertex AI API
- Navigate to Vertex AI API
- Select your project
- Click Enable
3. Enable Billing
4. Create a Service Account
- Go to IAM & Admin > Service Accounts
- Click Create Service Account
- Enter a name (e.g., "cua-gemini-agent")
- Click Create and Continue
- Grant the Vertex AI User role
- Click Done
5. Create and Download Service Account Key
- Click on your newly created service account
- Go to Keys tab
- Click Add Key → Create new key
- Select JSON format
- Click Create (the key file will download automatically)
- Important: Store this key file securely! It contains credentials for accessing your Google Cloud resources
Never commit your service account JSON key to version control! Add it to .gitignore immediately.
Install Dependencies
Install the required packages for OmniParser and Gemini 3:
Create a requirements.txt file:
cua-agent
cua-computer
cua-som # OmniParser for GUI element detection
litellm>=1.0.0
python-dotenv>=1.0.0
google-cloud-aiplatform>=1.70.0Install the dependencies:
pip install -r requirements.txtConfigure Environment Variables
Create a .env file in your project root:
# Google Cloud / Vertex AI credentials
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-service-account-key.json
# Cua credentials (for cloud sandboxes)
CUA_API_KEY=sk_cua-api01...
CUA_SANDBOX_NAME=your-sandbox-nameReplace the values:
your-project-id: Your Google Cloud Project ID from Step 1/path/to/your-service-account-key.json: Path to the JSON key file you downloadedsk_cua-api01...: Your Cua API key from the Cua dashboardyour-sandbox-name: Your sandbox name (if using cloud sandboxes)
Create Your Complex UI Navigation Script
Create a Python file (e.g., gemini_ui_navigation.py):
import asyncio
import logging
import os
import signal
import traceback
from agent import ComputerAgent
from computer import Computer, VMProviderType
from dotenv import load_dotenv
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def handle_sigint(sig, frame):
print("\n\nExecution interrupted by user. Exiting gracefully...")
exit(0)
async def complex_ui_navigation():
"""
Demonstrate Gemini 3's exceptional UI grounding capabilities
with complex, multi-step navigation tasks.
"""
try:
async with Computer(
os_type="linux",
provider_type=VMProviderType.CLOUD,
name=os.environ["CUA_SANDBOX_NAME"],
api_key=os.environ["CUA_API_KEY"],
verbosity=logging.INFO,
) as computer:
agent = ComputerAgent(
# Use OmniParser with Gemini 3 Pro for optimal GUI grounding
model="omniparser+vertex_ai/gemini-3-pro-preview",
tools=[computer],
only_n_most_recent_images=3,
verbosity=logging.INFO,
trajectory_dir="trajectories",
use_prompt_caching=False,
max_trajectory_budget=5.0,
# Gemini 3-specific parameters
thinking_level="high", # Enables deeper reasoning (vs "low")
media_resolution="high", # High-resolution image processing (vs "low" or "medium")
)
# Complex GUI grounding tasks inspired by ScreenSpot-Pro benchmark
# These test precise element location in professional UIs
tasks = [
# Task 1: GitHub repository navigation
{
"instruction": (
"Go to github.com/trycua/cua. "
"Find and click on the 'Issues' tab. "
"Then locate and click on the search box within the issues page "
"(not the global GitHub search). "
"Type 'omniparser' and press Enter."
),
"description": "Tests precise UI element distinction in a complex interface",
},
# Task 2: Search for and install Visual Studio Code
{
"instruction": (
"Open your system's app store (e.g., Microsoft Store). "
"Search for 'Visual Studio Code'. "
"In the search results, select 'Visual Studio Code'. "
"Click on 'Install' or 'Get' to begin the installation. "
"If prompted, accept any permissions or confirm the installation. "
"Wait for Visual Studio Code to finish installing."
),
"description": "Tests the ability to search for an application and complete its installation through a step-by-step app store workflow.",
},
]
history = []
for i, task_info in enumerate(tasks, 1):
task = task_info["instruction"]
print(f"\n{'='*60}")
print(f"[Task {i}/{len(tasks)}] {task_info['description']}")
print(f"{'='*60}")
print(f"\nInstruction: {task}\n")
# Add user message to history
history.append({"role": "user", "content": task})
# Run agent with conversation history
async for result in agent.run(history, stream=False):
history += result.get("output", [])
# Print output for debugging
for item in result.get("output", []):
if item.get("type") == "message":
content = item.get("content", [])
for content_part in content:
if content_part.get("text"):
logger.info(f"Agent: {content_part.get('text')}")
elif item.get("type") == "computer_call":
action = item.get("action", {})
action_type = action.get("type", "")
logger.debug(f"Computer Action: {action_type}")
print(f"\n✅ Task {i}/{len(tasks)} completed")
print("\n🎉 All complex UI navigation tasks completed successfully!")
except Exception as e:
logger.error(f"Error in complex_ui_navigation: {e}")
traceback.print_exc()
raise
def main():
try:
load_dotenv()
# Validate required environment variables
required_vars = [
"GOOGLE_CLOUD_PROJECT",
"GOOGLE_APPLICATION_CREDENTIALS",
"CUA_API_KEY",
"CUA_SANDBOX_NAME",
]
missing_vars = [var for var in required_vars if not os.environ.get(var)]
if missing_vars:
raise RuntimeError(
f"Missing required environment variables: {', '.join(missing_vars)}\n"
f"Please check your .env file and ensure all keys are set.\n"
f"See the setup guide for details on configuring Vertex AI credentials."
)
signal.signal(signal.SIGINT, handle_sigint)
asyncio.run(complex_ui_navigation())
except Exception as e:
logger.error(f"Error running automation: {e}")
traceback.print_exc()
if __name__ == "__main__":
main()import asyncio
import logging
import os
import signal
import traceback
from agent import ComputerAgent
from computer import Computer, VMProviderType
from dotenv import load_dotenv
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def handle_sigint(sig, frame):
print("\n\nExecution interrupted by user. Exiting gracefully...")
exit(0)
async def complex_ui_navigation():
"""
Demonstrate Gemini 3's exceptional UI grounding capabilities
with complex, multi-step navigation tasks.
"""
try:
async with Computer(
os_type="linux",
provider_type=VMProviderType.DOCKER,
image="trycua/cua-xfce:latest",
verbosity=logging.INFO,
) as computer:
agent = ComputerAgent(
# Use OmniParser with Gemini 3 Pro for optimal GUI grounding
model="omniparser+vertex_ai/gemini-3-pro-preview",
tools=[computer],
only_n_most_recent_images=3,
verbosity=logging.INFO,
trajectory_dir="trajectories",
use_prompt_caching=False,
max_trajectory_budget=5.0,
# Gemini 3-specific parameters
thinking_level="high", # Enables deeper reasoning (vs "low")
media_resolution="high", # High-resolution image processing (vs "low" or "medium")
)
# Complex GUI grounding tasks inspired by ScreenSpot-Pro benchmark
tasks = [
{
"instruction": (
"Go to github.com/trycua/cua. "
"Find and click on the 'Issues' tab. "
"Then locate and click on the search box within the issues page "
"(not the global GitHub search). "
"Type 'omniparser' and press Enter."
),
"description": "Tests precise UI element distinction in a complex interface",
},
]
history = []
for i, task_info in enumerate(tasks, 1):
task = task_info["instruction"]
print(f"\n{'='*60}")
print(f"[Task {i}/{len(tasks)}] {task_info['description']}")
print(f"{'='*60}")
print(f"\nInstruction: {task}\n")
history.append({"role": "user", "content": task})
async for result in agent.run(history, stream=False):
history += result.get("output", [])
for item in result.get("output", []):
if item.get("type") == "message":
content = item.get("content", [])
for content_part in content:
if content_part.get("text"):
logger.info(f"Agent: {content_part.get('text')}")
elif item.get("type") == "computer_call":
action = item.get("action", {})
action_type = action.get("type", "")
logger.debug(f"Computer Action: {action_type}")
print(f"\n✅ Task {i}/{len(tasks)} completed")
print("\n🎉 All complex UI navigation tasks completed successfully!")
except Exception as e:
logger.error(f"Error in complex_ui_navigation: {e}")
traceback.print_exc()
raise
def main():
try:
load_dotenv()
required_vars = [
"GOOGLE_CLOUD_PROJECT",
"GOOGLE_APPLICATION_CREDENTIALS",
]
missing_vars = [var for var in required_vars if not os.environ.get(var)]
if missing_vars:
raise RuntimeError(
f"Missing required environment variables: {', '.join(missing_vars)}\n"
f"Please check your .env file."
)
signal.signal(signal.SIGINT, handle_sigint)
asyncio.run(complex_ui_navigation())
except Exception as e:
logger.error(f"Error running automation: {e}")
traceback.print_exc()
if __name__ == "__main__":
main()import asyncio
import logging
import os
import signal
import traceback
from agent import ComputerAgent
from computer import Computer, VMProviderType
from dotenv import load_dotenv
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def handle_sigint(sig, frame):
print("\n\nExecution interrupted by user. Exiting gracefully...")
exit(0)
async def complex_ui_navigation():
"""
Demonstrate Gemini 3's exceptional UI grounding capabilities
with complex, multi-step navigation tasks.
"""
try:
async with Computer(
os_type="macos",
provider_type=VMProviderType.LUME,
name="macos-sequoia-cua:latest",
verbosity=logging.INFO,
) as computer:
agent = ComputerAgent(
# Use OmniParser with Gemini 3 Pro for optimal GUI grounding
model="omniparser+vertex_ai/gemini-3-pro-preview",
tools=[computer],
only_n_most_recent_images=3,
verbosity=logging.INFO,
trajectory_dir="trajectories",
use_prompt_caching=False,
max_trajectory_budget=5.0,
# Gemini 3-specific parameters
thinking_level="high", # Enables deeper reasoning (vs "low")
media_resolution="high", # High-resolution image processing (vs "low" or "medium")
)
# Complex GUI grounding tasks inspired by ScreenSpot-Pro benchmark
tasks = [
{
"instruction": (
"Go to github.com/trycua/cua. "
"Find and click on the 'Issues' tab. "
"Then locate and click on the search box within the issues page "
"(not the global GitHub search). "
"Type 'omniparser' and press Enter."
),
"description": "Tests precise UI element distinction in a complex interface",
},
]
history = []
for i, task_info in enumerate(tasks, 1):
task = task_info["instruction"]
print(f"\n{'='*60}")
print(f"[Task {i}/{len(tasks)}] {task_info['description']}")
print(f"{'='*60}")
print(f"\nInstruction: {task}\n")
history.append({"role": "user", "content": task})
async for result in agent.run(history, stream=False):
history += result.get("output", [])
for item in result.get("output", []):
if item.get("type") == "message":
content = item.get("content", [])
for content_part in content:
if content_part.get("text"):
logger.info(f"Agent: {content_part.get('text')}")
elif item.get("type") == "computer_call":
action = item.get("action", {})
action_type = action.get("type", "")
logger.debug(f"Computer Action: {action_type}")
print(f"\n✅ Task {i}/{len(tasks)} completed")
print("\n🎉 All complex UI navigation tasks completed successfully!")
except Exception as e:
logger.error(f"Error in complex_ui_navigation: {e}")
traceback.print_exc()
raise
def main():
try:
load_dotenv()
required_vars = [
"GOOGLE_CLOUD_PROJECT",
"GOOGLE_APPLICATION_CREDENTIALS",
]
missing_vars = [var for var in required_vars if not os.environ.get(var)]
if missing_vars:
raise RuntimeError(
f"Missing required environment variables: {', '.join(missing_vars)}\n"
f"Please check your .env file."
)
signal.signal(signal.SIGINT, handle_sigint)
asyncio.run(complex_ui_navigation())
except Exception as e:
logger.error(f"Error running automation: {e}")
traceback.print_exc()
if __name__ == "__main__":
main()Run Your Script
Execute your complex UI navigation automation:
python gemini_ui_navigation.pyThe agent will:
- Navigate to GitHub and locate specific UI elements
- Distinguish between similar elements (e.g., global search vs. issues search)
- Perform multi-step interactions with visual feedback
- Use Gemini 3's advanced reasoning for precise element grounding
Monitor the output to see the agent's progress through each task.
Understanding Gemini 3-Specific Parameters
thinking_level
Controls the amount of internal reasoning the model performs:
"high": Deeper reasoning, better for complex UI navigation (recommended for ScreenSpot-like tasks)"low": Faster responses, suitable for simpler tasks
media_resolution
Controls vision processing for multimodal inputs:
"high": Best for complex UIs with many small elements (recommended)"medium": Balanced quality and speed"low": Faster processing for simple interfaces
For tasks requiring precise GUI element location (like ScreenSpot-Pro), use
thinking_level="high" and media_resolution="high" for optimal performance.
Benchmark Performance
Gemini 3 Pro's performance on ScreenSpot-Pro demonstrates its exceptional UI grounding capabilities:
| Model | ScreenSpot-Pro Score |
|---|---|
| Gemini 3 Pro | 72.7% |
| Claude Sonnet 4.5 | 36.2% |
| Gemini 2.5 Pro | 11.4% |
| GPT-5.1 | 3.5% |
This makes Gemini 3 the ideal choice for complex UI navigation, element detection, and professional GUI automation tasks.
Troubleshooting
Authentication Issues
If you encounter authentication errors:
- Verify your service account JSON key path is correct
- Ensure the service account has the Vertex AI User role
- Check that the Vertex AI API is enabled in your project
- Confirm your
GOOGLE_CLOUD_PROJECTmatches your actual project ID
"Vertex AI API not enabled" Error
Run this command to enable the API:
gcloud services enable aiplatform.googleapis.com --project=YOUR_PROJECT_IDBilling Issues
Ensure billing is enabled for your Google Cloud project. Visit the Billing section to verify.
Next Steps
- Learn more about OmniParser agent loops
- Explore Vertex AI pricing
- Read about ScreenSpot-Pro benchmark
- Check out Google's Gemini 3 announcement
- Join our Discord community for help
Was this page helpful?