Browser Agent

The Browser Agent is a core component of the Q-ACE framework, providing high-level AI-powered web automation. It leverages the browser-use library to translate natural language tasks into precise browser actions.

🚀 Key Capabilities

Natural Language Control: Describe tasks like "Find the lowest price for a 4K monitor on Amazon" or "Log in to the staging portal and verify the latest report."
Real-Time Visual Feedback: Watch the agent work through a live SSE (Server-Sent Events) stream, showing every step and URL transition.
Intelligent Decision Making: Powered by advanced LLMs that understand DOM structures and interactive elements.
Multimodal Support: Uses vision (screenshots) to navigate complex UIs where text analysis alone might fail.
Robust Failure Recovery: Automatically retries failed actions and adjusts strategies when encountering unexpected layouts.

🛠 ️ Technology Stack

Library: browser-use v0.12+
Engine: Playwright (via browser-use)
Runtime: Isolated .venv using the uv package manager for high performance.
Interface: FastAPI SSE for streaming updates to a modern Alpine.js frontend.

🤖 Supported LLM Providers

The Browser Agent supports a wide range of LLM providers:

Google Gemini: Optimized for efficiency and speed (e.g., gemini-2.5-flash).
Ollama: Run privacy-focused automation locally (recommends gemma3:1b or higher).
OpenAI: Support for o3, gpt-4o, etc.
Anthropic: High-precision reasoning with claude-3-5-sonnet.
DeepSeek: Specialized coding and reasoning models.
Azure/AWS Bedrock: Enterprise-grade cloud integrations.

⚙ ️ Configuration & Customization

The agent is highly configurable via the Browser Settings and Agent Settings tabs:

Browser Settings

Headless Mode: Run silently in the background or watch the window in "headful" mode.
Recording & Tracing: Enable video recording (ffmpeg required) and Playwright tracing for debugging.
Device Emulation: Set custom window/viewport sizes and User-Agents.
Domain Restrictions: Define allowed_domains or prohibited_domains for safety.

Agent Settings

Thinking Mode: Allow the agent to pause and "think" before committing to an action.
Flash Mode: Rapid-fire execution for simple, high-speed tasks.
Max Steps/Actions: Control the execution depth to prevent infinite loops or excessive token usage.
Safety: Built-in sensitive data filtering to protect credentials and private info.

📊 History & Analytics

Every execution is logged and stored in the localized data/auth.db:

Step-by-Step Replay: View the history of actions, thoughts, and extracted data.
AI-Powered Analysis: After a run, use the "Analyze" feature to get a professional Root Cause Analysis (RCA) and performance report generated by an LLM.
Stats Dashboard: Track success rates and daily trends across your automation suite.

🚀 Execution Model

The agent runs in a dedicated subprocess (handlers/browser_agent_runner.py). This ensures that:

The main FastAPI server remains responsive.
The agent has its own resource pool.
Errors in automation don't crash the entire framework.
Robust cleanup and process management on both Windows and Linux.

Built with ❤️ by ATID College

Docs