An intelligent autonomous browser agent driven by Chrome DevTools Protocol, supporting multi-LLM backends, vision understanding, and WebMCP protocol extension for end-to-end web tasks including navigation, dynamic interaction, and file operations.
Positioning#
Web-Use is an intelligent autonomous browser agent directly driven by Chrome DevTools Protocol (CDP) via WebSocket connections to real Chrome/Edge browsers, enabling LLMs with end-to-end web operation capabilities.
Core Capabilities#
Autonomous Browsing & Interaction#
- Autonomous Web Navigation: Automatically navigate websites, fill forms, interact with dynamic content (SPAs, etc.) without human intervention
- Efficient Element Interaction: Indexed DOM elements for fast, precise click/input operations
- File Operations: Support file download and upload
- State Awareness: Maintain page state understanding, avoid infinite loops, and auto-recover from errors
- Smart Waiting: Handle loading states, animations, CAPTCHAs, OTPs, and other human-interactive scenarios
Multi-Model & Vision#
- Multi-LLM Support: 13 built-in providers — Anthropic Claude, Google Gemini, OpenAI, Groq, Ollama, Cerebras, Mistral, DeepSeek, NVIDIA, vLLM, Azure OpenAI, OpenRouter, LiteLLM
- Vision Capability: Screenshot-based page visual understanding for decision support (
use_vision=True)
Protocol & Extension#
- WebMCP (Web Model Context Protocol): Automatically discover site-exposed custom tools, dynamically register and invoke them like built-in tools; supports parameter validation and schema display
Operations & Control#
- Human-in-the-loop: Configurable pause for human input (
include_human_in_loop=True) - Browser Persistence: Keep browser open after task completion (
keep_alive=True) - System Profile Reuse: Use real browser profiles to retain login state and authentication
Typical Use Cases#
- E-commerce Price Comparison: Auto-search and aggregate prices across sellers on platforms like Amazon
- Social Media Automation: Auto-login to X/Twitter and publish posts
- Video Playback: Search and play specified videos on YouTube
- GitHub Navigation: Auto-login and browse specified repositories
- In-Site Documentation Search: Leverage WebMCP to invoke custom tools exposed by documentation sites
- Web Data Extraction: Automated browsing and structured information extraction
- Form Filling: Automate repetitive form-filling workflows
Architecture#
The project adopts a layered architecture (src/ directory):
| Module | Responsibility |
|---|---|
agent/ | Core agent logic: base class, main loop, service layer, view rendering |
agent/browser/ | Browser connection and CDP communication management |
agent/context/ | Context management |
agent/dom/ | DOM element indexing and interaction |
agent/events/ | Event system |
agent/registry/ | Tool/resource registry |
agent/tools/ | Built-in agent toolset |
agent/watchdog/ | Timeout and exception monitoring |
cdp/ | Chrome DevTools Protocol abstraction layer |
messages/ | Message/conversation models |
providers/ | LLM Provider abstraction layer (13 implementations) |
tools/ | Tool service layer |
Key Mechanisms:
- Direct browser control via CDP using WebSocket connections (
websocketslibrary), not high-level wrappers like Selenium/Playwright - DOM element indexing for accelerated element location
- Pillow for screenshot-based vision understanding
markdownifyfor HTML-to-Markdown conversion for LLM comprehensionpyotpfor OTP verification scenarios- Build system: Hatchling
Inspired by: vimGPT, WebVoyager, LangGraph Examples
Installation & Quick Start#
Prerequisites: Python ≥ 3.13, UV package manager, Chrome browser (remote debugging enabled)
git clone https://github.com/CursorTouch/Web-Use.git
cd Web-Use
uv sync
chrome --remote-debugging-port=9222
Configure .env file:
GOOGLE_API_KEY="<API_KEY_HERE>"
Minimal running code:
from src.agent.browser.config import BrowserConfig
from src.providers.ollama import ChatOllama
from src.agent import Agent
from dotenv import load_dotenv
load_dotenv()
llm = ChatOllama(model='qwen3.5:397b-cloud', temperature=0.5)
config = BrowserConfig(browser='chrome', headless=False, use_system_profile=True)
agent = Agent(config=config, llm=llm, use_vision=False, use_web_mcp=True, max_steps=100)
user_query = input('Enter your query: ')
agent.print_response(user_query)
uv run main.py
Key Configuration Parameters#
Agent Constructor Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
config | BrowserConfig | Required | Browser configuration |
llm | BaseChatLLM | Required | Language model instance |
use_vision | bool | False | Enable screenshot vision understanding |
use_web_mcp | bool | False | Enable WebMCP protocol for site tool discovery |
max_steps | int | 25 | Maximum execution steps |
max_consecutive_failures | int | 3 | Consecutive failure retry limit |
include_human_in_loop | bool | False | Pause for human input |
keep_alive | bool | False | Keep browser open after task completion |
BrowserConfig Parameters:
| Parameter | Default | Description |
|---|---|---|
browser | 'chrome' | Browser type ('chrome' or 'edge') |
headless | False | Headless mode |
use_system_profile | True | Use system browser profile |
user_data_dir | — | Custom profile directory path |
cdp_port | 9222 | CDP protocol port |
downloads_dir | '/Downloads' | Download directory |
attach_to_existing | False | Connect to an already running browser |
update_cdp | False | Regenerate CDP protocol files |
Unconfirmed Information#
- Python version requirement contradiction: README suggests 3.11+,
pyproject.tomlrequires ≥ 3.13 - Repository topic includes
langgraph, but no explicit reference found in code - WebMCP protocol has no formal specification document link
- No independent website or documentation site; docs are centralized in GitHub README
- No formal paper, Hugging Face page, or other external resource links
Primary languages: Python (99.7%), JavaScript (0.3%). Authors: Jeomon George, Muhammad Yaseen. Current version: v0.2. MIT License.