A multimodal browser agent for real web tasks featuring visual perception and Record → Compile → Replay workflow to convert manual operations into reusable routines.
OpenBrowser is a multimodal browser automation AI agent that adopts a vision-first strategy, driving real web tasks through screenshots and direct browser actions rather than DOM parsing. Built on OpenHands SDK with a local deployment architecture (FastAPI Server + Chrome Extension + Web Frontend), Chrome-only.
Visual Perception & Control#
- Operates pages via screenshots + direct browser actions, DOM as auxiliary only
- Capable of visual judgment — e.g., comparing properties by lighting, tidiness, and practicality from screenshots
Record → Compile → Replay Workflow#
- Record: Captures manual browser operations as traces
- Compile: Compiler Agent converts traces into reusable Routine Markdown
- Replay: Executes high-level Routines based on compiled artifacts (not literal event replay)
Execution Architecture#
- Execution Isolation: Browser execution window separated from control window; control model doesn't carry full browser session history
- Session Persistence: Maintains browser sessions, cookies, and login state across automation tasks
- Multi-interface Access: REST API (http://127.0.0.1:8765) + WebSocket (ws://127.0.0.1:8766) + CLI
Model Strategy#
- Multi-model tiering: strong models (qwen3.5-plus) + low-cost models (qwen3.5-flash)
- Cost as first-class constraint: model invocation cost treated as core engineering consideration
- Supported models: dashscope/qwen3.5-plus, dashscope/qwen3.5-flash, dashscope/qwen3.6-flash, dashscope/qwen3.6-plus
Evaluation System#
- 35 mock website test cases covering multi-step booking, inbox classification, drag panels, retail flows
- Dedicated Routine compile/replay evaluation harness
Agent Skill Integration#
- Skill files for Claude Code, Codex, and OpenClaw for embedding into local agent environments
Typical Use Cases#
- Property search & visual comparison (Demo: browsing 10+ listings on Zillow, outputting Top 3 recommendations)
- Multi-step form filling and submission
- Data scraping and structured information extraction
- Daily browser task automation (email classification, price comparison, information aggregation)
- Reusable business process solidification and replay
Quick Start#
uv sync
uv run local-chrome-server serve
cd extension && npm install && npm run build
Load extension/dist in Chrome, visit http://localhost:8765, enter Browser UUID from the extension page. LLM configuration is done on first Web UI access, stored at ~/.openbrowser/llm_config.json.
Project Structure#
server/— FastAPI server: Agent orchestration, REST endpoints, core logic, WebSocket serviceextension/— Chrome extension: Background script + CDP, browser automation commands, content script visual feedbackfrontend/— Web UIeval/— Evaluation framework: mock sites, event tracing, evaluation reportsskill/— Agent Skill fileslocal_vendor/openhands-sdk/— Vendored OpenHands SDK
Design Principles#
- Multimodal first, DOM as auxiliary
- Execution isolation — control model doesn't carry full browser history
- Continuous evaluation — regression-test-driven iteration
- Cost constraint as first-class design consideration
Unconfirmed Information#
- Author
softpuddingidentity and affiliation unclear - No formal release (0 Tags), in active development
- Compatibility with non-Qwen multimodal models unconfirmed
- Relationship with OpenClaw/PinchTab is comparison-only, no code-level association
- LGPL-3.0 licensed