OpenBrowser

A multimodal browser agent for real web tasks featuring visual perception and Record → Compile → Replay workflow to convert manual operations into reusable routines.

OpenBrowser is a multimodal browser automation AI agent that adopts a vision-first strategy, driving real web tasks through screenshots and direct browser actions rather than DOM parsing. Built on OpenHands SDK with a local deployment architecture (FastAPI Server + Chrome Extension + Web Frontend), Chrome-only.

Visual Perception & Control#

Operates pages via screenshots + direct browser actions, DOM as auxiliary only
Capable of visual judgment — e.g., comparing properties by lighting, tidiness, and practicality from screenshots

Record → Compile → Replay Workflow#

Record: Captures manual browser operations as traces
Compile: Compiler Agent converts traces into reusable Routine Markdown
Replay: Executes high-level Routines based on compiled artifacts (not literal event replay)

Execution Architecture#

Execution Isolation: Browser execution window separated from control window; control model doesn't carry full browser session history
Session Persistence: Maintains browser sessions, cookies, and login state across automation tasks
Multi-interface Access: REST API (http://127.0.0.1:8765) + WebSocket (ws://127.0.0.1:8766) + CLI

Model Strategy#

Multi-model tiering: strong models (qwen3.5-plus) + low-cost models (qwen3.5-flash)
Cost as first-class constraint: model invocation cost treated as core engineering consideration
Supported models: dashscope/qwen3.5-plus, dashscope/qwen3.5-flash, dashscope/qwen3.6-flash, dashscope/qwen3.6-plus

Evaluation System#

35 mock website test cases covering multi-step booking, inbox classification, drag panels, retail flows
Dedicated Routine compile/replay evaluation harness

Agent Skill Integration#

Skill files for Claude Code, Codex, and OpenClaw for embedding into local agent environments

Typical Use Cases#

Property search & visual comparison (Demo: browsing 10+ listings on Zillow, outputting Top 3 recommendations)
Multi-step form filling and submission
Data scraping and structured information extraction
Daily browser task automation (email classification, price comparison, information aggregation)
Reusable business process solidification and replay

Quick Start#

uv sync
uv run local-chrome-server serve
cd extension && npm install && npm run build

Load extension/dist in Chrome, visit http://localhost:8765, enter Browser UUID from the extension page. LLM configuration is done on first Web UI access, stored at ~/.openbrowser/llm_config.json.

Project Structure#

server/ — FastAPI server: Agent orchestration, REST endpoints, core logic, WebSocket service
extension/ — Chrome extension: Background script + CDP, browser automation commands, content script visual feedback
frontend/ — Web UI
eval/ — Evaluation framework: mock sites, event tracing, evaluation reports
skill/ — Agent Skill files
local_vendor/openhands-sdk/ — Vendored OpenHands SDK

Design Principles#

Multimodal first, DOM as auxiliary
Execution isolation — control model doesn't carry full browser history
Continuous evaluation — regression-test-driven iteration
Cost constraint as first-class design consideration

Unconfirmed Information#

Author softpudding identity and affiliation unclear
No formal release (0 Tags), in active development
Compatibility with non-Qwen multimodal models unconfirmed
Relationship with OpenClaw/PinchTab is comparison-only, no code-level association
LGPL-3.0 licensed

Visual Perception & Control#

Record → Compile → Replay Workflow#

Execution Architecture#

Model Strategy#

Evaluation System#

Agent Skill Integration#

Typical Use Cases#

Quick Start#

Project Structure#

Design Principles#

Unconfirmed Information#

Related Projects

Augustus

NVIDIA AI-Q Blueprint

Windows-MCP

STAY UPDATED