DISCOVER THE FUTURE OF AI AGENTSarrow_forward

OpenBrowser

calendar_todayAdded Apr 24, 2026
categoryAgent & Tooling
codeOpen Source
TypeScriptNode.jsWorkflow AutomationJavaScriptMultimodalPlaywrightAI AgentsAgent FrameworkBrowser AutomationAgent & ToolingAutomation, Workflow & RPAComputer Vision & Multimodal

A multimodal browser agent for real web tasks featuring visual perception and Record → Compile → Replay workflow to convert manual operations into reusable routines.

OpenBrowser is a multimodal browser automation AI agent that adopts a vision-first strategy, driving real web tasks through screenshots and direct browser actions rather than DOM parsing. Built on OpenHands SDK with a local deployment architecture (FastAPI Server + Chrome Extension + Web Frontend), Chrome-only.

Visual Perception & Control#

  • Operates pages via screenshots + direct browser actions, DOM as auxiliary only
  • Capable of visual judgment — e.g., comparing properties by lighting, tidiness, and practicality from screenshots

Record → Compile → Replay Workflow#

  • Record: Captures manual browser operations as traces
  • Compile: Compiler Agent converts traces into reusable Routine Markdown
  • Replay: Executes high-level Routines based on compiled artifacts (not literal event replay)

Execution Architecture#

  • Execution Isolation: Browser execution window separated from control window; control model doesn't carry full browser session history
  • Session Persistence: Maintains browser sessions, cookies, and login state across automation tasks
  • Multi-interface Access: REST API (http://127.0.0.1:8765) + WebSocket (ws://127.0.0.1:8766) + CLI

Model Strategy#

  • Multi-model tiering: strong models (qwen3.5-plus) + low-cost models (qwen3.5-flash)
  • Cost as first-class constraint: model invocation cost treated as core engineering consideration
  • Supported models: dashscope/qwen3.5-plus, dashscope/qwen3.5-flash, dashscope/qwen3.6-flash, dashscope/qwen3.6-plus

Evaluation System#

  • 35 mock website test cases covering multi-step booking, inbox classification, drag panels, retail flows
  • Dedicated Routine compile/replay evaluation harness

Agent Skill Integration#

  • Skill files for Claude Code, Codex, and OpenClaw for embedding into local agent environments

Typical Use Cases#

  • Property search & visual comparison (Demo: browsing 10+ listings on Zillow, outputting Top 3 recommendations)
  • Multi-step form filling and submission
  • Data scraping and structured information extraction
  • Daily browser task automation (email classification, price comparison, information aggregation)
  • Reusable business process solidification and replay

Quick Start#

uv sync
uv run local-chrome-server serve
cd extension && npm install && npm run build

Load extension/dist in Chrome, visit http://localhost:8765, enter Browser UUID from the extension page. LLM configuration is done on first Web UI access, stored at ~/.openbrowser/llm_config.json.

Project Structure#

  • server/ — FastAPI server: Agent orchestration, REST endpoints, core logic, WebSocket service
  • extension/ — Chrome extension: Background script + CDP, browser automation commands, content script visual feedback
  • frontend/ — Web UI
  • eval/ — Evaluation framework: mock sites, event tracing, evaluation reports
  • skill/ — Agent Skill files
  • local_vendor/openhands-sdk/ — Vendored OpenHands SDK

Design Principles#

  1. Multimodal first, DOM as auxiliary
  2. Execution isolation — control model doesn't carry full browser history
  3. Continuous evaluation — regression-test-driven iteration
  4. Cost constraint as first-class design consideration

Unconfirmed Information#

  • Author softpudding identity and affiliation unclear
  • No formal release (0 Tags), in active development
  • Compatibility with non-Qwen multimodal models unconfirmed
  • Relationship with OpenClaw/PinchTab is comparison-only, no code-level association
  • LGPL-3.0 licensed

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch