UI-TARS-desktop

An open-source multimodal AI Agent stack developed by ByteDance, comprising the general Agent TARS framework and the UI-TARS Desktop client. It enables natural language control of computers, browsers, and terminals via Vision-Language Models.

One-Minute Overview#

UI-TARS is an open-source project that enables AI to "see" and "operate" computer screens. It consists of two main parts: Agent TARS (a robust CLI/Web framework) and UI-TARS Desktop (a ready-to-use desktop client). By leveraging Vision-Language Models, it understands natural language instructions to control mice, keyboards, and browsers for tasks like booking tickets, coding, or generating charts.

Core Value：Transforms complex GUI automation into simple natural language interactions, supporting both local and remote control with a flexible developer framework.

Quick Start#

Installation Difficulty: Low - Agent TARS CLI runs via npx (requires Node.js >= 22); Desktop app requires download.

# Launch Agent TARS instantly using npx (no global install needed)
npx @agent-tars/cli@latest

# Or install globally
npm install @agent-tars/cli@latest -g

# Run with your preferred model provider (e.g., Volcengine or Anthropic)
agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key

Is it suitable for me?

✅ Automating Repetitive Tasks: Ideal for web navigation, form filling, or clicking buttons.

✅ Remote Operations: Control remote computers or browsers via AI without complex setup.

✅ AI Developers: Build Agents using MCP protocol or Vision models.

❌ Mission-Critical Precision: Being based on probabilistic vision models, occasional recognition errors may occur.

Core Capabilities#

1. UI-TARS Desktop - Your Personal AI Operator#

Installs locally to control apps (e.g., VS Code settings), browse the web, or perform remote operations via natural language.
Value: Fully local processing ensures privacy; supports Remote Computer/Browser operators out-of-the-box.

2. Agent TARS - Developer Framework#

Features both CLI and Web UI interfaces, supporting Hybrid Browser Agents (combining GUI vision and DOM logic).
Value: Event Stream driven architecture for easy debugging; built on MCP (Model Context Protocol) for seamless tool integration.

3. Vision Understanding & Precision Control#

Powered by UI-TARS and Seed-1.5/1.6 series models for robust screenshot recognition and precise mouse/keyboard emulation.
Value: Capable of pixel-level clicking and dragging, cross-platform support (Windows/MacOS/Browser).

Tech Stack & Integration#

Languages: JavaScript / TypeScript (Node.js environment) Key Dependencies: Node.js >= 22, Vision-Language Model APIs (e.g., Volcengine Doubao, Anthropic Claude) Integration:

CLI Tool: Configurable via command-line arguments.
MCP Protocol: Kernel built on MCP, functioning as a Server or Client.

Commercial & Licensing#

License: Apache-2.0

✅ Commercial Use: Allowed
✅ Modification: Allowed
✅ Distribution: Allowed
⚠️ Restrictions: Must include copyright and license notices (see Apache 2.0 terms).

Documentation & Resources#

Quality: Basic to Moderate, includes Quick Start guides.
Official Docs: Refer to the project README and Wiki.
Community: Discord community available.

One-Minute Overview#

Quick Start#

Core Capabilities#

1. UI-TARS Desktop - Your Personal AI Operator#

2. Agent TARS - Developer Framework#

3. Vision Understanding & Precision Control#

Tech Stack & Integration#

Commercial & Licensing#

Documentation & Resources#

Related Projects

oh-my-codex

Ironcurtain

vibe-remote

STAY UPDATED