An open-source multimodal AI Agent stack developed by ByteDance, comprising the general Agent TARS framework and the UI-TARS Desktop client. It enables natural language control of computers, browsers, and terminals via Vision-Language Models.
One-Minute Overview#
UI-TARS is an open-source project that enables AI to "see" and "operate" computer screens. It consists of two main parts: Agent TARS (a robust CLI/Web framework) and UI-TARS Desktop (a ready-to-use desktop client). By leveraging Vision-Language Models, it understands natural language instructions to control mice, keyboards, and browsers for tasks like booking tickets, coding, or generating charts.
Core Value:Transforms complex GUI automation into simple natural language interactions, supporting both local and remote control with a flexible developer framework.
Quick Start#
Installation Difficulty: Low - Agent TARS CLI runs via npx (requires Node.js >= 22); Desktop app requires download.
# Launch Agent TARS instantly using npx (no global install needed)
npx @agent-tars/cli@latest
# Or install globally
npm install @agent-tars/cli@latest -g
# Run with your preferred model provider (e.g., Volcengine or Anthropic)
agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key
Is it suitable for me?
- ✅ Automating Repetitive Tasks: Ideal for web navigation, form filling, or clicking buttons.
- ✅ Remote Operations: Control remote computers or browsers via AI without complex setup.
- ✅ AI Developers: Build Agents using MCP protocol or Vision models.
- ❌ Mission-Critical Precision: Being based on probabilistic vision models, occasional recognition errors may occur.
Core Capabilities#
1. UI-TARS Desktop - Your Personal AI Operator#
- Installs locally to control apps (e.g., VS Code settings), browse the web, or perform remote operations via natural language.
- Value: Fully local processing ensures privacy; supports Remote Computer/Browser operators out-of-the-box.
2. Agent TARS - Developer Framework#
- Features both CLI and Web UI interfaces, supporting Hybrid Browser Agents (combining GUI vision and DOM logic).
- Value: Event Stream driven architecture for easy debugging; built on MCP (Model Context Protocol) for seamless tool integration.
3. Vision Understanding & Precision Control#
- Powered by UI-TARS and Seed-1.5/1.6 series models for robust screenshot recognition and precise mouse/keyboard emulation.
- Value: Capable of pixel-level clicking and dragging, cross-platform support (Windows/MacOS/Browser).
Tech Stack & Integration#
Languages: JavaScript / TypeScript (Node.js environment) Key Dependencies: Node.js >= 22, Vision-Language Model APIs (e.g., Volcengine Doubao, Anthropic Claude) Integration:
- CLI Tool: Configurable via command-line arguments.
- MCP Protocol: Kernel built on MCP, functioning as a Server or Client.
Commercial & Licensing#
License: Apache-2.0
- ✅ Commercial Use: Allowed
- ✅ Modification: Allowed
- ✅ Distribution: Allowed
- ⚠️ Restrictions: Must include copyright and license notices (see Apache 2.0 terms).
Documentation & Resources#
- Quality: Basic to Moderate, includes Quick Start guides.
- Official Docs: Refer to the project README and Wiki.
- Community: Discord community available.