MobileAgent is an autonomous mobile agent framework powered by Multimodal Large Language Models (MLLM), enabling automated mobile app operations and task execution through visual perception and tool invocation.
Introduction#
MobileAgent is designed to solve complex task automation on mobile platforms (primarily Android). Traditional automation scripts (based on control trees or simple macro recording) lack flexibility and struggle with UI changes and cross-app scenarios. MobileAgent introduces Multimodal Large Language Models (VLM/MLLM) as the "brain", combined with visual positioning and ADB control tools, creating an agent that can "understand" screens, autonomously plan steps, and execute precise interactions.
Core Values#
- Visual-First: Directly processes screenshots without relying on app source code or accessibility service trees, enabling cross-app generalization
- Autonomous Planning & Reflection: Decomposes high-level user commands into multi-step operations with self-correction based on execution feedback
- Lightweight Deployment: Python-based core logic without heavy middleware, communicates via ADB
Capability Matrix#
Perception#
- Screen Visual Understanding: Uses VLM (GPT-4V, Qwen-VL) to identify screen content, icons, text, and popups
- UI Element Localization: Outputs bounding boxes or center coordinates for pixel-level clicking
Decision Making#
- Multi-step Task Planning: Chain-of-Thought reasoning to decompose complex tasks into action sequences
- Self-reflection: Checks screen changes after actions, determines success, and attempts correction or re-planning on failure
Execution#
- Standardized Operation Wrappers: Encapsulates ADB command set (Tap, Swipe, Type, Back, Home, etc.)
- Cross-app Operations: Supports task flows across multiple apps
Extensibility#
- Tool Calling: Supports Function Calling mechanism for custom tools (API calls, database operations)
- Multi-model Support: Decoupled architecture supporting OpenAI GPT-4V, Anthropic Claude, Qwen-VL and other VLMs
Architecture#
MobileAgent uses an Agent Loop architecture:
- Environment Layer: Communicates via ADB, captures screenshots, executes commands
- Core Brain Layer: Vision Encoder processes screenshots, LLM Planner receives commands and plans actions
- Tool Library: Defines primitives like click, long_press, scroll, input_text
Key Flow: User Request -> [Screenshot -> VLM Analysis -> Plan Action -> ADB Execute] -> Loop until Done
Installation#
Requirements#
- Python 3.8+
- Android SDK (adb command available)
- Connected Android device (USB debugging enabled) or emulator
- Valid VLM API Key (OpenAI API Key or Alibaba DashScope Key)
Steps#
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt
Configuration#
- Configure API Key in
config.pyor.env - Set ADB device serial number
- Switch model backends (GPT-4V / Qwen-VL / other VLMs)
Usage#
CLI#
python run.py --task "Open WeChat and send Hello MobileAgent to File Transfer"
SDK#
agent = MobileAgent(model="qwen-vl-plus")
agent.setup_device("emulator-5554")
agent.run("Enable dark mode in settings")
Use Cases#
- End-to-end automated testing (E2E Testing)
- Mobile RPA (repetitive task automation)
- Personal intelligent assistant
- App feature exploration and inspection