MobileAgent is an autonomous mobile agent framework powered by Multimodal Large Language Models (MLLM), enabling automated mobile app operations and task execution through visual perception and tool invocation.

Introduction#

MobileAgent is designed to solve complex task automation on mobile platforms (primarily Android). Traditional automation scripts (based on control trees or simple macro recording) lack flexibility and struggle with UI changes and cross-app scenarios. MobileAgent introduces Multimodal Large Language Models (VLM/MLLM) as the "brain", combined with visual positioning and ADB control tools, creating an agent that can "understand" screens, autonomously plan steps, and execute precise interactions.

Core Values#

Visual-First: Directly processes screenshots without relying on app source code or accessibility service trees, enabling cross-app generalization
Autonomous Planning & Reflection: Decomposes high-level user commands into multi-step operations with self-correction based on execution feedback
Lightweight Deployment: Python-based core logic without heavy middleware, communicates via ADB

Capability Matrix#

Perception#

Screen Visual Understanding: Uses VLM (GPT-4V, Qwen-VL) to identify screen content, icons, text, and popups
UI Element Localization: Outputs bounding boxes or center coordinates for pixel-level clicking

Decision Making#

Multi-step Task Planning: Chain-of-Thought reasoning to decompose complex tasks into action sequences
Self-reflection: Checks screen changes after actions, determines success, and attempts correction or re-planning on failure

Execution#

Standardized Operation Wrappers: Encapsulates ADB command set (Tap, Swipe, Type, Back, Home, etc.)
Cross-app Operations: Supports task flows across multiple apps

Extensibility#

Tool Calling: Supports Function Calling mechanism for custom tools (API calls, database operations)
Multi-model Support: Decoupled architecture supporting OpenAI GPT-4V, Anthropic Claude, Qwen-VL and other VLMs

Architecture#

MobileAgent uses an Agent Loop architecture:

Environment Layer: Communicates via ADB, captures screenshots, executes commands
Core Brain Layer: Vision Encoder processes screenshots, LLM Planner receives commands and plans actions
Tool Library: Defines primitives like click, long_press, scroll, input_text

Key Flow: User Request -> [Screenshot -> VLM Analysis -> Plan Action -> ADB Execute] -> Loop until Done

Installation#

Requirements#

Python 3.8+
Android SDK (adb command available)
Connected Android device (USB debugging enabled) or emulator
Valid VLM API Key (OpenAI API Key or Alibaba DashScope Key)

Steps#

git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt

Configuration#

Configure API Key in config.py or .env
Set ADB device serial number
Switch model backends (GPT-4V / Qwen-VL / other VLMs)

Usage#

CLI#

python run.py --task "Open WeChat and send Hello MobileAgent to File Transfer"

SDK#

agent = MobileAgent(model="qwen-vl-plus")
agent.setup_device("emulator-5554")
agent.run("Enable dark mode in settings")

Use Cases#

End-to-end automated testing (E2E Testing)
Mobile RPA (repetitive task automation)
Personal intelligent assistant
App feature exploration and inspection

MobileAgent