An open-sourced end-to-end VLM-based GUI Agent developed by Tsinghua University and Zhipu AI, built on GLM-4V-9B bilingual VLM, enabling cross-platform GUI automation and reasoning via screenshots and natural language instructions.
Overview#
CogAgent is an open-sourced end-to-end VLM-based GUI Agent built on GLM-4V-9B. It can "understand" screen screenshots, interpret natural language task instructions, and predict the next interaction actions (e.g., click, type, scroll). It not only returns action descriptions but also outputs precise pixel-level coordinates and structured operation commands.
Core Capabilities#
Perception and Reasoning#
- Bilingual Visual Understanding: Supports Chinese and English natural language instructions with screenshot interaction
- GUI Element Localization: Predicts Bounding Box for interactive elements, providing precise coordinates
- Multimodal Reasoning: Context-aware reasoning combined with historical operation steps
Action Space#
Supports complete GUI interaction primitives:
- Mouse Operations:
CLICK,DOUBLE_CLICK,RIGHT_CLICK,HOVER - Text Input:
TYPE(supports variable reference__CogName_xxx__) - Scroll Operations:
SCROLL_UP,SCROLL_DOWN,SCROLL_LEFT,SCROLL_RIGHT - Keyboard/Combos:
KEY_PRESS,GESTURE(supportsKEY_DOWN/UPsequences) - System/App:
LAUNCH(start app or URL),END(task completion) - Advanced/Context:
QUOTE_TEXT(with auto_scroll),QUOTE_CLIPBOARD,LLM(invoke internal LLM for subtasks)
Output Formats#
Provides 5 preset output formats for different downstream needs:
Action-Operation-SensitiveStatus-Plan-Action-OperationStatus-Action-Operation-SensitiveStatus-Action-OperationAction-Operation
Note: Sensitive tag identifies whether operations involve sensitive data (<<敏感操作>> / <<一般操作>>)
Platform Support#
- Desktop: Windows 10/11, macOS 14/15
- Mobile: Android 13/14/15
Performance#
Achieves SOTA on Screenspot, OmniAct, CogAgentBench-basic-cn, OSWorld benchmarks (compared to GPT-4o-20240806, Claude-3.5-Sonnet, Qwen2-VL, ShowUI, SeeClick, etc.).
Application Scenarios#
- Cross-application workflow automation: Email client and calendar coordination, automated holiday greetings, e-commerce shopping filters
- On-device intelligent assistant: Integrated on PC or mobile as system-level Copilot for complex multi-step tasks
- GUI Agent research: As base model or benchmark for developing vision-based Agent architectures
- Accessibility assistance: Assisting visually impaired or elderly users with complex graphical interfaces
Example Scenarios#
- Mark all emails as read (Mac platform)
- Automatically send Christmas greetings
- Online shopping search and filtering
Architecture Features#
- Base Model: GLM-4V-9B (bilingual VLM), approximately 9B parameters (BF16)
- Model Format: Image-Text-to-Text (Transformers + Safetensors)
- Visual Encoding: Images encoded into approximately 1600 tokens
- Inference Backend: Supports HuggingFace Transformers and vLLM (OpenAI API compatible)
Resource Requirements#
- BF16: At least 29GB VRAM
- INT8: ~15GB VRAM (performance loss)
- INT4: ~8GB VRAM (significant performance loss, not recommended)
Hardware Adaptation#
- NVIDIA GPU: Mainstream support
- Ascend NPU: Adapted (requires
torch_npu, tested on Atlas800 cluster)
Quick Start#
CLI Inference#
python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive
Web Demo#
python inference/web_demo.py --host 0.0.0.0 --port 7860 --model_dir THUDM/cogagent-9b-20241220 --format_key status_action_op_sensitive --platform "Mac" --output_dir ./results
Agent APP Deployment#
Server startup:
python openai_demo.py --model_path THUDM/cogagent-9b-20241220 --host 0.0.0.0 --port 7870
Client startup:
python client.py --api_key EMPTY --base_url http://127.0.0.1:7870/v1 --client_name 127.0.0.1 --client_port 7860 --model CogAgent
Input Format Specification#
Must be concatenated in the following order:
Task: {task}
History steps:
{history}
(Platform: {platform})
(Answer in {format} format.)
Fine-tuning Support#
- SFT: Vision Encoder frozen, batch_size=1, 8×A100
- LoRA: Vision Encoder not frozen, batch_size=1, 1×A100
- Checkpoint resumption supported
Production Use#
Already deployed in Zhipu AI's GLM-PC product.