CogAgent

An open-sourced end-to-end VLM-based GUI Agent developed by Tsinghua University and Zhipu AI, built on GLM-4V-9B bilingual VLM, enabling cross-platform GUI automation and reasoning via screenshots and natural language instructions.

Overview#

CogAgent is an open-sourced end-to-end VLM-based GUI Agent built on GLM-4V-9B. It can "understand" screen screenshots, interpret natural language task instructions, and predict the next interaction actions (e.g., click, type, scroll). It not only returns action descriptions but also outputs precise pixel-level coordinates and structured operation commands.

Core Capabilities#

Perception and Reasoning#

Bilingual Visual Understanding: Supports Chinese and English natural language instructions with screenshot interaction
GUI Element Localization: Predicts Bounding Box for interactive elements, providing precise coordinates
Multimodal Reasoning: Context-aware reasoning combined with historical operation steps

Action Space#

Supports complete GUI interaction primitives:

Mouse Operations: CLICK, DOUBLE_CLICK, RIGHT_CLICK, HOVER
Text Input: TYPE (supports variable reference __CogName_xxx__)
Scroll Operations: SCROLL_UP, SCROLL_DOWN, SCROLL_LEFT, SCROLL_RIGHT
Keyboard/Combos: KEY_PRESS, GESTURE (supports KEY_DOWN/UP sequences)
System/App: LAUNCH (start app or URL), END (task completion)
Advanced/Context: QUOTE_TEXT (with auto_scroll), QUOTE_CLIPBOARD, LLM (invoke internal LLM for subtasks)

Output Formats#

Provides 5 preset output formats for different downstream needs:

Action-Operation-Sensitive
Status-Plan-Action-Operation
Status-Action-Operation-Sensitive
Status-Action-Operation
Action-Operation

Note: Sensitive tag identifies whether operations involve sensitive data (<<敏感操作>> / <<一般操作>>)

Platform Support#

Desktop: Windows 10/11, macOS 14/15
Mobile: Android 13/14/15

Performance#

Achieves SOTA on Screenspot, OmniAct, CogAgentBench-basic-cn, OSWorld benchmarks (compared to GPT-4o-20240806, Claude-3.5-Sonnet, Qwen2-VL, ShowUI, SeeClick, etc.).

Application Scenarios#

Cross-application workflow automation: Email client and calendar coordination, automated holiday greetings, e-commerce shopping filters
On-device intelligent assistant: Integrated on PC or mobile as system-level Copilot for complex multi-step tasks
GUI Agent research: As base model or benchmark for developing vision-based Agent architectures
Accessibility assistance: Assisting visually impaired or elderly users with complex graphical interfaces

Example Scenarios#

Mark all emails as read (Mac platform)
Automatically send Christmas greetings
Online shopping search and filtering

Architecture Features#

Base Model: GLM-4V-9B (bilingual VLM), approximately 9B parameters (BF16)
Model Format: Image-Text-to-Text (Transformers + Safetensors)
Visual Encoding: Images encoded into approximately 1600 tokens
Inference Backend: Supports HuggingFace Transformers and vLLM (OpenAI API compatible)

Resource Requirements#

BF16: At least 29GB VRAM
INT8: ~15GB VRAM (performance loss)
INT4: ~8GB VRAM (significant performance loss, not recommended)

Hardware Adaptation#

NVIDIA GPU: Mainstream support
Ascend NPU: Adapted (requires torch_npu, tested on Atlas800 cluster)

Quick Start#

CLI Inference#

python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive

Web Demo#

python inference/web_demo.py --host 0.0.0.0 --port 7860 --model_dir THUDM/cogagent-9b-20241220 --format_key status_action_op_sensitive --platform "Mac" --output_dir ./results

Agent APP Deployment#

Server startup:

python openai_demo.py --model_path THUDM/cogagent-9b-20241220 --host 0.0.0.0 --port 7870

Client startup:

python client.py --api_key EMPTY --base_url http://127.0.0.1:7870/v1 --client_name 127.0.0.1 --client_port 7860 --model CogAgent

Input Format Specification#

Must be concatenated in the following order:

Task: {task}
History steps:
{history}
(Platform: {platform})
(Answer in {format} format.)

Fine-tuning Support#

SFT: Vision Encoder frozen, batch_size=1, 8×A100
LoRA: Vision Encoder not frozen, batch_size=1, 1×A100
Checkpoint resumption supported

Production Use#

Already deployed in Zhipu AI's GLM-PC product.