DISCOVER THE FUTURE OF AI AGENTSarrow_forward

CogAgent

calendar_todayAdded Feb 23, 2026
categoryAgent & Tooling
codeOpen Source
PythonWorkflow AutomationPyTorch大语言模型MultimodalTransformersAI AgentsAgent & ToolingModel & Inference FrameworkAutomation, Workflow & RPAComputer Vision & Multimodal

An open-sourced end-to-end VLM-based GUI Agent developed by Tsinghua University and Zhipu AI, built on GLM-4V-9B bilingual VLM, enabling cross-platform GUI automation and reasoning via screenshots and natural language instructions.

Overview#

CogAgent is an open-sourced end-to-end VLM-based GUI Agent built on GLM-4V-9B. It can "understand" screen screenshots, interpret natural language task instructions, and predict the next interaction actions (e.g., click, type, scroll). It not only returns action descriptions but also outputs precise pixel-level coordinates and structured operation commands.

Core Capabilities#

Perception and Reasoning#

  • Bilingual Visual Understanding: Supports Chinese and English natural language instructions with screenshot interaction
  • GUI Element Localization: Predicts Bounding Box for interactive elements, providing precise coordinates
  • Multimodal Reasoning: Context-aware reasoning combined with historical operation steps

Action Space#

Supports complete GUI interaction primitives:

  • Mouse Operations: CLICK, DOUBLE_CLICK, RIGHT_CLICK, HOVER
  • Text Input: TYPE (supports variable reference __CogName_xxx__)
  • Scroll Operations: SCROLL_UP, SCROLL_DOWN, SCROLL_LEFT, SCROLL_RIGHT
  • Keyboard/Combos: KEY_PRESS, GESTURE (supports KEY_DOWN/UP sequences)
  • System/App: LAUNCH (start app or URL), END (task completion)
  • Advanced/Context: QUOTE_TEXT (with auto_scroll), QUOTE_CLIPBOARD, LLM (invoke internal LLM for subtasks)

Output Formats#

Provides 5 preset output formats for different downstream needs:

  1. Action-Operation-Sensitive
  2. Status-Plan-Action-Operation
  3. Status-Action-Operation-Sensitive
  4. Status-Action-Operation
  5. Action-Operation

Note: Sensitive tag identifies whether operations involve sensitive data (<<敏感操作>> / <<一般操作>>)

Platform Support#

  • Desktop: Windows 10/11, macOS 14/15
  • Mobile: Android 13/14/15

Performance#

Achieves SOTA on Screenspot, OmniAct, CogAgentBench-basic-cn, OSWorld benchmarks (compared to GPT-4o-20240806, Claude-3.5-Sonnet, Qwen2-VL, ShowUI, SeeClick, etc.).

Application Scenarios#

  • Cross-application workflow automation: Email client and calendar coordination, automated holiday greetings, e-commerce shopping filters
  • On-device intelligent assistant: Integrated on PC or mobile as system-level Copilot for complex multi-step tasks
  • GUI Agent research: As base model or benchmark for developing vision-based Agent architectures
  • Accessibility assistance: Assisting visually impaired or elderly users with complex graphical interfaces

Example Scenarios#

  • Mark all emails as read (Mac platform)
  • Automatically send Christmas greetings
  • Online shopping search and filtering

Architecture Features#

  • Base Model: GLM-4V-9B (bilingual VLM), approximately 9B parameters (BF16)
  • Model Format: Image-Text-to-Text (Transformers + Safetensors)
  • Visual Encoding: Images encoded into approximately 1600 tokens
  • Inference Backend: Supports HuggingFace Transformers and vLLM (OpenAI API compatible)

Resource Requirements#

  • BF16: At least 29GB VRAM
  • INT8: ~15GB VRAM (performance loss)
  • INT4: ~8GB VRAM (significant performance loss, not recommended)

Hardware Adaptation#

  • NVIDIA GPU: Mainstream support
  • Ascend NPU: Adapted (requires torch_npu, tested on Atlas800 cluster)

Quick Start#

CLI Inference#

python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive

Web Demo#

python inference/web_demo.py --host 0.0.0.0 --port 7860 --model_dir THUDM/cogagent-9b-20241220 --format_key status_action_op_sensitive --platform "Mac" --output_dir ./results

Agent APP Deployment#

Server startup:

python openai_demo.py --model_path THUDM/cogagent-9b-20241220 --host 0.0.0.0 --port 7870

Client startup:

python client.py --api_key EMPTY --base_url http://127.0.0.1:7870/v1 --client_name 127.0.0.1 --client_port 7860 --model CogAgent

Input Format Specification#

Must be concatenated in the following order:

Task: {task}
History steps:
{history}
(Platform: {platform})
(Answer in {format} format.)

Fine-tuning Support#

  • SFT: Vision Encoder frozen, batch_size=1, 8×A100
  • LoRA: Vision Encoder not frozen, batch_size=1, 1×A100
  • Checkpoint resumption supported

Production Use#

Already deployed in Zhipu AI's GLM-PC product.

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch