DISCOVER THE FUTURE OF AI AGENTSarrow_forward

MiniCPM-o

calendar_todayAdded Feb 23, 2026
categoryModel & Inference Framework
codeOpen Source
PythonPyTorch大语言模型MultimodalTransformersCLIModel & Inference FrameworkModel Training & InferenceComputer Vision & Multimodal

An end-to-side omnimodal LLM by Tsinghua THUNLP supporting vision, speech, and full-duplex multimodal live streaming, optimized for mobile deployment with performance rivaling Gemini 2.5 Flash.

Project Overview#

MiniCPM-o is an end-to-side omnimodal Large Language Model series jointly developed by Tsinghua Natural Language Processing Laboratory (THUNLP) and ModelBest. It supports traditional single-image/multi-image/video understanding (Vision), integrates powerful speech dialogue (Speech) capabilities including voice cloning and emotion control, and innovatively implements Full-Duplex Multimodal Live Streaming - the model can simultaneously "listen, see, and speak" like a human, with input and output streams non-blocking each other.

Model Versions#

MiniCPM-o 4.5 (9B Parameters)#

  • Latest flagship version, built on SigLip2 + Whisper-medium + CosyVoice2 + Qwen3-8B
  • OpenCompass comprehensive score: 77.6, approaching Gemini 2.5 Flash
  • End-to-end multimodal architecture supporting vision, speech, and full-duplex multimodal live streaming

MiniCPM-V 4.0 (4.1B Parameters)#

  • Efficient version, built on SigLIP2-400M + MiniCPM4-3B
  • OpenCompass comprehensive score: 69.0, surpassing GPT-4.1-mini-20250414
  • Optimized for mobile deployment, first token latency <2s on iPhone 16 Pro Max

Core Capabilities#

CategoryFeatures
Vision UnderstandingSingle/multi-image/video understanding, OCR (up to 1.8M pixels), high FPS video (10fps)
Speech CapabilitiesChinese-English bilingual real-time speech dialogue, voice cloning, emotion/speed/style control
Full-Duplex Multimodal Live StreamingNon-blocking input/output streams, simultaneous seeing, hearing, and speaking
Active Interaction1Hz frequency decision-making for speaking, supports proactive reminders
Multi-languageSupports 30+ languages

Technical Architecture#

Model Components#

  • Vision Encoder: SigLIP2 (400M parameters)
  • Audio Encoder: Whisper-medium
  • Speech Decoder: CosyVoice2 / Step-Audio2
  • LLM Backbone: Qwen3-8B

Key Technical Mechanisms#

  1. End-to-End Omnimodal Architecture: Modal encoders/decoders tightly connected with LLM through hidden states
  2. TDM (Time-Division Multiplexing): Time-division multiplexing mechanism for millisecond-level timeline synchronization
  3. Full-Duplex Streaming Mechanism: Offline encoders/decoders transformed into online full-duplex versions
  4. Efficient Vision Compression: 1.8M pixel image requires only 640 visual tokens (75% fewer than similar models)

Typical Application Scenarios#

  • Real-time voice assistant (Chinese-English bilingual)
  • Role-playing and voice cloning
  • Document OCR parsing (OmniDocBench SOTA)
  • Visual Question Answering (VQA)
  • Multimodal real-time interaction (video + audio + text)
  • Edge deployment (mobile phones, iPad, Mac)

Requirements#

  • Python 3.10+
  • transformers==4.51.0 (recommended)
  • PyTorch >= 2.3.0, <= 2.8.0

Installation#

Without TTS/Streaming Inference:

pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils>=1.0.5"

With TTS/Streaming Inference:

pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.5"

Model Loading Example#

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    'openbmb/MiniCPM-o-2_6',
    trust_remote_code=True,
    attn_implementation='sdpa',
    torch_dtype=torch.bfloat16,
    init_vision=True,
    init_audio=True,
    init_tts=True
)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
model.init_tts()

Running Modes#

ModeUse CaseKey Parameters
Duplex Omni ModeFull-duplex streaming inference (real-time/video dialogue)omni_input=True
Half-Duplex Omni ModeHalf-duplex multimodal dialogue (chat/streaming)use_tts_template=True
Speech ConversationSpeech dialogue (role-play/assistant)mode='audio_roleplay' / 'audio_assistant'
Vision-OnlyPure vision understandinginit_audio=False, init_tts=False

Key Inference Parameters#

  • temperature: Default 0.5
  • max_new_tokens: Maximum generation tokens (e.g., 4096)
  • generate_audio: Whether to generate audio output
  • output_audio_path: Audio save path

Voice System Prompt Configuration#

ref_audio, _ = librosa.load('reference_audio.wav', sr=16000, mono=True)
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')

Supported Frameworks#

FrameworkUse Case
vLLMHigh-throughput inference
SGLangMemory-efficient inference
llama.cpp / llama.cpp-omniLocal device CPU inference
OllamaSimplified deployment
LLaMA-FactoryFine-tuning
SWIFTFine-tuning
FlagOSMulti-chip unified backend

License#

Apache-2.0

Developers#

THUNLP (Tsinghua Natural Language Processing Laboratory) and ModelBest

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch