An end-to-side omnimodal LLM by Tsinghua THUNLP supporting vision, speech, and full-duplex multimodal live streaming, optimized for mobile deployment with performance rivaling Gemini 2.5 Flash.
Project Overview#
MiniCPM-o is an end-to-side omnimodal Large Language Model series jointly developed by Tsinghua Natural Language Processing Laboratory (THUNLP) and ModelBest. It supports traditional single-image/multi-image/video understanding (Vision), integrates powerful speech dialogue (Speech) capabilities including voice cloning and emotion control, and innovatively implements Full-Duplex Multimodal Live Streaming - the model can simultaneously "listen, see, and speak" like a human, with input and output streams non-blocking each other.
Model Versions#
MiniCPM-o 4.5 (9B Parameters)#
- Latest flagship version, built on SigLip2 + Whisper-medium + CosyVoice2 + Qwen3-8B
- OpenCompass comprehensive score: 77.6, approaching Gemini 2.5 Flash
- End-to-end multimodal architecture supporting vision, speech, and full-duplex multimodal live streaming
MiniCPM-V 4.0 (4.1B Parameters)#
- Efficient version, built on SigLIP2-400M + MiniCPM4-3B
- OpenCompass comprehensive score: 69.0, surpassing GPT-4.1-mini-20250414
- Optimized for mobile deployment, first token latency <2s on iPhone 16 Pro Max
Core Capabilities#
| Category | Features |
|---|---|
| Vision Understanding | Single/multi-image/video understanding, OCR (up to 1.8M pixels), high FPS video (10fps) |
| Speech Capabilities | Chinese-English bilingual real-time speech dialogue, voice cloning, emotion/speed/style control |
| Full-Duplex Multimodal Live Streaming | Non-blocking input/output streams, simultaneous seeing, hearing, and speaking |
| Active Interaction | 1Hz frequency decision-making for speaking, supports proactive reminders |
| Multi-language | Supports 30+ languages |
Technical Architecture#
Model Components#
- Vision Encoder: SigLIP2 (400M parameters)
- Audio Encoder: Whisper-medium
- Speech Decoder: CosyVoice2 / Step-Audio2
- LLM Backbone: Qwen3-8B
Key Technical Mechanisms#
- End-to-End Omnimodal Architecture: Modal encoders/decoders tightly connected with LLM through hidden states
- TDM (Time-Division Multiplexing): Time-division multiplexing mechanism for millisecond-level timeline synchronization
- Full-Duplex Streaming Mechanism: Offline encoders/decoders transformed into online full-duplex versions
- Efficient Vision Compression: 1.8M pixel image requires only 640 visual tokens (75% fewer than similar models)
Typical Application Scenarios#
- Real-time voice assistant (Chinese-English bilingual)
- Role-playing and voice cloning
- Document OCR parsing (OmniDocBench SOTA)
- Visual Question Answering (VQA)
- Multimodal real-time interaction (video + audio + text)
- Edge deployment (mobile phones, iPad, Mac)
Requirements#
- Python 3.10+
- transformers==4.51.0 (recommended)
- PyTorch >= 2.3.0, <= 2.8.0
Installation#
Without TTS/Streaming Inference:
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils>=1.0.5"
With TTS/Streaming Inference:
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.5"
Model Loading Example#
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
'openbmb/MiniCPM-o-2_6',
trust_remote_code=True,
attn_implementation='sdpa',
torch_dtype=torch.bfloat16,
init_vision=True,
init_audio=True,
init_tts=True
)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
model.init_tts()
Running Modes#
| Mode | Use Case | Key Parameters |
|---|---|---|
| Duplex Omni Mode | Full-duplex streaming inference (real-time/video dialogue) | omni_input=True |
| Half-Duplex Omni Mode | Half-duplex multimodal dialogue (chat/streaming) | use_tts_template=True |
| Speech Conversation | Speech dialogue (role-play/assistant) | mode='audio_roleplay' / 'audio_assistant' |
| Vision-Only | Pure vision understanding | init_audio=False, init_tts=False |
Key Inference Parameters#
temperature: Default 0.5max_new_tokens: Maximum generation tokens (e.g., 4096)generate_audio: Whether to generate audio outputoutput_audio_path: Audio save path
Voice System Prompt Configuration#
ref_audio, _ = librosa.load('reference_audio.wav', sr=16000, mono=True)
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
Supported Frameworks#
| Framework | Use Case |
|---|---|
| vLLM | High-throughput inference |
| SGLang | Memory-efficient inference |
| llama.cpp / llama.cpp-omni | Local device CPU inference |
| Ollama | Simplified deployment |
| LLaMA-Factory | Fine-tuning |
| SWIFT | Fine-tuning |
| FlagOS | Multi-chip unified backend |
License#
Apache-2.0
Developers#
THUNLP (Tsinghua Natural Language Processing Laboratory) and ModelBest