A next-generation training engine built for ultra-large MoE (Mixture of Experts) models, offering high-efficiency and scalable training solutions for large language models.
One-Minute Overview#
XTuner is a next-generation training engine specifically designed for ultra-large MoE (Mixture of Experts) models. It breaks through traditional 3D parallel training architecture limitations, optimized for mainstream MoE training scenarios in today's academic research, supporting model training up to 1T parameters with training efficiency on Ascend A3 Supernode that exceeds NVIDIA H800.
Core Value: Through innovative parallel strategies and memory optimization techniques, XTuner achieves efficient training for MoE models, solving bottlenecks in traditional architectures for large-scale MoE model training.
Quick Start#
Installation Difficulty: Medium - As a professional training framework, requires knowledge of distributed training and GPU/NPU hardware
# Clone repository and install
git clone https://github.com/InternLM/xtuner
cd xtuner
pip install -e .
Is this suitable for me?
- ✅ Large-scale MoE model training: Suitable for research teams and enterprises needing to train 200B-1T parameter MoE models
- ✅ Multimodal model training: Supports multimodal pre-training and supervised fine-tuning for vision-language models
- ❌ Small-scale model training: May be overly complex for conventional model training under 10B parameters
- ❌ Resource-limited environments: Requires high-performance computing clusters; ordinary personal computers cannot leverage its full capabilities
Core Capabilities#
1. Dropless Training - Solving Large-Scale MoE Training Bottlenecks#
- Through optimized parallel strategies, can train 200B-scale MoE models without full expert parallelism; 600B models only require intra-node expert parallelism Actual Value: Significantly reduces technical barriers and resource requirements for large-scale MoE model training
2. Long Sequence Support - Breaking Context Length Limitations#
- Through advanced memory optimization techniques, can train 200B MoE models on 64k sequence lengths without sequence parallelism Actual Value: Supports processing and training of longer texts, suitable for applications requiring long document handling
3. Superior Training Efficiency - Industry-Leading Performance#
- First to achieve FSDP training throughput that surpasses traditional 3D parallel schemes for MoE models above 200B scale Actual Value: Significantly reduces training time and costs for large models, improving R&D efficiency
4. Multimodal Capabilities - Supporting Vision-Language Model Training#
- Fully supports multimodal pre-training and supervised fine-tuning for vision-language models, optimized for instruction following Actual Value: Can process both text and image data, expanding application scenarios
5. Reinforcement Learning Support - Advanced RLHF Capabilities#
- Implements GRPO (Group Relative Policy Optimization), with plans to support advanced optimization algorithms like MPO and DAPO Actual Value: Improves model's ability to follow human instructions and output quality