An LLM post-training framework for RL scaling by Tsinghua THUDM, deeply integrating Megatron-LM training with SGLang inference engine for distributed reinforcement learning on large models like GLM, Qwen, DeepSeek, and Llama.
Introduction#
slime is an LLM post-training framework developed by Tsinghua University's Data Mining and Knowledge Discovery Lab (THUDM), designed for reinforcement learning scaling. Its core design philosophy deeply couples the distributed training framework Megatron-LM with the high-efficiency inference engine SGLang to build a complete training loop supporting multiple RL algorithms.
Core Capabilities#
High-Performance Training#
Connects Megatron with SGLang to support efficient training in various modes, including Tensor Parallel, Pipeline Parallel, Expert Parallel, Sequence Parallel, and Context Parallel strategies.
Flexible Data Generation#
Implements arbitrary training data generation workflows through custom data generation interfaces and server-based engines, supporting prompt initialization, custom data injection, and rollout data backfilling.
Supported Models#
- GLM Series: GLM-5, GLM-4.7, GLM-4.6, GLM-4.5
- Qwen Series: Qwen3Next, Qwen3MoE, Qwen3, Qwen2.5
- DeepSeek Series: DeepSeek V3, V3.1, DeepSeek R1
- Llama Series: Llama 3
Supported Training Algorithms#
- GRPO (Group Relative Policy Optimization)
- GSPO
- Reinforce++ / Reinforce++ Baseline
- PPO (Proximal Policy Optimization)
- On-Policy Distillation
Architecture#
Adopts a Training - Rollout - Data Buffer triangular architecture:
- Training (Megatron): As consumer, reads training data from Data Buffer, completes parameter updates, and syncs latest weights to Rollout module
- Rollout (SGLang + Router): As producer, receives latest weights, generates Response and Reward/Verifier outputs based on Prompts, stores results in Data Buffer
- Data Buffer: As hub, manages Prompt pool, stores generated trajectory data, decouples training and inference flows
Key Features#
- Colocate Mode: Supports training and inference sharing the same GPU group to reduce communication overhead
- Dynamic Batching: Improves GPU utilization via
--use-dynamic-batch-size - Chunked Weight Updates: Optimizes memory usage for large MoE model parameters
- Multi-turn Interaction Support: Extensible via
--custom-generate-function-pathand--custom-rm-path - Ray Distributed Scheduling: Supports efficient coordination of multi-node, multi-GPU clusters
Hardware Support#
Recommends NVIDIA B200 series, H100/H200 (with CI protection). Verified at scale including 64×H100 for GLM-4.5 training and 128×H100 for DeepSeek-R1 training.
Quick Start#
Docker deployment recommended:
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-it slimerl/slime:latest /bin/bash
Model weight conversion (Hugging Face → Megatron torch_dist):
source scripts/models/glm4-9B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/GLM-Z1-9B-0414 \
--save /root/GLM-Z1-9B-0414_torch_dist
Projects Built on slime#
- P1: Physics Olympiad reasoning model
- RLVE: RL extension based on verifiable environments
- TritonForge: Agentic RL for GPU Kernel generation
- APRIL: Accelerating RL training via Active Partial Rollouts
- qqr: ArenaRL & MCP framework for open-ended agents