slime

An LLM post-training framework for RL scaling by Tsinghua THUDM, deeply integrating Megatron-LM training with SGLang inference engine for distributed reinforcement learning on large models like GLM, Qwen, DeepSeek, and Llama.

Introduction#

slime is an LLM post-training framework developed by Tsinghua University's Data Mining and Knowledge Discovery Lab (THUDM), designed for reinforcement learning scaling. Its core design philosophy deeply couples the distributed training framework Megatron-LM with the high-efficiency inference engine SGLang to build a complete training loop supporting multiple RL algorithms.

GLM Series: GLM-5, GLM-4.7, GLM-4.6, GLM-4.5
Qwen Series: Qwen3Next, Qwen3MoE, Qwen3, Qwen2.5
DeepSeek Series: DeepSeek V3, V3.1, DeepSeek R1
Llama Series: Llama 3

Supported Training Algorithms#

GRPO (Group Relative Policy Optimization)
GSPO
Reinforce++ / Reinforce++ Baseline
PPO (Proximal Policy Optimization)
On-Policy Distillation

Architecture#

Adopts a Training - Rollout - Data Buffer triangular architecture:

Training (Megatron): As consumer, reads training data from Data Buffer, completes parameter updates, and syncs latest weights to Rollout module
Rollout (SGLang + Router): As producer, receives latest weights, generates Response and Reward/Verifier outputs based on Prompts, stores results in Data Buffer
Data Buffer: As hub, manages Prompt pool, stores generated trajectory data, decouples training and inference flows

Key Features#

Colocate Mode: Supports training and inference sharing the same GPU group to reduce communication overhead
Dynamic Batching: Improves GPU utilization via --use-dynamic-batch-size
Chunked Weight Updates: Optimizes memory usage for large MoE model parameters
Multi-turn Interaction Support: Extensible via --custom-generate-function-path and --custom-rm-path
Ray Distributed Scheduling: Supports efficient coordination of multi-node, multi-GPU clusters

Hardware Support#

Recommends NVIDIA B200 series, H100/H200 (with CI protection). Verified at scale including 64×H100 for GLM-4.5 training and 128×H100 for DeepSeek-R1 training.

Quick Start#

Docker deployment recommended:

docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it slimerl/slime:latest /bin/bash

Model weight conversion (Hugging Face → Megatron torch_dist):

source scripts/models/glm4-9B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
  ${MODEL_ARGS[@]} \
  --hf-checkpoint /root/GLM-Z1-9B-0414 \
  --save /root/GLM-Z1-9B-0414_torch_dist

Projects Built on slime#

P1: Physics Olympiad reasoning model
RLVE: RL extension based on verifiable environments
TritonForge: Agentic RL for GPU Kernel generation
APRIL: Accelerating RL training via Active Partial Rollouts
qqr: ArenaRL & MCP framework for open-ended agents

Introduction#

Core Capabilities#

High-Performance Training#

Flexible Data Generation#

Supported Models#

Supported Training Algorithms#

Architecture#

Key Features#

Hardware Support#

Quick Start#

Projects Built on slime#

Related Projects

oh-my-codex

Ironcurtain

vibe-remote

STAY UPDATED