DISCOVER THE FUTURE OF AI AGENTSarrow_forward

slime

calendar_todayAdded Feb 22, 2026
categoryModel & Inference Framework
codeOpen Source
PythonDockerPyTorch大语言模型TransformersDeep LearningReinforcement LearningCLIModel & Inference FrameworkOtherModel Training & Inference

An LLM post-training framework for RL scaling by Tsinghua THUDM, deeply integrating Megatron-LM training with SGLang inference engine for distributed reinforcement learning on large models like GLM, Qwen, DeepSeek, and Llama.

Introduction#

slime is an LLM post-training framework developed by Tsinghua University's Data Mining and Knowledge Discovery Lab (THUDM), designed for reinforcement learning scaling. Its core design philosophy deeply couples the distributed training framework Megatron-LM with the high-efficiency inference engine SGLang to build a complete training loop supporting multiple RL algorithms.

Core Capabilities#

High-Performance Training#

Connects Megatron with SGLang to support efficient training in various modes, including Tensor Parallel, Pipeline Parallel, Expert Parallel, Sequence Parallel, and Context Parallel strategies.

Flexible Data Generation#

Implements arbitrary training data generation workflows through custom data generation interfaces and server-based engines, supporting prompt initialization, custom data injection, and rollout data backfilling.

Supported Models#

  • GLM Series: GLM-5, GLM-4.7, GLM-4.6, GLM-4.5
  • Qwen Series: Qwen3Next, Qwen3MoE, Qwen3, Qwen2.5
  • DeepSeek Series: DeepSeek V3, V3.1, DeepSeek R1
  • Llama Series: Llama 3

Supported Training Algorithms#

  • GRPO (Group Relative Policy Optimization)
  • GSPO
  • Reinforce++ / Reinforce++ Baseline
  • PPO (Proximal Policy Optimization)
  • On-Policy Distillation

Architecture#

Adopts a Training - Rollout - Data Buffer triangular architecture:

  1. Training (Megatron): As consumer, reads training data from Data Buffer, completes parameter updates, and syncs latest weights to Rollout module
  2. Rollout (SGLang + Router): As producer, receives latest weights, generates Response and Reward/Verifier outputs based on Prompts, stores results in Data Buffer
  3. Data Buffer: As hub, manages Prompt pool, stores generated trajectory data, decouples training and inference flows

Key Features#

  • Colocate Mode: Supports training and inference sharing the same GPU group to reduce communication overhead
  • Dynamic Batching: Improves GPU utilization via --use-dynamic-batch-size
  • Chunked Weight Updates: Optimizes memory usage for large MoE model parameters
  • Multi-turn Interaction Support: Extensible via --custom-generate-function-path and --custom-rm-path
  • Ray Distributed Scheduling: Supports efficient coordination of multi-node, multi-GPU clusters

Hardware Support#

Recommends NVIDIA B200 series, H100/H200 (with CI protection). Verified at scale including 64×H100 for GLM-4.5 training and 128×H100 for DeepSeek-R1 training.

Quick Start#

Docker deployment recommended:

docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it slimerl/slime:latest /bin/bash

Model weight conversion (Hugging Face → Megatron torch_dist):

source scripts/models/glm4-9B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
  ${MODEL_ARGS[@]} \
  --hf-checkpoint /root/GLM-Z1-9B-0414 \
  --save /root/GLM-Z1-9B-0414_torch_dist

Projects Built on slime#

  • P1: Physics Olympiad reasoning model
  • RLVE: RL extension based on verifiable environments
  • TritonForge: Agentic RL for GPU Kernel generation
  • APRIL: Accelerating RL training via Active Partial Rollouts
  • qqr: ArenaRL & MCP framework for open-ended agents

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch