Flexible and scalable reinforcement learning training infrastructure for embodied and agentic AI post-training, decoupling logical workflow composition from efficient physical execution via the M2Flow paradigm.
RLinf is an open-source reinforcement learning infrastructure developed by the Yu Wang and Chao Yu team at Tsinghua University, targeting post-training scenarios for embodied and agentic AI. It is licensed under Apache-2.0 and has released v0.2.
Core Innovation: M2Flow Paradigm#
RLinf's core innovation is the Macro-to-Micro Flow Transformation (M2Flow), which automatically decomposes high-level composable RL logical workflows along temporal and spatial dimensions into optimized execution flows, decoupling logical workflow composition from physical communication scheduling. The paper reports end-to-end training throughput improvements of 1.07×–2.43×.
Execution Modes & Scheduling#
The system offers three flexible execution modes:
- Collocated: All workers share all GPUs
- Disaggregated: Fine-grained pipelining with workers split by function
- Hybrid: Custom combinations of Collocated + Disaggregated, achieving up to 2.434× improvement over existing frameworks in embodied RL scenarios
Scheduling capabilities include dynamic scheduling (runtime resource allocation), static scheduling (auto-selection of optimal execution mode based on workload), and second-level online scaling (20–40% additional efficiency gain while maintaining on-policy properties).
Algorithm & Model Coverage#
Algorithm-wise, it covers 15 RL algorithms: On-policy family (PPO, GRPO, DAPO, Reinforce++, Async PPO), Off-policy family (SAC, CrossQ, RLPD, IQL), and embodied-specific (SAC-Flow, DSRL, RECAP, DAgger, HG-DAgger), along with full-parameter SFT, LoRA SFT, and VLM SFT.
Model support spans VLA models (π₀, π₀.₅, OpenVLA, OpenVLA-OFT, GR00T, Dexbotic, StarVLA, LingBot-VLA), VLM models (Qwen2.5-VL, Qwen3-VL), and world models (OpenSora, Wan).
Simulation & Real Robot Support#
Simulation environments cover 12+ platforms including ManiSkill3, LIBERO/LIBERO-Pro/LIBERO-Plus, IsaacLab, RoboTwin, RoboVerse, BEHAVIOR, MetaWorld, CALVIN, RoboCasa, Franka-Sim, and EmbodiChain. Real robot support includes Franka (with ZED camera, Robotiq gripper), XSquare Turtle2 dual-arm, and DOS-W1, enabling online policy learning and data collection via the RLinf-USER system.
Layered Architecture#
- Programming Abstraction Layer: Worker-based programming model (Actor, Rollout, Environment, Data, Replay Buffer, etc.) with YAML configuration-driven workflow definition
- Scheduling Layer: M2Flow + profiling-guided scheduling; context switching and elastic pipelining for flow transformation
- Communication Layer: Adaptive P2P communication, Channel Queuing, elastic communication mechanisms
- Backend Layer: FSDP + HuggingFace/SGLang/vLLM (rapid prototyping) and Megatron + SGLang/vLLM (large-scale training), with 5D parallelism support
- Cluster Layer: Ray-based distributed resource management and multi-node scheduling
Typical Application Scenarios#
① VLA model simulation RL post-training and sim-real co-training; ② LLM reasoning/search enhancement (GRPO on MATH, SearchR1, rStar2, WideSeek-R1); ③ Real-world online RL (RLinf-USER); ④ World model-driven VLA post-training (WoVR); ⑤ Offline RL (IQL on D4RL, RECAP); ⑥ VLM fine-tuning pipelines.
Installation & Getting Started#
Docker images are recommended for installation (due to complex embodied RL environment dependencies), with pip installation also available via PyPI. The project provides end-to-end SOTA reproduction recipes, with quickstart entries for PPO training VLA on ManiSkill3, GRPO training LLM on MATH, and multi-node training.
Companion Paper Ecosystem#
Main framework paper (M2Flow, 2025), RLinf-VLA (2025), RLinf-USER (2026), WoVR world model post-training (2026), Sim-Real Co-Training (2026), WideSeek-R1 multi-agent RL (2026).
Unconfirmed Information#
Specific pip package name requires checking installation docs; minimum hardware requirements not explicitly listed on homepage; detailed comparison with VeRL pending review; community adoption scale by associated organizations (AgiBot, X Square Robot, PsiBot) has no quantitative data.