SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

A scalable agentic 3D scene generation framework for embodied AI that automatically produces simulation-ready 3D indoor environments and robot training data from natural language task descriptions, with a built-in 10k scene dataset.

SAGE (Scalable Agentic 3D Scene Generation for Embodied AI) is developed by NVIDIA Research in collaboration with UIUC, Cornell, and Stanford. It provides an end-to-end pipeline for 3D scene generation and robot data production. The core workflow: users provide natural language task descriptions (e.g., "fetch a kettle from the kitchen"), and the system interprets task intent via LLM (gpt-oss-120b), performs visual reasoning with VLM (Qwen3-VL), and coordinates TRELLIS 3D asset generation, MatFuse/FLUX material synthesis, and scene layout solving through an MCP (Model Context Protocol) architecture to output complete 3D scenes directly loadable in Isaac Sim.

The framework uses a Client-Server separated architecture. The Server hosts foundation model inference and 3D generation services, while the Client runs the Isaac Sim simulation engine and local material generation, communicating via MCP. All large models are deployed through vLLM with tensor parallel and async scheduling support.

For data augmentation, SAGE provides layout-level augmentation (preserving task semantics while regenerating background layouts), pose augmentation (randomizing small object poses), and category-level object replacement. For robot data generation, it integrates M2T2 to produce contact-rich manipulation trajectories, supporting both static Franka Arm and mobile Franka configurations, with HDF5 output directly consumable by robomimic for policy training.

The SAGE-10k dataset contains 10,000 diverse indoor scenes covering 50 room types and styles with 565K unique 3D objects, hosted on Hugging Face. All scene generation scripts support image conditioning for Real2Sim research workflows.

Heavy runtime dependencies are required: Isaac Sim 4.2.0, GPU clusters (client VLM needs 8-GPU tensor parallel), HuggingFace Token, and objathor base data.

Primary language is Python (~98.6%). The main repository uses the Apache-2.0 license. The paper is published as an arXiv preprint (2602.10116). The repository currently has no releases and only one commit; maturity remains to be observed.

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Related Projects

MemOS Cloud OpenClaw Plugin

VeritasGraph

openrouter-rs

STAY UPDATED