A fully disaggregated multimodal model inference and serving framework that extends vLLM to support any-to-any modality unified inference and high-performance deployment.
vLLM-Omni is an official sub-project under the vLLM organization, positioned as a fully disaggregated multimodal model inference and serving framework. Its core design employs Stage Abstraction to decompose complex any-to-any models into graph-represented interconnected stages, where each stage is powered by an independent LLM or Diffusion engine, with OmniConnector handling cross-stage data routing and dynamic resource allocation.
In terms of modality coverage, vLLM-Omni provides unified inference and serving for text, image, video, and audio. Architecturally, it supports both autoregressive (AR) models and non-autoregressive parallel generation models such as Diffusion Transformers (DiT), enabling heterogeneous pipeline orchestration (e.g., LLM inference cascaded with Diffusion image generation) and multimodal mixed outputs (simultaneous text + image + audio generation).
On the performance front, it inherits vLLM's efficient KV cache management for SOTA-level AR inference, and improves overall throughput through pipelined stage execution overlapping and the fully disaggregated architecture. The paper reports up to 91.4% reduction in Job Completion Time (JCT) compared to baseline methods.
For usability, vLLM-Omni offers an offline Python API and an OpenAI-compatible online serving API with streaming output support. It seamlessly integrates Hugging Face models, includes built-in ComfyUI integration and a Diffusers Pipeline Adapter, and provides Helm Charts for Kubernetes deployment. Supported hardware backends include NVIDIA CUDA, AMD ROCm, Intel XPU, MThreads MUSA, and Huawei Ascend NPU.
Validated models span full-modality models like Qwen2.5-Omni and Qwen3-Omni, image generation models like Tongyi-MAI/Z-Image-Turbo and HunyuanImage-3.0-Instruct, video generation models like Helios and VACE, and audio/TTS models like Qwen3-TTS, CosyVoice3, and Fish Speech S2 Pro.
Installation & Quick Start
Requirements: Python 3.12, vLLM ≥ 0.19.0, Linux.
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .
Offline inference example (text-to-image):
from vllm_omni.entrypoints.omni import Omni
omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")
outputs = omni.generate("a cup of coffee on the table")
outputs[0].request_output.images[0].save("coffee.png")
Online serving:
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091