A distributed inference framework for running frontier LLMs across local device clusters, built on Apple MLX and libp2p, featuring automatic device discovery, topology-aware parallelism, and multi-API compatibility.
exo is a distributed LLM inference framework designed for local device clusters, enabling multiple consumer-grade devices to collaboratively run frontier models too large for a single machine (e.g., DeepSeek v3.1 671B, Qwen3-235B). Built on Apple MLX and MLX distributed for GPU-accelerated inference and cross-device communication, it leverages libp2p for zero-configuration automatic device discovery and cluster networking.
For parallelism, exo supports both tensor parallelism and pipeline parallelism, with a topology-aware algorithm that evaluates device resources and network conditions in real-time (including RDMA capabilities over Thunderbolt 5) to automatically select the optimal sharding strategy. Benchmarks show up to 1.8× speedup with 2 devices and 3.2× with 4 devices. On macOS 26.2+, exo offers Day-0 support for RDMA over Thunderbolt 5, reducing inter-device latency by approximately 99%.
For usability, exo simultaneously exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama API formats, allowing direct integration with existing toolchains. A built-in web Dashboard provides management and chat interfaces. It supports offline mode, cluster namespace isolation, distributed tracing, and loading custom MLX models from HuggingFace Hub.
The current Tier 1 platform is macOS Apple Silicon, with Linux supporting CPU-only inference (GPU support in development). Installation is available via source build, Nix, and macOS .dmg.