A native macOS menu-bar LLM inference server optimized for Apple Silicon, featuring tiered KV cache and multi-model concurrency.
oMLX is a local LLM inference server exclusively designed for Apple Silicon (M1/M2/M3/M4) chips. Its standout feature is a tiered KV cache mechanism (RAM hot tier + SSD cold tier) inspired by vLLM but significantly expanded. This mechanism offloads low-frequency KV blocks to disk in safetensors format, enabling prefix sharing, Copy-on-Write, and cross-restart cache reuse, which drastically improves response times in high-context-switching scenarios like coding assistance.
In terms of service capabilities, oMLX supports loading and scheduling LLMs, VLMs (with multi-image input and automatic OCR model detection), Embedding models, and Rerankers within a single instance. It ensures system stability through LRU eviction and fine-grained memory limits. Externally, it provides fully compatible OpenAI and Anthropic API interfaces, supporting streaming outputs, adaptive thinking chains, multi-family function calling, JSON Schema validation, and MCP tool integration.
On the user experience front, oMLX offers a native macOS menu-bar app built with PyObjC (not Electron), featuring one-click start/stop, crash guarding, and a fully offline Web admin panel. Deeply tied to the Apple MLX framework at its core, the project includes specific optimizations for coding tools like Claude Code (context scaling and SSE keep-alive), making it a comprehensive local inference gateway solution for the macOS platform.
Installation
- macOS App: Download .dmg from GitHub Releases, drag to Applications
- Homebrew:
brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx - From source:
git clone https://github.com/jundot/omlx.git && pip install -e .
Quick Start
omlx serve --model-dir ~/models
# OpenAI-compatible API: http://localhost:8000/v1
# Chat UI: http://localhost:8000/admin/chat
Core API Endpoints
POST /v1/chat/completions— Chat completions (streaming)POST /v1/completions— Text completions (streaming)POST /v1/messages— Anthropic Messages APIPOST /v1/embeddings— Text embeddingsPOST /v1/rerank— Document rerankingGET /v1/models— List available models
Key Configuration Options
--max-model-memory 32GB— Model memory cap--max-process-memory 80%— Process memory cap--paged-ssd-cache-dir ~/.omlx/cache— SSD cold cache directory--hot-cache-max-size 20%— Hot cache ratio--max-concurrent-requests 16— Max concurrent requests--mcp-config mcp.json— MCP tool configuration--api-key your-secret-key— API key authentication
Model Support
- LLMs: All mlx-lm supported models
- VLMs: Qwen3.5 series, GLM-4V, Pixtral, etc.
- OCR: DeepSeek-OCR, DOTS-OCR, GLM-OCR (auto-detection with prompt optimization)
- Embedding: BERT, BGE-M3, ModernBERT
- Reranker: ModernBERT, XLM-RoBERTa
Ecosystem Integration
- Upstream: Apple MLX, mlx-lm, mlx-vlm, mlx-embeddings
- Coding tools: Claude Code (specialized optimizations), OpenClaw, OpenCode, Codex, Pi
- Model sources: HuggingFace mlx-community org (in-panel search & download)
- Protocols: OpenAI API, Anthropic Messages API, MCP
Latest version is v0.3.7 (67 releases total), licensed under Apache-2.0, maintained by Jun Kim (jundot). Requires macOS 15.0+ (Sequoia), Apple Silicon, Python 3.10+.