oMLX

A native macOS menu-bar LLM inference server optimized for Apple Silicon, featuring tiered KV cache and multi-model concurrency.

oMLX is a local LLM inference server exclusively designed for Apple Silicon (M1/M2/M3/M4) chips. Its standout feature is a tiered KV cache mechanism (RAM hot tier + SSD cold tier) inspired by vLLM but significantly expanded. This mechanism offloads low-frequency KV blocks to disk in safetensors format, enabling prefix sharing, Copy-on-Write, and cross-restart cache reuse, which drastically improves response times in high-context-switching scenarios like coding assistance.

In terms of service capabilities, oMLX supports loading and scheduling LLMs, VLMs (with multi-image input and automatic OCR model detection), Embedding models, and Rerankers within a single instance. It ensures system stability through LRU eviction and fine-grained memory limits. Externally, it provides fully compatible OpenAI and Anthropic API interfaces, supporting streaming outputs, adaptive thinking chains, multi-family function calling, JSON Schema validation, and MCP tool integration.

On the user experience front, oMLX offers a native macOS menu-bar app built with PyObjC (not Electron), featuring one-click start/stop, crash guarding, and a fully offline Web admin panel. Deeply tied to the Apple MLX framework at its core, the project includes specific optimizations for coding tools like Claude Code (context scaling and SSE keep-alive), making it a comprehensive local inference gateway solution for the macOS platform.

Installation

macOS App: Download .dmg from GitHub Releases, drag to Applications
Homebrew: brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx
From source: git clone https://github.com/jundot/omlx.git && pip install -e .

Quick Start

omlx serve --model-dir ~/models
# OpenAI-compatible API: http://localhost:8000/v1
# Chat UI: http://localhost:8000/admin/chat

Core API Endpoints

POST /v1/chat/completions — Chat completions (streaming)
POST /v1/completions — Text completions (streaming)
POST /v1/messages — Anthropic Messages API
POST /v1/embeddings — Text embeddings
POST /v1/rerank — Document reranking
GET /v1/models — List available models

Key Configuration Options

--max-model-memory 32GB — Model memory cap
--max-process-memory 80% — Process memory cap
--paged-ssd-cache-dir ~/.omlx/cache — SSD cold cache directory
--hot-cache-max-size 20% — Hot cache ratio
--max-concurrent-requests 16 — Max concurrent requests
--mcp-config mcp.json — MCP tool configuration
--api-key your-secret-key — API key authentication

Model Support

LLMs: All mlx-lm supported models
VLMs: Qwen3.5 series, GLM-4V, Pixtral, etc.
OCR: DeepSeek-OCR, DOTS-OCR, GLM-OCR (auto-detection with prompt optimization)
Embedding: BERT, BGE-M3, ModernBERT
Reranker: ModernBERT, XLM-RoBERTa

Ecosystem Integration

Upstream: Apple MLX, mlx-lm, mlx-vlm, mlx-embeddings
Coding tools: Claude Code (specialized optimizations), OpenClaw, OpenCode, Codex, Pi
Model sources: HuggingFace mlx-community org (in-panel search & download)
Protocols: OpenAI API, Anthropic Messages API, MCP

Latest version is v0.3.7 (67 releases total), licensed under Apache-2.0, maintained by Jun Kim (jundot). Requires macOS 15.0+ (Sequoia), Apple Silicon, Python 3.10+.

Related Projects

Zylos Core

verl

Kalshi AI Trading Bot

STAY UPDATED