A high-performance OpenAI-compatible API server for MLX models on Apple Silicon, supporting text, vision, audio transcription, and image generation/editing.
mlx-openai-server is a local inference API server designed exclusively for Apple Silicon (M-series chips), providing full OpenAI API compatibility. Built on the MLX framework and FastAPI, it covers six model types: text-only language models (lm), multimodal models (multimodal, supporting text+image+audio), image generation (image-generation, supporting Flux series etc.), image editing (image-edit), text embeddings (embeddings), and audio transcription (whisper).
Key features include multi-model parallel execution with on-demand loading/unloading, speculative decoding acceleration, prompt KV caching, configurable quantization (4/8/16-bit), LoRA adapter support, and tool call / structured output compatibility. Multi-model deployment uses a process-isolated architecture (HandlerProcessProxy) where each model runs in a separate subprocess, effectively resolving MLX Metal/GPU semaphore leak issues.
Usage is minimal: a single command launches a model, while multi-model setups are managed via YAML configuration. It exposes standard /v1/ endpoints that work seamlessly with the OpenAI SDK or any OpenAI-API-compatible frontend (e.g., OpenWebUI), enabling zero-code-modification local switching — suitable for privacy-first inference, local AI coding assistants, and AI Agent backends.
Installation & Quick Start
- Requirements: macOS Apple Silicon, Python ≥3.11 and <3.13
- Install:
uv pip install mlx-openai-server - Quick launch:
mlx-openai-server launch --model-path mlx-community/Qwen3-Coder-Next-4bit --model-type lm - Whisper support requires ffmpeg
CLI Key Parameters
--model-path: MLX model path (local or HuggingFace), required--model-type: lm / multimodal / image-generation / image-edit / embeddings / whisper, required--config: YAML multi-model config file path--host/--port: bind address and port, default 127.0.0.1:8000--served-model-name: custom external model name--quantize: quantization level (4/8/16)--context-length/--max-tokens: context and generation length, max-tokens defaults to 100000--temperature: sampling temperature, default 1.0--draft-model-path: speculative decoding draft model path--prompt-cache-size: prompt cache entries, default 10--lora-paths: LoRA adapter paths (comma-separated)
Unconfirmed Information
- PyPI publication status inferred from install commands only
- No complete compatible model list; README mentions Gemma 4, MiniMax-M2.5, GLM-4.7, Qwen3, Flux, etc.
- No public performance benchmarks
- Concurrency limits unspecified
Current version 1.7.1, MIT License, authored by Gia-Huy Vuong (cubist38).