mlx-openai-server

A high-performance OpenAI-compatible API server for MLX models on Apple Silicon, supporting text, vision, audio transcription, and image generation/editing.

mlx-openai-server is a local inference API server designed exclusively for Apple Silicon (M-series chips), providing full OpenAI API compatibility. Built on the MLX framework and FastAPI, it covers six model types: text-only language models (lm), multimodal models (multimodal, supporting text+image+audio), image generation (image-generation, supporting Flux series etc.), image editing (image-edit), text embeddings (embeddings), and audio transcription (whisper).

Key features include multi-model parallel execution with on-demand loading/unloading, speculative decoding acceleration, prompt KV caching, configurable quantization (4/8/16-bit), LoRA adapter support, and tool call / structured output compatibility. Multi-model deployment uses a process-isolated architecture (HandlerProcessProxy) where each model runs in a separate subprocess, effectively resolving MLX Metal/GPU semaphore leak issues.

Usage is minimal: a single command launches a model, while multi-model setups are managed via YAML configuration. It exposes standard /v1/ endpoints that work seamlessly with the OpenAI SDK or any OpenAI-API-compatible frontend (e.g., OpenWebUI), enabling zero-code-modification local switching — suitable for privacy-first inference, local AI coding assistants, and AI Agent backends.

Installation & Quick Start

Requirements: macOS Apple Silicon, Python ≥3.11 and <3.13
Install: uv pip install mlx-openai-server
Quick launch: mlx-openai-server launch --model-path mlx-community/Qwen3-Coder-Next-4bit --model-type lm
Whisper support requires ffmpeg

CLI Key Parameters

--model-path: MLX model path (local or HuggingFace), required
--model-type: lm / multimodal / image-generation / image-edit / embeddings / whisper, required
--config: YAML multi-model config file path
--host / --port: bind address and port, default 127.0.0.1:8000
--served-model-name: custom external model name
--quantize: quantization level (4/8/16)
--context-length / --max-tokens: context and generation length, max-tokens defaults to 100000
--temperature: sampling temperature, default 1.0
--draft-model-path: speculative decoding draft model path
--prompt-cache-size: prompt cache entries, default 10
--lora-paths: LoRA adapter paths (comma-separated)

Unconfirmed Information

PyPI publication status inferred from install commands only
No complete compatible model list; README mentions Gemma 4, MiniMax-M2.5, GLM-4.7, Qwen3, Flux, etc.
No public performance benchmarks
Concurrency limits unspecified

Current version 1.7.1, MIT License, authored by Gia-Huy Vuong (cubist38).

Related Projects

Zylos Core

verl

Kalshi AI Trading Bot

STAY UPDATED