DISCOVER THE FUTURE OF AI AGENTS

mlx-openai-server

Added Apr 23, 2026
Model & Inference Framework
Open Source
PythonPyTorchLarge Language ModelsFastAPIMultimodalDeep LearningCLIModel & Inference FrameworkModel Training & InferenceProtocol, API & IntegrationComputer Vision & Multimodal

A high-performance OpenAI-compatible API server for MLX models on Apple Silicon, supporting text, vision, audio transcription, and image generation/editing.

mlx-openai-server is a local inference API server designed exclusively for Apple Silicon (M-series chips), providing full OpenAI API compatibility. Built on the MLX framework and FastAPI, it covers six model types: text-only language models (lm), multimodal models (multimodal, supporting text+image+audio), image generation (image-generation, supporting Flux series etc.), image editing (image-edit), text embeddings (embeddings), and audio transcription (whisper).

Key features include multi-model parallel execution with on-demand loading/unloading, speculative decoding acceleration, prompt KV caching, configurable quantization (4/8/16-bit), LoRA adapter support, and tool call / structured output compatibility. Multi-model deployment uses a process-isolated architecture (HandlerProcessProxy) where each model runs in a separate subprocess, effectively resolving MLX Metal/GPU semaphore leak issues.

Usage is minimal: a single command launches a model, while multi-model setups are managed via YAML configuration. It exposes standard /v1/ endpoints that work seamlessly with the OpenAI SDK or any OpenAI-API-compatible frontend (e.g., OpenWebUI), enabling zero-code-modification local switching — suitable for privacy-first inference, local AI coding assistants, and AI Agent backends.

Installation & Quick Start

  • Requirements: macOS Apple Silicon, Python ≥3.11 and <3.13
  • Install: uv pip install mlx-openai-server
  • Quick launch: mlx-openai-server launch --model-path mlx-community/Qwen3-Coder-Next-4bit --model-type lm
  • Whisper support requires ffmpeg

CLI Key Parameters

  • --model-path: MLX model path (local or HuggingFace), required
  • --model-type: lm / multimodal / image-generation / image-edit / embeddings / whisper, required
  • --config: YAML multi-model config file path
  • --host / --port: bind address and port, default 127.0.0.1:8000
  • --served-model-name: custom external model name
  • --quantize: quantization level (4/8/16)
  • --context-length / --max-tokens: context and generation length, max-tokens defaults to 100000
  • --temperature: sampling temperature, default 1.0
  • --draft-model-path: speculative decoding draft model path
  • --prompt-cache-size: prompt cache entries, default 10
  • --lora-paths: LoRA adapter paths (comma-separated)

Unconfirmed Information

  • PyPI publication status inferred from install commands only
  • No complete compatible model list; README mentions Gemma 4, MiniMax-M2.5, GLM-4.7, Qwen3, Flux, etc.
  • No public performance benchmarks
  • Concurrency limits unspecified

Current version 1.7.1, MIT License, authored by Gia-Huy Vuong (cubist38).

Related Projects

View All

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.