LLM inference in C/C++ achieving state-of-the-art performance on local or cloud with minimal setup via the GGUF format and multi-hardware backend support.
Overview#
llama.cpp is a cross-platform LLM inference framework implemented in pure C/C++, maintained by ggml-org. It focuses exclusively on inference execution and format conversion — no model training. Through its custom GGUF model format and extensive hardware backend support, it enables high-performance quantized inference deployment from edge devices to the cloud.
Core Capabilities#
Hardware & Backend Support#
- Apple Silicon: First-class Metal framework support with ARM NEON and Accelerate
- Mainstream GPUs: NVIDIA (custom CUDA kernels), AMD (HIP), Intel/NVIDIA (SYCL), Vulkan (generic GPU), Adreno (OpenCL)
- China-specific Hardware: Moore Threads (MUSA), Ascend (CANN)
- CPU Backends: BLAS / BLIS / ZenDNN / IBM zDNN
- In-progress Backends: OpenVINO (Intel), WebGPU, Hexagon (Snapdragon)
Inference Optimization#
- Quantization: 1.5-bit to 8-bit integer quantization for reduced memory footprint
- Heterogeneous Hybrid Inference: Offloads partial layers to GPU when model exceeds VRAM
- Speculative Decoding: Server-side support
Model Format & Services#
- GGUF Format: Custom model storage format with conversion from HuggingFace and other formats
- OpenAI-Compatible API:
llama-serverprovides/v1/chat/completionsand other standard endpoints with multi-user parallel decoding - Multimodal Inference: Vision-language support (e.g., HunyuanVL)
- Embedding & Reranking: Serves as embedding and reranking model endpoint
- Output Constraints: GBNF grammar for structured output (e.g., JSON Schema enforcement)
Typical Use Cases#
| Scenario | Description |
|---|---|
| Local LLM Inference | Run quantized LLMs on laptops/desktops without GPU servers |
| OpenAI-Compatible API | Quickly set up local ChatGPT-compatible endpoints |
| Edge Device Deployment | Android, iOS, Snapdragon and other mobile/embedded platforms |
| Model Evaluation | Perplexity measurement and performance benchmarking |
| Model Quantization & Conversion | Convert HuggingFace models to GGUF format |
| Developer Integration | C/C++ library, XCFramework prebuilt binaries, Python bindings |
Architecture#
- Compute Core: Built on the
ggmltensor library; llama.cpp serves as its primary testbed - API Layer:
libllamawraps inference logic;libllama-commonprovides shared utilities - Server:
llama-serveruses single-header HTTP librarycpp-httplib - Multimodal Subsystem: Image decoding (
stb-image), audio decoding (miniaudio.h), JSON parsing (nlohmann/json) — all single-header, zero-external-dependency - Conversion Pipeline:
convert_*.pyscripts +gguf-pytoolkit - Build System: CMake + CMakePresets exclusively; GitHub Actions multi-platform CI
Installation & Usage#
Installation: brew / nix / winget package managers, Docker containers, prebuilt binaries, or CMake source build
Quick Commands:
llama-cli -m my_model.gguf
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
Core Toolset:
llama-cli: Main interactive tool for chat, completion, and experimentationllama-server: OpenAI-compatible HTTP server (default port 8080)llama-perplexity: Perplexity and quality metric measurementllama-bench: Inference performance benchmarkingllama-simple: Minimal example for developer reference
Developer Experience#
- IDE Integration: VS Code extension and Vim/Neovim plugins with Fill-in-the-Middle (FIM) support
- Multi-language Bindings: Native C/C++ library + Python bindings + Apple XCFramework prebuilt packages
- Swift Integration: Import via XCFramework binary packages without source compilation