A lightweight framework for running and managing large language models locally with a Docker-like experience, supporting macOS, Linux, and Windows with GPU acceleration.
Overview#
Ollama is an AI infrastructure tool designed for local environments, addressing the pain points of complex configuration, cumbersome model downloads, and hardware adaptation that developers face when running open-source large models on local PCs.
Core Features#
Easy Installation & Execution#
- Docker-like experience - install and run models with a single command
- Cross-platform support: macOS Metal, Windows DirectX/ROCm, and Linux CUDA
- Automatic local GPU acceleration
Model Library Management#
- Docker Hub-like model repository with
ollama pull,ollama runcommands - Default 4-bit quantization (Q4_0) for reduced memory usage
- Rich built-in model library: Llama 3, Mistral, Gemma, Qwen, etc.
Modelfile Mechanism#
- Declarative configuration file for defining model source, system prompts, temperature parameters, stop tokens, etc.
- Dockerfile-like design philosophy for easy customization
REST API Service#
- Built-in local HTTP service (default port 11434)
- OpenAI-compatible API interface
- Supports streaming chat with context session
- Text embedding generation for RAG applications
Multimodal Support#
- Supports LLaVA, BakLLaVA and other vision-language models
- Image understanding capabilities
Technical Architecture#
- Tech Stack: Go (main logic), C++ (inference library bindings)
- Inference Engine: llama.cpp integration
- Model Format: GGUF format with weights, tokenizer, and metadata in a single file
- Memory Management: Dynamic model loading/unloading, mmap support for faster loading
Runtime Mechanism#
- Ollama Server: Background daemon managing model loading, GPU memory allocation, and request scheduling
- Ollama CLI: Client tool communicating via Unix Socket/Named Pipe
- Concurrent request queue management (hardware dependent)
Usage Examples#
# Run model after installation
ollama run llama3
# List locally downloaded models
ollama list
# API call example
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?"
}'
Ecosystem Integration#
- Web UI: Open WebUI, Text Generation Web UI
- Official SDKs: ollama-python, ollama-js
- Framework Integration: LangChain, LlamaIndex