Smart AI model cascading library using speculative execution to dynamically select optimal models, achieving 40-85% cost savings and 2-10x latency reduction.
cascadeflow is an open-source smart model cascading library (MIT License) developed by Lemony Inc., employing a Drafter-Validator pattern to solve high LLM call costs and latency issues.
Core Mechanism
The project uses speculative execution: first generate draft responses with low-cost models (e.g., gpt-4o-mini, $0.15-0.30/1M tokens), then validate through a quality engine checking length, confidence, JSON format, semantic alignment. If quality passes, return directly; if not, automatically escalate to expensive models ($1.25-3.00/1M tokens).
Performance
- Cost savings: 40-93% in benchmarks (MT-Bench 69%, GSM8K 93%, MMLU 52%, TruthfulQA 80%)
- Latency optimization: Small models <50ms vs large models 500-2000ms, 2-10x overall speedup
- Framework overhead: <2ms
- Quality retention: 96% GPT-5 quality, 70-80% queries accept drafts
Intelligent Routing
- Complexity detector: 5-level classification (trivial, simple, moderate, hard, expert)
- Domain-specific routing: Auto-detect 15 domains
- Pre-router: Decide direct call vs cascade
- Tool router: Optimize tool-calling scenarios
Provider Support
Native support for OpenAI, Anthropic, Groq, Ollama, vLLM, Together, Hugging Face; extended to 17+ providers via Vercel AI SDK; optional LiteLLM integration for 100+ providers.
Integration
- Python SDK:
pip install cascadeflow[all] - TypeScript SDK:
npm install @cascadeflow/core - Gateway mode: No code changes needed for existing applications
- Framework integrations: LangChain, n8n, FastAPI
Quick Start
from cascadeflow import CascadeAgent, ModelConfig
agent = CascadeAgent(models=[
ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
ModelConfig(name="gpt-5", provider="openai", cost=0.00562),
])
result = await agent.run("What's the capital of France?")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")
Core Configuration
- CascadeAgent: Main coordinator with
run(),run_streaming(),stream_events() - ModelConfig: Define model name, provider, cost, speed, quality score, specialized domains
- CascadeResult: 30+ diagnostic fields including content, cost, latency, complexity, quality scores
Use Cases
Cost control for high-concurrency LLM applications, edge/local-first deployment, low-latency chatbots and agents, structured output and tool-calling scenarios.