cascadeflow

Smart AI model cascading library using speculative execution to dynamically select optimal models, achieving 40-85% cost savings and 2-10x latency reduction.

cascadeflow is an open-source smart model cascading library (MIT License) developed by Lemony Inc., employing a Drafter-Validator pattern to solve high LLM call costs and latency issues.

Core Mechanism

The project uses speculative execution: first generate draft responses with low-cost models (e.g., gpt-4o-mini, $0.15-0.30/1M tokens), then validate through a quality engine checking length, confidence, JSON format, semantic alignment. If quality passes, return directly; if not, automatically escalate to expensive models ($1.25-3.00/1M tokens).

Performance

Cost savings: 40-93% in benchmarks (MT-Bench 69%, GSM8K 93%, MMLU 52%, TruthfulQA 80%)
Latency optimization: Small models <50ms vs large models 500-2000ms, 2-10x overall speedup
Framework overhead: <2ms
Quality retention: 96% GPT-5 quality, 70-80% queries accept drafts

Intelligent Routing

Complexity detector: 5-level classification (trivial, simple, moderate, hard, expert)
Domain-specific routing: Auto-detect 15 domains
Pre-router: Decide direct call vs cascade
Tool router: Optimize tool-calling scenarios

Provider Support

Native support for OpenAI, Anthropic, Groq, Ollama, vLLM, Together, Hugging Face; extended to 17+ providers via Vercel AI SDK; optional LiteLLM integration for 100+ providers.

Integration

Python SDK: pip install cascadeflow[all]
TypeScript SDK: npm install @cascadeflow/core
Gateway mode: No code changes needed for existing applications
Framework integrations: LangChain, n8n, FastAPI

Quick Start

from cascadeflow import CascadeAgent, ModelConfig

agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
    ModelConfig(name="gpt-5", provider="openai", cost=0.00562),
])

result = await agent.run("What's the capital of France?")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")

Core Configuration

CascadeAgent: Main coordinator with run(), run_streaming(), stream_events()
ModelConfig: Define model name, provider, cost, speed, quality score, specialized domains
CascadeResult: 30+ diagnostic fields including content, cost, latency, complexity, quality scores

Use Cases

Cost control for high-concurrency LLM applications, edge/local-first deployment, low-latency chatbots and agents, structured output and tool-calling scenarios.

Related Projects

oh-my-codex

Ironcurtain

vibe-remote

STAY UPDATED