Routing Strategies
Waterfall isn't just a model API — it's an intelligent router. Instead of picking a model yourself, you pick a strategy, and Waterfall handles the rest: fallbacks, rate limits, cost optimisation, and capability matching.
How to select a strategy
// Pass routing_strategy in extra_body const response = await openai.chat.completions.create({ model: "waterfall", messages: [...], extra_body: { routing_strategy: "free_smart" } })
Live Strategies
free_smartRoutes exclusively through our curated pool of free-tier models — OpenRouter :free endpoints and NVIDIA NIM community models. The pool is re-validated hourly so you never accidentally hit a model that switched to paid.
Cascade
- DeepSeek R1 (free)
- Llama 4 Maverick (free)
- Gemini 2.5 Pro (free)
- NVIDIA Kimi K2 (free)
- 70+ more free models
Best for
- Prototyping and learning
- High-volume agent loops
- Hermes sub-tasks
- Any zero-budget use case
cheap_smartIntelligently cascades through sub-$0.50/1M paid models — DeepSeek V3, Gemini Flash, GPT-4o Mini, and similar. Slightly higher quality ceiling than free-smart, with fractional cost.
Cascade
- DeepSeek V3 ($0.14/1M)
- Gemini 2.0 Flash ($0.15/1M)
- GPT-4o Mini ($0.15/1M)
- Llama 4 Scout via Groq
Best for
- Production apps with a budget
- When free-smart rate-limits
- High-quality tool calling at low cost
privacy_smartRoutes to Zero Data Retention (ZDR) OpenRouter models and direct-provider routes where available. ZDR means prompt and response content should not be stored after processing. It is not the same as HIPAA compliance, a BAA, or a full legal-data compliance program.
Cascade
- NVIDIA NIM (direct — no intermediary)
- OpenRouter ZDR models (348+ providers)
- Claude Sonnet 4 (ZDR)
- GPT-5 (ZDR)
- Gemini 3.1 Pro (ZDR)
Best for
- Privacy-sensitive app traffic
- Internal drafts and notes
- Internal enterprise tools
- Avoiding normal provider logging where possible
tool_callingPrioritises models with consistently correct JSON/tool-call outputs. Falls back through a validated chain so your agents never get malformed function calls.
Cascade
- Claude Sonnet 4
- GPT-4.1
- Llama 4 Maverick (free)
- Elephant Alpha (free)
- NVIDIA GLM-4.7 (free)
Best for
- Claude Code / Hermes backends
- Multi-step agent pipelines
- Structured data extraction
- API orchestration
orchestratorTargets the highest-capability reasoning models with large context windows. Designed as a fallback for Hermes and similar planner agents when their primary model is rate-limited.
Cascade
- Qwen 3.6 Plus (free)
- Kimi K2.5
- Gemini 2.5 Pro
- NVIDIA DeepSeek V3.2 (free)
Best for
- Planner fallback when primary is down
- Code review and synthesis
- High-stakes agent judgment
- Long-horizon reasoning
reasoningRoutes to models with explicit extended-thinking: DeepSeek R1, QwQ, o3-mini. These models "think out loud" before answering — slower but dramatically more accurate on math, science, and debugging.
Cascade
- DeepSeek R1 0528 (free)
- QwQ 32B (free)
- NVIDIA Kimi K2 Thinking (free)
- o1 / o3-mini
Best for
- Complex debugging
- Math and science problems
- Legal and medical analysis
- Code generation from scratch
speed_firstOptimises for time-to-first-token above all else. Routes to Groq and Cerebras inference clusters which run models at 150–500 tok/s.
Cascade
- Groq Llama 4 Scout (~500 tok/s)
- Groq Kimi K2 (~400 tok/s)
- Cerebras Llama 3.1 (~150 tok/s)
- Gemini Flash fallback
Best for
- Streaming chat interfaces
- Autocomplete and copilots
- Real-time voice pipelines
context_maxRoutes exclusively to models with the largest context windows. Perfect for RAG pipelines, full-document ingestion, and long-running agent conversations.
Cascade
- Gemini 2.5 Pro (1M ctx)
- Llama 4 Maverick (1M ctx, free)
- Claude 3.5 (200K ctx)
- Kimi K2 (128K ctx)
Best for
- Full codebase analysis
- Document Q&A
- Long agent conversations
- RAG with large corpora
smart_videoRoutes to vision-capable models cheapest-first. NVIDIA-hosted Gemma 3n and Llama 4 handle images for free; for actual video frames the router cascades through Gemini and Nova video endpoints.
Cascade
- Gemma 3n E4B (NVIDIA, vision, free)
- Llama 4 Maverick (NVIDIA, 1M ctx, free)
- Nova Lite Video ($0.06/1M)
- Gemini 2.5 Flash Lite Video ($0.10/1M)
- Gemini 2.5 Pro (fallback)
Best for
- Screenshot-to-code
- Video content analysis
- Visual QA and captioning
- Diagram and chart understanding
smart_legalRoutes legal queries through privacy-aware, high-context models. This is useful for legal drafting and analysis, but it is not a privilege guarantee. Law firms should use covered direct-provider routes with approved contracts and subprocessors for client-confidential data.
Cascade
- Mistral Large 3 675B (NVIDIA, free)
- Kimi K2 Thinking (NVIDIA, free)
- DeepSeek V3.2 (NVIDIA, free)
- Claude Sonnet 4 (complex)
- Claude Opus 4 (expert)
Best for
- Contract review and redlining
- Statute and case law research
- Compliance document generation
- Legal memo drafting
smart_healthRoutes health and clinical queries through careful models with privacy-aware defaults. This is not HIPAA compliance by itself. PHI should only use routes covered by a signed BAA and the right provider configuration.
Cascade
- Mistral Large 3 675B (NVIDIA, free)
- Kimi K2 Thinking (NVIDIA, free)
- DeepSeek V3.2 (NVIDIA, free)
- Claude Sonnet 4 (complex)
- Claude Opus 4 (expert/critical)
Best for
- Clinical documentation (SOAP notes)
- Medical coding and billing
- Patient communication drafts
- Literature summarization
smart_imageRoutes image generation requests through chat-capable image models cheapest-first. Gemini Flash Image for speed and cost; GPT-5 Image Mini and GPT-5 Image for highest fidelity. All models support natural-language prompts and iterative refinement.
Cascade
- Gemini 2.5 Flash Image ($0.30/1M)
- GPT-5 Image Mini ($2.50/1M)
- GPT-5 Image ($10/1M)
- Gemini 3 Pro Image Preview (fallback)
Best for
- UI mockups and wireframes
- Product photography concepts
- Marketing asset generation
- Creative concept art
smart_transcribeRoutes audio transcription and speech understanding through Voxtral (fast, cheap) and Gemini audio models. Supports long-form audio, multi-speaker diarisation, and real-time streaming transcription.
Cascade
- Voxtral Small 24B ($0.10/1M)
- Gemini 2.0 Flash ($0.10/1M)
- Gemini 2.5 Flash ($0.30/1M)
- GPT-4o Audio ($2.50/1M)
Best for
- Meeting and call transcripts
- Podcast summarisation
- Voice memo analysis
- Real-time captioning
smart_voiceRoutes voice-agent requests through models that natively support audio input and output. Gemini Flash for low-latency dialogue; GPT-audio and GPT-4o-audio for premium natural-sounding voice synthesis and understanding.
Cascade
- Gemini 2.0 Flash ($0.10/1M)
- GPT-audio Mini ($0.60/1M)
- Gemini 2.5 Flash ($0.30/1M)
- GPT-audio / GPT-4o-audio ($2.50/1M)
Best for
- Conversational AI assistants
- Voice-enabled customer support
- Real-time dialogue agents
- Speech synthesis pipelines
smart_multilingual_voiceRoutes multilingual voice and text requests through Gemma 4 (free, 140+ languages) and Gemini Flash. Optimised for cross-lingual transcription, translation, and global voice-agent deployment at minimal cost.
Cascade
- Gemma 4 26B A4B (free, 140+ languages)
- Gemma 4 31B (free, 140+ languages)
- Gemini 2.0 Flash ($0.10/1M)
- Gemini 2.5 Flash ($0.30/1M)
- Qwen 3.6 Max Preview (fallback)
Best for
- Global support bots
- Multilingual voice agents
- Cross-lingual transcription
- Language tutoring apps
Start routing smarter
All strategies are available through the same OpenAI-compatible API. No SDK changes needed.