Routing Strategies

Waterfall isn't just a model API — it's an intelligent router. Instead of picking a model yourself, you pick a strategy, and Waterfall handles the rest: fallbacks, rate limits, cost optimisation, and capability matching.

15 strategies live0 coming soon

How to select a strategy

// Pass routing_strategy in extra_body
const response = await openai.chat.completions.create({
  model: "waterfall",
  messages: [...],
  extra_body: { routing_strategy: "free_smart" }
})

Live Strategies

Free Smart

Enterprise-quality output, zero spend.

Most Popularfree_smart

Routes exclusively through our curated pool of free-tier models — OpenRouter :free endpoints and NVIDIA NIM community models. The pool is re-validated hourly so you never accidentally hit a model that switched to paid.

Cascade

DeepSeek R1 (free)
Llama 4 Maverick (free)
Gemini 2.5 Pro (free)
NVIDIA Kimi K2 (free)
70+ more free models

Best for

Prototyping and learning
High-volume agent loops
Hermes sub-tasks
Any zero-budget use case

Cheap Smart

Premium quality at sub-cent prices.

Newcheap_smart

Intelligently cascades through sub-$0.50/1M paid models — DeepSeek V3, Gemini Flash, GPT-4o Mini, and similar. Slightly higher quality ceiling than free-smart, with fractional cost.

Cascade

DeepSeek V3 ($0.14/1M)
Gemini 2.0 Flash ($0.15/1M)
GPT-4o Mini ($0.15/1M)
Llama 4 Scout via Groq

Best for

Production apps with a budget
When free-smart rate-limits
High-quality tool calling at low cost

Privacy Smart

ZDR-first routing for privacy-sensitive work.

privacy_smart

Routes to Zero Data Retention (ZDR) OpenRouter models and direct-provider routes where available. ZDR means prompt and response content should not be stored after processing. It is not the same as HIPAA compliance, a BAA, or a full legal-data compliance program.

Cascade

NVIDIA NIM (direct — no intermediary)
OpenRouter ZDR models (348+ providers)
Claude Sonnet 4 (ZDR)
GPT-5 (ZDR)
Gemini 3.1 Pro (ZDR)

Best for

Privacy-sensitive app traffic
Internal drafts and notes
Internal enterprise tools
Avoiding normal provider logging where possible

Tool Calling

Reliable structured output for agentic workflows.

tool_calling

Prioritises models with consistently correct JSON/tool-call outputs. Falls back through a validated chain so your agents never get malformed function calls.

Cascade

Claude Sonnet 4
GPT-4.1
Llama 4 Maverick (free)
Elephant Alpha (free)
NVIDIA GLM-4.7 (free)

Best for

Claude Code / Hermes backends
Multi-step agent pipelines
Structured data extraction
API orchestration

Orchestrator

Planner-grade judgment for complex multi-step tasks.

orchestrator

Targets the highest-capability reasoning models with large context windows. Designed as a fallback for Hermes and similar planner agents when their primary model is rate-limited.

Cascade

Qwen 3.6 Plus (free)
Kimi K2.5
Gemini 2.5 Pro
NVIDIA DeepSeek V3.2 (free)

Best for

Planner fallback when primary is down
Code review and synthesis
High-stakes agent judgment
Long-horizon reasoning

Reasoning

Chain-of-thought depth for hard problems.

reasoning

Routes to models with explicit extended-thinking: DeepSeek R1, QwQ, o3-mini. These models "think out loud" before answering — slower but dramatically more accurate on math, science, and debugging.

Cascade

DeepSeek R1 0528 (free)
QwQ 32B (free)
NVIDIA Kimi K2 Thinking (free)
o1 / o3-mini

Best for

Complex debugging
Math and science problems
Legal and medical analysis
Code generation from scratch

Speed First

Sub-200ms first token for real-time UIs.

speed_first

Optimises for time-to-first-token above all else. Routes to Groq and Cerebras inference clusters which run models at 150–500 tok/s.

Cascade

Groq Llama 4 Scout (~500 tok/s)
Groq Kimi K2 (~400 tok/s)
Cerebras Llama 3.1 (~150 tok/s)
Gemini Flash fallback

Best for

Streaming chat interfaces
Autocomplete and copilots
Real-time voice pipelines

Context Max

Up to 1M tokens — entire codebases in one prompt.

context_max

Routes exclusively to models with the largest context windows. Perfect for RAG pipelines, full-document ingestion, and long-running agent conversations.

Cascade

Gemini 2.5 Pro (1M ctx)
Llama 4 Maverick (1M ctx, free)
Claude 3.5 (200K ctx)
Kimi K2 (128K ctx)

Best for

Full codebase analysis
Document Q&A
Long agent conversations
RAG with large corpora

Smart Video

Multimodal intelligence for video + images.

smart_video

Routes to vision-capable models cheapest-first. NVIDIA-hosted Gemma 3n and Llama 4 handle images for free; for actual video frames the router cascades through Gemini and Nova video endpoints.

Cascade

Gemma 3n E4B (NVIDIA, vision, free)
Llama 4 Maverick (NVIDIA, 1M ctx, free)
Nova Lite Video ($0.06/1M)
Gemini 2.5 Flash Lite Video ($0.10/1M)
Gemini 2.5 Pro (fallback)

Best for

Screenshot-to-code
Video content analysis
Visual QA and captioning
Diagram and chart understanding

Smart Legal

Precise legal routing with honest privacy limits.

smart_legal

Routes legal queries through privacy-aware, high-context models. This is useful for legal drafting and analysis, but it is not a privilege guarantee. Law firms should use covered direct-provider routes with approved contracts and subprocessors for client-confidential data.

Cascade

Mistral Large 3 675B (NVIDIA, free)
Kimi K2 Thinking (NVIDIA, free)
DeepSeek V3.2 (NVIDIA, free)
Claude Sonnet 4 (complex)
Claude Opus 4 (expert)

Best for

Contract review and redlining
Statute and case law research
Compliance document generation
Legal memo drafting

Smart Health

Health-focused routing without fake HIPAA claims.

smart_health

Routes health and clinical queries through careful models with privacy-aware defaults. This is not HIPAA compliance by itself. PHI should only use routes covered by a signed BAA and the right provider configuration.

Cascade

Mistral Large 3 675B (NVIDIA, free)
Kimi K2 Thinking (NVIDIA, free)
DeepSeek V3.2 (NVIDIA, free)
Claude Sonnet 4 (complex)
Claude Opus 4 (expert/critical)

Best for

Clinical documentation (SOAP notes)
Medical coding and billing
Patient communication drafts
Literature summarization

Smart Image

Image generation via best-in-class multimodal models.

Newsmart_image

Routes image generation requests through chat-capable image models cheapest-first. Gemini Flash Image for speed and cost; GPT-5 Image Mini and GPT-5 Image for highest fidelity. All models support natural-language prompts and iterative refinement.

Cascade

Gemini 2.5 Flash Image ($0.30/1M)
GPT-5 Image Mini ($2.50/1M)
GPT-5 Image ($10/1M)
Gemini 3 Pro Image Preview (fallback)

Best for

UI mockups and wireframes
Product photography concepts
Marketing asset generation
Creative concept art

Smart Transcribe

Audio → text with speaker diarisation and translation.

Newsmart_transcribe

Routes audio transcription and speech understanding through Voxtral (fast, cheap) and Gemini audio models. Supports long-form audio, multi-speaker diarisation, and real-time streaming transcription.

Cascade

Voxtral Small 24B ($0.10/1M)
Gemini 2.0 Flash ($0.10/1M)
Gemini 2.5 Flash ($0.30/1M)
GPT-4o Audio ($2.50/1M)

Best for

Meeting and call transcripts
Podcast summarisation
Voice memo analysis
Real-time captioning

Smart Voice

Voice input + output for conversational agents.

Newsmart_voice

Routes voice-agent requests through models that natively support audio input and output. Gemini Flash for low-latency dialogue; GPT-audio and GPT-4o-audio for premium natural-sounding voice synthesis and understanding.

Cascade

Gemini 2.0 Flash ($0.10/1M)
GPT-audio Mini ($0.60/1M)
Gemini 2.5 Flash ($0.30/1M)
GPT-audio / GPT-4o-audio ($2.50/1M)

Best for

Conversational AI assistants
Voice-enabled customer support
Real-time dialogue agents
Speech synthesis pipelines

Smart Multilingual Voice

140+ language voice and text understanding.

Newsmart_multilingual_voice

Routes multilingual voice and text requests through Gemma 4 (free, 140+ languages) and Gemini Flash. Optimised for cross-lingual transcription, translation, and global voice-agent deployment at minimal cost.

Cascade

Gemma 4 26B A4B (free, 140+ languages)
Gemma 4 31B (free, 140+ languages)
Gemini 2.0 Flash ($0.10/1M)
Gemini 2.5 Flash ($0.30/1M)
Qwen 3.6 Max Preview (fallback)

Best for

Global support bots
Multilingual voice agents
Cross-lingual transcription
Language tutoring apps

Start routing smarter

All strategies are available through the same OpenAI-compatible API. No SDK changes needed.

Read the docs Browse models