Chained Models (Multi-Stage Pipeline)

The idea: Model A generates a rough translation → Model B post-edits it → Model C scores or validates the result. Each stage specializes in one thing. The pipeline's output is better than any single model alone.

:::info This is a cookbook, not a finished implementation This guide sketches multi-stage pipeline architecture. The specific models and chain configuration depend on your language pair and budget. :::

When to Use This

A single model produces inconsistent quality — good on some inputs, bad on others
You want to separate generation from validation — one model creates, another critiques
You have budget for multiple API calls per translation (latency and cost scale linearly with stages)
You want to combine models with different strengths (e.g., a creative generator + a precise editor)

How It Works

Input ──→ [Stage 1: Generator] ──→ [Stage 2: Editor] ──→ [Stage 3: Validator] ──→ Output
              │                         │                        │
              │ "Translate this"        │ "Fix errors in         │ "Rate 1-5 and
              │                         │  this translation"     │  flag issues"
              ▼                         ▼                        ▼
         Raw translation          Polished translation      Score + accept/reject

Example: Three-Stage Pipeline

# Stage 1: Fast model generates candidate
raw = await fast_model.translate(source, target_lang="crk")

# Stage 2: Strong model post-edits
edited = await strong_model.complete(
    f"The following {target_lang} translation may contain errors. "
    f"Fix any grammatical or vocabulary mistakes:\n"
    f"Source: {source}\nTranslation: {raw}\nCorrected:"
)

# Stage 3: Validator model scores
score = await validator.complete(
    f"Rate this translation 1-5 for accuracy and fluency:\n"
    f"Source: {source}\nTranslation: {edited}\nScore:"
)

# Accept if score >= 3, otherwise retry Stage 1 with different temperature

Common Chain Patterns

Pattern	Stages	Use Case
Generate → Edit	Fast LLM → Strong LLM	Cost-efficient quality improvement
Generate → Validate → Retry	LLM → FST/rules → LLM (retry on failure)	Morphological correctness (see FST-Gated)
Generate → Back-translate → Score	LLM(en→crk) → LLM(crk→en) → compare	Round-trip consistency check
Ensemble → Vote	3 LLMs independently → majority vote	Robustness through diversity

Key Design Decisions

Latency budget: Each stage multiplies latency. A 3-stage chain with 2s per stage = 6s per translation. For batch evaluation this is fine; for real-time it may not be.

Cost multiplier: 3 stages = 3× the API cost. Use cheaper models for early stages, expensive models for critical stages.

Error propagation: A bad Stage 1 output can mislead Stage 2. Include the original source in every stage so later models can recover.

Pros and Cons


✅ Can combine specialist strengths	❌ Latency and cost multiply per stage
✅ Separation of concerns (generate vs. validate)	❌ Complex to debug — which stage introduced the error?
✅ Easy to swap individual stages	❌ Error propagation between stages
✅ Round-trip validation catches hallucinations	❌ Diminishing returns beyond 2-3 stages

Combines Well With

FST-Gated Pipeline — FST as a validation stage
Dictionary-Augmented LLM — dictionary injection in the generation stage
Coached LLM Prompting — coaching in one or more stages

When to Use This​

How It Works​

Example: Three-Stage Pipeline​

Common Chain Patterns​

Key Design Decisions​

Pros and Cons​

Combines Well With​

See Also​