Evolutionary / Search-Based Translation

The idea: Generate multiple candidate translations, score them against a fitness function (chrF++, FST acceptance, round-trip consistency), mutate the best performers, and repeat. Natural selection for translations — the fittest survive.

:::info This is a cookbook, not a finished implementation This is the most experimental approach in the cookbook series. It hasn't been proven for MT at scale, but the architecture is sound and the harness will happily score whatever it produces. :::

When to Use This

You have a good scoring function but no single model produces consistent results
You want to explore the solution space more broadly than a single greedy generation
You have compute budget for many parallel generations (dozens of candidates per input)
You're interested in novel research — this approach is underexplored for low-resource MT

How It Works

[Generation 0]    Generate N candidates (different models, temperatures, prompts)
       │
       ▼
[Score]           Evaluate each candidate: chrF++, FST acceptance, fluency
       │
       ▼
[Select]          Keep top K performers
       │
       ▼
[Mutate]          Prompt an LLM: "improve this translation, fix these issues"
       │
       ▼
[Generation 1]    Score again. Repeat for G generations.
       │
       ▼
[Output]          Best-scoring candidate from final generation

Skeleton

async def evolutionary_translate(source, reference=None, generations=3, pop_size=8):
    # Generation 0: diverse candidates
    population = []
    for model in ["gemini-2.5-pro", "gpt-4o", "claude-sonnet-4-6"]:
        for temp in [0.3, 0.7, 1.0]:
            candidate = await translate(source, model=model, temperature=temp)
            population.append(candidate)
    
    for gen in range(generations):
        # Score each candidate
        scored = [(c, score(c, reference)) for c in population]
        scored.sort(key=lambda x: x[1], reverse=True)
        
        # Select top K
        survivors = [c for c, s in scored[:pop_size // 2]]
        
        # Mutate: ask LLM to improve each survivor
        mutants = []
        for survivor in survivors:
            mutant = await improve(source, survivor, feedback=scored[0])
            mutants.append(mutant)
        
        population = survivors + mutants
    
    return max(population, key=lambda c: score(c, reference))

Fitness Function Design

The fitness function is everything. Options:

Metric	What It Measures	Automated?
chrF++ against reference	Character-level similarity to gold	✅ Yes
FST acceptance rate	Morphological validity	✅ Yes (if FST available)
Round-trip consistency	Does back-translating recover the source?	✅ Yes
LLM-as-judge	Another LLM rates fluency/accuracy	✅ Yes (but noisy)
Dictionary term presence	Do known terms appear correctly?	✅ Yes

:::tip Combine multiple signals A weighted combination of metrics makes a more robust fitness function than any single metric. This mirrors the harness's own composite score. :::

Pros and Cons


✅ Explores diverse solutions	❌ Computationally expensive (N × G API calls)
✅ Can discover approaches no single model finds	❌ Requires a good fitness function
✅ Parallelizable	❌ Slow — multiple generations per translation
✅ Model-agnostic	❌ Diminishing returns after a few generations

Combines Well With

Chained Models — the mutation step is a form of chaining
FST-Gated Pipeline — FST acceptance as a fitness signal
Dictionary-Augmented LLM — dictionary presence as a fitness signal

When to Use This​

How It Works​

Skeleton​

Fitness Function Design​

Pros and Cons​

Combines Well With​

See Also​