Back-Translation Augmentation

The idea: Generate synthetic parallel data by translating existing target-language text back into the source language, then use these synthetic pairs to train or prompt a forward model. This expands your parallel corpus cheaply — but with caveats about quality.

:::info This is a cookbook, not a finished implementation This guide sketches the strategy and its critical pitfalls. Back-translation is powerful but can amplify errors if not done carefully. :::

When to Use This

You have monolingual target-language text but limited parallel data
You want to expand a training corpus for fine-tuning without manual translation
You need more few-shot examples but can't get human translations fast enough
You're willing to quality-filter the synthetic data aggressively

How It Works

[Target-language text]          "awâsisak mêtawêwak"
        │
        ▼
[Back-translate to source]      "The children are playing"  (via LLM or MT API)
        │
        ▼
[Create synthetic pair]         ("The children are playing", "awâsisak mêtawêwak")
        │
        ▼
[Quality filter]                Keep only high-confidence pairs
        │
        ▼
[Use for training/prompting]    Expand your parallel corpus

Collect monolingual text — target-language books, articles, transcripts, social media
Back-translate — use an LLM or MT API to translate each sentence to the source language
Quality filter — round-trip (translate back again) and compare; keep pairs where round-trip ≈ original
Use the synthetic corpus — for fine-tuning, few-shot examples, or coaching data

Quality Filtering: The Round-Trip Test

# Pseudo-code for round-trip quality filtering
for target_text in monolingual_corpus:
    # Back-translate: target → source
    synthetic_source = translate(target_text, "crk", "en")
    
    # Forward-translate: source → target
    round_trip = translate(synthetic_source, "en", "crk")
    
    # Compare round-trip to original
    chrf_score = compute_chrf(target_text, round_trip)
    
    if chrf_score > 0.70:  # High similarity = high-quality pair
        parallel_corpus.append((synthetic_source, target_text))

Critical Pitfall: Error Amplification

:::warning Back-translation amplifies existing model biases If your back-translation model consistently makes the same errors, your synthetic corpus will encode those errors as "correct." This creates a feedback loop: train on bad data → produce worse translations → generate worse synthetic data. Always quality-filter aggressively and mix synthetic data with verified human translations. :::

Where to Find Monolingual Text

Community newsletters, newspapers, and publications
Government documents in the target language (e.g., Nunavut Hansard for Inuktitut)
Educational materials and textbooks
Religious texts (widely available for many languages)
Social media (with appropriate permissions and quality filtering)
Transcribed audio/video from language programs

Pros and Cons


✅ Expands training data cheaply	❌ Amplifies model errors if not filtered
✅ Uses abundant monolingual text	❌ Quality ceiling limited by back-translation model
✅ Easy to generate at scale	❌ Round-trip filtering is compute-intensive
✅ Complements other approaches	❌ Synthetic data is never as good as human translation

Combines Well With

Fine-Tuned Model — back-translation creates training data for fine-tuning
Corpus Creation — synthetic data supplements human-created corpora
Coached LLM Prompting — synthetic examples can inform coaching dictionaries

When to Use This​

How It Works​

Quality Filtering: The Round-Trip Test​

Critical Pitfall: Error Amplification​

Where to Find Monolingual Text​

Pros and Cons​

Combines Well With​

See Also​