Coached LLM Prompting

The idea: Inject grammar rules, bilingual dictionaries, and style notes directly into the LLM's system prompt. No training, no fine-tuning — just structured linguistic knowledge that steers output toward valid translations.

:::info This is a cookbook, not a finished implementation This guide sketches the approach and its key design decisions. Adapt it to your language pair, available resources, and evaluation goals. :::

When to Use This

You have linguistic knowledge about the target language (grammar rules, dictionary entries, style preferences) but not enough parallel data for fine-tuning
You want to iterate fast — prompt changes deploy in seconds, no retraining
The target language has known patterns an LLM gets wrong (gender agreement, script conventions, formality levels)
You want to benchmark coached prompting against a baseline and iterate on what works

How It Works

Assemble coaching data — grammar rules, a bilingual dictionary, and style notes in a structured JSON file
Configure the register — a system prompt prefix that sets the language, script, and tone
Run the harness — the coaching data gets injected into every LLM prompt
Review failures — look at what the quality gate rejects, add rules to address patterns
Iterate — each coaching file revision is a new experiment; the harness tracks them all

Coaching Data Structure

coaching/<locale>.json
{
  "grammar_rules": [
    "Adjectives agree in gender and number with the noun they modify",
    "Use formal register (vous) for all UI text",
    "Preserve interpolation variables exactly: {{name}}, {count}"
  ],
  "dictionary": {
    "dashboard": "tableau de bord",
    "settings": "paramètres",
    "deploy": "déployer"
  },
  "style_notes": "Prefer active voice. Avoid anglicisms where a native term exists. Keep sentences concise for UI readability."
}

Key Design Decisions

Rule specificity vs. context window: More rules give the LLM more guidance, but eat into the context window available for actual translation. Start with 5–10 high-impact rules and add more only when you see specific failure patterns.

Dictionary coverage: You don't need a complete dictionary — focus on terms the LLM consistently gets wrong. Even 20–30 forced terms can dramatically improve consistency.

Rule ordering matters: Put the most important rules first. LLMs attend more strongly to early instructions.

Running an Experiment

python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --model google/gemini-2.5-pro \
  --condition coached-v1 \
  --coaching-file coaching/crk.json

Pros and Cons


✅ Zero training cost	❌ Quality ceiling limited by LLM's base knowledge
✅ Instant iteration (change prompt → re-run)	❌ Context window limits how much coaching fits
✅ Works with any LLM provider	❌ Rules can conflict — debugging prompt interactions is art
✅ Transparent — you can read exactly what the LLM sees	❌ Doesn't create new knowledge, only steers existing knowledge

Combines Well With

FST-Gated Pipeline — coaching + morphological validation catches what coaching alone misses
Dictionary-Augmented LLM — forced terminology is a form of coaching
Few-Shot Prompting — examples + rules together are more powerful than either alone

When to Use This​

How It Works​

Coaching Data Structure​

Key Design Decisions​

Running an Experiment​

Pros and Cons​

Combines Well With​

See Also​