Frequently Asked Questions
Executive Summary. Answers to common questions about the MT Eval Arena — how scoring works, what gets disqualified, how to handle languages without FSTs, model and parameter recommendations, and the submission process.
Scoring & Metrics
What metrics does the harness compute?
The harness computes five metrics for Plains Cree (the current benchmark language). Three are language-agnostic and will work for any language; two currently rely on CRK-specific plugins and will be generalized as we expand to more languages.
| Metric | Scale | What It Measures | Status |
|---|---|---|---|
| chrF++ | 0–100 | Character n-gram overlap between predicted and reference translations. Best surface metric for morphologically rich languages. Uses sacrebleu's native scoring. | ✅ All languages |
| Exact match | 0.0–1.0 | Proportion of entries where the prediction exactly matches the reference after normalization. | ✅ All languages |
| FST acceptance | 0.0–1.0 | Proportion of output words accepted by a finite-state transducer (morphological analyzer). Only computed when an FST binary is provided. | ✅ All languages with FST |
| Equivalent match | 0.0–1.0 | Fraction of entries matching the reference or an acceptable variant — accounting for word order, orthographic convention, and dialectal differences. | ⚡ CRK (generalizing) |
| Semantic score | 0.0–1.0 | Meaning preservation score — how well does the translation capture the intended meaning regardless of surface form? | ⚡ CRK (generalizing) |
Additional metrics are planned: morphological accuracy, code-switching detection, terminology adherence, and hallucination detection. See Scoring Specification §2 for the full 19-metric inventory.
How is the composite score calculated?
The composite is a weighted average of available metrics, normalized to a 0.0–1.0 scale. Weights are defined in two profiles:
- Profile A (languages with FST): 9 metrics, structural metrics (FST + morphological accuracy) carry 40% of the composite weight
- Profile B (languages without FST): 8 metrics, semantic and chrF++ carry equal top weight
When a metric is unavailable, its weight is redistributed proportionally across the remaining metrics. This means early-stage benchmarks (with only chrF++ and exact match available) still produce valid composites — the effective weights just reflect what's available.
The full weight tables, normalization rules, and exclusion rationale are in Scoring Specification §4. The harness code mirrors these tables in mt_eval_harness/scoring.py. chrF++ is normalized by dividing by 100 before weighting; code-switching and hallucination rates are inverted (lower = better).
What are quality tiers?
Quality tiers are heuristic labels mapped to composite score ranges. They help communicate what a score means practically:
| Tier | Composite Range | Interpretation |
|---|---|---|
| Baseline | 0.00 – 0.30 | Below useful quality. Method needs significant improvement. |
| Emerging | 0.30 – 0.50 | Shows promise. Some translations are correct but inconsistent. |
| Functional | 0.50 – 0.70 | Usable for reference with human review. Not suitable for unreviewed deployment. |
| Deployable | 0.70 – 0.85 | Ready for production use with periodic review. Triggers ownership transfer eligibility. |
| Fluent | 0.85 – 1.00 | Near-native quality. Suitable for unsupervised deployment. |
What's the difference between quality tiers and verification tiers?
Quality tiers describe what the automated score means (Baseline → Fluent). Verification tiers describe who validated the result:
| Verification Tier | What It Means |
|---|---|
| Self-benchmarked | The submitter ran the harness themselves. Scores are plausible but unverified. |
| GDS Verified | A maintainer reproduced the result using the submitted method configuration. |
| Community Validated | Bilingual speakers reviewed the translations and confirmed quality. |
A method can be "Deployable" quality but only "Self-benchmarked" verification — meaning the score looks great but nobody has independently confirmed it.
Submission & Disqualification
What gets my submission disqualified?
Your submission will be rejected or flagged if:
- Your method was exposed to evaluation data. If you trained, fine-tuned, few-shot-prompted, or otherwise used any entries from the evaluation dataset, your scores are artificially inflated. This includes using the reference translations in your prompt.
- Your run card fails integrity checks. The fingerprint must match the configuration. Tampered run cards are rejected.
- Your method doesn't implement the TranslationProcess protocol. The harness expects
translate(entries, config) → results. Custom integrations that bypass the harness are not accepted.
Can I submit multiple times?
Yes. The leaderboard tracks all submissions. You can iterate — run dozens of experiments, only submit your best. Each submission records a unique fingerprint, so there's no ambiguity about which run produced which score.
How do I get my score verified?
- Self-benchmarked (automatic): Every submission starts here.
- GDS Verified: Submit your method as a reproducible package (code + config + coaching data). A maintainer will re-run it against the same dataset and confirm the scores match.
- Community Validated: For Indigenous languages, this requires bilingual speakers to review a sample of translations. This cannot be automated — it requires community engagement.
Is the submission API live?
Not yet. The https://mtevalarena.org/api/leaderboard/submit endpoint is aspirational. Current submissions should be made via pull request to the eval harness repo with your run card JSON in the results/ directory.
Models & Parameters
What model should I use?
There's no single best model — it depends on the language pair, your budget, and your approach. General guidance:
| Language Type | Recommended Starting Point | Why |
|---|---|---|
| High-resource (French, Spanish, Japanese) | google/gemini-2.5-flash or gpt-4o-mini | Fast, cheap, strong baseline |
| Low-resource with some LLM coverage (Quechua, Yoruba) | google/gemini-2.5-pro or anthropic/claude-sonnet-4 | Larger models have better latent knowledge |
| Polysynthetic / very low-resource (Plains Cree, Inuktitut) | google/gemini-2.5-pro with coaching | Coaching data matters more than model choice here |
The eval harness uses OpenRouter, so any model available on OpenRouter can be benchmarked. Run rosetta models --method llm to see available models.
What temperature should I use?
Lower is generally better for translation:
| Temperature | Effect | Recommended For |
|---|---|---|
| 0.0 – 0.2 | Highly deterministic, consistent output | Production methods, final benchmarks |
| 0.3 – 0.5 | Some variation, occasionally more creative | Exploration, early iteration |
| 0.6+ | High variation, unpredictable | Not recommended for MT benchmarking |
Temperature is recorded in the run card, so different temperatures produce different fingerprints — they're treated as different experiments.
Does coaching data help?
Yes, significantly — for low-resource languages. Coaching data (grammar rules, dictionary entries, style notes) is injected into the LLM system prompt. For Plains Cree, coached methods consistently outperform raw LLM methods because the model has almost no Cree in its training data. The coaching data provides the linguistic context the model lacks.
For high-resource languages (French, Spanish), coaching has less impact because the model already has strong baseline knowledge.
See Coaching Data for the full specification.
FST & Morphological Validation
What if there's no FST for my language?
Many languages don't have a finite-state transducer. That's OK — the harness works without one. The composite score uses Profile B weights (see Scoring Specification §4.3) which shift weight to semantic and surface metrics. FST acceptance is marked as null in the run card.
The main registries for existing FSTs:
| Registry | Coverage | URL |
|---|---|---|
| GiellaLT | Sámi, Cree, Inuktitut, and other Arctic/subarctic languages | giellalt.uit.no |
| ALTLab | Plains Cree, Woods Cree, Ojibwe | altlab.artsrn.ualberta.ca |
| Apertium | ~60 language pairs, mostly European | apertium.org |
| UniMorph | Morphological paradigms for 150+ languages | unimorph.github.io |
Can I build an FST?
Yes, but it's non-trivial. An FST encodes the morphological rules of a language — all valid word forms. Building one requires deep linguistic knowledge of the language. If you have access to a morphological grammar (e.g., from a linguistics department), it can be compiled into an FST using tools like HFST or Foma.
How does FST gating work in practice?
The FST-gated pipeline works like this:
- LLM generates a translation
- Each word in the output is checked against the FST
- Words the FST rejects are flagged as morphologically invalid
- The method can retry with feedback ("the word X is not valid, try again")
- After retries, remaining invalid words are logged
The FST acceptance rate measures how many words pass validation. See the FST-Gated Pipeline Tutorial for a complete worked example.
Data & Datasets
Can I contribute a dataset for a new language?
Yes. Minimum requirements from Benchmark Specification §11:
- 50 gold-standard entries (source + verified reference translation)
- 30 development entries (can overlap with gold standard for small corpora)
- Community consent (for Indigenous languages, explicit authorization from a governance body)
- Provenance documentation (where the data came from, what license applies)
New datasets open new leaderboard tracks automatically. See For Language Communities for the contributor guide.
What format should my dataset be in?
JSON with the canonical field names:
{
"name": "my-language-dev-v1",
"language_pair": "en-xxx",
"segment": "development",
"version": "1.0",
"entries": [
{
"id": 1,
"source": "Hello",
"reference": "[translation in target language]",
"difficulty": 1,
"domain": "general"
}
]
}
See Datasets for the full schema and difficulty tier definitions.
Sovereignty & Ownership
Who owns a method built for an Indigenous language?
For Indigenous languages, methods that reach Deployable tier (composite ≥ 0.70) AND pass community validation trigger the ownership transfer process. Code ownership transfers from the researcher to the language community's governance organization.
The researcher retains:
- Publication rights (academic papers about the method)
- Credit on the leaderboard
- The right to apply the same techniques to other languages
The governance organization gains:
- Full ownership of the method code and coaching data
- Control over deployment (when, where, how)
- Revenue from API usage (90% community, 10% infrastructure)
Can I use rosetta for non-Indigenous languages without any sovereignty concerns?
Yes. For standard languages (French, Japanese, Spanish, etc.), there are no sovereignty considerations. Use rosetta normally — translate, sync, publish as you wish. The sovereignty framework applies specifically to Indigenous and community-governed languages where data governance principles (OCAP®, CARE, Te Mana Raraunga) require special consideration.
See Also
- How It Works — the full solution explainer
- Scoring Specification — the SSOT for all scoring logic (metrics, weights, tiers)
- Benchmark Specification — evaluation protocol, corpus format, sovereignty
- Submit a Method — step-by-step quickstart
- Leaderboard Rules — submission criteria
- Data Sovereignty — OCAP®, CARE, and ethical obligations