Skip to main content

Frequently Asked Questions

Executive Summary. Answers to common questions about the MT Eval Arena — how scoring works, what gets disqualified, how to handle languages without FSTs, model and parameter recommendations, and the submission process.


Scoring & Metrics

What metrics does the harness compute?

The harness computes five metrics for Plains Cree (the current benchmark language). Three are language-agnostic and will work for any language; two currently rely on CRK-specific plugins and will be generalized as we expand to more languages.

MetricScaleWhat It MeasuresStatus
chrF++0–100Character n-gram overlap between predicted and reference translations. Best surface metric for morphologically rich languages. Uses sacrebleu's native scoring.✅ All languages
Exact match0.0–1.0Proportion of entries where the prediction exactly matches the reference after normalization.✅ All languages
FST acceptance0.0–1.0Proportion of output words accepted by a finite-state transducer (morphological analyzer). Only computed when an FST binary is provided.✅ All languages with FST
Equivalent match0.0–1.0Fraction of entries matching the reference or an acceptable variant — accounting for word order, orthographic convention, and dialectal differences.⚡ CRK (generalizing)
Semantic score0.0–1.0Meaning preservation score — how well does the translation capture the intended meaning regardless of surface form?⚡ CRK (generalizing)

Additional metrics are planned: morphological accuracy, code-switching detection, terminology adherence, and hallucination detection. See Scoring Specification §2 for the full 19-metric inventory.

How is the composite score calculated?

The composite is a weighted average of available metrics, normalized to a 0.0–1.0 scale. Weights are defined in two profiles:

  • Profile A (languages with FST): 9 metrics, structural metrics (FST + morphological accuracy) carry 40% of the composite weight
  • Profile B (languages without FST): 8 metrics, semantic and chrF++ carry equal top weight

When a metric is unavailable, its weight is redistributed proportionally across the remaining metrics. This means early-stage benchmarks (with only chrF++ and exact match available) still produce valid composites — the effective weights just reflect what's available.

The full weight tables, normalization rules, and exclusion rationale are in Scoring Specification §4. The harness code mirrors these tables in mt_eval_harness/scoring.py. chrF++ is normalized by dividing by 100 before weighting; code-switching and hallucination rates are inverted (lower = better).

What are quality tiers?

Quality tiers are heuristic labels mapped to composite score ranges. They help communicate what a score means practically:

TierComposite RangeInterpretation
Baseline0.00 – 0.30Below useful quality. Method needs significant improvement.
Emerging0.30 – 0.50Shows promise. Some translations are correct but inconsistent.
Functional0.50 – 0.70Usable for reference with human review. Not suitable for unreviewed deployment.
Deployable0.70 – 0.85Ready for production use with periodic review. Triggers ownership transfer eligibility.
Fluent0.85 – 1.00Near-native quality. Suitable for unsupervised deployment.

What's the difference between quality tiers and verification tiers?

Quality tiers describe what the automated score means (Baseline → Fluent). Verification tiers describe who validated the result:

Verification TierWhat It Means
Self-benchmarkedThe submitter ran the harness themselves. Scores are plausible but unverified.
GDS VerifiedA maintainer reproduced the result using the submitted method configuration.
Community ValidatedBilingual speakers reviewed the translations and confirmed quality.

A method can be "Deployable" quality but only "Self-benchmarked" verification — meaning the score looks great but nobody has independently confirmed it.


Submission & Disqualification

What gets my submission disqualified?

Your submission will be rejected or flagged if:

  1. Your method was exposed to evaluation data. If you trained, fine-tuned, few-shot-prompted, or otherwise used any entries from the evaluation dataset, your scores are artificially inflated. This includes using the reference translations in your prompt.
  2. Your run card fails integrity checks. The fingerprint must match the configuration. Tampered run cards are rejected.
  3. Your method doesn't implement the TranslationProcess protocol. The harness expects translate(entries, config) → results. Custom integrations that bypass the harness are not accepted.

Can I submit multiple times?

Yes. The leaderboard tracks all submissions. You can iterate — run dozens of experiments, only submit your best. Each submission records a unique fingerprint, so there's no ambiguity about which run produced which score.

How do I get my score verified?

  1. Self-benchmarked (automatic): Every submission starts here.
  2. GDS Verified: Submit your method as a reproducible package (code + config + coaching data). A maintainer will re-run it against the same dataset and confirm the scores match.
  3. Community Validated: For Indigenous languages, this requires bilingual speakers to review a sample of translations. This cannot be automated — it requires community engagement.

Is the submission API live?

Not yet. The https://mtevalarena.org/api/leaderboard/submit endpoint is aspirational. Current submissions should be made via pull request to the eval harness repo with your run card JSON in the results/ directory.


Models & Parameters

What model should I use?

There's no single best model — it depends on the language pair, your budget, and your approach. General guidance:

Language TypeRecommended Starting PointWhy
High-resource (French, Spanish, Japanese)google/gemini-2.5-flash or gpt-4o-miniFast, cheap, strong baseline
Low-resource with some LLM coverage (Quechua, Yoruba)google/gemini-2.5-pro or anthropic/claude-sonnet-4Larger models have better latent knowledge
Polysynthetic / very low-resource (Plains Cree, Inuktitut)google/gemini-2.5-pro with coachingCoaching data matters more than model choice here

The eval harness uses OpenRouter, so any model available on OpenRouter can be benchmarked. Run rosetta models --method llm to see available models.

What temperature should I use?

Lower is generally better for translation:

TemperatureEffectRecommended For
0.0 – 0.2Highly deterministic, consistent outputProduction methods, final benchmarks
0.3 – 0.5Some variation, occasionally more creativeExploration, early iteration
0.6+High variation, unpredictableNot recommended for MT benchmarking

Temperature is recorded in the run card, so different temperatures produce different fingerprints — they're treated as different experiments.

Does coaching data help?

Yes, significantly — for low-resource languages. Coaching data (grammar rules, dictionary entries, style notes) is injected into the LLM system prompt. For Plains Cree, coached methods consistently outperform raw LLM methods because the model has almost no Cree in its training data. The coaching data provides the linguistic context the model lacks.

For high-resource languages (French, Spanish), coaching has less impact because the model already has strong baseline knowledge.

See Coaching Data for the full specification.


FST & Morphological Validation

What if there's no FST for my language?

Many languages don't have a finite-state transducer. That's OK — the harness works without one. The composite score uses Profile B weights (see Scoring Specification §4.3) which shift weight to semantic and surface metrics. FST acceptance is marked as null in the run card.

The main registries for existing FSTs:

RegistryCoverageURL
GiellaLTSámi, Cree, Inuktitut, and other Arctic/subarctic languagesgiellalt.uit.no
ALTLabPlains Cree, Woods Cree, Ojibwealtlab.artsrn.ualberta.ca
Apertium~60 language pairs, mostly Europeanapertium.org
UniMorphMorphological paradigms for 150+ languagesunimorph.github.io

Can I build an FST?

Yes, but it's non-trivial. An FST encodes the morphological rules of a language — all valid word forms. Building one requires deep linguistic knowledge of the language. If you have access to a morphological grammar (e.g., from a linguistics department), it can be compiled into an FST using tools like HFST or Foma.

How does FST gating work in practice?

The FST-gated pipeline works like this:

  1. LLM generates a translation
  2. Each word in the output is checked against the FST
  3. Words the FST rejects are flagged as morphologically invalid
  4. The method can retry with feedback ("the word X is not valid, try again")
  5. After retries, remaining invalid words are logged

The FST acceptance rate measures how many words pass validation. See the FST-Gated Pipeline Tutorial for a complete worked example.


Data & Datasets

Can I contribute a dataset for a new language?

Yes. Minimum requirements from Benchmark Specification §11:

  • 50 gold-standard entries (source + verified reference translation)
  • 30 development entries (can overlap with gold standard for small corpora)
  • Community consent (for Indigenous languages, explicit authorization from a governance body)
  • Provenance documentation (where the data came from, what license applies)

New datasets open new leaderboard tracks automatically. See For Language Communities for the contributor guide.

What format should my dataset be in?

JSON with the canonical field names:

{
"name": "my-language-dev-v1",
"language_pair": "en-xxx",
"segment": "development",
"version": "1.0",
"entries": [
{
"id": 1,
"source": "Hello",
"reference": "[translation in target language]",
"difficulty": 1,
"domain": "general"
}
]
}

See Datasets for the full schema and difficulty tier definitions.


Sovereignty & Ownership

Who owns a method built for an Indigenous language?

For Indigenous languages, methods that reach Deployable tier (composite ≥ 0.70) AND pass community validation trigger the ownership transfer process. Code ownership transfers from the researcher to the language community's governance organization.

The researcher retains:

  • Publication rights (academic papers about the method)
  • Credit on the leaderboard
  • The right to apply the same techniques to other languages

The governance organization gains:

  • Full ownership of the method code and coaching data
  • Control over deployment (when, where, how)
  • Revenue from API usage (90% community, 10% infrastructure)

Can I use rosetta for non-Indigenous languages without any sovereignty concerns?

Yes. For standard languages (French, Japanese, Spanish, etc.), there are no sovereignty considerations. Use rosetta normally — translate, sync, publish as you wish. The sovereignty framework applies specifically to Indigenous and community-governed languages where data governance principles (OCAP®, CARE, Te Mana Raraunga) require special consideration.


See Also