Frequently Asked Questions

Executive Summary. Answers to common questions about the MT Eval Arena — how scoring works, what gets disqualified, how to handle languages without FSTs, model and parameter recommendations, and the submission process.

Scoring & Metrics

What metrics does the harness compute?

The harness computes five metrics for Plains Cree (the current benchmark language). Three are language-agnostic and will work for any language; two currently rely on CRK-specific plugins and will be generalized as we expand to more languages.

Metric	Scale	What It Measures	Status
chrF++	0–100	Character n-gram overlap between predicted and reference translations. Best surface metric for morphologically rich languages. Uses sacrebleu's native scoring.	✅ All languages
Exact match	0.0–1.0	Proportion of entries where the prediction exactly matches the reference after normalization.	✅ All languages
FST acceptance	0.0–1.0	Proportion of output words accepted by a finite-state transducer (morphological analyzer). Only computed when an FST binary is provided.	✅ All languages with FST
Equivalent match	0.0–1.0	Fraction of entries matching the reference or an acceptable variant — accounting for word order, orthographic convention, and dialectal differences.	⚡ CRK (generalizing)
Semantic score	0.0–1.0	Meaning preservation score — how well does the translation capture the intended meaning regardless of surface form?	⚡ CRK (generalizing)

Additional metrics are planned: morphological accuracy, code-switching detection, terminology adherence, and hallucination detection. See Scoring Specification §2 for the full 19-metric inventory.

How is the composite score calculated?

The composite is a weighted average of available metrics, normalized to a 0.0–1.0 scale. Weights are defined in two profiles:

Profile A (languages with FST): 9 metrics, structural metrics (FST + morphological accuracy) carry 40% of the composite weight
Profile B (languages without FST): 8 metrics, semantic and chrF++ carry equal top weight

When a metric is unavailable, its weight is redistributed proportionally across the remaining metrics. This means early-stage benchmarks (with only chrF++ and exact match available) still produce valid composites — the effective weights just reflect what's available.

The full weight tables, normalization rules, and exclusion rationale are in Scoring Specification §4. The harness code mirrors these tables in mt_eval_harness/scoring.py. chrF++ is normalized by dividing by 100 before weighting; code-switching and hallucination rates are inverted (lower = better).

What are quality tiers?

Quality tiers are heuristic labels mapped to composite score ranges. They help communicate what a score means practically:

Tier	Composite Range	Interpretation
Baseline	0.00 – 0.30	Below useful quality. Method needs significant improvement.
Emerging	0.30 – 0.50	Shows promise. Some translations are correct but inconsistent.
Functional	0.50 – 0.70	Usable for reference with human review. Not suitable for unreviewed deployment.
Deployable	0.70 – 0.85	Ready for production use with periodic review. Triggers ownership transfer eligibility.
Fluent	0.85 – 1.00	Near-native quality. Suitable for unsupervised deployment.

What's the difference between quality tiers and verification tiers?

Quality tiers describe what the automated score means (Baseline → Fluent). Verification tiers describe who validated the result:

Verification Tier	What It Means
Self-benchmarked	The submitter ran the harness themselves. Scores are plausible but unverified.
GDS Verified	A maintainer reproduced the result using the submitted method configuration.
Community Validated	Bilingual speakers reviewed the translations and confirmed quality.

A method can be "Deployable" quality but only "Self-benchmarked" verification — meaning the score looks great but nobody has independently confirmed it.

Submission & Disqualification

What gets my submission disqualified?

Your submission will be rejected or flagged if:

Your method was exposed to evaluation data. If you trained, fine-tuned, few-shot-prompted, or otherwise used any entries from the evaluation dataset, your scores are artificially inflated. This includes using the reference translations in your prompt.
Your run card fails integrity checks. The fingerprint must match the configuration. Tampered run cards are rejected.
Your method doesn't implement the TranslationProcess protocol. The harness expects translate(entries, config) → results. Custom integrations that bypass the harness are not accepted.

Can I submit multiple times?

Yes. The leaderboard tracks all submissions. You can iterate — run dozens of experiments, only submit your best. Each submission records a unique fingerprint, so there's no ambiguity about which run produced which score.

How do I get my score verified?

Self-benchmarked (automatic): Every submission starts here.
GDS Verified: Submit your method as a reproducible package (code + config + coaching data). A maintainer will re-run it against the same dataset and confirm the scores match.
Community Validated: For Indigenous languages, this requires bilingual speakers to review a sample of translations. This cannot be automated — it requires community engagement.

Is the submission API live?

Not yet. The https://mtevalarena.org/api/leaderboard/submit endpoint is aspirational. Current submissions should be made via pull request to the eval harness repo with your run card JSON in the results/ directory.

Models & Parameters

What model should I use?

There's no single best model — it depends on the language pair, your budget, and your approach. General guidance:

Language Type	Recommended Starting Point	Why
High-resource (French, Spanish, Japanese)	`google/gemini-2.5-flash` or `gpt-4o-mini`	Fast, cheap, strong baseline
Low-resource with some LLM coverage (Quechua, Yoruba)	`google/gemini-2.5-pro` or `anthropic/claude-sonnet-4`	Larger models have better latent knowledge
Polysynthetic / very low-resource (Plains Cree, Inuktitut)	`google/gemini-2.5-pro` with coaching	Coaching data matters more than model choice here

The eval harness uses OpenRouter, so any model available on OpenRouter can be benchmarked. Run rosetta models --method llm to see available models.

What temperature should I use?

Lower is generally better for translation:

Temperature	Effect	Recommended For
0.0 – 0.2	Highly deterministic, consistent output	Production methods, final benchmarks
0.3 – 0.5	Some variation, occasionally more creative	Exploration, early iteration
0.6+	High variation, unpredictable	Not recommended for MT benchmarking

Temperature is recorded in the run card, so different temperatures produce different fingerprints — they're treated as different experiments.

Does coaching data help?

Yes, significantly — for low-resource languages. Coaching data (grammar rules, dictionary entries, style notes) is injected into the LLM system prompt. For Plains Cree, coached methods consistently outperform raw LLM methods because the model has almost no Cree in its training data. The coaching data provides the linguistic context the model lacks.

For high-resource languages (French, Spanish), coaching has less impact because the model already has strong baseline knowledge.

See Coaching Data for the full specification.

FST & Morphological Validation

What if there's no FST for my language?

Many languages don't have a finite-state transducer. That's OK — the harness works without one. The composite score uses Profile B weights (see Scoring Specification §4.3) which shift weight to semantic and surface metrics. FST acceptance is marked as null in the run card.

The main registries for existing FSTs:

Registry	Coverage	URL
GiellaLT	Sámi, Cree, Inuktitut, and other Arctic/subarctic languages	giellalt.uit.no
ALTLab	Plains Cree, Woods Cree, Ojibwe	altlab.artsrn.ualberta.ca
Apertium	~60 language pairs, mostly European	apertium.org
UniMorph	Morphological paradigms for 150+ languages	unimorph.github.io

Can I build an FST?

Yes, but it's non-trivial. An FST encodes the morphological rules of a language — all valid word forms. Building one requires deep linguistic knowledge of the language. If you have access to a morphological grammar (e.g., from a linguistics department), it can be compiled into an FST using tools like HFST or Foma.

How does FST gating work in practice?

The FST-gated pipeline works like this:

LLM generates a translation
Each word in the output is checked against the FST
Words the FST rejects are flagged as morphologically invalid
The method can retry with feedback ("the word X is not valid, try again")
After retries, remaining invalid words are logged

The FST acceptance rate measures how many words pass validation. See the FST-Gated Pipeline Tutorial for a complete worked example.

Data & Datasets

Can I contribute a dataset for a new language?

Yes. Minimum requirements from Benchmark Specification §11:

50 gold-standard entries (source + verified reference translation)
30 development entries (can overlap with gold standard for small corpora)
Community consent (for Indigenous languages, explicit authorization from a governance body)
Provenance documentation (where the data came from, what license applies)

New datasets open new leaderboard tracks automatically. See For Language Communities for the contributor guide.

What format should my dataset be in?

JSON with the canonical field names:

{
  "name": "my-language-dev-v1",
  "language_pair": "en-xxx",
  "segment": "development",
  "version": "1.0",
  "entries": [
    {
      "id": 1,
      "source": "Hello",
      "reference": "[translation in target language]",
      "difficulty": 1,
      "domain": "general"
    }
  ]
}

See Datasets for the full schema and difficulty tier definitions.

Sovereignty & Ownership

Who owns a method built for an Indigenous language?

For Indigenous languages, methods that reach Deployable tier (composite ≥ 0.70) AND pass community validation trigger the ownership transfer process. Code ownership transfers from the researcher to the language community's governance organization.

The researcher retains:

Publication rights (academic papers about the method)
Credit on the leaderboard
The right to apply the same techniques to other languages

The governance organization gains:

Full ownership of the method code and coaching data
Control over deployment (when, where, how)
Revenue from API usage (90% community, 10% infrastructure)

Can I use rosetta for non-Indigenous languages without any sovereignty concerns?

Yes. For standard languages (French, Japanese, Spanish, etc.), there are no sovereignty considerations. Use rosetta normally — translate, sync, publish as you wish. The sovereignty framework applies specifically to Indigenous and community-governed languages where data governance principles (OCAP®, CARE, Te Mana Raraunga) require special consideration.

Scoring & Metrics​

What metrics does the harness compute?​

How is the composite score calculated?​

What are quality tiers?​

What's the difference between quality tiers and verification tiers?​

Submission & Disqualification​

What gets my submission disqualified?​

Can I submit multiple times?​

How do I get my score verified?​

Is the submission API live?​

Models & Parameters​

What model should I use?​

What temperature should I use?​

Does coaching data help?​

FST & Morphological Validation​

What if there's no FST for my language?​

Can I build an FST?​

How does FST gating work in practice?​

Data & Datasets​

Can I contribute a dataset for a new language?​

What format should my dataset be in?​

Sovereignty & Ownership​

Who owns a method built for an Indigenous language?​

Can I use rosetta for non-Indigenous languages without any sovereignty concerns?​

See Also​