MT Evaluation
Executive Summary. This page defines the leaderboard submission criteria, scoring metrics (chrF++, FST acceptance, exact match, equivalent match, semantic score), anti-gaming policies, verification tiers, and the submission workflow. Methods that have been exposed to evaluation data are disqualified.
rosetta includes a machine translation evaluation framework designed for reproducible benchmarking of translation methods — especially for low-resource and Indigenous languages where standard MT benchmarks don't exist and quality claims are hard to verify.
The Leaderboard
The centerpiece is the Method Leaderboard — a live, Supabase-backed scoreboard where researchers and community members submit and compare translation methods with fingerprinted, reproducible evaluation.
Every submission includes:
- Fingerprinted pipeline — tied to a specific Git commit and config hash, so results trace back to the exact code that produced them
- Versioned dataset — content-hashed and versioned; scores are only comparable within the same dataset version
- Standardised metrics — all scoring is computed by the shared evaluation harness, eliminating implementation differences
- Trust tiers — self-benchmarked, GDS Verified, or Community Validated
- Cost tracking — API cost per submission, so cost–quality tradeoffs are transparent
The leaderboard currently tracks five metrics. Three work for any language; two are available for Plains Cree and will be generalized as we expand:
| Metric | Type | What It Measures |
|---|---|---|
| chrF++ | Character n-gram F-score | Primary quality metric — correlates well with human judgement, especially for morphologically rich languages |
| Exact Match | Proportion of perfect matches | Strict accuracy — how often is the translation exactly the gold standard? |
| FST Acceptance | Morphological gate pass rate | For methods with finite-state transducer verification — what proportion of outputs are morphologically valid? |
| Equivalent Match | Acceptable variant rate | Fraction matching the reference or an acceptable variant (word order, orthographic convention). Currently CRK; generalizing. |
| Semantic Score | Semantic fidelity | Meaning preservation — does the translation capture the intended meaning regardless of surface form? Currently CRK; generalizing. |
:::info Full Metric Suite The Scoring Specification defines the complete 19-metric inventory across 5 categories, composite score formula, weight tables, and quality tier thresholds. :::
Available Datasets
EDTeKLA Development Set v1
The first evaluation dataset, built for English→Plains Cree (SRO) translation. Created by the EdTeKLA research group at the University of Alberta.
| Property | Value |
|---|---|
| ID | edtekla-dev-v1 |
| Language pair | EN → CRK (Plains Cree, SRO orthography) |
| Entry count | 124 |
| License | CC BY-NC-SA 4.0 |
| Provenance | gold_standard (verified by speakers), textbook (published educational materials) |
FLORES+ Devtest
A broad-coverage multilingual benchmark maintained by the Open Language Data Initiative (OLDI).
| Property | Value |
|---|---|
| Language pairs | EN → 39 languages (all rosetta registered languages) |
| Entry count | 1,012 sentences per language |
| License | CC BY-SA 4.0 |
| Source | Originally Meta FLORES-200, now OLDI-maintained |
| Location | Pre-extracted fixtures at test/benchmark/fixtures/ in the main rosetta repo |
See Evaluation Datasets for the full dataset schema, difficulty tiers, and how to create your own.
:::danger DO NOT TRAIN on evaluation data
These datasets are evaluation-only. Methods trained, fine-tuned, few-shot-prompted, or otherwise exposed to evaluation data will produce artificially inflated scores and will be disqualified from the leaderboard.
This is not a suggestion — it is the single most important rule of evaluation integrity. Use separate corpora for training. Evaluation sets must remain unseen by your model during development.
If you are using coaching data or few-shot examples, those must come from completely separate sources. If in doubt, don't include it. :::
:::warning LLM non-determinism
LLM outputs are non-deterministic. Scores represent point-in-time measurements under specific model versions and API configurations. Model providers may update weights, decoding strategies, or safety filters at any time, which can cause score drift between runs. The leaderboard records the exact model slug and timestamp for every submission. :::
What Makes a Good Method
Not all methods are created equal. Here's what separates rigorous work from inflated scores.
Characteristics of a strong method
- Clean separation of train and eval data — your method has never seen the evaluation set during development, tuning, prompt engineering, or few-shot example selection
- Reproducible — someone else can clone your repo, run the harness, and get the same scores (within LLM non-determinism bounds)
- Documented — your method card describes what your method does, what tools it uses, and what its limitations are
- Honest about scope — if your method only works for one language pair, say so; if it degrades on certain morphological patterns, document that
- Community-aware — for Indigenous languages, your method respects data sovereignty. You've consulted with language communities or used only openly licensed data
Red flags (what gets disqualified)
| Red Flag | Why It's a Problem |
|---|---|
| Training on eval data | Defeats the purpose of evaluation entirely. Inflated scores mislead everyone. |
| Cherry-picking results | Running 10 times and submitting the best run without disclosing the others |
| Undisclosed post-processing | Manually fixing outputs before scoring |
| Contaminated coaching data | Using eval set examples as few-shot prompts or dictionary entries |
| Claiming commercial readiness without provenance | If your method uses CC BY-NC-SA data, it's not commercially ready |
Verification tiers
Verification tiers describe who validated the result — separate from the quality tiers (Baseline → Fluent) defined in the Scoring Specification, §5, which describe what the automated composite score means.
| Tier | Meaning | How to Get It |
|---|---|---|
| Self-benchmarked | You ran the harness yourself and submitted results | Open a PR with your run card |
| GDS Verified | The rosetta maintainers reproduced your results | Submit your method as an installable plugin |
| Community Validated | Governance org ran against gold-standard + community review | Submit method code to governance org |
How to Submit
- Build your method — see Building a Method for the method interface
- Run the harness — see Eval Harness for setup and usage
- Generate a run card — the harness produces a JSON run card with your scores, fingerprint, and metadata
- Open a PR — submit your run card to the eval harness repository
- Appear on the leaderboard — once merged, your results appear on the Method Leaderboard
Future Directions
- FLORES+ model comparison runs — systematic evaluation of frontier models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, etc.) across all 39 rosetta languages
- More language pairs — Quechua, Inuktitut, and other low-resource languages as community-verified datasets become available
- Dataset import — tooling to convert external evaluation datasets (WMT, Tatoeba, etc.) into the rosetta evaluation format
- Automated re-runs — detecting model version changes and re-running benchmarks to track score drift
See Also
- Method Leaderboard — live scores and submissions
- Eval Harness — how to run evaluations
- Evaluation Datasets — dataset format and available datasets
- Building a Method — the method interface specification
- Run Card Specification — the run card JSON schema
- Benchmark Specification — evaluation protocol, corpus format, sovereignty
- Scoring Specification — SSOT for metrics, composite weights, and quality tiers