The MT Eval Arena
Executive Summary. The MT Eval Arena is an open benchmarking platform for machine translation methods, with a focus on languages that commercial services will never support. It provides standardized evaluation, a public leaderboard, and a deployment bridge to production via i18n-rosetta. For Indigenous languages, proven methods transfer ownership to the community.
An open proving ground for machine translation methods — especially for languages that commercial services will never support.
Build a method. Benchmark it. Prove it works. If it wins, it gets deployed.
The Problem
Google Translate supports ~130 languages. There are over 7,000 spoken on Earth. For thousands of languages — including many Indigenous languages with active speaker communities — no commercial translation API exists, no large parallel corpus has been assembled, and no pretrained model produces reliable output.
The economics of commercial MT don't reach these languages. The speakers who need translation tools the most are the same communities least likely to have them built.
The Arena exists to change that. It provides the infrastructure to develop, evaluate, and deploy translation methods for any language — with reproducible scoring, open submission, and community governance over who controls the results.
How It Works
- You build a translation method — coached LLM, fine-tuned model, FST-gated pipeline, or anything else that produces translations.
- The harness benchmarks it — standardized metrics (chrF++, exact match, FST acceptance), fingerprinted to a specific Git commit.
- Results appear on the leaderboard — every submission is reproducible and comparable.
- If it wins, ownership transfers — for Indigenous languages, the winning method's code transfers to the community governance organization.
- The method deploys to production — via i18n-rosetta, the developer-facing API. Revenue flows back to the community.
Prove it here. Deploy it there.
Who This Is For
| You are... | The Arena gives you... |
|---|---|
| ML engineer / researcher | Standardized benchmarks, reproducible scoring, a leaderboard to compete on |
| Linguist | A framework to turn grammar rules and dictionaries into testable methods |
| Language community member | Governance over how your language's methods are developed and deployed |
| Funder / grant reviewer | Transparent, reproducible metrics to evaluate translation research proposals |
| Student | An open challenge with real impact — build a method, submit your scores |
Current Benchmarks
EDTeKLA Development Set v1
- Language pair: English → Plains Cree (SRO)
- Entries: 124 curated pairs
- License: CC BY-NC-SA 4.0
- Source: EdTeKLA research group, University of Alberta
FLORES+ Devtest
- Language pairs: English → 39 languages
- Entries: 1,012 sentences per language
- License: CC BY-SA 4.0
- Source: OLDI
The One Rule
:::danger Do not train on evaluation data Methods exposed to the benchmark dataset — as training data, few-shot examples, dictionary entries, or prompt material — will be disqualified. Fine-tune on whatever you want. Just not on the test set. :::
Next Steps
- Submit a Method — how to submit your first benchmark run
- Benchmark Specification — the full experiment protocol
- Leaderboard Rules — submission criteria and anti-gaming policies
- Data Sovereignty — OCAP, CARE, and why ownership transfer matters
- The Economic Model — how Arena scores become community revenue