The MT Eval Arena

Executive Summary. The MT Eval Arena is an open benchmarking platform for machine translation methods, with a focus on languages that commercial services will never support. It provides standardized evaluation, a public leaderboard, and a deployment bridge to production via i18n-rosetta. For Indigenous languages, proven methods transfer ownership to the community.

An open proving ground for machine translation methods — especially for languages that commercial services will never support.

Build a method. Benchmark it. Prove it works. If it wins, it gets deployed.

The Problem

Google Translate supports ~130 languages. There are over 7,000 spoken on Earth. For thousands of languages — including many Indigenous languages with active speaker communities — no commercial translation API exists, no large parallel corpus has been assembled, and no pretrained model produces reliable output.

The economics of commercial MT don't reach these languages. The speakers who need translation tools the most are the same communities least likely to have them built.

The Arena exists to change that. It provides the infrastructure to develop, evaluate, and deploy translation methods for any language — with reproducible scoring, open submission, and community governance over who controls the results.

How It Works

You build a translation method — coached LLM, fine-tuned model, FST-gated pipeline, or anything else that produces translations.
The harness benchmarks it — standardized metrics (chrF++, exact match, FST acceptance), fingerprinted to a specific Git commit.
Results appear on the leaderboard — every submission is reproducible and comparable.
If it wins, ownership transfers — for Indigenous languages, the winning method's code transfers to the community governance organization.
The method deploys to production — via i18n-rosetta, the developer-facing API. Revenue flows back to the community.

Prove it here. Deploy it there.

Who This Is For

You are...	The Arena gives you...
ML engineer / researcher	Standardized benchmarks, reproducible scoring, a leaderboard to compete on
Linguist	A framework to turn grammar rules and dictionaries into testable methods
Language community member	Governance over how your language's methods are developed and deployed
Funder / grant reviewer	Transparent, reproducible metrics to evaluate translation research proposals
Student	An open challenge with real impact — build a method, submit your scores

Current Benchmarks

EDTeKLA Development Set v1

Language pair: English → Plains Cree (SRO)
Entries: 124 curated pairs
License: CC BY-NC-SA 4.0
Source: EdTeKLA research group, University of Alberta

FLORES+ Devtest

Language pairs: English → 39 languages
Entries: 1,012 sentences per language
License: CC BY-SA 4.0
Source: OLDI

The One Rule

:::danger Do not train on evaluation data Methods exposed to the benchmark dataset — as training data, few-shot examples, dictionary entries, or prompt material — will be disqualified. Fine-tune on whatever you want. Just not on the test set. :::

Next Steps

Submit a Method — how to submit your first benchmark run
Benchmark Specification — the full experiment protocol
Leaderboard Rules — submission criteria and anti-gaming policies
Data Sovereignty — OCAP, CARE, and why ownership transfer matters
The Economic Model — how Arena scores become community revenue

→ View the Leaderboard

The Problem​

How It Works​

Who This Is For​

Current Benchmarks​

EDTeKLA Development Set v1​

FLORES+ Devtest​

The One Rule​

Next Steps​