Skip to main content

MT Evaluation

Executive Summary. This page defines the leaderboard submission criteria, scoring metrics (chrF++, FST acceptance, exact match, equivalent match, semantic score), anti-gaming policies, verification tiers, and the submission workflow. Methods that have been exposed to evaluation data are disqualified.

rosetta includes a machine translation evaluation framework designed for reproducible benchmarking of translation methods — especially for low-resource and Indigenous languages where standard MT benchmarks don't exist and quality claims are hard to verify.


The Leaderboard

The centerpiece is the Method Leaderboard — a live, Supabase-backed scoreboard where researchers and community members submit and compare translation methods with fingerprinted, reproducible evaluation.

Every submission includes:

  • Fingerprinted pipeline — tied to a specific Git commit and config hash, so results trace back to the exact code that produced them
  • Versioned dataset — content-hashed and versioned; scores are only comparable within the same dataset version
  • Standardised metrics — all scoring is computed by the shared evaluation harness, eliminating implementation differences
  • Trust tiers — self-benchmarked, GDS Verified, or Community Validated
  • Cost tracking — API cost per submission, so cost–quality tradeoffs are transparent

The leaderboard currently tracks five metrics. Three work for any language; two are available for Plains Cree and will be generalized as we expand:

MetricTypeWhat It Measures
chrF++Character n-gram F-scorePrimary quality metric — correlates well with human judgement, especially for morphologically rich languages
Exact MatchProportion of perfect matchesStrict accuracy — how often is the translation exactly the gold standard?
FST AcceptanceMorphological gate pass rateFor methods with finite-state transducer verification — what proportion of outputs are morphologically valid?
Equivalent MatchAcceptable variant rateFraction matching the reference or an acceptable variant (word order, orthographic convention). Currently CRK; generalizing.
Semantic ScoreSemantic fidelityMeaning preservation — does the translation capture the intended meaning regardless of surface form? Currently CRK; generalizing.

:::info Full Metric Suite The Scoring Specification defines the complete 19-metric inventory across 5 categories, composite score formula, weight tables, and quality tier thresholds. :::

→ View the leaderboard


Available Datasets

EDTeKLA Development Set v1

The first evaluation dataset, built for English→Plains Cree (SRO) translation. Created by the EdTeKLA research group at the University of Alberta.

PropertyValue
IDedtekla-dev-v1
Language pairEN → CRK (Plains Cree, SRO orthography)
Entry count124
LicenseCC BY-NC-SA 4.0
Provenancegold_standard (verified by speakers), textbook (published educational materials)

FLORES+ Devtest

A broad-coverage multilingual benchmark maintained by the Open Language Data Initiative (OLDI).

PropertyValue
Language pairsEN → 39 languages (all rosetta registered languages)
Entry count1,012 sentences per language
LicenseCC BY-SA 4.0
SourceOriginally Meta FLORES-200, now OLDI-maintained
LocationPre-extracted fixtures at test/benchmark/fixtures/ in the main rosetta repo

See Evaluation Datasets for the full dataset schema, difficulty tiers, and how to create your own.

:::danger DO NOT TRAIN on evaluation data

These datasets are evaluation-only. Methods trained, fine-tuned, few-shot-prompted, or otherwise exposed to evaluation data will produce artificially inflated scores and will be disqualified from the leaderboard.

This is not a suggestion — it is the single most important rule of evaluation integrity. Use separate corpora for training. Evaluation sets must remain unseen by your model during development.

If you are using coaching data or few-shot examples, those must come from completely separate sources. If in doubt, don't include it. :::

:::warning LLM non-determinism

LLM outputs are non-deterministic. Scores represent point-in-time measurements under specific model versions and API configurations. Model providers may update weights, decoding strategies, or safety filters at any time, which can cause score drift between runs. The leaderboard records the exact model slug and timestamp for every submission. :::


What Makes a Good Method

Not all methods are created equal. Here's what separates rigorous work from inflated scores.

Characteristics of a strong method

  • Clean separation of train and eval data — your method has never seen the evaluation set during development, tuning, prompt engineering, or few-shot example selection
  • Reproducible — someone else can clone your repo, run the harness, and get the same scores (within LLM non-determinism bounds)
  • Documented — your method card describes what your method does, what tools it uses, and what its limitations are
  • Honest about scope — if your method only works for one language pair, say so; if it degrades on certain morphological patterns, document that
  • Community-aware — for Indigenous languages, your method respects data sovereignty. You've consulted with language communities or used only openly licensed data

Red flags (what gets disqualified)

Red FlagWhy It's a Problem
Training on eval dataDefeats the purpose of evaluation entirely. Inflated scores mislead everyone.
Cherry-picking resultsRunning 10 times and submitting the best run without disclosing the others
Undisclosed post-processingManually fixing outputs before scoring
Contaminated coaching dataUsing eval set examples as few-shot prompts or dictionary entries
Claiming commercial readiness without provenanceIf your method uses CC BY-NC-SA data, it's not commercially ready

Verification tiers

Verification tiers describe who validated the result — separate from the quality tiers (Baseline → Fluent) defined in the Scoring Specification, §5, which describe what the automated composite score means.

TierMeaningHow to Get It
Self-benchmarkedYou ran the harness yourself and submitted resultsOpen a PR with your run card
GDS VerifiedThe rosetta maintainers reproduced your resultsSubmit your method as an installable plugin
Community ValidatedGovernance org ran against gold-standard + community reviewSubmit method code to governance org

How to Submit

  1. Build your method — see Building a Method for the method interface
  2. Run the harness — see Eval Harness for setup and usage
  3. Generate a run card — the harness produces a JSON run card with your scores, fingerprint, and metadata
  4. Open a PR — submit your run card to the eval harness repository
  5. Appear on the leaderboard — once merged, your results appear on the Method Leaderboard

Future Directions

  • FLORES+ model comparison runs — systematic evaluation of frontier models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, etc.) across all 39 rosetta languages
  • More language pairs — Quechua, Inuktitut, and other low-resource languages as community-verified datasets become available
  • Dataset import — tooling to convert external evaluation datasets (WMT, Tatoeba, etc.) into the rosetta evaluation format
  • Automated re-runs — detecting model version changes and re-running benchmarks to track score drift

See Also