Skip to main content

Submit a Method

Executive Summary. A step-by-step quickstart for submitting your first benchmark run to the leaderboard. Clone the harness, run it against a dataset, review your run card, and submit. Takes 10 minutes if you have an API key.

This guide walks you through submitting your first benchmark run to the MT Eval Arena leaderboard.


Prerequisites

  • Python 3.10+
  • An OpenRouter API key (or equivalent for your model provider)
  • A translation method — anything that produces translations from a source text
# Clone the eval harness
git clone https://github.com/gamedaysuits/gds-mt-eval-harness.git
cd gds-mt-eval-harness
pip install sacrebleu aiohttp

Step 1: Run the Harness

The harness scores your method against a standardized dataset:

python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--model google/gemini-2.5-pro \
--condition your-method-name \
--temperature 0.2
FlagWhat It Does
--datasetPath to the evaluation dataset JSON
--modelOpenRouter model slug
--conditionLabel for your method (appears on leaderboard)
--temperatureSampling temperature (lower = more deterministic)
--fst-analyzerOptional: path to FST binary for morphological validation
--submitAuto-submit the run card to the leaderboard

The harness produces a run card — a self-contained JSON file with your scores, the dataset hash, the model slug, and a cryptographic fingerprint tying results to the exact experiment configuration.


Step 2: Review Your Run Card

Run cards are saved to results/. Inspect yours before submitting:

cat results/your-run-card.json | python -m json.tool

Key fields to check:

  • scores.chrf_plus_plus — your primary quality metric
  • scores.exact_match_rate — proportion of perfect translations
  • scores.fst_acceptance_rate — morphological validity (if FST was used)
  • totals.total_cost_usd — what the run cost
  • fingerprint — the experiment's reproducibility hash

See the Run Card Specification for the full schema.


Step 3: Submit

Automatic submission

If you passed --submit when running the harness, your run card was already uploaded.

Manual submission

Submit any run card via the API:

curl -X POST https://mtevalarena.org/api/leaderboard/submit \
-H "Content-Type: application/json" \
-d @results/your-run-card.json

Or upload through the Leaderboard UI.


What Happens Next

  1. Your submission is validated (dataset hash, run card integrity)
  2. Results appear on the leaderboard as Self-benchmarked (trust tier 1)
  3. To get GDS Verified status, submit your method as an installable plugin so maintainers can reproduce your results
  4. For Indigenous language methods: if your method reaches the top, the ownership transfer process begins

See Also