Eval Harness v2.0
Executive Summary. This page covers installation, configuration, and usage of the MT evaluation harness — the tool that benchmarks translation methods against standardized corpora and produces scored run cards. For canonical definitions of metrics, schemas, and evaluation protocol, see the Benchmark Specification.
The harness runs translation experiments and produces run cards. It handles prompt construction, API calls, scoring, and result serialization — you supply the dataset and the model.
Installation
Requirements: Python 3.10+
pip install sacrebleu aiohttp
Clone the harness repository:
git clone https://github.com/gamedaysuits/gds-mt-eval-harness.git
cd gds-mt-eval-harness
Usage
python eval/baseline_experiment.py --dataset path/to/dataset.json
This runs every entry in the dataset through the configured model, scores the outputs, and writes a run card JSON file to the results/ directory.
CLI Flags
| Flag | Required | Default | Description |
|---|---|---|---|
--dataset | ✅ | — | Path to the evaluation dataset JSON file |
--model | — | openai/gpt-4o | OpenRouter model slug (e.g., google/gemini-2.5-pro) |
--condition | — | baseline | Experiment label. Use to distinguish prompt strategies (e.g., coached, few-shot, dictionary-augmented) |
--temperature | — | 0.3 | Sampling temperature. Lower = more deterministic |
--batch-size | — | 5 | Number of entries per concurrent API batch |
--fst-analyzer | — | null | Path to an FST analyzer binary. When provided, each output is tested for morphological acceptance |
--submit | — | false | Submit the run card to the leaderboard API after the run completes |
Examples
# Run with defaults (GPT-4o, baseline condition)
python eval/baseline_experiment.py --dataset data/edtekla-dev-v1.json
# Coached experiment with Gemini, lower temperature
python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--model google/gemini-2.5-pro \
--condition coached-v3 \
--temperature 0.1
# Run with FST validation and auto-submit
python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--fst-analyzer ./bin/crk-analyzer \
--submit
Run Card Schema
Every experiment produces a run card — a self-contained JSON document. The top-level structure:
{
"run_id": "uuid-v4",
"harness_version": "2.0",
"model_slug": "openai/gpt-4o",
"model_id": "gpt-4o-2024-08-06",
"condition": "baseline",
"timestamp": "2025-05-20T03:22:41Z",
"elapsed_seconds": 142.7,
"dataset": { ... },
"config": { ... },
"system_prompt_sha256": "abc123...",
"system_prompt_used": "You are a translator...",
"fingerprint": { ... },
"scores": { ... },
"totals": { ... },
"environment": { ... },
"results": [ ... ],
"run_card_hash": "sha256-of-entire-card"
}
See the Run Card Specification for the full schema with every field documented.
:::info Authoritative Schema The Benchmark Specification is the single source of truth for the run card schema. For metric definitions, composite weights, and quality tiers, see the Scoring Specification. This page documents how to use the harness; the specs define what the outputs mean. :::
Key Blocks
dataset — Identifies which dataset was used, including its content hash so results are tied to a specific version:
{
"id": "edtekla-dev-v1",
"version": "1.0",
"language_pair": "EN→CRK",
"sha256": "...",
"entry_count": 124
}
scores — Aggregate metrics for the run:
{
"total": 124,
"exact_matches": 12,
"exact_match_rate": 0.0968,
"fst_accepted": 87,
"fst_acceptance_rate": 0.7016,
"chrf_plus_plus": 42.31,
"errors": 0,
"avg_latency_seconds": 1.15,
"median_latency_seconds": 1.02,
"p95_latency_seconds": 2.34,
"by_difficulty": { ... },
"by_provenance": { ... }
}
totals — Token usage and cost tracking:
{
"prompt_tokens": 48200,
"completion_tokens": 3100,
"reasoning_tokens": 0,
"cached_tokens": 12000,
"total_cost_usd": 0.42,
"cost_per_entry_usd": 0.0034,
"reasoning_ratio": 0.0
}
Fingerprint vs Run Card Hash
The harness produces two distinct hashes. They serve different purposes:
Fingerprint
The fingerprint answers: "Could this run be reproduced?"
It hashes the combination of inputs that define the experiment configuration — not the outputs:
- Dataset SHA-256
- Model slug
- Condition label
- System prompt SHA-256
- Temperature
- Harness version
Two runs with identical fingerprints used the same setup. Their results should be comparable (modulo API non-determinism).
Run Card Hash
The run card hash answers: "Has this specific result file been tampered with?"
It's the SHA-256 of the entire run card JSON (excluding the run_card_hash field itself). If any field changes — a score, a timestamp, a single output — the hash breaks.
:::info When to use which Use the fingerprint to group comparable runs (same experiment, different executions). Use the run card hash to verify integrity of a specific result file. :::
Submitting to the Leaderboard
Automatic submission
Pass --submit to upload the run card on completion:
python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--submit
Manual submission
Run cards are saved as JSON files in results/. You can submit any run card file via the leaderboard UI at /leaderboard, or through the API:
curl -X POST https://i18n-rosetta.com/api/leaderboard/submit \
-H "Content-Type: application/json" \
-d @results/your-run-card.json
:::warning Leaderboard validation
The leaderboard validates submitted run cards against the dataset registry. Submissions referencing unknown datasets, or with a broken run_card_hash, are rejected.
:::
:::danger DO NOT TRAIN on evaluation data If your method has seen the evaluation dataset during development — as training data, few-shot examples, dictionary entries, or prompt engineering material — your submission will be disqualified. See MT Evaluation for what makes a good vs. bad method. :::
See Also
- MT Evaluation — overview, leaderboard value proposition, and good/bad method guidance
- Evaluation Datasets — dataset format, EDTeKLA, FLORES+
- Run Card Specification — the full JSON schema
- Building a Method — the method interface for creating evaluable methods
- Method Leaderboard — live benchmark scores
- Benchmark Specification — evaluation protocol, corpus format, run card schema
- Scoring Specification — SSOT for metrics, composite weights, and quality tiers