Eval Harness v2.0

Executive Summary. This page covers installation, configuration, and usage of the MT evaluation harness — the tool that benchmarks translation methods against standardized corpora and produces scored run cards. For canonical definitions of metrics, schemas, and evaluation protocol, see the Benchmark Specification.

The harness runs translation experiments and produces run cards. It handles prompt construction, API calls, scoring, and result serialization — you supply the dataset and the model.

Installation

Requirements: Python 3.10+

pip install sacrebleu aiohttp

Clone the harness repository:

git clone https://github.com/gamedaysuits/gds-mt-eval-harness.git
cd gds-mt-eval-harness

Usage

python eval/baseline_experiment.py --dataset path/to/dataset.json

This runs every entry in the dataset through the configured model, scores the outputs, and writes a run card JSON file to the results/ directory.

CLI Flags

Flag	Required	Default	Description
`--dataset`	✅	—	Path to the evaluation dataset JSON file
`--model`	—	`openai/gpt-4o`	OpenRouter model slug (e.g., `google/gemini-2.5-pro`)
`--condition`	—	`baseline`	Experiment label. Use to distinguish prompt strategies (e.g., `coached`, `few-shot`, `dictionary-augmented`)
`--temperature`	—	`0.3`	Sampling temperature. Lower = more deterministic
`--batch-size`	—	`5`	Number of entries per concurrent API batch
`--fst-analyzer`	—	`null`	Path to an FST analyzer binary. When provided, each output is tested for morphological acceptance
`--submit`	—	`false`	Submit the run card to the leaderboard API after the run completes

Examples

# Run with defaults (GPT-4o, baseline condition)
python eval/baseline_experiment.py --dataset data/edtekla-dev-v1.json

# Coached experiment with Gemini, lower temperature
python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --model google/gemini-2.5-pro \
  --condition coached-v3 \
  --temperature 0.1

# Run with FST validation and auto-submit
python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --fst-analyzer ./bin/crk-analyzer \
  --submit

Run Card Schema

Every experiment produces a run card — a self-contained JSON document. The top-level structure:

{
  "run_id": "uuid-v4",
  "harness_version": "2.0",
  "model_slug": "openai/gpt-4o",
  "model_id": "gpt-4o-2024-08-06",
  "condition": "baseline",
  "timestamp": "2025-05-20T03:22:41Z",
  "elapsed_seconds": 142.7,
  "dataset": { ... },
  "config": { ... },
  "system_prompt_sha256": "abc123...",
  "system_prompt_used": "You are a translator...",
  "fingerprint": { ... },
  "scores": { ... },
  "totals": { ... },
  "environment": { ... },
  "results": [ ... ],
  "run_card_hash": "sha256-of-entire-card"
}

See the Run Card Specification for the full schema with every field documented.

:::info Authoritative Schema The Benchmark Specification is the single source of truth for the run card schema. For metric definitions, composite weights, and quality tiers, see the Scoring Specification. This page documents how to use the harness; the specs define what the outputs mean. :::

Key Blocks

dataset — Identifies which dataset was used, including its content hash so results are tied to a specific version:

{
  "id": "edtekla-dev-v1",
  "version": "1.0",
  "language_pair": "EN→CRK",
  "sha256": "...",
  "entry_count": 124
}

scores — Aggregate metrics for the run:

{
  "total": 124,
  "exact_matches": 12,
  "exact_match_rate": 0.0968,
  "fst_accepted": 87,
  "fst_acceptance_rate": 0.7016,
  "chrf_plus_plus": 42.31,
  "errors": 0,
  "avg_latency_seconds": 1.15,
  "median_latency_seconds": 1.02,
  "p95_latency_seconds": 2.34,
  "by_difficulty": { ... },
  "by_provenance": { ... }
}

totals — Token usage and cost tracking:

{
  "prompt_tokens": 48200,
  "completion_tokens": 3100,
  "reasoning_tokens": 0,
  "cached_tokens": 12000,
  "total_cost_usd": 0.42,
  "cost_per_entry_usd": 0.0034,
  "reasoning_ratio": 0.0
}

Fingerprint vs Run Card Hash

The harness produces two distinct hashes. They serve different purposes:

Fingerprint

The fingerprint answers: "Could this run be reproduced?"

It hashes the combination of inputs that define the experiment configuration — not the outputs:

Dataset SHA-256
Model slug
Condition label
System prompt SHA-256
Temperature
Harness version

Two runs with identical fingerprints used the same setup. Their results should be comparable (modulo API non-determinism).

Run Card Hash

The run card hash answers: "Has this specific result file been tampered with?"

It's the SHA-256 of the entire run card JSON (excluding the run_card_hash field itself). If any field changes — a score, a timestamp, a single output — the hash breaks.

:::info When to use which Use the fingerprint to group comparable runs (same experiment, different executions). Use the run card hash to verify integrity of a specific result file. :::

Submitting to the Leaderboard

Automatic submission

Pass --submit to upload the run card on completion:

python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --submit

Manual submission

Run cards are saved as JSON files in results/. You can submit any run card file via the leaderboard UI at /leaderboard, or through the API:

curl -X POST https://i18n-rosetta.com/api/leaderboard/submit \
  -H "Content-Type: application/json" \
  -d @results/your-run-card.json

:::warning Leaderboard validation The leaderboard validates submitted run cards against the dataset registry. Submissions referencing unknown datasets, or with a broken run_card_hash, are rejected. :::

:::danger DO NOT TRAIN on evaluation data If your method has seen the evaluation dataset during development — as training data, few-shot examples, dictionary entries, or prompt engineering material — your submission will be disqualified. See MT Evaluation for what makes a good vs. bad method. :::

Installation​

Usage​

CLI Flags​

Examples​

Run Card Schema​

Key Blocks​

Fingerprint vs Run Card Hash​

Fingerprint​

Run Card Hash​

Submitting to the Leaderboard​

Automatic submission​

Manual submission​

See Also​