Run Card Specification
Executive Summary. The run card is the atomic unit of benchmarking — a JSON document recording the complete configuration, per-entry results, and aggregate scores of one evaluation run. This page documents the schema, fields, fingerprinting mechanism, and score structure. See the Benchmark Specification for canonical definitions.
The run card is the complete record of a single evaluation run. It contains everything needed to understand, reproduce, and verify the experiment: configuration, scores, individual results, token usage, and environment metadata.
Schema version: 2.0
:::info Authoritative Schema The Benchmark Specification is the single source of truth for the run card schema. For metric definitions, composite weights, and quality tiers, see the Scoring Specification. This page documents the current implementation. :::
Top-Level Fields
| Field | Type | Description |
|---|---|---|
run_id | string | UUID v4 generated at the start of the run |
harness_version | string | Semantic version of the harness that produced this card (e.g., 2.0) |
model_slug | string | OpenRouter model slug used for the run (e.g., openai/gpt-4o) |
model_id | string | Resolved model identifier returned by the API (e.g., gpt-4o-2024-08-06) |
condition | string | Experiment label (e.g., baseline, coached-v3, few-shot) |
timestamp | string | ISO 8601 UTC timestamp when the run started |
elapsed_seconds | number | Wall-clock duration of the entire run |
{
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"harness_version": "2.0",
"model_slug": "openai/gpt-4o",
"model_id": "gpt-4o-2024-08-06",
"condition": "baseline",
"timestamp": "2025-05-20T03:22:41Z",
"elapsed_seconds": 142.7
}
dataset
Identifies the evaluation dataset and pins it to a specific content version via SHA-256.
| Field | Type | Description |
|---|---|---|
id | string | Dataset identifier (e.g., edtekla-dev-v1) |
version | string | Dataset version string |
language_pair | string | Display label (e.g., EN→CRK) |
sha256 | string | SHA-256 hash of the dataset file contents. Guarantees the exact data used |
entry_count | number | Number of entries in the dataset |
{
"dataset": {
"id": "edtekla-dev-v1",
"version": "1.0",
"language_pair": "EN→CRK",
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"entry_count": 124
}
}
config
The API and batching configuration used for this run.
| Field | Type | Description |
|---|---|---|
api_provider | string | API provider name (e.g., openrouter) |
temperature | number | Sampling temperature |
max_tokens | number | Maximum tokens per completion |
batch_size | number | Entries per concurrent batch |
concurrency | number | Maximum parallel API requests |
{
"config": {
"api_provider": "openrouter",
"temperature": 0.3,
"max_tokens": 1024,
"batch_size": 5,
"concurrency": 3
}
}
system_prompt_sha256 / system_prompt_used
| Field | Type | Description |
|---|---|---|
system_prompt_sha256 | string | SHA-256 hash of the system prompt. Included in the fingerprint |
system_prompt_used | string | The full system prompt text sent to the model |
The prompt hash is part of the fingerprint — two runs with different prompts will have different fingerprints even if all other settings match.
fingerprint
A reproducibility identifier. Two runs with identical fingerprints used the same experimental setup.
| Field | Type | Description |
|---|---|---|
hash | string | SHA-256 hash of the sorted components |
components | object | The input values that were hashed |
Fingerprint Components
| Component | Description |
|---|---|
dataset_sha256 | Hash of the dataset file |
model_slug | Model used |
condition | Experiment condition label |
system_prompt_sha256 | Hash of the system prompt |
temperature | Sampling temperature |
harness_version | Harness version |
{
"fingerprint": {
"hash": "7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069",
"components": {
"dataset_sha256": "e3b0c44298fc1c14...",
"model_slug": "openai/gpt-4o",
"condition": "baseline",
"system_prompt_sha256": "abc123...",
"temperature": 0.3,
"harness_version": "2.0"
}
}
}
:::info Fingerprint ≠ Run Card Hash
The fingerprint identifies the experiment configuration. The run_card_hash verifies the result file integrity. See Fingerprint vs Run Card Hash for details.
:::
scores
Aggregate metrics for the entire run.
Top-Level Scores
| Field | Type | Description |
|---|---|---|
total | number | Total entries evaluated |
exact_matches | number | Entries where output exactly matched the gold standard |
exact_match_rate | number | exact_matches / total (0.0–1.0) |
fst_accepted | number | Entries where the FST analyzer accepted the output |
fst_acceptance_rate | number | fst_accepted / total (0.0–1.0). null if no FST analyzer was used |
chrf_plus_plus | number | Corpus-level chrF++ score (0–100) |
errors | number | Entries that failed (API error, timeout, etc.) |
avg_latency_seconds | number | Mean response time across all entries |
median_latency_seconds | number | Median response time |
p95_latency_seconds | number | 95th percentile response time |
by_difficulty
Scores broken down by difficulty tier. Each key (integer 1–5) contains the same metric fields as the top-level scores.
{
"by_difficulty": {
"1": {
"total": 20,
"exact_matches": 8,
"exact_match_rate": 0.40,
"chrf_plus_plus": 68.2,
"fst_accepted": 18,
"fst_acceptance_rate": 0.90
},
"2": { ... },
"3": { ... },
"4": { ... },
"5": { ... }
}
}
by_provenance
Scores broken down by entry provenance. Each key (e.g., gold_standard, textbook) contains the same metric fields.
{
"by_provenance": {
"gold_standard": {
"total": 80,
"exact_matches": 10,
"exact_match_rate": 0.125,
"chrf_plus_plus": 44.8
},
"textbook": { ... }
}
}
totals
Token usage and cost tracking for the entire run.
| Field | Type | Description |
|---|---|---|
prompt_tokens | number | Total input tokens across all API calls |
completion_tokens | number | Total output tokens |
reasoning_tokens | number | Tokens used for chain-of-thought reasoning (model-dependent, 0 for most models) |
cached_tokens | number | Tokens served from the provider's prompt cache |
total_cost_usd | number | Total cost in USD (as reported by the API) |
cost_per_entry_usd | number | total_cost_usd / entry_count |
reasoning_ratio | number | reasoning_tokens / completion_tokens (0.0–1.0) |
{
"totals": {
"prompt_tokens": 48200,
"completion_tokens": 3100,
"reasoning_tokens": 0,
"cached_tokens": 12000,
"total_cost_usd": 0.42,
"cost_per_entry_usd": 0.0034,
"reasoning_ratio": 0.0
}
}
environment
Runtime environment metadata for reproducibility.
| Field | Type | Description |
|---|---|---|
harness_version | string | Harness version (mirrors top-level harness_version) |
harness_git_commit | string | Git commit SHA of the harness at run time |
python_version | string | Python interpreter version |
sacrebleu_version | string | sacrebleu library version (used for chrF++ scoring) |
os | string | Operating system identifier |
{
"environment": {
"harness_version": "2.0",
"harness_git_commit": "a1b2c3d",
"python_version": "3.11.9",
"sacrebleu_version": "2.4.0",
"os": "macOS-14.5-arm64"
}
}
results[]
The per-entry results array. One object per dataset entry, in index order.
| Field | Type | Description |
|---|---|---|
entry_id | integer | ID of this entry in the corpus (matches entries[].id) |
source | string | The source text that was translated |
reference | string | The gold-standard reference from the corpus |
predicted | string | The method's actual output |
exact_match | boolean | Whether predicted exactly matches reference after normalization |
entry_chrf | number | Sentence-level chrF++ score for this entry (0–100) |
fst_accepted | boolean | null | Whether the FST analyzer accepted the output. null if no analyzer was configured |
fst_analysis | string[] | FST analysis strings for the output (empty array if not analyzed or rejected) |
difficulty | integer | Difficulty tier from the corpus (1–5) |
provenance | string | Provenance tag from the corpus |
latency_seconds | number | Response time for this individual entry |
usage | object | Per-entry token usage: { prompt_tokens, completion_tokens, reasoning_tokens } |
error | string | null | Error message if this entry failed. null on success |
{
"results": [
{
"entry_id": 1,
"source": "Hello",
"reference": "tânisi",
"predicted": "tânisi",
"exact_match": true,
"entry_chrf": 100.0,
"fst_accepted": true,
"fst_analysis": ["tânisi+V+AI+Ind+2Sg"],
"difficulty": 1,
"provenance": "gold_standard",
"latency_seconds": 0.82,
"usage": {
"prompt_tokens": 385,
"completion_tokens": 12,
"reasoning_tokens": 0
},
"error": null
}
]
}
run_card_hash
| Field | Type | Description |
|---|---|---|
run_card_hash | string | SHA-256 hash of the entire run card JSON, with the run_card_hash field itself set to "" during hashing |
This is the tamper-detection seal. The leaderboard re-computes this hash on submission and rejects cards where it doesn't match.
Computing the hash:
- Serialize the run card to JSON with
run_card_hashset to"" - Compute SHA-256 of the serialized string
- Set
run_card_hashto the resulting hex digest
import hashlib, json
card["run_card_hash"] = ""
card_json = json.dumps(card, sort_keys=True, ensure_ascii=False)
card["run_card_hash"] = hashlib.sha256(card_json.encode()).hexdigest()
See Also
- MT Evaluation — overview, leaderboard value, and good/bad method guidance
- Eval Harness — how to run evaluations and generate run cards
- Evaluation Datasets — dataset format, EDTeKLA, FLORES+
- Building a Method — the method interface and method card spec
- Method Leaderboard — live benchmark scores
- Benchmark Specification — evaluation protocol, corpus format, run card schema
- Scoring Specification — SSOT for metrics, composite weights, and quality tiers