Skip to main content

Run Card Specification

Executive Summary. The run card is the atomic unit of benchmarking — a JSON document recording the complete configuration, per-entry results, and aggregate scores of one evaluation run. This page documents the schema, fields, fingerprinting mechanism, and score structure. See the Benchmark Specification for canonical definitions.

The run card is the complete record of a single evaluation run. It contains everything needed to understand, reproduce, and verify the experiment: configuration, scores, individual results, token usage, and environment metadata.

Schema version: 2.0

:::info Authoritative Schema The Benchmark Specification is the single source of truth for the run card schema. For metric definitions, composite weights, and quality tiers, see the Scoring Specification. This page documents the current implementation. :::


Top-Level Fields

FieldTypeDescription
run_idstringUUID v4 generated at the start of the run
harness_versionstringSemantic version of the harness that produced this card (e.g., 2.0)
model_slugstringOpenRouter model slug used for the run (e.g., openai/gpt-4o)
model_idstringResolved model identifier returned by the API (e.g., gpt-4o-2024-08-06)
conditionstringExperiment label (e.g., baseline, coached-v3, few-shot)
timestampstringISO 8601 UTC timestamp when the run started
elapsed_secondsnumberWall-clock duration of the entire run
{
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"harness_version": "2.0",
"model_slug": "openai/gpt-4o",
"model_id": "gpt-4o-2024-08-06",
"condition": "baseline",
"timestamp": "2025-05-20T03:22:41Z",
"elapsed_seconds": 142.7
}

dataset

Identifies the evaluation dataset and pins it to a specific content version via SHA-256.

FieldTypeDescription
idstringDataset identifier (e.g., edtekla-dev-v1)
versionstringDataset version string
language_pairstringDisplay label (e.g., EN→CRK)
sha256stringSHA-256 hash of the dataset file contents. Guarantees the exact data used
entry_countnumberNumber of entries in the dataset
{
"dataset": {
"id": "edtekla-dev-v1",
"version": "1.0",
"language_pair": "EN→CRK",
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"entry_count": 124
}
}

config

The API and batching configuration used for this run.

FieldTypeDescription
api_providerstringAPI provider name (e.g., openrouter)
temperaturenumberSampling temperature
max_tokensnumberMaximum tokens per completion
batch_sizenumberEntries per concurrent batch
concurrencynumberMaximum parallel API requests
{
"config": {
"api_provider": "openrouter",
"temperature": 0.3,
"max_tokens": 1024,
"batch_size": 5,
"concurrency": 3
}
}

system_prompt_sha256 / system_prompt_used

FieldTypeDescription
system_prompt_sha256stringSHA-256 hash of the system prompt. Included in the fingerprint
system_prompt_usedstringThe full system prompt text sent to the model

The prompt hash is part of the fingerprint — two runs with different prompts will have different fingerprints even if all other settings match.


fingerprint

A reproducibility identifier. Two runs with identical fingerprints used the same experimental setup.

FieldTypeDescription
hashstringSHA-256 hash of the sorted components
componentsobjectThe input values that were hashed

Fingerprint Components

ComponentDescription
dataset_sha256Hash of the dataset file
model_slugModel used
conditionExperiment condition label
system_prompt_sha256Hash of the system prompt
temperatureSampling temperature
harness_versionHarness version
{
"fingerprint": {
"hash": "7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069",
"components": {
"dataset_sha256": "e3b0c44298fc1c14...",
"model_slug": "openai/gpt-4o",
"condition": "baseline",
"system_prompt_sha256": "abc123...",
"temperature": 0.3,
"harness_version": "2.0"
}
}
}

:::info Fingerprint ≠ Run Card Hash The fingerprint identifies the experiment configuration. The run_card_hash verifies the result file integrity. See Fingerprint vs Run Card Hash for details. :::


scores

Aggregate metrics for the entire run.

Top-Level Scores

FieldTypeDescription
totalnumberTotal entries evaluated
exact_matchesnumberEntries where output exactly matched the gold standard
exact_match_ratenumberexact_matches / total (0.0–1.0)
fst_acceptednumberEntries where the FST analyzer accepted the output
fst_acceptance_ratenumberfst_accepted / total (0.0–1.0). null if no FST analyzer was used
chrf_plus_plusnumberCorpus-level chrF++ score (0–100)
errorsnumberEntries that failed (API error, timeout, etc.)
avg_latency_secondsnumberMean response time across all entries
median_latency_secondsnumberMedian response time
p95_latency_secondsnumber95th percentile response time

by_difficulty

Scores broken down by difficulty tier. Each key (integer 1–5) contains the same metric fields as the top-level scores.

{
"by_difficulty": {
"1": {
"total": 20,
"exact_matches": 8,
"exact_match_rate": 0.40,
"chrf_plus_plus": 68.2,
"fst_accepted": 18,
"fst_acceptance_rate": 0.90
},
"2": { ... },
"3": { ... },
"4": { ... },
"5": { ... }
}
}

by_provenance

Scores broken down by entry provenance. Each key (e.g., gold_standard, textbook) contains the same metric fields.

{
"by_provenance": {
"gold_standard": {
"total": 80,
"exact_matches": 10,
"exact_match_rate": 0.125,
"chrf_plus_plus": 44.8
},
"textbook": { ... }
}
}

totals

Token usage and cost tracking for the entire run.

FieldTypeDescription
prompt_tokensnumberTotal input tokens across all API calls
completion_tokensnumberTotal output tokens
reasoning_tokensnumberTokens used for chain-of-thought reasoning (model-dependent, 0 for most models)
cached_tokensnumberTokens served from the provider's prompt cache
total_cost_usdnumberTotal cost in USD (as reported by the API)
cost_per_entry_usdnumbertotal_cost_usd / entry_count
reasoning_rationumberreasoning_tokens / completion_tokens (0.0–1.0)
{
"totals": {
"prompt_tokens": 48200,
"completion_tokens": 3100,
"reasoning_tokens": 0,
"cached_tokens": 12000,
"total_cost_usd": 0.42,
"cost_per_entry_usd": 0.0034,
"reasoning_ratio": 0.0
}
}

environment

Runtime environment metadata for reproducibility.

FieldTypeDescription
harness_versionstringHarness version (mirrors top-level harness_version)
harness_git_commitstringGit commit SHA of the harness at run time
python_versionstringPython interpreter version
sacrebleu_versionstringsacrebleu library version (used for chrF++ scoring)
osstringOperating system identifier
{
"environment": {
"harness_version": "2.0",
"harness_git_commit": "a1b2c3d",
"python_version": "3.11.9",
"sacrebleu_version": "2.4.0",
"os": "macOS-14.5-arm64"
}
}

results[]

The per-entry results array. One object per dataset entry, in index order.

FieldTypeDescription
entry_idintegerID of this entry in the corpus (matches entries[].id)
sourcestringThe source text that was translated
referencestringThe gold-standard reference from the corpus
predictedstringThe method's actual output
exact_matchbooleanWhether predicted exactly matches reference after normalization
entry_chrfnumberSentence-level chrF++ score for this entry (0–100)
fst_acceptedboolean | nullWhether the FST analyzer accepted the output. null if no analyzer was configured
fst_analysisstring[]FST analysis strings for the output (empty array if not analyzed or rejected)
difficultyintegerDifficulty tier from the corpus (1–5)
provenancestringProvenance tag from the corpus
latency_secondsnumberResponse time for this individual entry
usageobjectPer-entry token usage: { prompt_tokens, completion_tokens, reasoning_tokens }
errorstring | nullError message if this entry failed. null on success
{
"results": [
{
"entry_id": 1,
"source": "Hello",
"reference": "tânisi",
"predicted": "tânisi",
"exact_match": true,
"entry_chrf": 100.0,
"fst_accepted": true,
"fst_analysis": ["tânisi+V+AI+Ind+2Sg"],
"difficulty": 1,
"provenance": "gold_standard",
"latency_seconds": 0.82,
"usage": {
"prompt_tokens": 385,
"completion_tokens": 12,
"reasoning_tokens": 0
},
"error": null
}
]
}

run_card_hash

FieldTypeDescription
run_card_hashstringSHA-256 hash of the entire run card JSON, with the run_card_hash field itself set to "" during hashing

This is the tamper-detection seal. The leaderboard re-computes this hash on submission and rejects cards where it doesn't match.

Computing the hash:

  1. Serialize the run card to JSON with run_card_hash set to ""
  2. Compute SHA-256 of the serialized string
  3. Set run_card_hash to the resulting hex digest
import hashlib, json

card["run_card_hash"] = ""
card_json = json.dumps(card, sort_keys=True, ensure_ascii=False)
card["run_card_hash"] = hashlib.sha256(card_json.encode()).hexdigest()

See Also