Evaluation Corpus Design Framework

Version: 1.0
Status: Draft
Purpose: A systematic methodology for building evaluation corpora that produce valid, reliable, and linguistically meaningful translation quality assessments. This is the source of truth for how Champollion evaluation datasets are designed, constructed, and maintained.

1. Design Principles

1.1 — Why Not Public Benchmarks?

Public parallel corpora (FLORES+, Tatoeba, WMT test sets, OPUS) are available for development and debugging but are excluded from official leaderboard evaluation. The reason is straightforward:

Contamination. Frontier LLMs are trained on enormous web scrapes. Any parallel text that has existed publicly — especially in curated, widely-cited benchmark datasets — is likely in their training data. When you evaluate GPT-4o on FLORES+ and it scores 85 chrF++, you cannot distinguish "the model is good at translating" from "the model memorized these specific sentence pairs." This is not a theoretical concern — research has demonstrated measurable contamination effects on MT benchmarks.

For Champollion, this matters acutely because:

Our leaderboard primarily compares LLM-based methods
Our value proposition is honest, rigorous evaluation
Our target users (language communities) make deployment decisions based on these scores

1.2 — Core Requirements

Every Champollion evaluation corpus must satisfy:

Requirement	Rationale
Human-authored	No synthetic data. All source text and reference translations must be written by humans. LLMs may assist with alignment and formatting but never generate content.
Not publicly available in parallel form	The source text may be public; the reference translations may be public; but the specific pairing must not exist as a downloadable parallel corpus.
Provenance-tracked	Every entry must have documented origin: source document, translator, license, date.
Linguistically informed	Coverage must be guided by typological features, not random sampling.
Domain-stratified	Entries must span defined text domains with controlled representation.
Difficulty-tiered	Entries must be assigned difficulty tiers (1–5) based on structural complexity.
Version-controlled	Corpus versions are content-hashed. Scores are only comparable within the same version.
Community-reviewable	Reference translations must be reviewable by language community members.

2. Source Text Selection

2.1 — Domain Taxonomy

Champollion evaluates translation for practical deployment contexts, not academic exercises. The domain taxonomy reflects real-world text types that translation users encounter:

Domain	Code	Description	Example Sources
Software UI	`ui`	Button labels, menu items, error messages, tooltips, onboarding flows	Open-source app strings, documentation portals
Official/Administrative	`admin`	Government documents, legal notices, forms, policy statements	Public government publications, municipal documents
Educational	`edu`	Textbook content, lesson materials, instructional text	Published educational materials, teaching guides
Narrative/Literary	`lit`	Stories, cultural texts, oral history transcriptions	Published books, cultural archives (with permission)
Conversational	`conv`	Dialog, chat-like exchanges, informal written communication	Published dialog corpora, screenplays, interview transcripts
Technical	`tech`	API documentation, README files, technical specifications	Open-source project documentation
Health/Medical	`health`	Patient-facing medical information, public health messaging	Government health publications
News/Journalistic	`news`	News articles, press releases, current affairs	Community newspapers, Indigenous media outlets

2.2 — Domain Distribution

A standard evaluation corpus should aim for the following distribution. The exact percentages may vary by language pair based on what text types are most relevant to the target community:

Domain	Target %	Rationale
Software UI	25%	Primary deployment context for champollion CLI users
Official/Administrative	15%	High-stakes translation with legal implications
Educational	15%	Core use case for language revitalization
Narrative/Literary	10%	Tests cultural nuance and literary register
Conversational	10%	Tests informal register and natural speech patterns
Technical	10%	Tests precision and terminology consistency
Health/Medical	10%	High-stakes, tests domain-specific vocabulary
News/Journalistic	5%	Tests contemporary vocabulary and neutral register

2.3 — Source Selection Criteria

When selecting source texts for a new corpus:

License compatibility. Source text must be under a license that permits use in an evaluation corpus. Prefer CC BY, CC BY-SA, or public domain. Document the license.
Recency. Prefer texts published within the last 10 years. Language evolves — especially vocabulary around technology, governance, and medicine.
Register diversity. Within each domain, seek texts at different formality levels. A government press release (formal) and a government social media post (informal) are both admin domain but different registers.
Cultural relevance. For Indigenous and minority languages, prioritize texts that matter to the community — land management documents, educational materials in the language, cultural preservation texts — over texts that happen to exist in parallel.
No machine-translated sources. If a "parallel" document was created by running the original through Google Translate and then post-editing, it is NOT acceptable as a reference translation. The reference must be an independent human translation.

3. Difficulty Tier System

3.1 — Tier Definitions

Every entry is assigned a difficulty tier (1–5) based on the structural complexity of the source text, not the translation difficulty (which varies by method).

Tier	Label	Structural Characteristics
1	Elementary	Simple sentences. Single clause. Present tense. Common vocabulary. No idioms. No embedded structures.
2	Intermediate	Compound sentences. Two clauses joined by conjunction. Past/future tense. Some domain vocabulary.
3	Advanced	Complex sentences. Subordinate clauses, relative clauses. Mixed tenses. Domain-specific terminology. Passive voice.
4	Expert	Multiple embedded clauses. Legal/technical register. Conditional structures. Abstract concepts. Cultural references.
5	Extreme	Dense prose with multiple simultaneous challenges: nested subordination, ambiguous pronoun reference, cultural idioms, mixed register, rare vocabulary.

3.2 — Linguistically Informed Difficulty Factors

Beyond structural complexity, difficulty is modulated by typological distance between the source and target language. These factors are drawn from WALS typological features and the language card's classification data:

Factor	Low Difficulty	High Difficulty
Word order	Same basic order (e.g., SVO→SVO)	Different basic order (e.g., SVO→SOV)
Morphological type	Similar type (e.g., analytic→analytic)	Different type (e.g., analytic→polysynthetic)
Grammatical gender	Same system or no gender	Source has no gender, target has complex gender
Honorific/register	No register marking	Target has complex register system (e.g., Japanese, Korean)
Script	Same script	Different script (transliteration required)
Animacy	No animacy distinction	Target has animacy-based agreement (e.g., Cree)
Evidentiality	No evidentiality	Target marks information source grammatically

3.3 — Tier Distribution

A standard corpus should have approximately:

Tier	Target %	Rationale
1	15%	Establishes baseline — even bad methods should handle these
2	25%	Bread-and-butter practical translation
3	30%	Where method quality differences become visible
4	20%	Separates good methods from great ones
5	10%	Ceiling test — very few methods will handle these well

4. Reference Translation Quality

4.1 — Translator Requirements

Reference translations must be produced by humans who are:

Fluent speakers of the target language (L1 or equivalent)
Literate in both source and target language
Domain-aware for the domain of the text (a medical translator for health texts, etc.)
Independent — the translator must not have access to any MT output for the same text during translation

4.2 — Translation Brief

Every translator receives a brief that includes:

The register to use (formal, conversational, etc.)
The target audience (general public, specialists, children, etc.)
Any terminology conventions specific to the language community
Explicit instruction: "Translate the meaning, not the words. A natural-sounding translation is more valuable than a literal one."

4.3 — Quality Assurance

Dual translation. Ideally, each entry has two independent reference translations by different translators. Where this isn't feasible, prioritize dual translation for Tiers 4–5.
Community review. Reference translations should be reviewed by at least one additional speaker who did not produce the translation.
Acceptable variants. For each reference, document known acceptable variants (word order, orthographic conventions, dialectal forms). These feed the equivalent_match_rate metric.

4.4 — What Makes a Bad Reference

Problem	Why It Invalidates Evaluation
Machine-translated then post-edited	Post-editing preserves MT structure; penalizes methods that produce more natural translations
Translated by a learner, not a fluent speaker	Reference may contain errors that penalize correct MT output
Overly literal	Natural translations score poorly against literal references
Single valid interpretation for ambiguous source	Penalizes valid alternative interpretations

5. Contamination Prevention

5.1 — The Contamination Threat Model

Threat	Description	Mitigation
Training data overlap	LLMs trained on the parallel corpus	Don't publish the parallel corpus publicly
Few-shot leakage	Method author uses eval entries as few-shot examples	Fingerprint-check: entries in the prompt are detected and flagged
Indirect contamination	Source text exists in LLM training data (monolingual)	Acceptable — monolingual source text is expected. The pairing must be novel.
Crowd contamination	Community reviewers share entries publicly	License terms prohibit redistribution of the parallel corpus

5.2 — Corpus Secrecy Tiers

Tier	Visibility	Use
Public development set	Fully public	Method development, debugging, regression testing. Scores NOT published to leaderboard.
Held-out evaluation set	Source text visible, references secret	Official leaderboard evaluation. Methods receive source text and return translations; scoring happens server-side. References are never exposed to the method.
Gold-standard set	Fully secret, community-controlled	Community-validated evaluation. Managed by governance organization. Used for "Community Validated" verification tier.

5.3 — Rotation Policy

Evaluation corpora should be rotated periodically:

After a corpus has been in use for 12 months, begin constructing a replacement
Retire the old corpus to "development set" status (public)
Promote the new corpus to "held-out evaluation set"
This prevents gradual contamination through iterative optimization against a fixed target

6. Corpus Construction Workflow

6.1 — Step-by-Step Process

Step 1: Language Pair Selection
    └─ Identify target language, read language card
    └─ Review typological features (WALS), contact influences, scripts
    └─ Identify which difficulty factors apply

Step 2: Source Text Curation
    └─ Identify candidate source documents per domain
    └─ Verify licenses
    └─ Extract candidate sentences/segments
    └─ Classify by domain and preliminary difficulty tier

Step 3: Segment Selection
    └─ Sample segments to match domain distribution (§2.2)
    └─ Sample segments to match difficulty distribution (§3.3)
    └─ Ensure linguistic phenomenon coverage (§6.2)
    └─ Target minimum corpus size (§6.3)

Step 4: Reference Translation
    └─ Assign segments to qualified translators
    └─ Provide translation brief
    └─ Collect translations
    └─ Dual-translate Tier 4–5 entries

Step 5: Quality Assurance
    └─ Community review of references
    └─ Document acceptable variants
    └─ Flag and resolve disagreements

Step 6: Metadata & Packaging
    └─ Assign final difficulty tiers
    └─ Add provenance metadata per entry
    └─ Content-hash the corpus for versioning
    └─ Package as corpus JSON per harness spec

Step 7: Registration
    └─ Register in Supabase datasets table
    └─ Add to ATTRIBUTION.md if new sources used
    └─ Document in arena website

6.2 — Linguistic Phenomenon Coverage

Every corpus should include entries that test specific linguistic phenomena relevant to the language pair. These are drawn from the language card's linguisticChallenges and contactInfluences fields:

Universal phenomena (all language pairs):

Pronoun resolution (ambiguous antecedents)
Negation (single, double, scope)
Quantifiers (all, some, none, most)
Temporal expressions (relative dates, durations)
Named entities (people, places, organizations)
Numbers and measurements
Lists and enumeration

Pair-specific phenomena (from language card):

For polysynthetic targets: complex verb morphology, incorporation
For gendered targets: gender agreement, neutral/inclusive reference
For SOV targets: clause-final verbs, postpositions
For tone languages: tone-dependent meaning distinctions
For honorific languages: register markers, social context
For contact languages: code-switching boundaries, loanword integration

6.3 — Minimum Corpus Size

Statistical reliability requires minimum entry counts. These are based on paired bootstrap confidence interval requirements (from significance.py):

Purpose	Minimum Entries	Recommended
Development set	50	100–200
Held-out evaluation set	100	200–500
Gold-standard set	200	500+
Per-domain minimum	10	25+
Per-tier minimum	10	20+

Why 100 minimum for evaluation? With fewer than ~100 entries, paired bootstrap significance tests (1,000 resamples) cannot reliably detect differences smaller than ~5 chrF++ points. With 200+ entries, we can detect ~2-point differences at p<0.05.

7. Corpus JSON Format

Every corpus entry follows the harness specification:

{
  "id": "edtekla-dev-v1-042",
  "source": "The school board will meet on Tuesday to discuss the new curriculum.",
  "reference": "ᑭᓯᑭᓄᐦᐊᒫᑐᐏᓐ ᑲ ᐃᔑ ᐱᒥᐸᔨᐦᑕᐦᒃ ᑭᔅᑭᓄᐦᐊᒫᑐᐏᓇ ᐁ ᐃᔑ ᒫᒥᑐᓀᔨᐦᑕᐦᒃ ᐅᔥᑭ ᑭᔅᑭᓄᐦᐊᒫᑫᐏᓂᔭ ᓂᔓ ᑭᔑᑲᐤ",
  "acceptable_variants": [
    "ᑭᔅᑭᓄᐦᐊᒫᑐᐏᓐ ᓂᔓ ᑭᔑᑲᐤ ᑲ ᐃᔑ ᒫᒥᑐᓀᔨᐦᑕᐦᒃ ᐅᔥᑭ ᑭᔅᑭᓄᐦᐊᒫᑫᐏᓂᔭ"
  ],
  "domain": "edu",
  "difficulty": 3,
  "phenomena": ["temporal_expression", "named_entity", "future_tense"],
  "provenance": {
    "source_doc": "EdTeKLA Module 4, Unit 7",
    "source_license": "CC BY-NC-SA 4.0",
    "translator": "anonymous-speaker-001",
    "translator_qualification": "L1 Plains Cree, certified translator",
    "translation_date": "2025-11-15",
    "reviewer": "anonymous-speaker-002",
    "review_date": "2025-12-01"
  }
}

8. Anti-Gaming Measures

8.1 — Corpus Integrity

Measure	Implementation
Content hashing	Corpus version = SHA-256 of sorted entry IDs + references. Any modification produces a new version.
Entry fingerprinting	Each entry has a content-derived ID. If someone submits results against a modified corpus, the fingerprint won't match.
Held-out enforcement	For official evaluation, methods receive ONLY source text. References are never exposed. Scoring happens server-side.
Rotation schedule	Corpora rotate annually to prevent long-term optimization against a fixed target.

8.2 — Submission Integrity

Measure	Implementation
Deterministic fingerprint	Run config (model, temperature, prompt, corpus version) is hashed. Identical configs produce identical fingerprints.
Cherry-pick detection	Submitters must disclose all runs, not just the best one. Multiple submissions with the same fingerprint are flagged.
Contamination check	If eval entries appear verbatim in the method's prompt or coaching data, the submission is disqualified.

9. Existing Corpora

9.1 — EDTeKLA Development Set v1

Property	Value
ID	`edtekla-dev-v1`
Pair	EN → CRK (Plains Cree, SRO)
Entries	404 (`master_corpus.json`: 62 gold + 342 textbook); 548 total available
Domains	Educational (100%)
Tiers	1–5 (distribution TBD per entry audit)
License	CC BY-NC-SA 4.0
Status	Development set (public)

Limitations: Single domain (educational only). No domain stratification. Tier assignments may need audit. Small corpus size limits statistical power for significance testing.

9.2 — Planned Corpora

Corpus	Pair	Status	Owner
EN → TL (Filipino) custom corpus	EN → TL	Planned	Project owner
EN → CRK held-out set	EN → CRK	Future (needs community partner)	Community governance org

10. Language Card Integration

The corpus framework integrates with the language card system:

Domain selection is informed by the card's linguisticChallenges — if a language has unique challenges (polysynthesis, tone, animacy), the corpus must include entries that test them.
Difficulty calibration uses the card's classification — typological distance between source and target families affects what constitutes "difficult."
Register coverage uses the card's registers — if a language has defined registers (formal-filipino, taglish-professional, taglish-casual), the corpus should include entries at each register level.
Contact influence testing uses the card's contactInfluences — for languages with heavy borrowing layers (Filipino: Spanish + English + Arabic), include entries that test whether methods handle loanwords correctly vs. over-translating them.
Script handling uses the card's scripts[] — for multi-script languages (Serbian: Cyrillic + Latin), include entries that test correct script selection.

References

Champollion Scoring Specification — defines all metrics, composite weights, quality tiers
Champollion Benchmark Specification — evaluation protocol, corpus format, data sovereignty
WALS (World Atlas of Language Structures) — typological features database
Glottolog — language classification source of truth
ISO 639-3 — language identification standard
EdTeKLA — source of the first evaluation corpus

This document is a living specification. Update it as new corpora are built and lessons are learned.

1. Design Principles​

1.1 — Why Not Public Benchmarks?​

1.2 — Core Requirements​

2. Source Text Selection​

2.1 — Domain Taxonomy​

2.2 — Domain Distribution​

2.3 — Source Selection Criteria​

3. Difficulty Tier System​

3.1 — Tier Definitions​

3.2 — Linguistically Informed Difficulty Factors​

3.3 — Tier Distribution​

4. Reference Translation Quality​

4.1 — Translator Requirements​

4.2 — Translation Brief​

4.3 — Quality Assurance​

4.4 — What Makes a Bad Reference​

5. Contamination Prevention​

5.1 — The Contamination Threat Model​

5.2 — Corpus Secrecy Tiers​

5.3 — Rotation Policy​

6. Corpus Construction Workflow​

6.1 — Step-by-Step Process​

6.2 — Linguistic Phenomenon Coverage​

6.3 — Minimum Corpus Size​

7. Corpus JSON Format​

8. Anti-Gaming Measures​

8.1 — Corpus Integrity​

8.2 — Submission Integrity​

9. Existing Corpora​

9.1 — EDTeKLA Development Set v1​

9.2 — Planned Corpora​

10. Language Card Integration​

References​