Skip to main content

Evaluation Corpus Design Framework

Version: 1.0
Status: Draft
Purpose: A systematic methodology for building evaluation corpora that produce valid, reliable, and linguistically meaningful translation quality assessments. This is the source of truth for how Champollion evaluation datasets are designed, constructed, and maintained.


1. Design Principles

1.1 — Why Not Public Benchmarks?

Public parallel corpora (FLORES+, Tatoeba, WMT test sets, OPUS) are available for development and debugging but are excluded from official leaderboard evaluation. The reason is straightforward:

Contamination. Frontier LLMs are trained on enormous web scrapes. Any parallel text that has existed publicly — especially in curated, widely-cited benchmark datasets — is likely in their training data. When you evaluate GPT-4o on FLORES+ and it scores 85 chrF++, you cannot distinguish "the model is good at translating" from "the model memorized these specific sentence pairs." This is not a theoretical concern — research has demonstrated measurable contamination effects on MT benchmarks.

For Champollion, this matters acutely because:

  • Our leaderboard primarily compares LLM-based methods
  • Our value proposition is honest, rigorous evaluation
  • Our target users (language communities) make deployment decisions based on these scores

1.2 — Core Requirements

Every Champollion evaluation corpus must satisfy:

RequirementRationale
Human-authoredNo synthetic data. All source text and reference translations must be written by humans. LLMs may assist with alignment and formatting but never generate content.
Not publicly available in parallel formThe source text may be public; the reference translations may be public; but the specific pairing must not exist as a downloadable parallel corpus.
Provenance-trackedEvery entry must have documented origin: source document, translator, license, date.
Linguistically informedCoverage must be guided by typological features, not random sampling.
Domain-stratifiedEntries must span defined text domains with controlled representation.
Difficulty-tieredEntries must be assigned difficulty tiers (1–5) based on structural complexity.
Version-controlledCorpus versions are content-hashed. Scores are only comparable within the same version.
Community-reviewableReference translations must be reviewable by language community members.

2. Source Text Selection

2.1 — Domain Taxonomy

Champollion evaluates translation for practical deployment contexts, not academic exercises. The domain taxonomy reflects real-world text types that translation users encounter:

DomainCodeDescriptionExample Sources
Software UIuiButton labels, menu items, error messages, tooltips, onboarding flowsOpen-source app strings, documentation portals
Official/AdministrativeadminGovernment documents, legal notices, forms, policy statementsPublic government publications, municipal documents
EducationaleduTextbook content, lesson materials, instructional textPublished educational materials, teaching guides
Narrative/LiterarylitStories, cultural texts, oral history transcriptionsPublished books, cultural archives (with permission)
ConversationalconvDialog, chat-like exchanges, informal written communicationPublished dialog corpora, screenplays, interview transcripts
TechnicaltechAPI documentation, README files, technical specificationsOpen-source project documentation
Health/MedicalhealthPatient-facing medical information, public health messagingGovernment health publications
News/JournalisticnewsNews articles, press releases, current affairsCommunity newspapers, Indigenous media outlets

2.2 — Domain Distribution

A standard evaluation corpus should aim for the following distribution. The exact percentages may vary by language pair based on what text types are most relevant to the target community:

DomainTarget %Rationale
Software UI25%Primary deployment context for champollion CLI users
Official/Administrative15%High-stakes translation with legal implications
Educational15%Core use case for language revitalization
Narrative/Literary10%Tests cultural nuance and literary register
Conversational10%Tests informal register and natural speech patterns
Technical10%Tests precision and terminology consistency
Health/Medical10%High-stakes, tests domain-specific vocabulary
News/Journalistic5%Tests contemporary vocabulary and neutral register

2.3 — Source Selection Criteria

When selecting source texts for a new corpus:

  1. License compatibility. Source text must be under a license that permits use in an evaluation corpus. Prefer CC BY, CC BY-SA, or public domain. Document the license.

  2. Recency. Prefer texts published within the last 10 years. Language evolves — especially vocabulary around technology, governance, and medicine.

  3. Register diversity. Within each domain, seek texts at different formality levels. A government press release (formal) and a government social media post (informal) are both admin domain but different registers.

  4. Cultural relevance. For Indigenous and minority languages, prioritize texts that matter to the community — land management documents, educational materials in the language, cultural preservation texts — over texts that happen to exist in parallel.

  5. No machine-translated sources. If a "parallel" document was created by running the original through Google Translate and then post-editing, it is NOT acceptable as a reference translation. The reference must be an independent human translation.


3. Difficulty Tier System

3.1 — Tier Definitions

Every entry is assigned a difficulty tier (1–5) based on the structural complexity of the source text, not the translation difficulty (which varies by method).

TierLabelStructural Characteristics
1ElementarySimple sentences. Single clause. Present tense. Common vocabulary. No idioms. No embedded structures.
2IntermediateCompound sentences. Two clauses joined by conjunction. Past/future tense. Some domain vocabulary.
3AdvancedComplex sentences. Subordinate clauses, relative clauses. Mixed tenses. Domain-specific terminology. Passive voice.
4ExpertMultiple embedded clauses. Legal/technical register. Conditional structures. Abstract concepts. Cultural references.
5ExtremeDense prose with multiple simultaneous challenges: nested subordination, ambiguous pronoun reference, cultural idioms, mixed register, rare vocabulary.

3.2 — Linguistically Informed Difficulty Factors

Beyond structural complexity, difficulty is modulated by typological distance between the source and target language. These factors are drawn from WALS typological features and the language card's classification data:

FactorLow DifficultyHigh Difficulty
Word orderSame basic order (e.g., SVO→SVO)Different basic order (e.g., SVO→SOV)
Morphological typeSimilar type (e.g., analytic→analytic)Different type (e.g., analytic→polysynthetic)
Grammatical genderSame system or no genderSource has no gender, target has complex gender
Honorific/registerNo register markingTarget has complex register system (e.g., Japanese, Korean)
ScriptSame scriptDifferent script (transliteration required)
AnimacyNo animacy distinctionTarget has animacy-based agreement (e.g., Cree)
EvidentialityNo evidentialityTarget marks information source grammatically

3.3 — Tier Distribution

A standard corpus should have approximately:

TierTarget %Rationale
115%Establishes baseline — even bad methods should handle these
225%Bread-and-butter practical translation
330%Where method quality differences become visible
420%Separates good methods from great ones
510%Ceiling test — very few methods will handle these well

4. Reference Translation Quality

4.1 — Translator Requirements

Reference translations must be produced by humans who are:

  1. Fluent speakers of the target language (L1 or equivalent)
  2. Literate in both source and target language
  3. Domain-aware for the domain of the text (a medical translator for health texts, etc.)
  4. Independent — the translator must not have access to any MT output for the same text during translation

4.2 — Translation Brief

Every translator receives a brief that includes:

  • The register to use (formal, conversational, etc.)
  • The target audience (general public, specialists, children, etc.)
  • Any terminology conventions specific to the language community
  • Explicit instruction: "Translate the meaning, not the words. A natural-sounding translation is more valuable than a literal one."

4.3 — Quality Assurance

  1. Dual translation. Ideally, each entry has two independent reference translations by different translators. Where this isn't feasible, prioritize dual translation for Tiers 4–5.

  2. Community review. Reference translations should be reviewed by at least one additional speaker who did not produce the translation.

  3. Acceptable variants. For each reference, document known acceptable variants (word order, orthographic conventions, dialectal forms). These feed the equivalent_match_rate metric.

4.4 — What Makes a Bad Reference

ProblemWhy It Invalidates Evaluation
Machine-translated then post-editedPost-editing preserves MT structure; penalizes methods that produce more natural translations
Translated by a learner, not a fluent speakerReference may contain errors that penalize correct MT output
Overly literalNatural translations score poorly against literal references
Single valid interpretation for ambiguous sourcePenalizes valid alternative interpretations

5. Contamination Prevention

5.1 — The Contamination Threat Model

ThreatDescriptionMitigation
Training data overlapLLMs trained on the parallel corpusDon't publish the parallel corpus publicly
Few-shot leakageMethod author uses eval entries as few-shot examplesFingerprint-check: entries in the prompt are detected and flagged
Indirect contaminationSource text exists in LLM training data (monolingual)Acceptable — monolingual source text is expected. The pairing must be novel.
Crowd contaminationCommunity reviewers share entries publiclyLicense terms prohibit redistribution of the parallel corpus

5.2 — Corpus Secrecy Tiers

TierVisibilityUse
Public development setFully publicMethod development, debugging, regression testing. Scores NOT published to leaderboard.
Held-out evaluation setSource text visible, references secretOfficial leaderboard evaluation. Methods receive source text and return translations; scoring happens server-side. References are never exposed to the method.
Gold-standard setFully secret, community-controlledCommunity-validated evaluation. Managed by governance organization. Used for "Community Validated" verification tier.

5.3 — Rotation Policy

Evaluation corpora should be rotated periodically:

  1. After a corpus has been in use for 12 months, begin constructing a replacement
  2. Retire the old corpus to "development set" status (public)
  3. Promote the new corpus to "held-out evaluation set"
  4. This prevents gradual contamination through iterative optimization against a fixed target

6. Corpus Construction Workflow

6.1 — Step-by-Step Process

Step 1: Language Pair Selection
└─ Identify target language, read language card
└─ Review typological features (WALS), contact influences, scripts
└─ Identify which difficulty factors apply

Step 2: Source Text Curation
└─ Identify candidate source documents per domain
└─ Verify licenses
└─ Extract candidate sentences/segments
└─ Classify by domain and preliminary difficulty tier

Step 3: Segment Selection
└─ Sample segments to match domain distribution (§2.2)
└─ Sample segments to match difficulty distribution (§3.3)
└─ Ensure linguistic phenomenon coverage (§6.2)
└─ Target minimum corpus size (§6.3)

Step 4: Reference Translation
└─ Assign segments to qualified translators
└─ Provide translation brief
└─ Collect translations
└─ Dual-translate Tier 4–5 entries

Step 5: Quality Assurance
└─ Community review of references
└─ Document acceptable variants
└─ Flag and resolve disagreements

Step 6: Metadata & Packaging
└─ Assign final difficulty tiers
└─ Add provenance metadata per entry
└─ Content-hash the corpus for versioning
└─ Package as corpus JSON per harness spec

Step 7: Registration
└─ Register in Supabase datasets table
└─ Add to ATTRIBUTION.md if new sources used
└─ Document in arena website

6.2 — Linguistic Phenomenon Coverage

Every corpus should include entries that test specific linguistic phenomena relevant to the language pair. These are drawn from the language card's linguisticChallenges and contactInfluences fields:

Universal phenomena (all language pairs):

  • Pronoun resolution (ambiguous antecedents)
  • Negation (single, double, scope)
  • Quantifiers (all, some, none, most)
  • Temporal expressions (relative dates, durations)
  • Named entities (people, places, organizations)
  • Numbers and measurements
  • Lists and enumeration

Pair-specific phenomena (from language card):

  • For polysynthetic targets: complex verb morphology, incorporation
  • For gendered targets: gender agreement, neutral/inclusive reference
  • For SOV targets: clause-final verbs, postpositions
  • For tone languages: tone-dependent meaning distinctions
  • For honorific languages: register markers, social context
  • For contact languages: code-switching boundaries, loanword integration

6.3 — Minimum Corpus Size

Statistical reliability requires minimum entry counts. These are based on paired bootstrap confidence interval requirements (from significance.py):

PurposeMinimum EntriesRecommended
Development set50100–200
Held-out evaluation set100200–500
Gold-standard set200500+
Per-domain minimum1025+
Per-tier minimum1020+

Why 100 minimum for evaluation? With fewer than ~100 entries, paired bootstrap significance tests (1,000 resamples) cannot reliably detect differences smaller than ~5 chrF++ points. With 200+ entries, we can detect ~2-point differences at p<0.05.


7. Corpus JSON Format

Every corpus entry follows the harness specification:

{
"id": "edtekla-dev-v1-042",
"source": "The school board will meet on Tuesday to discuss the new curriculum.",
"reference": "ᑭᓯᑭᓄᐦᐊᒫᑐᐏᓐ ᑲ ᐃᔑ ᐱᒥᐸᔨᐦᑕᐦᒃ ᑭᔅᑭᓄᐦᐊᒫᑐᐏᓇ ᐁ ᐃᔑ ᒫᒥᑐᓀᔨᐦᑕᐦᒃ ᐅᔥᑭ ᑭᔅᑭᓄᐦᐊᒫᑫᐏᓂᔭ ᓂᔓ ᑭᔑᑲᐤ",
"acceptable_variants": [
"ᑭᔅᑭᓄᐦᐊᒫᑐᐏᓐ ᓂᔓ ᑭᔑᑲᐤ ᑲ ᐃᔑ ᒫᒥᑐᓀᔨᐦᑕᐦᒃ ᐅᔥᑭ ᑭᔅᑭᓄᐦᐊᒫᑫᐏᓂᔭ"
],
"domain": "edu",
"difficulty": 3,
"phenomena": ["temporal_expression", "named_entity", "future_tense"],
"provenance": {
"source_doc": "EdTeKLA Module 4, Unit 7",
"source_license": "CC BY-NC-SA 4.0",
"translator": "anonymous-speaker-001",
"translator_qualification": "L1 Plains Cree, certified translator",
"translation_date": "2025-11-15",
"reviewer": "anonymous-speaker-002",
"review_date": "2025-12-01"
}
}

8. Anti-Gaming Measures

8.1 — Corpus Integrity

MeasureImplementation
Content hashingCorpus version = SHA-256 of sorted entry IDs + references. Any modification produces a new version.
Entry fingerprintingEach entry has a content-derived ID. If someone submits results against a modified corpus, the fingerprint won't match.
Held-out enforcementFor official evaluation, methods receive ONLY source text. References are never exposed. Scoring happens server-side.
Rotation scheduleCorpora rotate annually to prevent long-term optimization against a fixed target.

8.2 — Submission Integrity

MeasureImplementation
Deterministic fingerprintRun config (model, temperature, prompt, corpus version) is hashed. Identical configs produce identical fingerprints.
Cherry-pick detectionSubmitters must disclose all runs, not just the best one. Multiple submissions with the same fingerprint are flagged.
Contamination checkIf eval entries appear verbatim in the method's prompt or coaching data, the submission is disqualified.

9. Existing Corpora

9.1 — EDTeKLA Development Set v1

PropertyValue
IDedtekla-dev-v1
PairEN → CRK (Plains Cree, SRO)
Entries404 (master_corpus.json: 62 gold + 342 textbook); 548 total available
DomainsEducational (100%)
Tiers1–5 (distribution TBD per entry audit)
LicenseCC BY-NC-SA 4.0
StatusDevelopment set (public)

Limitations: Single domain (educational only). No domain stratification. Tier assignments may need audit. Small corpus size limits statistical power for significance testing.

9.2 — Planned Corpora

CorpusPairStatusOwner
EN → TL (Filipino) custom corpusEN → TLPlannedProject owner
EN → CRK held-out setEN → CRKFuture (needs community partner)Community governance org

10. Language Card Integration

The corpus framework integrates with the language card system:

  1. Domain selection is informed by the card's linguisticChallenges — if a language has unique challenges (polysynthesis, tone, animacy), the corpus must include entries that test them.

  2. Difficulty calibration uses the card's classification — typological distance between source and target families affects what constitutes "difficult."

  3. Register coverage uses the card's registers — if a language has defined registers (formal-filipino, taglish-professional, taglish-casual), the corpus should include entries at each register level.

  4. Contact influence testing uses the card's contactInfluences — for languages with heavy borrowing layers (Filipino: Spanish + English + Arabic), include entries that test whether methods handle loanwords correctly vs. over-translating them.

  5. Script handling uses the card's scripts[] — for multi-script languages (Serbian: Cyrillic + Latin), include entries that test correct script selection.


References

  • Champollion Scoring Specification — defines all metrics, composite weights, quality tiers
  • Champollion Benchmark Specification — evaluation protocol, corpus format, data sovereignty
  • WALS (World Atlas of Language Structures) — typological features database
  • Glottolog — language classification source of truth
  • ISO 639-3 — language identification standard
  • EdTeKLA — source of the first evaluation corpus

This document is a living specification. Update it as new corpora are built and lessons are learned.