Corpus Partnership Strategy: Establishing Evaluation Corpora Through Academic Linguistics Departments
Purpose. This document provides the complete workflow for establishing a machine translation evaluation corpus through a linguistics department partnership. It covers what we need the department to deliver, what the corpus must look like, how it is cryptographically sealed, how sandbox evaluation works, and what the department gets in return. This is the document you bring to a meeting with a potential academic partner.
Audience. Department heads, principal investigators, research coordinators, and Indigenous language program directors at universities with active language documentation or NLP programs.
Companion documents:
- Speaker Validation Protocol — the ask for bilingual speakers to mark existing translations (quality rating, linter validation, FST review)
- Benchmark Specification — the full technical spec for corpora, run cards, and evaluation protocols
- Data Sovereignty — OCAP®, CARE, and why ownership transfer matters
Last updated: 2026-06-07
1. What This Partnership Produces
A sealed evaluation corpus: a curated set of parallel text pairs (source language → target language) that becomes the ground truth for measuring machine translation quality. Methods are tested against this corpus in a sandbox — developers never see the test data.
The partnership produces three artifacts:
| Artifact | What It Is | Who Controls It |
|---|---|---|
| Development corpus | 100–200+ public parallel text pairs for method development | Published openly (CC BY-NC-SA 4.0 or equivalent) |
| Gold-standard test set | 50–150 secret parallel text pairs for official evaluation | Community governance org (cryptographically sealed) |
| Diagnostic test suite | 10–50 targeted contrastive pairs testing specific linguistic phenomena | Published openly |
The development corpus enables anyone to build translation methods. The gold-standard set ensures those methods are tested honestly. The diagnostic suite catches specific failure modes (e.g., "can this system handle obviation?").
2. What the Department Needs to Do
Phase 1: Corpus Design (2–4 weeks, researcher time)
Lead: PI or postdoc with expertise in the target language.
-
Select source material domains. Choose 4–6 real-world domains where translation is actually needed by the language community. Our taxonomy supports 16 domains (see Benchmark Spec §2.7):
Priority Domain Why 🔴 High edu— EducationalTextbooks, curricula — direct community need 🔴 High gov— GovernmentBand council documents, policy — practical daily need 🔴 High medical— HealthClinic intake forms, health info — safety-critical 🟡 Medium conv— ConversationalEveryday speech — establishes baseline fluency 🟡 Medium legal— LegalRights documents, treaties — community significance 🟢 Lower literary— Literary/CulturalStories, oral histories — cultural preservation -
Draft a corpus design document specifying:
- Target size per segment (development, gold_standard, diagnostic)
- Difficulty tier distribution (see §3.3 below)
- Register and domain coverage
- Source sentence selection criteria (no synthetic text, no Bible-only)
- Speaker recruitment plan
-
Submit the design to us for review. We validate it against the corpus schema (Benchmark Spec §2) and return feedback within 1 week.
Phase 2: Source Sentence Creation (4–8 weeks, speaker time)
Lead: Research coordinator working with bilingual speakers.
-
Generate or select source sentences across the planned domains and difficulty tiers. Sources can be:
- Existing published bilingual materials (textbooks, government documents)
- Newly elicited sentences designed to cover specific linguistic phenomena
- Adapted from real-world documents (band council agendas, clinic forms, educational materials)
-
Each source sentence must have:
- Domain tag (from the 16-code taxonomy)
- Register tag (conversational, formal, technical, ceremonial, educational)
- Context tag (greeting, declaration, question, instruction, narrative, label, error)
- Estimated difficulty tier (1–5, see §3.3)
- Provenance tag (textbook, elicited, corpus, gold_standard)
-
Translate each source sentence into the target language, performed by bilingual speakers. Multiple reference translations per entry are valuable but not required.
-
Optionally, add morphological analysis for each reference translation:
- Interlinear gloss (morpheme-by-morpheme breakdown)
- FST tag string (if an FST exists for the language)
- Translator notes on dialectal variants, ambiguity, or cultural context
Phase 3: Quality Assurance (2–4 weeks)
Lead: Linguist with target language expertise.
-
Cross-review. Each translation should be reviewed by at least one additional bilingual speaker who did not produce the original translation. The reviewer checks:
- Is the translation accurate?
- Is it natural-sounding?
- Is the difficulty rating correct?
- Are there acceptable variants that should be noted?
-
Run through our schema validator. We provide a script that validates the corpus against the entry schema (Benchmark Spec §2.2). It checks:
- Required fields present
- Domain codes valid
- Difficulty tiers are integers 1–5
- No duplicate IDs
- Character encoding (UTF-8 NFC normalization)
-
If an FST exists for the language, run the reference translations through it. Every word in the reference should be FST-valid. Words that are not (loanwords, neologisms, proper nouns) should be documented in an allowlist.
Phase 4: Segmentation and Sealing (1 week, our engineering)
Lead: Champollion team, with department review.
-
Stratified split. We split the corpus into segments using deterministic random sampling (seed documented, reproducible):
Segment Target Size Access development60% of entries (min 100) Public gold_standard30% of entries (min 50) Secret, sealed held_out10% of entries (min 10) Secret, sealed, never used until activated The split preserves difficulty tier distribution (stratified sampling) so each segment has proportional representation across tiers.
-
Cryptographic sealing of the gold_standard and held_out segments:
1. SHA-256 hash of each entry (source + reference + metadata)2. SHA-256 hash of the complete segment file3. Segment file encrypted with AES-256-GCM4. Encryption key split using Shamir Secret Sharing (2-of-3 threshold)5. Key shares distributed to:- Share 1: Community governance organization- Share 2: Academic department partner- Share 3: Champollion project (escrow)6. Hash manifest published to a public commit (proves the corpus existedat a specific time without revealing its contents) -
The development segment is committed to the public repository and published with full licensing.
-
The diagnostic segment is also public — it tests specific linguistic phenomena (see §3.4).
Phase 5: Integration and Launch (1–2 weeks, our engineering)
-
Harness configuration. We add the language to the evaluation harness:
- Language card created or verified
- Corpus registered in the dataset registry
- LYSS metrics configured (LYSS-fst if FST available, LYSS-eq if linter rules exist)
- Default scoring profile selected (Profile A if FST available, Profile B otherwise)
-
Baseline benchmark. We run a 12-model sweep against the development segment to populate the leaderboard with initial scores.
-
Public announcement. The language appears on the Arena leaderboard with a live development-segment benchmark. The department is credited as the corpus partner.
3. What the Corpus Must Look Like
3.1 Format
Every corpus file is a JSON document following the schema in Benchmark Spec §2.1–§2.2:
{
"dataset": {
"id": "crk-ualberta-v1",
"version": "1.0",
"language_pair": "EN→CRK",
"source_language": "en",
"target_language": "crk",
"created": "2026-09-15",
"license": "CC-BY-NC-SA-4.0",
"provenance": ["textbook", "elicited", "gold_standard"]
},
"entries": [
{
"id": 1,
"source": "I see the dog",
"reference": "niwâpamâw atim",
"segment": "development",
"difficulty": 2,
"provenance": "textbook",
"register": "conversational",
"context": "declaration",
"domain": "edu",
"morphological_analysis": "ni-wâpam-âw atim | 1sg-see.TA-3sg.DIR dog.AN",
"notes": "Animate noun (atim); direct form because speaker is proximate"
}
]
}
3.2 Minimum Size Requirements
| Segment | Minimum Entries | Recommended |
|---|---|---|
development | 100 | 200–300 |
gold_standard | 50 | 100–150 |
diagnostic | 10 | 30–50 |
held_out | 10 | 20–30 |
| Total | 170 | 350–530 |
3.3 Difficulty Distribution
The corpus must include entries across all five difficulty tiers, weighted toward tiers 2–4:
| Tier | Description | Target Distribution |
|---|---|---|
| 1 — Basic vocabulary | Single words, common greetings, numbers | 10–15% |
| 2 — Simple sentences | SVO, present tense | 25–30% |
| 3 — Moderate complexity | Past/future tense, possessives, animacy | 30–35% |
| 4 — Complex morphology | Obviation, passive, conjunct order, relative clauses | 15–20% |
| 5 — Advanced | Multi-clause, formal register, ceremonial, idiomatic | 5–10% |
3.4 Diagnostic Test Suite
The diagnostic segment tests specific linguistic phenomena using contrastive pairs: one correct translation and one minimally-different incorrect translation. If a system's metric scores the correct one higher, the test passes.
For polysynthetic languages, the diagnostic suite should target:
| Phenomenon | Example (Cree) | What It Tests |
|---|---|---|
| Animacy agreement | atim (AN) vs. maskisin (IN) — different verb forms | Does the system know which nouns are animate? |
| Obviation | Proximate vs. obviative third person | Does it track third-person hierarchy? |
| Inverse marking | Direct vs. inverse verb forms | Does it handle patient-outranks-agent? |
| Conjunct/Independent | Main clause vs. subordinate clause verb order | Does it use the right verb paradigm? |
| Inclusive/Exclusive | "We (including you)" vs. "We (excluding you)" | Does it distinguish first-person plural forms? |
For other language families, identify the 3–5 most diagnostic phenomena that distinguish competent from incompetent translation. The department's linguistic expertise is essential here — these are the tests that only a specialist would know to write.
3.5 What We Do NOT Want
| Anti-Pattern | Why |
|---|---|
| Bible-only text | Archaic register, liturgical vocabulary, formulaic structure. OMT-1600 evaluated 1,560 languages this way — we deliberately avoid it. |
| Synthetic evaluation pairs | LLM-generated references defeat the purpose of evaluation. The reference must be human-authored. |
| Single-register corpora | All formal, or all conversational. Real-world translation spans multiple registers. |
| Difficulty-1-only | Single words and greetings don't test translation — they test vocabulary lookup. |
| Machine-translated references | Using Google Translate output as a "reference" is circular. |
| Sentences with no context tag | We need to know the communicative function for diagnostic analysis. |
4. Cryptographic Sealing and Sandbox Testing
4.1 Why Seal the Test Set?
Conventional ML benchmarks publish test sets openly. Once published, frontier LLMs will eventually train on them (intentionally or through web scraping), making scores unreliable. For Indigenous language data, there's an additional concern: published linguistic data can be used without community consent.
Sealing ensures:
- Test set integrity: Methods cannot overfit to data they've never seen
- Data sovereignty: The community controls who evaluates against their data
- Perpetual freshness: The test set never becomes contaminated
4.2 How Sandbox Testing Works
Developer workflow:
1. Developer builds a translation method using the PUBLIC development corpus
2. Developer tests locally against the development segment (unlimited, self-serve)
3. When ready, developer submits their complete method (code + config + coaching data)
4. Governance org installs the method in the evaluation sandbox
5. Sandbox runs the method against the SEALED gold-standard test set
6. Only scores are returned to the developer
7. Developer never sees the source sentences or reference translations
The sandbox:
- Runs on governance-controlled infrastructure
- Has selective network access (LLM APIs only, no exfiltration)
- Produces a tamper-proof run card (SHA-256 hash of all inputs and outputs)
- Logs all execution for audit purposes
- Can be inspected by the governance org at any time
4.3 Key Management
The encryption key for the sealed test set is split using Shamir Secret Sharing with a 2-of-3 threshold:
| Share Holder | Role | Revocation Power |
|---|---|---|
| Community governance org | Primary custodian | Can revoke evaluation access unilaterally |
| Academic department partner | Co-custodian | Can participate in key reconstruction |
| Champollion project | Escrow | Cannot access data alone; ensures continuity if other parties become unavailable |
Any 2 of 3 shares reconstruct the key. This means:
- The community + department can access the data without Champollion
- The community + Champollion can access the data without the department
- Champollion alone can NEVER access the data
4.4 Hash Manifests
When the corpus is sealed, a hash manifest is published to a public Git commit:
{
"corpus_id": "crk-ualberta-v1",
"seal_date": "2026-09-15T00:00:00Z",
"segments": {
"development": {
"entry_count": 200,
"sha256": "a3f7c...",
"access": "public"
},
"gold_standard": {
"entry_count": 100,
"sha256": "b8d2e...",
"access": "sealed",
"key_scheme": "shamir-2-of-3"
},
"held_out": {
"entry_count": 20,
"sha256": "c9e4f...",
"access": "sealed",
"key_scheme": "shamir-2-of-3"
},
"diagnostic": {
"entry_count": 30,
"sha256": "d1a3b...",
"access": "public"
}
},
"total_entries": 350,
"manifest_sha256": "e2b5c..."
}
This proves:
- The corpus existed at a specific date
- It has a known size and structure
- Any modification to the sealed segments would break the hash chain
- The community can verify their data hasn't been tampered with
5. What the Department Gets
5.1 Research Infrastructure
| Asset | Description |
|---|---|
| Evaluation harness | A working, tested evaluation framework for their language — saves months of tool-building |
| LYSS metrics | Language-specific evaluation metrics (LYSS-fst, LYSS-eq, LYSS-sem) configured for their language — if FST and dictionary resources exist |
| Leaderboard | A public, live leaderboard showing the state of the art for their language pair |
| Baseline benchmark | 12-model sweep providing immediate, publishable baselines |
| Diagnostic test suite | Targeted tests for specific linguistic phenomena — reusable for other evaluations |
5.2 Publications
The corpus construction and evaluation results support multiple publications:
| Paper | Venue | Department Role |
|---|---|---|
| Corpus construction methodology | LREC, ComputEL | Lead or co-author |
| Baseline evaluation results | ACL, EMNLP | Co-author |
| LYSS metric validation | WMT Metrics Shared Task | Co-author |
| Diagnostic test suite design | SIGMORPHON, NAACL | Lead or co-author |
| Language-specific NLP resources | Language-specific venues | Lead author |
5.3 Grant Positioning
The partnership provides concrete outputs for grant proposals:
- "Open-source evaluation infrastructure for [language] MT" — demonstrable deliverable
- "Cryptographic data sovereignty for Indigenous linguistic data" — novel, publishable
- "Community-governed benchmark with live leaderboard" — ongoing impact metric
- "Independent evaluation of OMT-1600 / Google Translate for [language]" — timely, high-visibility
5.4 Community Impact
- The language community gains an independent evaluation capability — they can assess whether any MT system (Google, Meta, or custom) actually works for their language
- The community controls the test data via cryptographic key custody
- Any methods proven through the benchmark transfer ownership to the community (see Benchmark Spec §8.3)
- Revenue from deployed methods flows to the community (90/10 split)
5.5 What It Costs the Department
| Component | Estimated Cost | Who Pays |
|---|---|---|
| PI/postdoc time (design, oversight) | ~40 hours | Department (or grant-funded) |
| Speaker compensation (translation) | $2,500–6,000 | Grant-funded or Champollion-funded |
| Speaker compensation (review) | $500–1,500 | Grant-funded or Champollion-funded |
| Research coordinator time | ~20 hours | Department |
| Engineering, infrastructure, harness | $0 | Champollion project |
We provide all engineering, harness configuration, LYSS metric setup, leaderboard integration, and ongoing infrastructure at no cost to the department. The department's contribution is linguistic expertise and speaker access.
6. Timeline
| Phase | Duration | Key Milestone |
|---|---|---|
| 1: Corpus Design | 2–4 weeks | Design document approved |
| 2: Source Sentences + Translation | 4–8 weeks | Raw corpus completed |
| 3: Quality Assurance | 2–4 weeks | Cross-reviewed, schema-validated |
| 4: Sealing | 1 week | Gold-standard sealed, hash manifest published |
| 5: Integration | 1–2 weeks | Language live on leaderboard with baselines |
| Total | 10–19 weeks | Live leaderboard with sealed evaluation |
7. How to Get Started
-
Contact us — [project email/contact]. We'll schedule a 30-minute call to discuss your language, available resources, and partnership logistics.
-
We provide:
- This document
- The corpus schema and validation tools
- Examples from our existing Cree (CRK) corpus
- A draft corpus design template
-
You provide:
- A PI or postdoc to lead the linguistic work
- Access to bilingual speakers (or a plan to recruit them)
- Information about available resources (FST, dictionary, existing corpora)
- Institutional approval for data governance (OCAP® compliance or equivalent)
-
We co-design the corpus — domain selection, difficulty distribution, diagnostic tests, timeline, and budget.
-
Work begins. We check in weekly. The department has full autonomy over linguistic decisions; we handle all engineering.
8. Frequently Asked Questions
"We already have a parallel corpus. Can we use it?"
Yes — if the corpus has clear provenance, is human-authored, and the license permits use in evaluation. We'll help you format it to our schema, add missing metadata, and integrate it. Existing corpora can dramatically accelerate the timeline (skip Phase 2 or reduce it to a gap-fill exercise).
"We don't have an FST for our language."
That's fine. LYSS-fst (morphological validity) requires an FST, but the harness works without it using Profile B weights (chrF++, BLEU, COMET, behavioral metrics). If a GiellaLT FST exists for a related language, we may be able to adapt it. If not, the corpus still enables valuable evaluation — just without the morphological validity gate.
"Our speakers use a non-Latin script."
Fully supported. The corpus schema handles any Unicode script. We've designed for SRO (Standard Roman Orthography) and syllabics for Cree, but the same infrastructure works for Devanagari, Arabic script, CJK, Ethiopic, or any other writing system.
"What about dialect variation?"
Tag it. The corpus entry schema includes a notes field for dialectal information. If multiple dialects are represented, document them. The linter's equivalence classes (LYSS-eq) can be configured to accept dialectal variants as equivalent. The diagnostic test suite can include dialect-specific contrasts.
"Who owns the corpus?"
The language community, via the governance organization. The department is credited as a research partner. Champollion holds an escrow key share for operational continuity but cannot access the sealed data alone. The development segment is published under a Creative Commons license specified by the community.
"What if we want to stop?"
The community can revoke evaluation access at any time by refusing to reconstruct the encryption key. The sealed data is never exposed. The development segment, already published, remains public under its license. The department's research outputs (publications, presentations) are theirs to keep regardless.
"What if the governance organization doesn't exist yet?"
We can begin with Phases 1–3 (corpus design, creation, QA) without a governance org. The sealing (Phase 4) requires identifying a key custodian. In the interim, the department can serve as the co-custodian alongside the Champollion project, with the understanding that custody transfers to the community governance org when one is established.
Appendix: Tagging vs. Corpus Construction
This document covers corpus construction — creating the parallel text pairs that form the evaluation ground truth. Tagging (morphological annotation, interlinear glossing, FST tag strings) is a separate activity that enriches the corpus but is not required for basic evaluation.
| Activity | Required? | What It Enables |
|---|---|---|
| Corpus construction (this document) | ✅ Required | Basic evaluation: chrF++, exact match, COMET, behavioral metrics |
| FST coverage checking | 🟡 Optional | LYSS-fst morphological validity metric |
| Morphological annotation | 🟡 Optional | morphological_accuracy metric (Scoring Spec §2.2) |
| Linter equivalence rules | 🟡 Optional | LYSS-eq equivalent match metric |
| Semantic validator rules | 🟡 Optional | LYSS-sem semantic validation metric |
| Speaker quality ratings | Separate activity | Metric validation (see Speaker Validation Protocol) |
Tagging and speaker validation are covered by separate documents and can proceed in parallel with or after corpus construction.