Corpus Creation Guide
The idea: Before you can evaluate a translation method, you need an evaluation corpus. This guide covers how to build one from scratch — data sourcing, format requirements, quality standards, licensing, and contributing to the Arena.
:::info This is not a translation method This guide is the prerequisite for many methods. A good evaluation corpus is the foundation that makes everything else possible. Even 50 curated pairs are enough to open a new leaderboard track. :::
When to Use This
- You want to add a new language pair to the Arena leaderboard
- You're a language teacher who wants to benchmark student translations
- You're a community language worker with access to bilingual materials
- You're a researcher who needs a standardized evaluation set for your language pair
Corpus Format
The harness takes simple JSON:
{
"metadata": {
"name": "Quechua Dev v1",
"version": "1.0.0",
"source_language": "eng",
"target_language": "que",
"entry_count": 75,
"license": "CC-BY-SA-4.0",
"author": "Your Name / Organization",
"description": "75 English-Quechua pairs from educational materials"
},
"entries": [
{
"id": 1,
"source": "Hello, how are you?",
"reference": "Allillanchu, imaynallan kashanki?"
},
{
"id": 2,
"source": "The sun is shining today",
"reference": "Kunan p'unchay inti k'anchashan"
}
]
}
Where to Source Data
| Source | Quality | Volume | Licensing |
|---|---|---|---|
| Textbooks / educational materials | High (expert-reviewed) | Low-medium | Check with publisher |
| Government documents | Medium (formal register) | Medium-high | Often public domain |
| Bilingual dictionaries | High (verified entries) | Medium | Varies |
| Community elders / speakers | Highest (native intuition) | Low (limited time) | Community-governed |
| Religious texts | Medium (domain-specific) | High | Usually open |
| Existing corpora (Hansard, FLORES) | Medium-high | High | Check license |
| Hand-crafted | Highest | Low | You own it |
Quality Standards
A good evaluation corpus has:
- Diverse content — not just greetings or simple phrases. Include questions, commands, complex sentences, domain-specific terms
- Verified translations — reviewed by at least one fluent speaker, ideally two
- Consistent orthography — one script, one spelling convention throughout
- Independent sources — not derived from the same text that methods will train on
- Clear licensing — explicit license that allows evaluation use
:::danger Corpus contamination The evaluation corpus must be independent of any training data. If a method was trained or prompted with data from the evaluation corpus, it will be disqualified. Design your corpus to be held-out from day one. :::
Size Guidelines
| Size | What It Enables |
|---|---|
| 50 entries | Minimal viable evaluation — enough to detect gross quality differences |
| 100–200 entries | Reliable ranking — enough for statistical significance between methods |
| 500+ entries | Research-grade — robust composite scores, confidence intervals |
| 1,000+ entries | Gold standard — equivalent to FLORES devtest coverage |
Start small. 50 entries is enough to open a leaderboard track. You can expand later.
Contributing to the Arena
- Create your corpus in the JSON format above
- License it — CC BY-SA 4.0 is recommended for open evaluation; CC BY-NC-SA 4.0 for restricted use
- Submit a PR to the eval harness repo with your corpus in
data/ - The leaderboard opens automatically for your language pair once the corpus is merged
For Indigenous Language Communities
Corpus creation is an act of language sovereignty. Your corpus, your terms:
- You decide the license and access conditions
- You can contribute a public development set (for method development) while keeping a secret test set (for official evaluation) under community control
- The sovereignty framework protects your data at every level
Even a small corpus is a strategic asset — it's the benchmark that decides what "good enough" means for your language.
Combines Well With
- Partial Translation — creating a corpus IS the human translation step
- Back-Translation — synthetic data supplements human-created corpora
- Every other cookbook — they all need an evaluation corpus
See Also
- Evaluation Datasets — existing corpora (EDTeKLA, FLORES+)
- Data Sovereignty — ownership and control
- For Language Communities — community engagement
- Support a Low-Resource Language — the big picture