Skip to main content

Corpus Creation Guide

The idea: Before you can evaluate a translation method, you need an evaluation corpus. This guide covers how to build one from scratch — data sourcing, format requirements, quality standards, licensing, and contributing to the Arena.

:::info This is not a translation method This guide is the prerequisite for many methods. A good evaluation corpus is the foundation that makes everything else possible. Even 50 curated pairs are enough to open a new leaderboard track. :::

When to Use This

  • You want to add a new language pair to the Arena leaderboard
  • You're a language teacher who wants to benchmark student translations
  • You're a community language worker with access to bilingual materials
  • You're a researcher who needs a standardized evaluation set for your language pair

Corpus Format

The harness takes simple JSON:

my-corpus.json
{
"metadata": {
"name": "Quechua Dev v1",
"version": "1.0.0",
"source_language": "eng",
"target_language": "que",
"entry_count": 75,
"license": "CC-BY-SA-4.0",
"author": "Your Name / Organization",
"description": "75 English-Quechua pairs from educational materials"
},
"entries": [
{
"id": 1,
"source": "Hello, how are you?",
"reference": "Allillanchu, imaynallan kashanki?"
},
{
"id": 2,
"source": "The sun is shining today",
"reference": "Kunan p'unchay inti k'anchashan"
}
]
}

Where to Source Data

SourceQualityVolumeLicensing
Textbooks / educational materialsHigh (expert-reviewed)Low-mediumCheck with publisher
Government documentsMedium (formal register)Medium-highOften public domain
Bilingual dictionariesHigh (verified entries)MediumVaries
Community elders / speakersHighest (native intuition)Low (limited time)Community-governed
Religious textsMedium (domain-specific)HighUsually open
Existing corpora (Hansard, FLORES)Medium-highHighCheck license
Hand-craftedHighestLowYou own it

Quality Standards

A good evaluation corpus has:

  1. Diverse content — not just greetings or simple phrases. Include questions, commands, complex sentences, domain-specific terms
  2. Verified translations — reviewed by at least one fluent speaker, ideally two
  3. Consistent orthography — one script, one spelling convention throughout
  4. Independent sources — not derived from the same text that methods will train on
  5. Clear licensing — explicit license that allows evaluation use

:::danger Corpus contamination The evaluation corpus must be independent of any training data. If a method was trained or prompted with data from the evaluation corpus, it will be disqualified. Design your corpus to be held-out from day one. :::

Size Guidelines

SizeWhat It Enables
50 entriesMinimal viable evaluation — enough to detect gross quality differences
100–200 entriesReliable ranking — enough for statistical significance between methods
500+ entriesResearch-grade — robust composite scores, confidence intervals
1,000+ entriesGold standard — equivalent to FLORES devtest coverage

Start small. 50 entries is enough to open a leaderboard track. You can expand later.

Contributing to the Arena

  1. Create your corpus in the JSON format above
  2. License it — CC BY-SA 4.0 is recommended for open evaluation; CC BY-NC-SA 4.0 for restricted use
  3. Submit a PR to the eval harness repo with your corpus in data/
  4. The leaderboard opens automatically for your language pair once the corpus is merged

For Indigenous Language Communities

Corpus creation is an act of language sovereignty. Your corpus, your terms:

  • You decide the license and access conditions
  • You can contribute a public development set (for method development) while keeping a secret test set (for official evaluation) under community control
  • The sovereignty framework protects your data at every level

Even a small corpus is a strategic asset — it's the benchmark that decides what "good enough" means for your language.

Combines Well With

  • Partial Translation — creating a corpus IS the human translation step
  • Back-Translation — synthetic data supplements human-created corpora
  • Every other cookbook — they all need an evaluation corpus

See Also