Prize Specification

Purpose. This document defines the prize pool structure, threshold conditions, claim process, and rules for the MT Eval Arena. It specifies exactly what "capable of machine translation" means in measurable terms, and under what conditions prize money is released. This document references SCORING_SPEC for metric definitions and BENCHMARK_SPEC for evaluation protocol — it does not duplicate them.

Status: LIVE. The Founder's Prize (§2.1) is funded and active.

Last updated: 2026-06-04

1. Philosophy

1.1 Prizes Reward Breakthroughs, Not Participation

Prize money is released only when a method demonstrably achieves a defined capability threshold. There are no participation prizes, runner-up awards, or consolation payouts. If nobody clears the bar, nobody gets paid. This is by design — it means sponsors only pay for results that actually work.

1.2 Community Validation Is Non-Negotiable

Automated metrics are proxies (SCORING_SPEC §1.1). A method can score well on chrF++ and FST acceptance while producing output that no speaker would accept. Every prize claim requires community validation — bilingual speakers must confirm the output is usable. This is the human validation gate (BENCHMARK_SPEC §7).

1.3 Ownership Transfer Is Part of the Deal

Methods that claim a prize are subject to the ownership transfer clause (BENCHMARK_SPEC §8.3). The developer keeps attribution and publication rights. The governance org gains the right to use, modify, distribute, and monetize the method for their language. This is not a penalty — it's the point. Prize money funds the creation of technology that belongs to the language community.

1.4 Anti-Gaming

Prize thresholds are defined against gold-standard evaluation (secret test set, run by governance org in sandbox). Developers never see the test data. This is architecturally enforced — not a policy that relies on honor. See BENCHMARK_SPEC §8.2.

1.5 Corpus Licensing: Non-Commercial Corpora Stay Out of the Prize Lane

Some corpora used during method development carry non-commercial licenses — for example, the EdTeKLA Cree Language Textbook corpus is CC BY-NC-SA 4.0. These corpora are research/development-lane only:

Prize gold-standard corpora must not embed NC-licensed corpus content. Gold-standard test segments are community-commissioned originals (see Corpus Partnership Strategy) — human-authored for the prize, with rights cleared for evaluation and commercial deployment from the start.
A method that claims a prize must not embed NC-licensed corpus content (e.g., as coaching data, embedded examples, or lookup tables). The transferred method is intended for commercial deployment by the governance org (BENCHMARK_SPEC §8.3, Method Submission Agreement §6); NC-licensed content inside it would poison that deployment.
Developers may freely use NC-licensed corpora to develop and self-evaluate — that is what the development lane is for. The restriction applies to what is submitted and what is deployed, not to how a developer learns.

1.6 Dependency Classes Gate Prize Eligibility

All prize evaluation happens in a sandbox (§1.4), and prize-winning methods transfer to the governance org (§1.3). Both facts impose the same constraint: everything a method depends on must be something the developer has the right to put in the sandbox and convey to the community. Every submission declares a dependency class — defined in the Method Interface spec, with admissibility terms in the Method Submission Agreement §2.6 — and eligibility follows the class:

Dependency class	Prize-eligible?	Conditions
S — self-contained	✅ Yes	None beyond the threshold conditions in §2
O — open external (e.g., AGPL FST mirrored at submission)	✅ Yes	Artifacts pinned and vendored into the submission; licenses permit community transfer; copyleft terms preserved (the community receives the same rights the license grants everyone)
A1 — substitutable LLM inference	⚠️ Conditional	Model declared, pinned, and substitutable (must run against a community-hosted open-weight model); evaluation routed through the sandbox LLM gateway (🔲 planned — A1 methods cannot produce gold-standard scores until the gateway is operational); transfer conveys the full recipe (prompts, coaching, code), not the model
A2 — non-substitutable external data/service API	❌ Not yet	Ineligible until the rights holder grants sandbox-inclusion and transfer permissions. Allowed on the open leaderboard with a visible "external dependency" flag
X — bundled content without rights	❌ Never	Inadmissible in every lane

A method's class is the most restrictive class among its declared dependencies. Undeclared dependencies of any class are disqualifying (§5).

2. Active Prize Pools

2.1 The Founder's Prize — EN→Plains Cree (nêhiyawêwin)

Field	Value
Prize pool	$10,000 CAD
Language pair	English → Plains Cree (EN→CRK)
Funded by	Champollion project founder
Status	ACTIVE — accepting submissions
Opens	When gold-standard corpus + governance org are in place
Expires	No expiry. Prize remains active until claimed or explicitly withdrawn.

Threshold Conditions

A method claims the Founder's Prize by meeting ALL of the following conditions simultaneously:

#	Condition	Metric	Threshold	Rationale
1	Composite score	`composite` (SCORING_SPEC §4)	≥ 0.80	Between Deployable (0.70) and Fluent (0.85). Requires high quality across all metric dimensions — not just morphological validity.
2	FST acceptance	`fst_acceptance_rate` (SCORING_SPEC §2.2)	≥ 0.99 (99%+)	Effectively all output words must be morphologically valid forms recognized by the GiellaLT FST. The 1% tolerance accounts for edge cases (proper nouns, neologisms, loanwords) that the FST may legitimately not cover. This is the defining quality gate for polysynthetic MT — if the FST rejects more than 1% of words, the method is producing forms that do not exist in the language. The entire point of this prize is to buy a system that doesn't mangle things.
3	chrF++	`chrf_plus_plus` (SCORING_SPEC §2.1)	≥ 55.0	Character n-gram overlap must exceed 55 on the 0–100 scale. Ensures surface-level similarity to reference translations, not just morphological validity.
4	Community validation	Human review (BENCHMARK_SPEC §7)	≥ 70% "acceptable" or "excellent"	A stratified sample of outputs (≥30 entries across difficulty tiers 2–5) is reviewed by ≥2 bilingual CRK speakers. At least 70% of reviewed entries must receive an "acceptable" or "excellent" rating.
5	Gold-standard evaluation	Sandbox execution (BENCHMARK_SPEC §8.2)	Required	All automated metrics must be computed against the `gold_standard` corpus segment, run by the governance org in a sandboxed environment. Development-set scores do not count.
6	Reproducibility	Fingerprint match (BENCHMARK_SPEC §3.8)	±2%	The governance org must be able to re-run the method and achieve scores within ±2% of the submitted run card.

Why 99+% FST? The central problem in machine translation for polysynthetic languages is hallucination — LLMs produce strings that look like the target language but are morphologically invalid. A method that produces 95% valid output still has 5% fabricated words — unacceptable noise for any production use. The 99%+ threshold demands near-zero hallucination while allowing for the rare edge case (a proper noun the FST doesn't know, a legitimate neologism). If a method cannot achieve 99%+ FST acceptance, it has not solved the problem.

Why 0.80 composite? This sits between Deployable (0.70) and Fluent (0.85). A method at 0.80 with 99%+ FST acceptance produces output where virtually every word is a real Cree word and the overall translation quality is high across surface, structural, and semantic dimensions. The community validation gate (condition #4) ensures this isn't just metric gaming — speakers must confirm the output is genuinely usable.

What This Threshold Means in Practice

At composite ≥ 0.80 with FST ≥ 0.99 and chrF++ ≥ 55, a bilingual speaker would typically see:

Virtually every output word is a real Cree word (FST validates 99%+ — near-zero hallucinated forms)
Major grammatical categories (person, number, tense) are correct in most entries
Word order is generally natural
Meaning is preserved reliably
Remaining errors are real-language errors (wrong inflection, incorrect obviation, animacy mismatches) — not fabricated words
A fluent speaker could use the output as a high-quality draft and correct it significantly faster than translating from scratch

This is a system that does not mangle the language. It may not be perfect, but every word it produces is a real word. That is the minimum bar for respectful machine translation of a polysynthetic language.

3. Prize Claim Process

3.1 Submission

Developer submits their complete, runnable method to the governance org:
- All source code
- All dependencies (coaching data, dictionaries, FST configs, prompts)
- Installation and execution instructions
- A README describing the method's approach
- A development-set run card showing approximate scores (for pre-screening)
Developer signs the terms of participation, including:
- Ownership transfer clause (BENCHMARK_SPEC §8.3)
- No training on evaluation data declaration
- Reproducibility commitment

3.2 Evaluation

Governance org installs and runs the method in a sandboxed harness against the gold_standard corpus
Automated metrics are computed (composite, FST, chrF++, etc.)
If automated thresholds are met (conditions 1–3), governance org proceeds to community review
If automated thresholds are NOT met, developer receives scores and feedback. No community review is triggered.

3.3 Community Review

A stratified sample of outputs (≥30 entries, covering difficulty tiers 2–5) is presented to bilingual speakers
At minimum 2 independent reviewers rate each entry
Rating scale: reject / gist / acceptable / excellent
If ≥70% of entries receive "acceptable" or "excellent" from both reviewers, community validation passes

3.4 Payout

All 6 conditions are met
Governance org confirms result
Prize is paid within 30 days of confirmation
Method ownership transfers per BENCHMARK_SPEC §8.3
Result is published on the leaderboard with "Community Validated" verification tier

3.5 Multiple Submissions

The same developer/team may submit multiple times
Each submission is evaluated independently
If a method is improved and re-submitted, only the latest run card counts
The prize is awarded to the first method that clears all thresholds — it is not split

3.6 Team Submissions

Teams and Elder-youth pairs are eligible
Prize distribution within a team is the team's responsibility
All team members must sign the terms of participation
Attribution on the leaderboard lists all team members

4. Future Prize Pools

The Founder's Prize is the seed. Additional prize pools are funded by sponsors. Each new prize pool is documented as a new subsection of §2 with its own:

Prize amount and currency
Language pair
Sponsor attribution
Threshold conditions (which may differ from the Founder's Prize)
Expiry date (if any)
Any special conditions

Sponsors fund prize pools at any amount. Suggested tiers:

Tier	Amount	Suggested Threshold
Seed	$5,000–$15,000	Deployable (composite ≥ 0.70) + community validation
Breakthrough	$25,000–$50,000	Fluent (composite ≥ 0.85) + community validation
Grand Prize	$100,000+	Fluent + multi-register coverage + deployment integration

Sponsors may also fund:

Improvement bounties — fixed payment for each 5-point improvement in chrF++ over the current best
Register prizes — separate awards for specific registers (formal, ceremonial, educational)
Speed prizes — best cost-adjusted score (SCORING_SPEC §6.3)

4.2 Prize Pool Escrow

All prize funds are held in escrow (managed by the project or a designated trustee) until threshold conditions are met. If a prize expires without being claimed, funds are returned to the sponsor or redirected to a new prize pool at the sponsor's discretion.

5. Disqualification

A submission is disqualified if:

Training on evaluation data. Method was exposed to gold_standard or held_out corpus entries. (Architecturally prevented by sandboxed execution — but if evidence of contamination is found, the result is voided.)
Non-reproducible. Governance org cannot reproduce scores within ±2%.
Undeclared or ineligible dependencies. The method requires runtime access to external services beyond what its dependency manifest declares, or its effective dependency class is A2 or X (§1.6). Declared Class A1 LLM inference routed through the evaluation gateway is permitted; any other runtime network dependency — and any undeclared dependency of any class — is disqualifying.
Terms of participation not signed. All team members must agree to ownership transfer.
Gaming detected. Output is optimized for the metric rather than translation quality (caught by community review and/or anti-gaming checks per BENCHMARK_SPEC §9.3).

6. Relationship to Other Specs

This Document	References	For
§2 threshold conditions	SCORING_SPEC §4 (composite), §2.1–2.2 (metrics), §5 (tiers)	Metric definitions and scale
§2 community validation	BENCHMARK_SPEC §7	Human review protocol
§3 sandbox execution	BENCHMARK_SPEC §8.2	Sovereignty mechanism
§3 ownership transfer	BENCHMARK_SPEC §8.3	IP transfer terms
§1.6 dependency classes	Method Interface spec; Method Submission Agreement §2.6; BENCHMARK_SPEC §8.6	Class definitions, admissibility terms, sandbox network policy
§4 cost-adjusted prizes	SCORING_SPEC §6.3	Cost-adjusted formula

7. Code–Spec Synchronization

7.1 Canonical Source

This document (arena/website/docs/specifications/prize-spec.md) is the canonical source for:

Prize pool definitions (§2)
Threshold conditions (§2.x)
Claim process (§3)
Disqualification rules (§5)

7.2 Implementation Requirements

When a prize pool is activated:

The leaderboard UI must display active prizes and their threshold conditions
Run cards that meet automated thresholds (conditions 1–3) must be flagged for community review
The quality_tier field in the run card schema already captures the tier ("deployable", "fluent")
No new code changes to the harness are needed — the prize spec is a policy layer on top of existing scoring

Prize structures must be compatible with ownership transfer terms. The winner can claim the prize, but the method becomes the governance org's property if it reaches Deployable tier. This is by design — the prize funds the creation of technology that belongs to the language community.

1. Philosophy​

1.1 Prizes Reward Breakthroughs, Not Participation​

1.2 Community Validation Is Non-Negotiable​

1.3 Ownership Transfer Is Part of the Deal​

1.4 Anti-Gaming​

1.5 Corpus Licensing: Non-Commercial Corpora Stay Out of the Prize Lane​

1.6 Dependency Classes Gate Prize Eligibility​

2. Active Prize Pools​

2.1 The Founder's Prize — EN→Plains Cree (nêhiyawêwin)​

Threshold Conditions​

What This Threshold Means in Practice​

3. Prize Claim Process​

3.1 Submission​

3.2 Evaluation​

3.3 Community Review​

3.4 Payout​

3.5 Multiple Submissions​

3.6 Team Submissions​

4. Future Prize Pools​

4.1 Sponsor Prize Template​

4.2 Prize Pool Escrow​

5. Disqualification​

6. Relationship to Other Specs​

7. Code–Spec Synchronization​

7.1 Canonical Source​

7.2 Implementation Requirements​