Skip to main content

Scoring Specification

Executive Summary. This is the single source of truth for all evaluation metrics, composite scoring, quality tiers, and cost analysis in the Rosetta MT evaluation ecosystem. Every metric computed by the harness, every weight in the composite formula, and every tier threshold is defined here — and only here. Code, documentation, and database schemas derive from this document. When they conflict, this document is authoritative.

Scope. This document defines what we measure and how we score it. It does not define the run card schema (see Benchmark Specification §3), the benchmark protocol (Benchmark Specification §6), or the leaderboard rules (see arena docs). Those documents reference this one for metric definitions and scoring logic.

Last updated: 2026-05-26


1. Scoring Philosophy

1.1 Automated Metrics Are Proxies

Every metric defined here is machine-computed. They are useful for rapid iteration, systematic comparison, and detecting regressions. They are not substitutes for human judgment. The quality tiers in §5 are heuristic labels — only human review can confirm actual usability.

1.2 Multi-Signal Design

No single metric captures translation quality. A translation can have perfect chrF++ overlap but fail morphological validation. It can pass FST checks but carry the wrong meaning. It can be semantically accurate but stylistically alien to the target language. The composite score in §4 aggregates multiple independent signals, each capturing a different dimension of quality.

1.3 Extensibility

This metric inventory is not closed. New languages bring new requirements: tone accuracy for tonal languages, diacritical precision for Semitic scripts, syllabary correctness for Cree. The architecture (MetricPlugin protocol, weighted composite with re-normalization) is designed for metrics to be added without breaking existing scores.

1.4 Three Dimensions of Evaluation

Every run card measures three independent dimensions:

Quality — How good is the translation? (composite score, §4)
Cost — How much does it cost? (cost metrics, §6)
Speed — How fast does it run? (speed metrics, §7)

These are independent axes. A method can be high-quality but expensive, fast but inaccurate, or any combination. The leaderboard enables sorting by any dimension. The cost-adjusted score (§6.3) is the only metric that combines dimensions.


2. Metric Inventory

Metrics are organized into four categories. Each metric has an implementation status, scale, and level (per-entry, corpus-level, or both).

2.1 Surface Metrics

Surface metrics compare the predicted translation to the reference translation at the string level. They require no linguistic tools — just string comparison.

IDMetricStatusScaleLevelImplementation
exact_match_rateExact Match✅ Implemented0.0–1.0BothBinary: does predicted == reference? Corpus rate = matches / total.
equivalent_match_rateEquivalent Match⚡ Partial0.0–1.0BothDoes the predicted output match any accepted variant? For CRK: implemented via CrkLinterMetric plugin using deterministic variant-class rules (word order, orthographic, optional particle, lemma synonym, progressive ambiguity). Generic cross-language implementation requires per-entry variants[] in corpus.
chrf_plus_pluschrF++✅ Implemented0–100BothCharacter n-gram F-score (sacrebleu). Robust to morphological variation. The primary surface metric for agglutinative/polysynthetic languages. Per-entry uses sentence_chrf; corpus uses corpus_chrf.
bleuBLEU✅ Implemented0–100CorpusWord-level n-gram precision (sacrebleu). Excluded from composite — word-level scoring penalizes morphological variation unfairly. Computed and reported for compatibility with MT literature.
terTranslation Edit Rate🔲 Planned0–∞ (lower is better)BothMinimum edit distance between predicted and reference, normalized by reference length (sacrebleu corpus_ter). Already available in our sacrebleu dependency.
length_ratioLength Ratio🔲 Planned0–∞ (1.0 is ideal)Bothlen(predicted) / len(reference) in characters. Detects truncation (<0.5) and inflation/hallucination (>2.0). Trivial to implement.

2.2 Structural Metrics

Structural metrics validate the linguistic well-formedness of the translation. They require language-specific tools (FST analyzers, morphological parsers) and are the strongest signals for morphologically rich languages.

IDMetricStatusScaleLevelImplementation
fst_acceptance_rateFST Acceptance✅ Implemented0.0–1.0BothProportion of output words accepted by a finite-state transducer (GiellaLT). A word is "valid" if the FST returns at least one morphological analysis. Available for any language with a GiellaLT .hfstol analyzer.
morphological_accuracyMorphological Accuracy🔲 Planned0.0–1.0BothA word can be FST-valid but have the wrong inflection (right root, wrong suffix). This metric compares the FST analysis of the predicted word against the expected morphological features. Requires per-entry morphological annotations in the corpus.
orthographic_accuracyOrthographic Accuracy🔲 Planned0.0–1.0BothValidates script-specific correctness: SRO macron/circumflex usage for Cree, diacritical marks for Inuktitut, vowel length markers for Ojibwe. Per-language rule sets.

2.3 Semantic Metrics

Semantic metrics measure meaning preservation using embeddings or learned models. They catch translations that are surface-different but meaning-equivalent, and flag translations that are surface-similar but semantically wrong.

IDMetricStatusScaleLevelImplementation
semantic_scoreSemantic Similarity⚡ Partial0.0–1.0BothCRK: verdict-weighted score from CrkSemanticMetric (proxy). Universal: cosine similarity of sentence embeddings (source + predicted vs source + reference). Model TBD — must support low-resource languages, which rules out most English-centric embedding models.
comet_scoreCOMET✅ Implemented~0.0–1.0BothLearned MT evaluation metric (Unbabel). Trained on human quality judgments. Excluded from composite — training data is biased toward high-resource European languages; scores for LRLs are unreliable. Computed when unbabel-comet is installed. Reported with a low-resource warning flag.

Why COMET is excluded from the composite. COMET is trained on WMT human evaluation data, which is overwhelmingly high-resource European language pairs. When applied to Plains Cree or other LRLs, the model's internal representations have no exposure to those languages — it's extrapolating from languages with fundamentally different morphological systems. The scores are still directionally useful (higher COMET ≈ more fluent-sounding output in general) but the absolute values are not calibrated. We report COMET for transparency but don't let it influence the composite score until we can validate it against human judgments for each target language.

2.4 Behavioral Metrics

Behavioral metrics detect specific failure modes in translation output. They don't measure quality directly — they detect problems.

IDMetricStatusScaleLevelImplementation
code_switching_rateCode-Switching Rate🔲 Planned0.0–1.0 (lower is better)BothProportion of output words that are in the source language (typically English). Detected via Unicode script analysis and/or a source-language word list. Very common LLM failure mode: the model inserts English words when it doesn't know the target-language equivalent.
hallucination_rateHallucination Rate🔲 Planned0.0–1.0 (lower is better)BothProportion of output content that has no corresponding source content. Detected via word alignment or cross-lingual embedding overlap. Catches the model generating plausible-sounding but fabricated translations.
terminology_adherenceTerminology Adherence🔲 Planned0.0–1.0BothFor coached methods: proportion of prescribed terminology terms that appear in the output. Requires coaching dictionary data. Measures whether the model respects expert-provided vocabulary.
consistency_scoreCross-Entry Consistency🔲 Planned0.0–1.0Corpus onlyDoes the model translate the same source term the same way across entries? Low consistency suggests the model is guessing rather than applying learned patterns. Requires repeated terms across corpus entries.

2.5 Compliance Metrics

Compliance metrics validate that translations preserve structural integrity — placeholders, formatting, and typography conventions. They are quality-gate checks, not quality scores.

IDMetricStatusScaleLevelImplementation
compliance_indexDouble-Pass Compliance✅ Implemented0.0–1.0BothWeighted composite: 60% variable integrity (are {placeholder} vars preserved?) + 20% quote compliance (correct quote characters per language card) + 20% casing compliance (no Latin letter leakage for caseless languages). Computed on both raw and post-processed output. Via DoublePassCompliancePlugin.
repair_effectivenessRepair Effectiveness✅ Implemented0.0–1.0CorpusProportion of compliance violations that were automatically repaired by post-translation hooks. Measures how much the quality gate improved the raw output.

Why compliance is not in the composite. Compliance metrics measure structural preservation (placeholders, quotes), not translation quality. A translation can be perfect linguistically but fail compliance because it dropped a {name} variable. These are quality gates — they block bad output from shipping, but they don't rank translation quality.


3. Metric Status Tiers

Every metric in §2 falls into one of four implementation tiers:

TierMeaningRun Card Behavior
✅ ImplementedCode exists, tested, producing values in run cards todayNumeric value in run card
⚡ PartialLanguage-specific proxy exists (e.g., CRK) but universal implementation is pendingNumeric value when proxy applies, null otherwise
🔲 PlannedSpecified but not yet implementednull in run card (field present, value absent)
💡 ProposedUnder discussion, not yet specifiedNot in run card

A metric moves from Planned → Partial when:

  1. A language-specific implementation is merged and tested
  2. It produces values for at least one language pair
  3. The universal implementation remains pending (documented in this spec)

A metric moves from Partial → Implemented when:

  1. A language-agnostic implementation is merged and tested
  2. It produces values for any language pair without language-specific plugins
  3. This document is updated to reflect ✅ status

A metric moves from Planned → Implemented when:

  1. Implementation is merged and tested
  2. It has been validated on at least one real evaluation run
  3. This document is updated with its implementation details

A metric moves from Proposed → Planned when:

  1. Its definition, scale, and computation method are agreed upon
  2. It is added to this document with a 🔲 Planned status
  3. A null placeholder is added to the run card schema

4. Composite Score

4.1 Formula

The composite score is a weighted average of all available metrics, re-normalized so the weights of available metrics sum to 1.0:

composite = Σ (weight_i × value_i) for all available metrics
─────────────────────
Σ weight_i (re-normalization denominator)

A metric is "available" if its value in the run card is a number (not null). When a metric is unavailable — because the language has no FST, or because a metric is not yet implemented — its weight is redistributed proportionally across the remaining metrics.

This means the composite is always comparable within a run: it uses whatever metrics are available and normalizes accordingly. Cross-run comparison is valid when runs use the same set of available metrics.

4.2 Input Normalization

Before entering the composite formula, all metrics must be on a 0.0–1.0 scale where 1.0 = perfect:

MetricNative ScaleNormalization
exact_match_rate0.0–1.0None (already normalized)
equivalent_match_rate0.0–1.0None
fst_acceptance_rate0.0–1.0None
morphological_accuracy0.0–1.0None
chrf_plus_plus0–100Divide by 100
semantic_score0.0–1.0None
code_switching_rate0.0–1.0 (lower = better)1.0 - value (invert: 0% code-switching = 1.0)
hallucination_rate0.0–1.0 (lower = better)1.0 - value (invert)
terminology_adherence0.0–1.0None

Metrics excluded from the composite (bleu, comet_score, ter, length_ratio, consistency_score) are not normalized for this purpose.

4.3 Weight Tables

Profile A: Languages WITH FST Coverage

For languages that have a GiellaLT finite-state transducer available. Structural metrics carry 50% of the composite, reflecting the primacy of morphological correctness for polysynthetic/agglutinative languages.

MetricTarget WeightRationale
fst_acceptance_rate0.25Highest weight. If the FST rejects a word, it's not a valid form in the language — regardless of what other metrics say. Binary, structurally grounded.
morphological_accuracy0.15A word can be FST-valid but morphologically wrong (right root, wrong inflection). Together with FST, structural metrics carry 40%.
chrf_plus_plus0.15Character n-gram overlap: the best surface-level proxy for polysynthetic languages. Handles agglutinative morphology better than word-level metrics.
semantic_score0.15Meaning preservation when surface form diverges. Catches semantically wrong translations that pass structural checks.
equivalent_match_rate0.10Rewards acceptable variants, not just the one reference translation. Important for languages with flexible word order.
code_switching_rate0.05Penalizes source-language leakage. Inverted: 0% code-switching = 1.0.
terminology_adherence0.05Rewards coached methods that respect prescribed vocabulary. Only active when coaching data is present.
hallucination_rate0.05Penalizes fabricated content. Inverted: 0% hallucination = 1.0.
exact_match_rate0.05Lowest weight. Too strict for polysynthetic languages — multiple correct translations exist. Kept as a ceiling check.

Total: 1.00. When metrics are unavailable, their weights are redistributed proportionally across available metrics. For example, if only fst_acceptance_rate, chrf_plus_plus, and exact_match_rate are available (current state), the effective weights become:

  • FST: 0.25/0.45 = 0.556
  • chrF++: 0.15/0.45 = 0.333
  • EM: 0.05/0.45 = 0.111

Profile B: Languages WITHOUT FST Coverage

For languages without morphological validation tools. Semantic and surface metrics carry equal weight.

MetricTarget WeightRationale
semantic_score0.25Without structural validation, meaning preservation is the strongest available signal.
chrf_plus_plus0.25Without FST, character-level overlap becomes the primary surface check.
equivalent_match_rate0.15Variant matching provides structured quality assessment without requiring morphological tools.
exact_match_rate0.10Without FST, exact match carries more weight as the only structural validation proxy.
code_switching_rate0.10Source language leakage matters more when there's no FST to catch bad output.
terminology_adherence0.05Coached vocabulary compliance.
hallucination_rate0.05Fabricated content detection.
orthographic_accuracy0.05Script-specific correctness fills part of the gap left by absent FST.

Note on weight evolution. These weights are provisional and will be recalibrated as human validation data accumulates. The long-term goal is to derive weights empirically: which automated metrics best predict human quality judgments for each language family?

4.4 Adding a New Metric to the Composite

To add a new metric to the composite:

  1. Define it in §2 with status 🔲 Planned, including scale, level, and computation method.
  2. Implement it as a MetricPlugin (or in tester.py for core metrics).
  3. Add a null placeholder in the run card scores block.
  4. Assign it a target weight in §4.3 by adjusting existing weights downward. Weights must sum to 1.00.
  5. Update /docs/specifications/benchmark §3 if the run card schema changes.
  6. Update scoring.py weight tables (the code must mirror this document).
  7. Run a validation benchmark to confirm the metric produces sensible values on real data.
  8. Update this document to change status from 🔲 to .

5. Quality Tiers

These tiers are heuristic labels on automated composite scores. They describe what the scores tend to mean in practice, based on human review of outputs at each level. They are not validated quality judgments — only human review can confirm actual usability.

TierComposite RangeWhat a Speaker Typically Sees
Baseline0.00–0.30Raw LLM output with no language-specific support. Morphology is mostly hallucinated.
Emerging0.30–0.50Some correct patterns appearing. Coaching is helping, but output is not reliable.
Functional0.50–0.70Output is recognizable to a speaker. Major grammatical categories usually correct. Frequent morphological errors.
Deployable0.70–0.85Suitable for draft translation with human review. Most morphology is correct.
Fluent0.85–1.00Approaching competent human translation. Errors are rare and minor.

These tiers are provisional. They will be recalibrated as human validation data accumulates and we learn where the "a speaker finds this useful" threshold actually falls for each language. No method can claim Deployable or above without community review confirming bilingual speakers agree the output is usable.

5.1 Tier Thresholds (Machine-Readable)

For code implementations, the thresholds are (evaluated top-down, first match wins):

composite >= 0.85 → "fluent"
composite >= 0.70 → "deployable"
composite >= 0.50 → "functional"
composite >= 0.30 → "emerging"
composite >= 0.00 → "baseline"
composite is null → "unscored"

6. Cost Metrics

Cost metrics measure the financial efficiency of a translation method. They are reported separately from quality — cost does not influence the composite score (except in the cost-adjusted secondary ranking).

6.1 Token Metrics

IDMetricComputation
prompt_tokensTotal input tokensSum of usage.prompt_tokens across all API calls
completion_tokensTotal output tokensSum of usage.completion_tokens
reasoning_tokensChain-of-thought tokensSum of usage.completion_tokens_details.reasoning_tokens (0 for most models)
cached_tokensProvider-cached tokensSum of usage.prompt_tokens_details.cached_tokens
total_tokensTotal tokens consumedprompt_tokens + completion_tokens
tokens_per_entryAverage tokens per translationtotal_tokens / entry_count

6.2 Cost Metrics

IDMetricComputationUse Case
total_cost_usdTotal run costProvider-reported pricing × token counts"How much did this benchmark cost?"
cost_per_entry_usdCost per corpus entrytotal_cost_usd / entry_countComparing methods on the same corpus
cost_per_1k_tokensCost per 1,000 tokenstotal_cost_usd / total_tokens × 1000Universal LLM efficiency — comparable across corpora
cost_per_source_charCost per source charactertotal_cost_usd / total_source_charsComparable across languages with different tokenization

Why multiple cost metrics? An "entry" varies in length — a 3-word phrase costs less than a paragraph. cost_per_entry_usd is useful for comparing methods on the same corpus (same entries = same lengths = fair comparison). cost_per_1k_tokens is the standard LLM efficiency metric, comparable across corpora. cost_per_source_char normalizes for tokenization differences — the same sentence may tokenize into different numbers of tokens depending on the model's vocabulary.

6.3 Cost-Adjusted Score

For methods using paid APIs, we compute a secondary ranking:

cost_adjusted = composite / log2(1 + cost_per_entry_usd × 1000)

This rewards methods that achieve good scores efficiently. It uses cost_per_entry_usd (not per-token) because the cost-adjusted score is always computed within a single benchmark (same corpus), making per-entry comparison fair.

The cost-adjusted score is a secondary ranking — the primary leaderboard ranks by composite score. It answers a different question: "given a budget, which method gives the best results?"


7. Speed Metrics

Speed metrics measure the latency and throughput of a translation method. Like cost, speed does not influence the composite score.

IDMetricComputationLevel
elapsed_secondsWall-clock run durationtime_end - time_startRun
avg_latency_secondsMean per-entry latencyΣ latency_s / n_entriesCorpus
median_latency_secondsMedian per-entry latency50th percentile of latency_sCorpus
p95_latency_seconds95th percentile latency95th percentile of latency_sCorpus
tokens_per_secondThroughputtotal_tokens / elapsed_secondsRun
entries_per_minuteTranslation rateentry_count / (elapsed_seconds / 60)Run

8. Confidence and Significance

8.1 Bootstrap Confidence Intervals

All key metrics support bootstrap confidence intervals (percentile method, n=1000 resamples, α=0.05):

MetricCI Reported
chrf_plus_pluschrf_ci_lower, chrf_ci_upper
exact_match_rateexact_match_ci_lower, exact_match_ci_upper
fst_acceptance_rate🔲 Planned
comet_score🔲 Planned (metric function exists in metrics_comet.py but not wired into compute_all_cis())
composite🔲 Planned

8.2 Paired Bootstrap Significance Tests

For comparing two methods, the harness computes paired bootstrap resampling tests:

H₀: The two methods perform equally on this corpus.
H₁: One method is significantly better.

If the p-value < 0.05 and the confidence interval of the difference excludes zero, the difference is statistically significant at the 95% level.


9. Run Card Scores Schema

This section defines the hierarchical structure of the scores block in a run card. This schema is derived from the metrics defined in §2–§7 and must be kept in sync.

{
"scores": {
// §2.1 Surface metrics
"exact_match_rate": 0.6613, // 0.0–1.0
"exact_matches": 41, // count
"equivalent_match_rate": 0.7258, // ⚡ partial (CRK: CrkLinterMetric)
"equivalent_matches": 45, // ⚡ partial (CRK: CrkLinterMetric)
"chrf_plus_plus": 80.65, // 0–100 (sacrebleu native scale)
"bleu": 54.78, // 0–100, NOT in composite
"ter": null, // 🔲 planned, 0–∞ (lower=better)
"length_ratio": null, // 🔲 planned, ideal=1.0

// §2.2 Structural metrics
"fst_acceptance_rate": 1.0, // 0.0–1.0
"fst_accepted": 74, // count
"morphological_accuracy": null, // 🔲 planned
"orthographic_accuracy": null, // 🔲 planned

// §2.3 Semantic metrics
"semantic_score": 0.6842, // ⚡ partial (CRK: CrkSemanticMetric)
"comet_score": null, // nullable, NOT in composite
"comet_model": "", // model ID used for COMET

// §2.4 Behavioral metrics
"code_switching_rate": null, // 🔲 planned (lower=better)
"hallucination_rate": null, // 🔲 planned (lower=better)
"terminology_adherence": null, // 🔲 planned
"consistency_score": null, // 🔲 planned

// §4 Composite
"composite": 0.8988, // 0.0–1.0
"quality_tier": "fluent", // §5 tier label
"cost_adjusted": null, // §6.3 secondary ranking

// §8.1 Confidence intervals
"confidence_intervals": {
"chrf_plus_plus": { "ci_lower": 78.2, "ci_upper": 83.1 },
"exact_match_rate": { "ci_lower": 0.54, "ci_upper": 0.78 }
},

// Breakdowns
"by_difficulty": {}, // scores grouped by difficulty tier
"by_provenance": {}, // scores grouped by entry provenance

// Counts
"total": 62,
"evaluated": 62,
"errors": 0
},

"cost": {
"total_cost_usd": 1.7114,
"cost_per_entry_usd": 0.027603,
"cost_per_1k_tokens": 0.00848,
"cost_per_source_char": null // 🔲 needs source char counting
},

"speed": {
"elapsed_seconds": 45.2,
"avg_latency_seconds": 0.234,
"median_latency_seconds": 0.190,
"p95_latency_seconds": 0.415,
"tokens_per_second": null, // 🔲 needs total_tokens / elapsed
"entries_per_minute": null // 🔲 needs entry_count / (elapsed/60)
},

"tokens": {
"prompt_tokens": 13985,
"completion_tokens": 187822,
"reasoning_tokens": 175726,
"cached_tokens": 0,
"total_tokens": 201807, // prompt + completion
"tokens_per_entry": 3255 // total / entry_count
}
}

9.1 Schema–Database Mapping

The run card JSON is stored in full as a jsonb column in Supabase. Key metrics are also denormalized into top-level columns for sort/filter performance:

Run Card FieldSupabase ColumnTypeIndex
scores.compositecomposite_scorerealidx_composite
scores.quality_tierquality_tiertext
scores.chrf_plus_pluschrf_plus_plusrealidx_leaderboard
scores.exact_match_rateexact_match_ratereal
scores.fst_acceptance_ratefst_acceptance_ratereal
scores.bleucorpus_bleureal
scores.comet_scorecomet_scorereal
cost.total_cost_usdtotal_cost_usdreal
cost.cost_per_1k_tokenscost_per_1k_tokensreal
speed.avg_latency_secondsavg_latency_secondsreal
model_slugmodel_slugtextidx_model
conditionconditiontext
dataset.iddataset_idtextidx_leaderboard
dataset.language_pairlanguage_pairtext
fingerprint.hashfingerprint_hashtextidx_fingerprint
(full card)run_cardjsonb

When new metrics are implemented, the corresponding column should be added via a numbered migration in gds-mt-eval-harness/migrations/.


10. Code–Spec Synchronization

10.1 Canonical Source

This document is the canonical source for:

  • Metric definitions (§2)
  • Composite weight tables (§4.3)
  • Quality tier thresholds (§5.1)
  • Cost metric formulas (§6.2)
  • Run card scores schema (§9)

10.2 Code Mirror

The file gds-mt-eval-harness/mt_eval_harness/scoring.py mirrors the weight tables and tier thresholds from this document. It is the code implementation of §4.3 and §5.1. When this document is updated:

  1. Update scoring.py to match
  2. Run pytest tests/test_scoring_ssot.py to validate alignment
  3. Update FAQ and website docs that summarize the weights

10.3 Documents That Reference This Spec

DocumentWhat It ReferencesHow to Keep in Sync
docs//docs/specifications/benchmark §4–§5Composite formula, weight tables, tier thresholdsCross-reference this doc; do not duplicate tables
website/docs/getting-started/faq.mdSimplified weight summaryMust match §4.3; link back to this doc
docs/AGENTS.mdQuality tier names + thresholdsMust match §5
docs/HOW_IT_WORKS.mdDeployable thresholdMust match §5
publish.py via scoring.pyWeight dicts + tier functionAutomated test validates match

Appendix A: Metrics NOT in Composite (and Why)

MetricWhy Excluded
BLEUWord-level scoring penalizes morphological variation in polysynthetic languages. A minor inflectional difference (correct meaning, slightly different suffix) counts as a complete miss. chrF++ handles this better at the character level.
COMETTrained on WMT data (high-resource European pairs). Scores for LRLs are unreliable — the model is extrapolating from languages with different morphological systems. Reported for transparency, not for scoring.
TEREdit distance correlates with chrF++ for most use cases. Including both would double-count surface similarity. TER is reported for reference.
Length RatioA diagnostic, not a quality signal. A ratio of 1.02 and a ratio of 0.98 are both fine. Only extreme values indicate problems.
Consistency ScoreCorpus-level only — no per-entry value to aggregate. Also, some inconsistency is legitimate (same English word → different target-language translations depending on context).
Compliance IndexQuality gate, not quality signal. Measures structural preservation (placeholders, quotes), not translation accuracy.

Appendix B: Language-Specific Metric Implementations

Some metrics have language-specific implementations that predate the generic harness metrics:

LanguagePluginMetricNotes
CRK (Plains Cree)CrkLinterMetricequivalent_match_rateDeterministic variant-class rules: word order, orthographic, optional particle, lemma synonym, progressive ambiguity, inclusive/exclusive. Produces per-entry lint_verdict (EXACT/EQUIVALENT/MISS/NO_OUTPUT).
CRKCrkFSTMetricfst_acceptance_rateSuperseded by GiellaLTFSTMetric but produces identical results.
CRKCrkSemanticMetricSemantic validationDeterministic: FST lemma extraction + dictionary glosses + spaCy content-word overlap. Produces verdicts (EXACT_MATCH/VALID/WRONG_ORDER/PARTIAL/INCOMPLETE/WRONG/NO_OUTPUT).
GiellaLT langsGiellaLTFSTMetricfst_acceptance_rateGeneric: works for CRK, SME, SMA, SMJ, SMN, SMS, FIN, NOB, IKU — any language with a .hfstol analyzer.

When a language has a specific implementation, it takes precedence over the generic. The generic implementation is the fallback for languages without specialized tooling.

Appendix C: Metrics Under Consideration

These are ideas being evaluated but not yet specified enough for §2:

IdeaWhat It Would MeasureBlockers
Fluency (LM perplexity)Is the output well-formed prose in the target language?Requires a target-language LM. No good models exist for most LRLs.
Register matchDoes the translation match the expected formality level?Requires sociolinguistic classifiers. Research problem.
Cultural appropriatenessAre cultural references handled correctly?Cannot be automated — inherently requires human review.
Discourse coherenceDo consecutive translations form a coherent passage?Requires document-level evaluation, not sentence-level.