Statistical Significance Testing — Implementation Spec
Target codebase:
arena(specificallytester.pyandcompare.py) Purpose: Enable researchers to determine whether the difference between two evaluation runs is statistically significant or just noise. Priority: High — this is the single most important missing feature for publishable results.
Why This Matters
When comparing two runs (e.g., Gemini 3.1 Pro chrF++ 42.96 vs Claude Sonnet chrF++ 41.80 on 92 entries), we currently cannot say whether the difference is real or noise. With only ~92 test entries, random variation can easily produce 1-2 point swings. Experts will ask for significance tests. We need to answer.
Algorithm: Paired Bootstrap Resampling
This is the standard method used by SacreBLEU, MT-Lens, and WMT shared tasks. It's well-understood by MT researchers and produces results they trust.
How It Works
Given two systems A and B evaluated on the same N test entries:
- Compute the actual metric difference:
Δ = metric(A) - metric(B) - Repeat
n_bootstraptimes (default 1000): a. Sample N entries with replacement from the shared test set b. Compute the metric for both A and B on this bootstrap sample c. Compute the bootstrap difference:Δ_boot = metric(A_boot) - metric(B_boot) - The p-value = fraction of bootstrap samples where
Δ_boothas the opposite sign fromΔ - If p-value < α (default 0.05), the difference is statistically significant
Key Properties
- Paired: Both systems are evaluated on the same bootstrap sample, preserving entry-level correlation
- Non-parametric: No assumption about the distribution of scores
- Standard: This is exactly what
sacrebleu --paired-bsdoes under the hood
Important: sacrebleu Is a Hard Dependency
sacrebleu is currently listed under [project.optional-dependencies] and guarded by try/except in tester.py. This should be changed. An MT eval harness that cannot compute chrF++ or BLEU is not an MT eval harness. sacrebleu should be:
- Moved to
[project.dependencies]inpyproject.toml - Imported directly in
tester.py(remove thetry/except HAS_SACREBLEUguard) - Imported directly in the new
significance.pymodule
The HAS_SACREBLEU conditional paths in tester.py should be removed — they make the code more complex for a scenario (running without sacrebleu) that should not be supported.
Implementation Plan
1. Promote sacrebleu to hard dependency
pyproject.toml: Move sacrebleu>=2.3 from [project.optional-dependencies].metrics to [project.dependencies].
tester.py: Replace:
# Optional: sacrebleu for chrF++ and BLEU
try:
from sacrebleu.metrics import CHRF, BLEU
HAS_SACREBLEU = True
except ImportError:
HAS_SACREBLEU = False
With:
from sacrebleu.metrics import CHRF, BLEU
Remove all if HAS_SACREBLEU: guards throughout tester.py.
2. New module: mt_eval_harness/significance.py
"""
Statistical significance testing via paired bootstrap resampling.
Standard method used by WMT shared tasks, SacreBLEU, and MT-Lens.
Compares two runs on the same corpus to determine if the performance
difference is statistically significant.
"""
from __future__ import annotations
import random
from dataclasses import dataclass
from sacrebleu.metrics import CHRF, BLEU
@dataclass
class SignificanceResult:
"""Result of a paired bootstrap significance test."""
metric_name: str # e.g., "corpus_chrf", "exact_match_rate"
system_a_score: float # Score for system A
system_b_score: float # Score for system B
delta: float # A - B
p_value: float # Two-sided p-value
n_bootstrap: int # Number of bootstrap iterations
confidence_level: float # 1 - alpha
significant: bool # p_value < alpha
winner: str | None # "A", "B", or None if not significant
ci_lower: float # Lower bound of 95% CI on the delta
ci_upper: float # Upper bound of 95% CI on the delta
def paired_bootstrap(
entries_a: list[dict],
entries_b: list[dict],
metric_fn: callable,
n_bootstrap: int = 1000,
alpha: float = 0.05,
seed: int = 12345,
metric_name: str = "metric",
) -> SignificanceResult:
"""Run paired bootstrap resampling significance test.
Args:
entries_a: Per-entry results from system A (from TestReport["entries"])
entries_b: Per-entry results from system B (must be same length, same IDs)
metric_fn: Function(list[dict]) -> float that computes the corpus-level
metric from a list of entry dicts. Must handle the entry format
from TestReport.
n_bootstrap: Number of bootstrap iterations (1000 is standard)
alpha: Significance level (0.05 = 95% confidence)
seed: RNG seed for reproducibility (12345 matches SacreBLEU default)
metric_name: Human-readable name for the metric being tested
Returns:
SignificanceResult with all fields populated.
Raises:
ValueError: If entries_a and entries_b have different lengths or IDs.
"""
...
3. Built-in metric functions
def exact_match_rate(entries: list[dict]) -> float:
"""Compute exact match rate from a list of entry dicts."""
non_error = [e for e in entries if not e.get("error")]
if not non_error:
return 0.0
exact = sum(1 for e in non_error if e.get("exact_match"))
return exact / len(non_error)
def corpus_chrf(entries: list[dict]) -> float:
"""Compute corpus-level chrF++ from a list of entry dicts."""
chrf = CHRF(word_order=2)
refs = [e["expected"] for e in entries if e.get("expected", "").strip()]
hyps = [e["predicted"] if e.get("predicted", "").strip() else "EMPTY"
for e in entries if e.get("expected", "").strip()]
if not refs:
return 0.0
return chrf.corpus_score(hyps, [refs]).score
def corpus_bleu(entries: list[dict]) -> float:
"""Compute corpus-level BLEU from a list of entry dicts."""
bleu = BLEU()
refs = [e["expected"] for e in entries if e.get("expected", "").strip()]
hyps = [e["predicted"] if e.get("predicted", "").strip() else "EMPTY"
for e in entries if e.get("expected", "").strip()]
if not refs:
return 0.0
return bleu.corpus_score(hyps, [refs]).score
4. Integration into compare.py
The existing compare.py already does side-by-side comparison of multiple TestReports. Add significance testing:
# In compare_reports(), after computing deltas:
if len(reports) == 2:
sig_results = run_significance_tests(reports[0], reports[1])
comparison["significance"] = [asdict(r) for r in sig_results]
When more than 2 reports are compared, run pairwise significance tests for all pairs. Store results keyed by "(run_a_id, run_b_id)".
5. CLI integration
Add a --significance flag to mt-eval compare:
# Compare two runs with significance testing
mt-eval compare report_a.json report_b.json --significance
# Custom bootstrap count
mt-eval compare report_a.json report_b.json --significance --n-bootstrap 5000
Also consider a standalone command:
# Quick significance check between two reports
mt-eval significance report_a.json report_b.json
6. Output format
Console output:
Significance Tests (paired bootstrap, n=1000, α=0.05):
Metric A B Δ p-value Sig?
─────────────────── ──────── ──────── ─────── ──────── ────
corpus_chrf 42.96 41.80 +1.16 0.142 No
exact_match_rate 0.198 0.185 +0.013 0.381 No
corpus_bleu 6.80 3.81 +2.99 0.018 Yes *
JSON output (added to comparison report):
{
"significance": [
{
"metric_name": "corpus_chrf",
"system_a_score": 42.96,
"system_b_score": 41.80,
"delta": 1.16,
"p_value": 0.142,
"n_bootstrap": 1000,
"confidence_level": 0.95,
"significant": false,
"winner": null,
"ci_lower": -0.85,
"ci_upper": 3.12
}
]
}
7. Dashboard integration
If significance data is present in the comparison JSON, the dashboard should display it. Show a row in the comparison table with significance indicators (e.g., * for p < 0.05, ** for p < 0.01). This is a nice-to-have, not blocking.
Edge Cases and Validation
-
Mismatched entries: The two TestReports must have the same entry IDs. If they don't (e.g., one ran on a subset), only test significance on the intersection. Warn about excluded entries.
-
Too few entries: If N < 10, warn that significance tests are unreliable with so few entries. Still run them, but print the warning.
-
Identical scores: If both systems produce identical per-entry results, p_value should be 1.0 (no difference at all).
-
Plugin metrics: The significance module should also test any plugin metrics that appear in BOTH reports. Use a generic approach: if both reports have
plugin_metrics.crk_fst_validity.avg_fst_validity, test it. -
Reproducibility: The RNG seed must be logged in the output so results are exactly reproducible. Default to 12345 (matching SacreBLEU convention).
What NOT to Build
- No separate COMET significance: COMET is now integrated as a corpus metric via
metrics_comet.py. Bootstrap CIs are computed over COMET scores just like chrF++/BLEU. For pairwise COMET significance between two systems, usecomet-comparefrom Unbabel. - No Bayesian analysis: Stick to frequentist bootstrap. It's what the MT community expects and understands.
- No multi-test correction: When testing multiple metrics, don't apply Bonferroni or similar corrections. The convention in MT evaluation is to report raw p-values per metric and let the reader interpret.
Files to Modify
| File | Change |
|---|---|
pyproject.toml | Move sacrebleu from optional to hard dependency |
mt_eval_harness/tester.py | Remove HAS_SACREBLEU guards, direct import |
mt_eval_harness/significance.py | [NEW] Core implementation |
mt_eval_harness/__init__.py | Export SignificanceResult, paired_bootstrap |
mt_eval_harness/compare.py | Wire significance tests into report comparison |
mt_eval_harness/cli.py | Add --significance and --n-bootstrap flags |
mt_eval_harness/dashboard.py | Display significance in comparison table (nice-to-have) |
tests/test_significance.py | [NEW] Unit tests |
Testing Requirements
- Deterministic with seed: Same inputs + same seed = same p-value, every time
- Known-answer test: Two identical result sets → p_value = 1.0
- Known-significant test: Construct two result sets where one is clearly better (e.g., all exact matches vs all misses) → p_value ≈ 0.0
- Mismatched IDs: Should raise ValueError or warn and compute on intersection
- Empty inputs: Should handle gracefully (return p_value = 1.0 or raise)
Confidence Intervals (Companion Feature)
Status: ✅ IMPLEMENTED in
confidence.py
Confidence intervals (CIs) answer a different question from significance testing:
- Significance testing (
significance.py): "Is the difference between system A and system B real?" - Confidence intervals (
confidence.py): "How uncertain is this system's score on its own?"
Implementation: confidence.py
Uses the same percentile bootstrap resampling method as significance testing:
| Parameter | Value | Justification |
|---|---|---|
n_bootstrap | 1000 | SacreBLEU default, WMT 2024 convention |
seed | 12345 | SacreBLEU default seed for reproducibility |
alpha | 0.05 | Standard 95% confidence level |
| Method | Percentile bootstrap | Koehn (2004), Efron (1979) |
What Gets CIs
All corpus-level metrics computed by the harness:
corpus_chrf(chrF++ score)corpus_bleu(BLEU score)exact_match_rate(0.0–1.0)
CLI Flags
# Default: CIs are computed automatically
mt-eval test run_log.json
# Skip CI computation (faster, for quick iteration)
mt-eval test run_log.json --no-ci
# More bootstrap iterations (more precise, slower)
mt-eval test run_log.json --n-bootstrap-ci 2000
Small Sample Warning
When N < 30 entries, the module emits a warning that CIs may have poor coverage. The bootstrap cannot create information absent from the sample — with very few entries, the intervals will be wide, correctly reflecting high uncertainty.
COMET Integration
COMET (metrics_comet.py) is now integrated as a first-class metric:
- Model:
Unbabel/wmt22-comet-da(WMT 2022 winning reference-based model) - Automatically computed when
unbabel-cometis installed - Per-entry scores stored in TestReport entries
- Low-resource language detection via XLM-R coverage table
- Optional dependency:
pip install mt-eval-harness[comet]
Supabase Migration
New columns added to run_cards table:
comet_score(FLOAT8, nullable)corpus_bleu(FLOAT8, nullable)chrf_ci_lower/chrf_ci_upper(FLOAT8, nullable)exact_match_ci_lower/exact_match_ci_upper(FLOAT8, nullable)
See migrations/001_add_comet_and_ci_columns.sql for the migration script.