Skip to main content

Shared Method Interface

Executive Summary. This page specifies the TranslationProcess protocol that all Arena methods must implement, the six method classes (raw-llm, coached-llm, pipeline, custom-plugin, api, human), and the method plugin format. Any approach that implements this protocol can be benchmarked.

The eval harness and i18n-rosetta share a common concept of translation method. A method is any procedure that takes source text and produces translated text — whether it's a direct LLM call, a multi-stage pipeline, a third-party API, or a human translator.

Architecture

Method Plugin (v2 Spec)
├── manifest.json ← Shared metadata (name, version, supported pairs)
├── method_card.json ← Leaderboard description (what, not how)
├── translate.py ← Python entry point (for eval harness)
└── translate.js ← Node.js entry point (for i18n-rosetta CLI)

Two Systems, One Interface

Eval Harnessi18n-rosetta
LanguagePythonNode.js
Entry pointtranslate.pytranslate.js
InterfaceTranslationProcess protocolmethodPlugin config
PurposeBatch evaluation with scoringLive localization in dev/CI
OutputRun card with metricsTranslated locale files

A method that supports both systems provides two entry points — one for each language runtime. The method card is the bridge: it describes the method in a format both systems understand.

Method Card

A method card describes what a translation method is without revealing proprietary details like the full system prompt. It answers:

  • What class of method is this? (raw LLM, coached LLM, pipeline, API, etc.)
  • What tools does it use? (FST analyzer, dictionary, etc.)
  • Is the implementation open source?
  • What language pairs does it support?

See the Method Card Spec for the full JSON schema.

Example

{
"method_id": "fst-gated-v8",
"name": "FST-Gated Coached Translation v8",
"class": "pipeline",
"description": "LLM translation with morphological validation. Failed words are retried with FST feedback.",
"author": "Curtis Forbes",
"tools_used": ["HFST morphological analyzer", "Wolvengrey dictionary"],
"open_source": false,
"supported_pairs": ["eng>crk"]
}

Method Classes

ClassDescription
raw-llmDirect LLM call with minimal instruction
coached-llmLLM with structured prompt, examples, constraints
pipelineMulti-stage pipeline with deterministic components
custom-pluginExternal process implementing the TranslationProcess protocol
apiThird-party translation API (Google Translate, DeepL, etc.)
humanHuman translation (for establishing baselines)

Eval Harness: TranslationProcess Protocol

The eval harness uses Python's structural typing (Protocol) for plugins. Any class with the right method signature works — no inheritance required:

class MyMethod:
async def translate(self, entries: list[dict], config: RunConfig) -> list[dict]:
results = []
for entry in entries:
translation = await self.do_translation(entry["source"])
results.append({
"id": entry["id"],
"predicted": translation,
"latency_s": 0.5,
"usage": {"prompt_tokens": 0, "completion_tokens": 0},
"error": None,
"tool_calls": [],
"tool_call_count": 0,
"metadata": {},
})
return results

See the Plugin Protocol for complete documentation including wrapper examples for non-Python methods.

i18n-rosetta: methodPlugin Config

In rosetta, methods are registered per language pair in i18n-rosetta.config.json:

{
"version": 3,
"pairs": {
"en:crk": {
"methodPlugin": "crk-coached-v1"
}
}
}

See the Plugin Spec for the rosetta-side interface.

Leaderboard Integration

When a method card is attached to a run (via --method-card), it's embedded in the run card and displayed on the leaderboard:

# Run with method card attached
python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--method-card method_card.json \
--submit

The leaderboard shows:

  • Class badge — visual indicator (e.g., "pipeline", "coached-llm")
  • Method name — from the method card
  • Tools used — listed from the method card
  • Open source indicator

When no method card is attached, the leaderboard shows harness-native configuration (model, condition, temperature, tools enabled).

:::danger DO NOT TRAIN on evaluation data Methods whose development process included exposure to the evaluation dataset — as training data, few-shot examples, dictionary entries, or prompt tuning material — will be disqualified from the leaderboard. See MT Evaluation for what distinguishes a good method from a bad one. :::


See Also