Evaluation Methodology¶

Pre-registration¶

Every ml-lab investigation locks its hypothesis, metrics, and pass criteria before any experiment code runs. This isn't bureaucracy — it's a response to a specific failure mode observed in v4: specification drift, where implementation changes silently violated the original experimental design.

Pre-registration is enforced by /intent-watch, which monitors the experiment directory for changes that conflict with the source-of-truth document (typically HYPOTHESIS.md). Any HIGH or CRITICAL conflict blocks the experiment.

Amendments are allowed — pre-registration doesn't mean the spec can never change. But amendments must be logged with their trigger and rationale, creating an auditable trail of why the design evolved.

Metrics¶

ml-lab investigations use pre-specified metrics agreed before the PoC is built. The specific metrics depend on the hypothesis, but the structure is always:

Primary metric — the main measurement that will determine the verdict
Pass criteria — what value or outcome would support or falsify the claim
Secondary metrics — additional measurements that provide context

Metrics are chosen during the hypothesis-sharpening phase (before Step 1) and locked at pre-registration.

Scoring¶

The self-evaluation experiments (v1–v8) use benchmark cases with ground-truth metadata:

must_find — planted flaws the critic must detect for a case to count as "detected"
correct_position — the ground-truth verdict (critic_wins, defense_wins, or empirical_test_agreed)
acceptable_resolutions — valid ways to resolve the finding
ideal_resolution — the best possible resolution

Scoring is deterministic: derive_verdict() maps debate outputs to case-level verdicts, which are compared against ground truth.

Cross-vendor evaluation¶

v6 introduced cross-vendor scoring — running the same evaluation protocol with different LLM providers to test whether results are provider-dependent. This requires separate API credentials (CROSS_VENDOR_API_KEY, CROSS_VENDOR_BASE_URL, CROSS_VENDOR_MODEL).

Cross-vendor results are reported alongside primary results, not averaged with them, because provider-specific biases are a finding, not noise.

Statistical methods¶

The methodology uses non-parametric tests by default:

Kruskal-Wallis for multi-group comparisons (score distributions are typically non-normal)
Sensitivity analysis across scoring schemes, weighting functions, and case orderings
Mixed-effects models when modeling nested structure (e.g., cases within tasks within model tiers)

Multiple comparison corrections are applied when testing individual dimensions.

Quality gates¶

Several gates protect against false conclusions:

PoC gate — confirms the measurement works before investing in review
Gate 1 — all pre-flight items closed before experiment execution
Intent-watch — continuous drift monitoring during experiment scripting
Micro-iteration — evaluation design flaws trigger a re-run, not a conclusion
Macro-iteration — surprising results reopen the full review cycle
Peer review — optional multi-round review loop (Steps 10–11)
Coherence audit — cross-document consistency check (Step 12)