Evaluation Methodology¶
Pre-registration¶
Every ml-lab investigation locks its hypothesis, metrics, and pass criteria before any experiment code runs. This isn't bureaucracy — it's a response to a specific failure mode observed in v4: specification drift, where implementation changes silently violated the original experimental design.
Pre-registration is enforced by /intent-watch, which monitors the experiment directory for changes that conflict with the source-of-truth document (typically HYPOTHESIS.md). Any HIGH or CRITICAL conflict blocks the experiment.
Amendments are allowed — pre-registration doesn't mean the spec can never change. But amendments must be logged with their trigger and rationale, creating an auditable trail of why the design evolved.
Metrics¶
ml-lab investigations use pre-specified metrics agreed before the PoC is built. The specific metrics depend on the hypothesis, but the structure is always:
- Primary metric — the main measurement that will determine the verdict
- Pass criteria — what value or outcome would support or falsify the claim
- Secondary metrics — additional measurements that provide context
Metrics are chosen during the hypothesis-sharpening phase (before Step 1) and locked at pre-registration.
Scoring¶
The self-evaluation experiments (v1–v8) use benchmark cases with ground-truth metadata:
must_find— planted flaws the critic must detect for a case to count as "detected"correct_position— the ground-truth verdict (critic_wins, defense_wins, or empirical_test_agreed)acceptable_resolutions— valid ways to resolve the findingideal_resolution— the best possible resolution
Scoring is deterministic: derive_verdict() maps debate outputs to case-level verdicts, which are compared against ground truth.
Cross-vendor evaluation¶
v6 introduced cross-vendor scoring — running the same evaluation protocol with different LLM providers to test whether results are provider-dependent. This requires separate API credentials (CROSS_VENDOR_API_KEY, CROSS_VENDOR_BASE_URL, CROSS_VENDOR_MODEL).
Cross-vendor results are reported alongside primary results, not averaged with them, because provider-specific biases are a finding, not noise.
Statistical methods¶
The methodology uses non-parametric tests by default:
- Kruskal-Wallis for multi-group comparisons (score distributions are typically non-normal)
- Sensitivity analysis across scoring schemes, weighting functions, and case orderings
- Mixed-effects models when modeling nested structure (e.g., cases within tasks within model tiers)
Multiple comparison corrections are applied when testing individual dimensions.
Quality gates¶
Several gates protect against false conclusions:
- PoC gate — confirms the measurement works before investing in review
- Gate 1 — all pre-flight items closed before experiment execution
- Intent-watch — continuous drift monitoring during experiment scripting
- Micro-iteration — evaluation design flaws trigger a re-run, not a conclusion
- Macro-iteration — surprising results reopen the full review cycle
- Peer review — optional multi-round review loop (Steps 10–11)
- Coherence audit — cross-document consistency check (Step 12)