v1–v3: Early Rounds¶

The first three experiment versions established the basic evaluation methodology and revealed its initial calibration problems.

v1: Does debate work at all?¶

Question: Can a critic-defender debate identify real flaws in an ML proof-of-concept?

Result: Qualitatively yes — the debate produced actionable findings that improved the PoC. But the evaluation was purely qualitative (human judgment of usefulness), so no quantitative claims could be made.

Artifacts: experiments/self_debate_experiment_v1/REPORT.md, CONCLUSIONS.md

v2: Adding quantitative metrics¶

Question: What are the detection rates and verdict accuracy of the debate protocol on benchmark cases with known ground truth?

Metrics introduced:

Detection rate — fraction of planted flaws (must_find) that the critic identified
Verdict accuracy — fraction of cases where the final verdict matched correct_position

Result: High detection rates, but this turned out to be a calibration problem (cases were too easy, not critics too good). Documented in TECHNICAL_REPORT.md.

Artifacts: experiments/self_debate_experiment_v2/REPORT.md, TECHNICAL_REPORT.md, CONCLUSIONS.md

v3: Harder cases, scoring bugs¶

Question: Does the protocol maintain detection quality on subtler methodology flaws?

Result: Detection rates dropped (as intended with harder cases), but a post-mortem revealed that scoring infrastructure was the bigger problem:

Rubric dimensions applied inconsistently
Judge reliability varied without measurement
Composite metrics masked per-dimension effects

Post-mortem findings:

Instrument the measurement before measuring the subject
A rubric that produces a number is not a rubric that produces a reliable number
Pre-registration would have caught the inconsistencies as violations rather than silent degradation

Artifacts: experiments/self_debate_experiment_v3/REPORT.md, TECHNICAL_REPORT.md, CONCLUSIONS.md, POST_MORTEM.md

Impact on ml-lab¶

v1–v3 established that adversarial debate works for methodology evaluation but that the evaluation infrastructure needs as much rigor as the thing being evaluated. v3's post-mortem directly motivated v4's pre-registration requirement.