v1–v3: Early Rounds¶
The first three experiment versions established the basic evaluation methodology and revealed its initial calibration problems.
v1: Does debate work at all?¶
Question: Can a critic-defender debate identify real flaws in an ML proof-of-concept?
Result: Qualitatively yes — the debate produced actionable findings that improved the PoC. But the evaluation was purely qualitative (human judgment of usefulness), so no quantitative claims could be made.
Artifacts: experiments/self_debate_experiment_v1/REPORT.md, CONCLUSIONS.md
v2: Adding quantitative metrics¶
Question: What are the detection rates and verdict accuracy of the debate protocol on benchmark cases with known ground truth?
Metrics introduced:
- Detection rate — fraction of planted flaws (
must_find) that the critic identified - Verdict accuracy — fraction of cases where the final verdict matched
correct_position
Result: High detection rates, but this turned out to be a calibration problem (cases were too easy, not critics too good). Documented in TECHNICAL_REPORT.md.
Artifacts: experiments/self_debate_experiment_v2/REPORT.md, TECHNICAL_REPORT.md, CONCLUSIONS.md
v3: Harder cases, scoring bugs¶
Question: Does the protocol maintain detection quality on subtler methodology flaws?
Result: Detection rates dropped (as intended with harder cases), but a post-mortem revealed that scoring infrastructure was the bigger problem:
- Rubric dimensions applied inconsistently
- Judge reliability varied without measurement
- Composite metrics masked per-dimension effects
Post-mortem findings:
- Instrument the measurement before measuring the subject
- A rubric that produces a number is not a rubric that produces a reliable number
- Pre-registration would have caught the inconsistencies as violations rather than silent degradation
Artifacts: experiments/self_debate_experiment_v3/REPORT.md, TECHNICAL_REPORT.md, CONCLUSIONS.md, POST_MORTEM.md
Impact on ml-lab¶
v1–v3 established that adversarial debate works for methodology evaluation but that the evaluation infrastructure needs as much rigor as the thing being evaluated. v3's post-mortem directly motivated v4's pre-registration requirement.