v7: Refined Evaluation¶
v7 tightened the evaluation methodology with sensitivity analysis and cohesion auditing.
Focus¶
After v6 established both modes (debate and ensemble), v7 asked: how robust are the results? Specifically:
- Do results hold across different scoring schemes?
- Are conclusions stable under different weighting functions?
- Does case ordering affect outcomes?
- Are the artifacts internally consistent?
Methods¶
Sensitivity analysis:
- Multiple scoring schemes applied to the same data
- Different weighting functions tested
- Case ordering permuted
- Results reported with confidence intervals rather than point estimates
Cohesion audit:
- 6 cross-document consistency checks
- Verified that claims in REPORT.md match data in CONCLUSIONS.md
- Checked that methodology description matches actual implementation
- Flagged any inconsistency for resolution before exit
Results¶
- Results were stable across scoring schemes (positive finding)
- Some weighting functions shifted effect sizes but not direction
- Case ordering had minimal effect
- Cohesion audit caught two minor inconsistencies, both resolved
Impact on ml-lab¶
v7 added Step 12 (Artifact Coherence Audit) to the ml-lab workflow — automatic cross-document consistency checking before the investigation concludes. Also reinforced the practice of reporting sensitivity analyses alongside primary results.
Artifacts¶
experiments/self_debate_experiment_v7/REPORT.md,TECHNICAL_REPORT.md,CONCLUSIONS.mdCOHESION_AUDIT.mdSENSITIVITY_ANALYSIS.md