v7: Refined Evaluation¶

v7 tightened the evaluation methodology with sensitivity analysis and cohesion auditing.

Focus¶

After v6 established both modes (debate and ensemble), v7 asked: how robust are the results? Specifically:

Do results hold across different scoring schemes?
Are conclusions stable under different weighting functions?
Does case ordering affect outcomes?
Are the artifacts internally consistent?

Methods¶

Sensitivity analysis:

Multiple scoring schemes applied to the same data
Different weighting functions tested
Case ordering permuted
Results reported with confidence intervals rather than point estimates

Cohesion audit:

6 cross-document consistency checks
Verified that claims in REPORT.md match data in CONCLUSIONS.md
Checked that methodology description matches actual implementation
Flagged any inconsistency for resolution before exit

Results¶

Results were stable across scoring schemes (positive finding)
Some weighting functions shifted effect sizes but not direction
Case ordering had minimal effect
Cohesion audit caught two minor inconsistencies, both resolved

Impact on ml-lab¶

v7 added Step 12 (Artifact Coherence Audit) to the ml-lab workflow — automatic cross-document consistency checking before the investigation concludes. Also reinforced the practice of reporting sensitivity analyses alongside primary results.

Artifacts¶

experiments/self_debate_experiment_v7/
REPORT.md, TECHNICAL_REPORT.md, CONCLUSIONS.md
COHESION_AUDIT.md
SENSITIVITY_ANALYSIS.md