Skip to content

v7: Refined Evaluation

v7 tightened the evaluation methodology with sensitivity analysis and cohesion auditing.

Focus

After v6 established both modes (debate and ensemble), v7 asked: how robust are the results? Specifically:

  • Do results hold across different scoring schemes?
  • Are conclusions stable under different weighting functions?
  • Does case ordering affect outcomes?
  • Are the artifacts internally consistent?

Methods

Sensitivity analysis:

  • Multiple scoring schemes applied to the same data
  • Different weighting functions tested
  • Case ordering permuted
  • Results reported with confidence intervals rather than point estimates

Cohesion audit:

  • 6 cross-document consistency checks
  • Verified that claims in REPORT.md match data in CONCLUSIONS.md
  • Checked that methodology description matches actual implementation
  • Flagged any inconsistency for resolution before exit

Results

  • Results were stable across scoring schemes (positive finding)
  • Some weighting functions shifted effect sizes but not direction
  • Case ordering had minimal effect
  • Cohesion audit caught two minor inconsistencies, both resolved

Impact on ml-lab

v7 added Step 12 (Artifact Coherence Audit) to the ml-lab workflow — automatic cross-document consistency checking before the investigation concludes. Also reinforced the practice of reporting sensitivity analyses alongside primary results.

Artifacts

  • experiments/self_debate_experiment_v7/
  • REPORT.md, TECHNICAL_REPORT.md, CONCLUSIONS.md
  • COHESION_AUDIT.md
  • SENSITIVITY_ANALYSIS.md