v6: Ensemble Extension¶
v6 tested an alternative to adversarial debate: independent ensemble review with cross-vendor scoring.
Hypothesis¶
Three independent critics with no visibility into each other's outputs produce higher recall (fewer missed flaws) than a single critic in debate mode, at the cost of lower precision (more false positives).
Design¶
Ensemble mode:
ml-criticdispatched 3 times independently- No defender, no convergence loop
- Findings clustered by root cause
- Each issue tagged with support count: 3/3, ⅔, or ⅓
Cross-vendor scoring:
- Same evaluation protocol run with different LLM providers
- Tests whether results are provider-dependent
- Separate API credentials required
Results¶
- Ensemble mode achieved higher recall as hypothesized
- ⅓ minority findings were a mix of genuine novel concerns and noise
- Cross-vendor results showed provider-specific biases (a finding, not noise)
- Ensemble mode added to ml-lab as an opt-in review mode
Impact on ml-lab¶
v6 added ensemble mode as a user choice, recommended for exploratory investigations where the risk surface is unknown. Debate mode remains the default for focused hypothesis testing.
Artifacts¶
experiments/self_debate_experiment_v6/REPORT.md,CONCLUSIONS.mdDATA_ACQUISITION.mdFINAL_SYNTHESIS.mdPEER_REVIEW_R1.md