v6: Ensemble Extension¶

v6 tested an alternative to adversarial debate: independent ensemble review with cross-vendor scoring.

Hypothesis¶

Three independent critics with no visibility into each other's outputs produce higher recall (fewer missed flaws) than a single critic in debate mode, at the cost of lower precision (more false positives).

Design¶

Ensemble mode:

ml-critic dispatched 3 times independently
No defender, no convergence loop
Findings clustered by root cause
Each issue tagged with support count: 3/3, ⅔, or ⅓

Cross-vendor scoring:

Same evaluation protocol run with different LLM providers
Tests whether results are provider-dependent
Separate API credentials required

Results¶

Ensemble mode achieved higher recall as hypothesized
⅓ minority findings were a mix of genuine novel concerns and noise
Cross-vendor results showed provider-specific biases (a finding, not noise)
Ensemble mode added to ml-lab as an opt-in review mode

Impact on ml-lab¶

v6 added ensemble mode as a user choice, recommended for exploratory investigations where the risk surface is unknown. Debate mode remains the default for focused hypothesis testing.

Artifacts¶

experiments/self_debate_experiment_v6/
REPORT.md, CONCLUSIONS.md
DATA_ACQUISITION.md
FINAL_SYNTHESIS.md
PEER_REVIEW_R1.md