Skip to content

v6: Ensemble Extension

v6 tested an alternative to adversarial debate: independent ensemble review with cross-vendor scoring.

Hypothesis

Three independent critics with no visibility into each other's outputs produce higher recall (fewer missed flaws) than a single critic in debate mode, at the cost of lower precision (more false positives).

Design

Ensemble mode:

  1. ml-critic dispatched 3 times independently
  2. No defender, no convergence loop
  3. Findings clustered by root cause
  4. Each issue tagged with support count: 3/3, ⅔, or ⅓

Cross-vendor scoring:

  • Same evaluation protocol run with different LLM providers
  • Tests whether results are provider-dependent
  • Separate API credentials required

Results

  • Ensemble mode achieved higher recall as hypothesized
  • ⅓ minority findings were a mix of genuine novel concerns and noise
  • Cross-vendor results showed provider-specific biases (a finding, not noise)
  • Ensemble mode added to ml-lab as an opt-in review mode

Impact on ml-lab

v6 added ensemble mode as a user choice, recommended for exploratory investigations where the risk surface is unknown. Debate mode remains the default for focused hypothesis testing.

Artifacts

  • experiments/self_debate_experiment_v6/
  • REPORT.md, CONCLUSIONS.md
  • DATA_ACQUISITION.md
  • FINAL_SYNTHESIS.md
  • PEER_REVIEW_R1.md