Debate Protocol Design¶
Why adversarial debate?¶
A single reviewer — human or LLM — tends to confirm the design it's reading. The critic-defender structure breaks this by giving two agents opposing mandates: one must find flaws, the other must defend the design. Neither can simply agree. This surfaces implicit assumptions that a single-pass review would accept at face value.
The two-stage structure¶
Stage A: Initial exchange¶
-
ml-critic (R1) reads the PoC and hypothesis, then produces findings tagged by severity (FATAL, MATERIAL, MINOR). The critic's mandate is exhaustive: find every implicit claim the PoC makes but hasn't tested.
-
ml-defender (R1) responds to each finding using a 7-type rebuttal taxonomy:
Type Meaning CONCEDE Finding is valid; will fix before main run REBUT-DESIGN Design choice was intentional and correct REBUT-SCOPE Finding is out of scope for this investigation REBUT-EVIDENCE Evidence doesn't support the claimed severity REBUT-IMMATERIAL Finding is real but doesn't affect conclusions DEFER Will address as a pre-flight item EXONERATE Finding is based on a misunderstanding The taxonomy matters because it forces the defender to commit to a specific mode of disagreement rather than offering vague pushback.
Stage B: Convergence loop¶
- ml-critic-r2 challenges each rebuttal with an ACCEPT / CHALLENGE / PARTIAL verdict
- ml-defender responds to challenges
derive_verdict()— a pure Python function (no LLM) — computes a deterministic per-finding verdict based on final-round severity, rebuttal type, and acceptance status- Loop continues (min 2, max 4 rounds) until verdicts stabilize
Why a deterministic verdict function?¶
The verdict could be computed by an LLM. It isn't, for two reasons:
-
Reproducibility — the same inputs always produce the same verdict. An LLM verdict would vary across runs, making it impossible to distinguish methodology changes from stochastic variation.
-
Auditability — the verdict logic is inspectable Python code with clear rules. When a verdict seems wrong, you can trace exactly which input caused it and adjust the rules. An LLM verdict is a black box that can only be debugged by prompt engineering.
The verdict function maps (final_severity, rebuttal_type, acceptance_status) → {critique_wins, defense_wins, empirical_test_agreed}. The mapping is explicit and testable.
When ensemble mode is better¶
Ensemble mode (3× independent critics, no defender, union pooling by support tier) trades depth for breadth. It's better when:
- The risk surface is unknown and you want maximum recall
- You'll manually triage precision — false positives are acceptable
- The hypothesis is exploratory rather than focused
Debate mode is better when:
- You want every finding resolved to a deterministic verdict
- The hypothesis is specific and you need to know which issues are real
- You want the defender to filter low-quality findings before they enter experiment design
Macro-iteration¶
When experimental results falsify a review assumption (e.g., the defender claimed a measurement was valid but the experiment revealed it wasn't), the entire review cycle reopens with results in hand. This is macro-iteration — up to 3 cycles of review → experiment → re-review.
Macro-iteration exists because some flaws are only visible after running the experiment. A pre-experiment review can't catch everything. The alternative — treating the first experiment's results as final — systematically underestimates methodology risk.