v5: Benchmark Case Generation¶

v5 stepped back from running the main evaluation to build a harder, more rigorous benchmark case library.

Problem¶

v2–v3 revealed that benchmark cases were too easy — critics found planted flaws at near-100% rates, making it impossible to distinguish good critics from mediocre ones. Cases had obvious tells: naming conventions, structural patterns, or uniform severity levels.

Approach¶

A dedicated case generation pipeline with validation gates:

Generation — multi-LLM pipeline produces candidate cases with planted flaws
Difficulty calibration — each case tested against multiple LLMs; trivially-easy cases rejected
Metadata annotation — must_find, acceptable_resolutions, correct_position, ideal_resolution
Smoke testing — automated validation that cases are well-formed and solvable

The pipeline itself went through three architectural iterations:

Single LLM prompt (too inconsistent)
Multi-stage agentic prompt (lost context between stages)
Python-orchestrated multi-LLM pipeline with concurrent execution and validation gates (final)

Key design decisions¶

Cases include both "critic should find something" and "design is actually correct" scenarios
Difficulty is calibrated by empirical performance, not author judgment
Each case has explicit ground truth metadata for automated scoring
Case library is versioned and never modified after validation

Artifacts¶

experiments/self_debate_experiment_v5/
REPORT.md, CONCLUSIONS.md
POST_MORTEM.md
CASE_GENERATION_REPORT.md
synthetic-candidates/ — generated benchmark cases
plan/scripts/self_debate_poc.py — benchmark case metadata schema