v5: Benchmark Case Generation¶
v5 stepped back from running the main evaluation to build a harder, more rigorous benchmark case library.
Problem¶
v2–v3 revealed that benchmark cases were too easy — critics found planted flaws at near-100% rates, making it impossible to distinguish good critics from mediocre ones. Cases had obvious tells: naming conventions, structural patterns, or uniform severity levels.
Approach¶
A dedicated case generation pipeline with validation gates:
- Generation — multi-LLM pipeline produces candidate cases with planted flaws
- Difficulty calibration — each case tested against multiple LLMs; trivially-easy cases rejected
- Metadata annotation —
must_find,acceptable_resolutions,correct_position,ideal_resolution - Smoke testing — automated validation that cases are well-formed and solvable
The pipeline itself went through three architectural iterations:
- Single LLM prompt (too inconsistent)
- Multi-stage agentic prompt (lost context between stages)
- Python-orchestrated multi-LLM pipeline with concurrent execution and validation gates (final)
Key design decisions¶
- Cases include both "critic should find something" and "design is actually correct" scenarios
- Difficulty is calibrated by empirical performance, not author judgment
- Each case has explicit ground truth metadata for automated scoring
- Case library is versioned and never modified after validation
Artifacts¶
experiments/self_debate_experiment_v5/REPORT.md,CONCLUSIONS.mdPOST_MORTEM.mdCASE_GENERATION_REPORT.mdsynthetic-candidates/— generated benchmark casesplan/scripts/self_debate_poc.py— benchmark case metadata schema