Skip to content

v5: Benchmark Case Generation

v5 stepped back from running the main evaluation to build a harder, more rigorous benchmark case library.

Problem

v2–v3 revealed that benchmark cases were too easy — critics found planted flaws at near-100% rates, making it impossible to distinguish good critics from mediocre ones. Cases had obvious tells: naming conventions, structural patterns, or uniform severity levels.

Approach

A dedicated case generation pipeline with validation gates:

  1. Generation — multi-LLM pipeline produces candidate cases with planted flaws
  2. Difficulty calibration — each case tested against multiple LLMs; trivially-easy cases rejected
  3. Metadata annotationmust_find, acceptable_resolutions, correct_position, ideal_resolution
  4. Smoke testing — automated validation that cases are well-formed and solvable

The pipeline itself went through three architectural iterations:

  1. Single LLM prompt (too inconsistent)
  2. Multi-stage agentic prompt (lost context between stages)
  3. Python-orchestrated multi-LLM pipeline with concurrent execution and validation gates (final)

Key design decisions

  • Cases include both "critic should find something" and "design is actually correct" scenarios
  • Difficulty is calibrated by empirical performance, not author judgment
  • Each case has explicit ground truth metadata for automated scoring
  • Case library is versioned and never modified after validation

Artifacts

  • experiments/self_debate_experiment_v5/
  • REPORT.md, CONCLUSIONS.md
  • POST_MORTEM.md
  • CASE_GENERATION_REPORT.md
  • synthetic-candidates/ — generated benchmark cases
  • plan/scripts/self_debate_poc.py — benchmark case metadata schema