Reference¶
Lookup-oriented documentation for the v2 investigation. Each page is canonical for one slice of the experimental design or output — extract one fact, link out, return.
For narrative discussion, see the conclusions and the findings. For the verbatim text of every preamble tested, see preambles.
Pages¶
- Preambles — Verbatim text of all 12 conditions (9 main-run + 3 confound probes).
- Rubric — The 11 algorithmic-code quality dimensions, severity scale, calibration anchor, and conditional-N/A rules.
- Models — The 10-model subject pool (3 reasoning + 7 non-reasoning), reasoning-parameter handling, and the family-based self-judgment rule.
- Tasks — The 7 Python tasks (5 creation incl. one multi-file, 2 refactor) with full prompts and category notes.
- Generation protocol — Subject- and judge-side generation constants (T = 0.3, max_tokens = 10000, concurrency = 50), reasoning-effort gating, retry semantics, and extraction.
- Judge protocol — Judge blindness, the two judge kinds (
idiom_comment,rubric), the full 10-judge cross-judge matrix, and self-judgment exclusion. - Statistical methods — Kruskal–Wallis omnibus, bootstrap 95% CI (
n_boot = 2000), and the mixed-effects M0 / M1 / M2 ladder withpreamble × tierinteraction. - Results schema — Field-level reference for every output file:
generations.jsonl,judgments.jsonl,sample_cqs.json,static_analysis.jsonl,REPORT.md,MIXED_EFFECTS.md,WEIGHT_SENSITIVITY.md,confound_probe_results/REPORT.md. - Glossary — Definitions for CQS-craft, rubric severity, always-on / conditionally-N/A dimensions, rubric overlap density, attention-allocation, self-judgment exclusion, ETA, defense_wins, critique_wins.