Skip to content

Main run

The v2 main experiment is a single script — preamble_quality_v2_main.py — that executes the locked design from SPEC_V2.md (see Pre-registration).

Prerequisites

export OPENROUTER_API_KEY=<your key>

OPENROUTER_API_KEY is required. There is no mock mode — every generation and every judge call hits OpenRouter.

uv resolves all Python dependencies from the script's PEP 723 inline metadata at run time; no virtualenv setup is needed.

Smoke test

Run a 4-sample slice first to verify the pipeline end-to-end:

uv run preamble_quality_experiment_v2/preamble_quality_v2_main.py --slice

--slice runs 2 models × 2 preambles × 1 task × 1 rep. Expect ~$0.12 and a few minutes wall clock. On success the smoke run produces a REPORT.md showing directional sanity (e.g. python_coder_agent > none).

Full run

uv run preamble_quality_experiment_v2/preamble_quality_v2_main.py

Expected runtime: ~1 hour.

Expected scale (locked by SPEC §6):

  • 10 subjects × 9 preambles × 7 tasks × 2 reps = 1,260 generations.
  • Full 10-judge cross-judge matrix (self-exclusion) → ~22,680 cross-judge calls.
  • Expected cost: ~$32 (smoke-extrapolation; the actual main run came in at $32.02).

Useful flags

Flag Behavior
--slice Small smoke-test slice (2 models × 2 preambles × 1 task × 1 rep).
--resume Resume — skip already-completed work in JSONL files.
--gen-only Stop after generation phase (no judging, no analysis).
--skip-static Skip static-analysis diagnostic panel.

JSONL files are written incrementally as work completes, so --resume after an interruption picks up where the previous run stopped.

Outputs

All artifacts land under preamble_quality_experiment_v2/experiment_v2_results/:

  • generations.jsonl — one row per generation (1,260 rows on a full run), including seeds, temperature, provider, raw response, and extraction result.
  • judgments.jsonl — one row per (sample, judge_model) rubric evaluation, including per-dimension severities and rationale fields.
  • sample_cqs.json — per-generation CQS-craft and its components (idiom, comment, mean rubric severity).
  • static_analysis.jsonl — radon / pylint / flake8 cognitive-complexity panel, reported as a diagnostic — not a CQS input.
  • REPORT.md — headline numbers (Kruskal–Wallis, per-condition means with bootstrap CIs, rubric per-dimension breakdown).

After the main run completes, see Analysis to reproduce the mixed-effects model, weight-sensitivity table, and figures, and Confound probes to reproduce the three post-hoc probes.

Cross-reference

The v2 main-run completion is recorded in INVESTIGATION_LOG.jsonl at seq 41 — useful as a sanity check that your reproduction matches the original run's extraction rate (96.4%), judge parse rate (90.7%), and headline KW p-value (9.24×10⁻¹⁸ pooled).