Skip to content

Confound probes

The three post-hoc confound probes — A, B, C — are run by a single script: confound_probes.py. They were added after the main run to distinguish two indistinguishable predictions of the main result: an H-mechanism story (preambles change code-craft) vs. an H-judge-priming story (preambles align the model's surface markers to the enumerated rubric dimensions). The probes use the main run's judge panel and rubric verbatim — same 10-judge cross-judge matrix, same self-exclusion, same calibration anchor, reasoning judges with reasoning excluded.

Command

uv run preamble_quality_experiment_v2/confound_probes.py

Prerequisite: OPENROUTER_API_KEY set, and the main run's experiment_v2_results/{generations.jsonl, judgments.jsonl, sample_cqs.json} already exist (the probes pull reference-condition stats from the main-run data).

Expected runtime and cost. ~5 minutes wall-clock at concurrency=50, ~$3–5 (the actual run came in at $1.04).

What the three probes test

All three are run on task_expr_parser (a creation task that elicits all 9 always-on rubric dimensions plus example_quality, but not concurrency_safety). Subjects: all 10 v2 main-run models. n=10 generations per probe under the full 10-judge cross-judge panel.

Probe Preamble shape What it isolates
A — long_directive_misaligned Same length + expert tone as long_directive, but 12 clauses naming axes not in the rubric (compactness, performance, determinism, etc.). If CQS-craft stays substantially above none, expert-framed directive priming has effect beyond rubric-naming. If it drops below, rubric-overlap is doing the work.
B — rubric_bare_list Just the rubric items as a bare enumerated list — no engineering-discipline framing. If this matches long_directive, naming the rubric items is the whole effect. If it falls short, imperative tone + framing contribute too.
C — antirubric_directive Same expert framing, but clauses explicitly deprioritize the rubric items (no docstrings, no type hints, no defensive guards). If CQS-craft drops below none, code-level surface markers track preamble priming even against the judges' stated rubric.

Expected output

All artifacts land under preamble_quality_experiment_v2/confound_probe_results/:

  • generations.jsonl — one row per probe generation (30 rows total: 3 probes × 10 subjects × 1 rep).
  • judgments.jsonl — one row per (sample, judge_model) rubric evaluation.
  • sample_cqs.json — per-generation CQS-craft and components.
  • REPORT.md — headline numbers per probe vs. main-run reference conditions (none, long_directive, negative_control, python_coder_agent), per-probe significance tests vs. none, per-dimension severity table, and a verdict on which mechanism story the probes support.

The original probe run found Probe A −0.155 vs. none (p=0.0001), Probe B +0.015 vs. none (p=0.50, recovering ~70% of long_directive's lift), Probe C −0.154 vs. none (p=0.0001). The refined mechanism story — attention-allocation, not judge information-leakage — is documented at INVESTIGATION_LOG.jsonl seq 49–51 and in REPORT_ADDENDUM.md.

Cross-reference

  • The probes were prompted by the post-main-run mechanism question. See Investigation logs seq 49 for the trigger.
  • Judges remained blind to the preamble throughout — the judge prompt sees only the code, not the subject's preamble. The clarification was added to CONCLUSIONS.md and the root README at log entry seq 51 to head off the "judges reward expert-toned preambles" misreading.
  • The probes provide the empirical basis for Limitation 1: CQS-craft is rubric-dependent.