Worked Example: LLM Preambles¶
This tutorial walks through a real investigation that used ml-lab to test whether coding-agent preambles measurably change code quality. The investigation was run in the llm-preamble repository and produced a statistically significant result (p = 9.2 x 10-18) with a refined mechanism explanation.
Every artifact referenced below is real — nothing was fabricated for this tutorial.
The hypothesis¶
The investigation started with a question: do coding-agent preambles affect the quality of generated code? The initial hypothesis was sharpened through conversation into a falsifiable claim:
Production-realistic preambles (including reasoning-inclusive models) move craft-quality dimensions but not capability dimensions, as measured by an LLM-judge rubric.
Step 1: Build a proof-of-concept¶
ml-lab's first step is always a minimal PoC — just enough code to prove the measurement is possible. The preamble investigation's PoC generated 6 code samples across 3 preamble conditions and judged them with 18 LLM judge calls. Total cost: ~$0.01.
preamble_quality_v2_poc.py
├── 3 preamble conditions (none, expert_coder, long_directive)
├── 2 tasks × 1 rep per condition = 6 generations
└── 3 judges per generation = 18 judge calls
The PoC confirmed that (a) models produce measurably different code under different preambles, and (b) the LLM judge rubric can detect the difference. That's enough to proceed.
Step 2: Adversarial review¶
With the PoC in hand, ml-lab dispatched ml-critic for a Stage A round-1 critique. The critic found 7 issues:
| ID | Severity | Finding |
|---|---|---|
| F1 | FATAL | Score clamping bug — rubric dimension scores were silently clamped to [0, 5] before validation |
| F2 | FATAL | No trap task — baseline smell rate could be 0%, making severity scores meaningless |
| F3 | MATERIAL | Subject pool too narrow — 7 non-reasoning models only |
| F4 | MATERIAL | Single-judge scoring lacks reliability estimate |
| F5 | MATERIAL | Rubric dimensions not validated against actual prevalence |
| F6 | MINOR | No explicit reasoning parameter logging |
| F7 | MINOR | Missing provider metadata in generation records |
ml-defender responded with structured rebuttals using the 7-type taxonomy:
- F1: CONCEDE — the clamping bug was real and had to be fixed before the main run
- F2: DEFER — agreed to add a trap task as a pre-flight check
- F3: REBUT-SCOPE → eventually CONCEDE — pool was expanded from 7 to 10 models (adding 3 reasoning models)
- F4: REBUT-DESIGN → eventually CONCEDE — restored 10-judge cross-judge panel
- F5: DEFER — added rubric prevalence audit to pre-flight
- F6–F7: CONCEDE — minor instrumentation fixes
Three rounds of Stage B debate (ml-critic-r2 challenging each rebuttal, ml-defender responding) converged on all 7 findings resolved.
The value of adversarial review
The F1 finding (clamping bug) was a genuine show-stopper that would have invalidated the entire experiment. The PoC appeared to work — the bug was silent. Without structured adversarial review, it would have shipped to the main run.
Step 3: Pre-flight and pre-registration¶
The debate produced a Gate 1 plan with 13 captured decisions and a pre-flight checklist:
- Fix the clamping bug (F1)
- Add a trap task with 0% baseline smell verification (F2)
- Expand the model pool to include reasoning models (F3)
- Restore multi-judge panel (F4)
- Run rubric prevalence audit (F5)
- Add reasoning parameter and provider logging (F6–F7)
Each pre-flight item was executed as a separate phase (A through D2) with pass/fail gates. The pre-registration spec (SPEC_V2.md) was written with 12 sections including an amendment log.
Five amendments were logged during pre-flight as the spec drifted:
| Amendment | Trigger | Change |
|---|---|---|
| A1 | Phase D prevalence audit | Rubric redesigned: 14 → 11 algorithmic-code dimensions |
| A2 | Baseline analysis | Task dropped: modeflag_sort had 0% baseline smell rate |
| A3 | Pool expansion | Model pool grew from 7 to 10 (3 reasoning models added) |
| A4 | Reasoning model inclusion | Explicit reasoning parameter + provider logging |
| A5 | Judge reliability concern | Multi-judge panel + calibration anchor restored |
Step 4: Main experiment run¶
With all pre-flight items closed and /intent-watch confirming no spec drift, the main run executed:
- 1,260 generations (10 models × 9 preambles × 7 tasks × 2 reps)
- 22,680 judge calls (10-judge panel per generation)
- Cost: ~$32
- Valid data: 1,215/1,260 generations, 22,028/24,300 judge calls parsed
Step 5: Results and mechanism discovery¶
The primary result was unambiguous: p = 9.2 × 10-18 on pooled CQS-craft (Kruskal-Wallis test across preamble conditions). Seven of nine always-on rubric dimensions were individually significant.
But the investigation didn't stop at "preambles matter." ml-lab's Step 9 (production re-evaluation) prompted three confound probes to discriminate why preambles matter:
| Probe | Condition | Result | Rules out |
|---|---|---|---|
| A | Non-rubric expert preamble | -0.155 CQS vs none | — |
| B | Bare rubric (no persona) | +0.015 CQS vs none (recovers 70% of long_directive lift) | Judge-priming hypothesis |
| C | Anti-rubric expert preamble | -0.154 CQS vs none (symmetric with A) | Persona-halo hypothesis |
The refined mechanism: preambles direct the model's attention-allocation budget toward enumerated dimensions. CQS-craft is real but rubric-dependent — lift is proportional to overlap between preamble-enumerated dimensions and rubric-measured dimensions.
Step 6: Investigation log¶
The full investigation generated 53 sequenced entries in INVESTIGATION_LOG.jsonl:
Entries 1–7: PoC + hypothesis correction
Entries 8–25: Three-round debate (critic/defender)
Entries 26–28: Gate 1 plan approval
Entries 29–38: Pre-flight phases A–D2
Entries 39–41: Main run execution
Entries 42–53: Analysis + confound probes
Every decision, amendment, and finding is traceable through this log.
Key takeaways¶
- The PoC is cheap. $0.01 and 30 seconds to confirm the measurement works before investing in review.
- Adversarial review finds real bugs. The clamping bug (F1) would have invalidated the experiment. It was silent — the PoC appeared to work.
- Pre-registration drift is normal. Five amendments were logged and justified. The point isn't to prevent drift — it's to make drift visible and auditable.
- Confound probes refine mechanisms. The main result said "preambles matter." The probes said why — attention allocation, not judge priming or persona halo.
- The investigation log is the audit trail. 53 entries trace the full path from hypothesis to mechanism.
Next: Setting Up ml-journal — initialize the audit trail that makes this kind of traceability possible.