LLM Preamble Experiments¶
Empirical investigation of whether coding-agent preambles measurably change the quality of code that LLMs produce. They do — and the channel is load-bearing in both directions. Two pre-registered investigations, 1,290 generations, 25,140 cross-judge ratings.
TL;DR¶
The preamble channel is genuinely load-bearing — content choices measurably move outputs in either direction relative to a no-preamble baseline. There is no universal "best preamble"; a preamble's effect is governed by overlap between the dimensions the preamble enumerates and the dimensions your downstream evaluator measures. Modest effect sizes in either direction (~3–6 points out of 100).
The full headline table with empirical evidence per claim is in the five findings.
Start here¶
If you're a practitioner deciding what to put in your agent's system prompt¶
- The five findings — each with claim, evidence table, action, and related-work pointer.
- Designing a preamble — a six-step procedure derived from the findings.
- Preambles tested verbatim — the actual text of all 12 conditions, with mean CQS-craft per condition.
- Interpret CQS-craft effect sizes — when the measured effect matters and when it doesn't.
If you're a researcher or methodologist¶
- Explanation — mechanism arguments: attention-allocation, enumeration-vs-demonstration, the verified system-vs-user channel asymmetry, why static metrics miss, the v1 → v2 instrument correction.
- Methodology — pre-registration, ml-lab debate workflow, investigation logs, the five spec amendments, limitations.
- Statistical methods — Kruskal–Wallis omnibus, mixed-effects M0/M1/M2, bootstrap 95% CI.
- Related work — situating in the 2024–2026 literature on persona/system-prompt effects and LLM-as-judge evaluation.
If you want to reproduce the runs¶
- Quickstart — setup + commands.
- Main run — full v2 experiment end-to-end (~1 hour).
- Confound probes — the 3-probe identification test (~5 min).
What this site is¶
A Diátaxis-organized companion to the llm-preamble repo. The repo's README.md, PREAMBLES.md, RELATED_WORK.md, and per-cycle CONCLUSIONS.md remain the source-of-truth artifacts; this site reorganizes them for navigation and adds the explanation pages that consolidate mechanism arguments developed during and after the investigation. Every page links back to the underlying source — script, results file, or markdown artifact — under the repo root.
ML-Lab¶
Designed, executed, and analyzed using ml-lab — a Claude Code plugin for rigorous, pre-registered ML hypothesis investigations (hypothesis → adversarial critique → PoC → empirical resolution → peer review). Every artifact in this repo (HYPOTHESIS.md, SPEC_V2.md, CONCLUSIONS.md, REPORT_ADDENDUM.md, INVESTIGATION_LOG.jsonl) is a canonical output of that workflow.