LLM Preamble — Project

Overview

LLM Preamble asks whether the system-prompt content of a coding agent materially changes the code its model produces. Two investigations sit behind a one-sentence answer: yes, but the direction is governed by overlap between the preamble’s enumerated dimensions and the dimensions the downstream evaluator scores, not by tone or expertise.

The question

Coding-agent vendors and researchers ship long, opinionated system prompts (“you are a senior engineer”, “prefer composition over inheritance”, “validate inputs and handle edge cases”). Whether that preamble content actually moves output quality, in which directions, and through what mechanism, was largely unmeasured. v1 established a measurable effect on alignment-tunable craft dimensions. v2 refined the mechanism with independent confound probes.

Method

Both investigations use locked analysis plans before any data collection. The generation pool spans 10 frontier models via OpenRouter, a fixed task suite, and multiple preamble conditions per task. Evaluation uses LLM-as-judge with a cross-judge panel: every generation rated by a panel of judges blind to which preamble produced it, with mixed-effects models accounting for model, task, and judge random effects. Independent confound probes then test the proposed mechanism rather than only the headline effect.

What it found

Preambles are potent enough to push performance below the no-prompt baseline. A “junior developer” framing and a high-effort expert directive enumerating non-rubric dimensions both lower code quality more than the strongest aligned preamble raises it (β = −0.060 and −0.155 respectively, both p < 10⁻⁴ vs no preamble). The lever pulls both directions, and harder downward than upward. Whatever you put in the preamble, the model genuinely listens to.
The mechanism is rubric overlap, not engineering virtue. A bare list of the evaluator’s dimensions, with no expert tone, captures ~70% of the maximum positive lift. Imperative phrasing and per-dimension explanations add the remaining 30%. What’s enumerated moves; what isn’t, doesn’t.
Implication: there is no universal best preamble. Preamble winners flip under a different evaluator. The “best” preamble for a benchmark that scores compactness and performance would lose under a benchmark that scores documentation and edge-case handling.
Static-analysis tools (radon, pylint, cyclomatic, Halstead) can’t see preamble effects at all. Flat results across every condition. The signal lives only in evaluators that score the dimensions preambles actually tune.

Built with ML Lab

The investigation was run end-to-end inside ML Lab, with hypothesis sharpening, pre-registration with locked metrics and pass criteria, adversarial review of the experimental design, and an append-only audit log of every step. The pre-registration, falsifiable claims, and explicit confound probes aren’t decoration; they’re what the framework requires. This is the second public investigation completed through that workflow, after ATO Device Embeddings.