Effect-size calibration — when this matters and when it doesn't¶

CQS-craft is on a [0, 1] scale. The empirical anchors:

Anchor	CQS-craft
`trivial_baseline` (no system, name-only prompt, T=1.0)	0.556
`negative_control` ("junior developer")	0.723
`none` (no system prompt at all)	0.778
Strongest single preamble (`long_directive`)	0.815

This matters when:

You ship to a high-volume coding agent where small per-sample quality differences compound (millions of code suggestions per day → small β × large N → real measurable downstream user impact).
Your downstream evaluator measures the same dimensions the v2 rubric measures (error handling, edge cases, type discipline, documentation, organization, abstraction calibration, API ergonomics, concurrency safety).
You have an A/B test budget large enough to detect a 5-point shift (n ≥ a few hundred samples per arm; v2's per-arm n was ~138).

This matters less when:

Your downstream evaluator measures different dimensions (compactness, performance, security). The probes proved the preamble winners flip under a different rubric.
You're shipping to a low-volume specialty system where per-sample variance dwarfs the expected preamble effect.
Your model is already on the high end of the CQS-craft range. There's evidence of a ceiling near ~0.85 on this rubric for current frontier models; preamble can move you toward it but not past it.

This matters not at all when:

You're using static-analysis tools (radon, pylint, cyclomatic complexity) as your quality bar. Those don't detect preamble effects.