Skip to content

Effect-size calibration — when this matters and when it doesn't

CQS-craft is on a [0, 1] scale. The empirical anchors:

Anchor CQS-craft
trivial_baseline (no system, name-only prompt, T=1.0) 0.556
negative_control ("junior developer") 0.723
none (no system prompt at all) 0.778
Strongest single preamble (long_directive) 0.815

This matters when:

  • You ship to a high-volume coding agent where small per-sample quality differences compound (millions of code suggestions per day → small β × large N → real measurable downstream user impact).
  • Your downstream evaluator measures the same dimensions the v2 rubric measures (error handling, edge cases, type discipline, documentation, organization, abstraction calibration, API ergonomics, concurrency safety).
  • You have an A/B test budget large enough to detect a 5-point shift (n ≥ a few hundred samples per arm; v2's per-arm n was ~138).

This matters less when:

  • Your downstream evaluator measures different dimensions (compactness, performance, security). The probes proved the preamble winners flip under a different rubric.
  • You're shipping to a low-volume specialty system where per-sample variance dwarfs the expected preamble effect.
  • Your model is already on the high end of the CQS-craft range. There's evidence of a ceiling near ~0.85 on this rubric for current frontier models; preamble can move you toward it but not past it.

This matters not at all when:

  • You're using static-analysis tools (radon, pylint, cyclomatic complexity) as your quality bar. Those don't detect preamble effects.