Skip to content

Models

The v2 main run uses a 10-model subject pool that doubles as the 10-model judge panel. This produces a full cross-judge matrix with self-judgments dropped from primary CQS-craft.

Source of truth: preamble_quality_v2_main.py lines 65–88.


Subject pool

Reasoning models (3)

REASONING_MODELS = [
    "qwen/qwen3.6-flash",
    "deepseek/deepseek-v4-flash",
    "minimax/minimax-m2.5",
]

These models receive reasoning: {effort: "high"} on every generation call (preamble_quality_v2_main.py:655–659). Reasoning-effort is a property of the subject under test, not a control variable.

Non-reasoning models (7)

NON_REASONING_MODELS = [
    "deepseek/deepseek-v3.2",
    "google/gemma-4-31b-it",
    "mistralai/mistral-small-2603",
    "nvidia/nemotron-3-super-120b-a12b",
    "openai/gpt-4o-mini",
    "google/gemini-3.1-flash-lite",
    "google/gemini-2.5-flash",
]

These models receive no reasoning parameter — the field is omitted from the request body entirely.

The pool was widened from 7 (non-reasoning only) to 10 during pre-flight Amendment A3 after a methodological challenge that "a non-reasoning-only pool answers a question of low production relevance." A4 then added the explicit reasoning: {effort: "high"} parameter after a routing-variability audit found that the same model identifier could return different reasoning behavior across calls.


Reasoning-parameter handling

Three modes are sent on the wire, distinguished by _post() at lines 590–594:

Mode When used Body field
high Generation by a reasoning subject "reasoning": {"effort": "high"}
exclude Judging by a reasoning judge "reasoning": {"exclude": true}
off All non-reasoning calls (subjects and judges) (field omitted)

The exclude mode on the judge side (preamble_quality_v2_main.py:690) reflects that judging is a structured fill-the-JSON task and judge-side reasoning is not the variable under test. The judge prompt is identical across tiers; only the wire-level reasoning flag differs.


Cross-judge matrix

All 10 models judge every non-self extracted sample on both idiom_comment and rubric kinds:

SUBJECT_MODELS = ALL_MODELS    # 10
JUDGE_MODELS   = ALL_MODELS    # 10 (full v1-equivalent cross-judge matrix)

Each extracted generation produces 10 judges × 2 kinds = 20 judge calls, of which 2 are same-family self-judgments that are dropped from primary CQS (see below).

Total scale of the run. 1260 generations × 2 judge kinds × 10 judges = 25,200 nominal judge calls; in practice 24,300 judge calls were issued (extraction failures skip judging). 22,028 parsed cleanly. See experiment_v2_results/REPORT.md.


Self-judgment exclusion

A judgment is self if the subject model and judge model share a provider-prefix family. The family helper is just model.split("/", 1)[0]:

def model_family(model: str) -> str:
    return model.split("/", 1)[0]

So deepseek/deepseek-v3.2 judged by deepseek/deepseek-v4-flash counts as a self-judgment and is excluded from primary CQS. The full self/cross stratification is retained in the report for F3 hygiene checking.

Subject family Models in family (both tiers)
qwen qwen/qwen3.6-flash
deepseek deepseek/deepseek-v4-flash, deepseek/deepseek-v3.2
minimax minimax/minimax-m2.5
google google/gemma-4-31b-it, google/gemini-3.1-flash-lite, google/gemini-2.5-flash
mistralai mistralai/mistral-small-2603
nvidia nvidia/nemotron-3-super-120b-a12b
openai openai/gpt-4o-mini

Notably, the deepseek and google families each have multiple models — cross-family judgments within those families (e.g. deepseek-v3.2 judging deepseek-v4-flash) are still treated as self-judgments and excluded. See the glossary and judge protocol for the downstream consequences.