Agents¶
All agents are subagents dispatched via the Agent tool from the main session. They run with isolated context — each agent sees only the artifacts passed to it.
ml-critic¶
Role: Adversarial critic — finds flaws the PoC hasn't tested.
Dispatched by: /ml-lab skill at Step 3.
- Debate mode: dispatched once (Stage A.1)
- Ensemble mode: dispatched 3 times independently with no cross-visibility
Persona: Skeptical ML engineer with an applied mathematics background. Looks for fundamental flaws in the proof-of-concept.
Modes:
| Mode | When | Input |
|---|---|---|
| Initial critique (Step 3) | First review cycle | PoC code + hypothesis |
| Debate rounds (Step 5) | Open-ended debate | Prior exchange history |
| Evidence-informed re-critique | Macro-iteration cycles 2+ | Prior critique + experimental results |
Output: CRITIQUE.md (debate) or CRITIQUE_{1,2,3}.md (ensemble) with severity-tagged findings (FATAL, MATERIAL, MINOR).
ml-critic-r2¶
Role: R2 challenger — issues ACCEPT/CHALLENGE/PARTIAL verdicts on defender rebuttals.
Dispatched by: /ml-lab skill at Step 3, Stage B only (debate mode).
Not used in ensemble mode.
Input: Defender's rebuttal for a specific finding.
Output: Per-finding verdict with justification. Fed into derive_verdict() for deterministic case-level verdict computation.
ml-defender¶
Role: Design defender — argues for the implementation against adversarial critique.
Dispatched by: /ml-lab skill at Step 3 (debate mode only).
Not used in ensemble mode.
Persona: The original designer who understands the intent behind every choice.
Rebuttal taxonomy (7 types):
| Type | Meaning |
|---|---|
| CONCEDE | Finding is valid; will fix |
| REBUT-DESIGN | Design choice was intentional and correct |
| REBUT-SCOPE | Finding is out of scope for this investigation |
| REBUT-EVIDENCE | Evidence doesn't support the claimed severity |
| REBUT-IMMATERIAL | Finding is real but doesn't affect conclusions |
| DEFER | Will address as pre-flight item before main run |
| EXONERATE | Finding is based on a misunderstanding |
Modes:
| Mode | When | Input |
|---|---|---|
| Initial defense (Stage A.2) | After critic R1 | Critic findings + PoC |
| Structured R2 response (Stage B.2) | After critic-r2 challenge | R2 verdict + prior exchange |
| Evidence-informed re-defense | Macro-iteration cycles 2+ | Prior defense + experimental results |
Output: DEFENSE.md with per-finding structured rebuttals.
research-reviewer¶
Role: Deep peer reviewer — Opus-class structured review of REPORT.md.
Dispatched by: /ml-lab skill at Step 10, Round 1.
Output: PEER_REVIEW_R1.md with severity-tagged findings.
research-reviewer-lite¶
Role: Verification reviewer — Haiku-class follow-up review.
Dispatched by: /ml-lab skill at Step 10, Rounds 2–3.
Purpose: Verify that remediation addressed the Round 1 findings without introducing new issues. Lighter weight than the full reviewer.
report-writer¶
Role: Produces technical reports from investigation artifacts.
Dispatched by: /ml-lab skill at Steps 8 and 11.
Modes:
| Mode | Output | Input |
|---|---|---|
| Mode 1 (Step 8) | REPORT.md |
Analytical artifacts + quantitative results |
| Mode 2 (Step 11) | TECHNICAL_REPORT.md |
All available artifacts (results-mode synthesis) |
readme-rewriter¶
Role: Outside-reader README rewriter.
Dispatched by: /ml-lab skill at Step 13 (user-confirmed).
Process: diagnose (as first-time reader) → outline (proposed structure) → rewrite (complete README optimized for external audiences).
intent-monitor¶
Role: Pre-registration drift monitor.
Dispatched by: /intent-watch skill.
Process: Indexes binding constraints from a source-of-truth document, detects recent git changes, evaluates diffs for conflicts. Emits a clean-pass line or structured conflict report.
pipeline-reviewer¶
Role: Fidelity judgment gate for promoted Metaflow flows.
Dispatched by: Pipeline promotion workflow, after a flow is scaffolded and before it is run. Blocking — a FAIL verdict halts promotion.
Model: Sonnet.
Scope: Guards intent-fidelity invariants only. Does not run the flow, does not police DAG shape, and does not duplicate the deterministic flow-lint's mechanical checks.
Inputs: Promoted flow source + the investigation's source-of-truth document(s) (one or more of HYPOTHESIS.md, the original PoC, CONCLUSIONS.md, PLAN.md, or any document the dispatching agent designates as authoritative).
The five judgment checks:
| Check ID | Invariant | Failure it catches |
|---|---|---|
split_convention |
Flow uses the same split strategy (shared vs. independent data streams) as the original PoC | Changing split convention silently alters variance structure across arms |
reshuffle_symmetry |
Per-epoch shuffling or resampling is applied identically to all methods | Asymmetric reshuffling gives one method a silent optimization advantage |
axes_match_source |
Swept axes (hyperparameter grids, dataset sizes, etc.) match the source-of-truth in identity and range | Silently added, dropped, or re-ranged axes change which regime is under evaluation |
metric_quantity |
Each metric's implementation computes the quantity its name claims | A metric reporting the wrong quantity passes all type/shape checks while drawing the verdict from the wrong signal |
sweep_override |
Flow does not perform per-experiment selection (sweep-then-pick-best) where the source-of-truth used a fixed configuration | Selection effect inflates results beyond what the original PoC measured |
Output: A single JSON object with a findings array (one entry per check, each with check_id, verdict, evidence, source_of_truth_ref, intent_mismatch, and actionable_fix), a top-level blocking boolean (true if any finding is FAIL), and a summary string. All five checks must appear even when the verdict is PASS.