ML Lab — Project

Overview

ML Lab runs a structured investigation when you have an ML idea, signal, or model claim that needs validating rather than ad-hoc experimentation. It converts a vague hypothesis into a falsifiable claim, locks metrics and pass criteria before any code is written, then puts the proof-of-concept through adversarial review, empirical testing, and peer review.

The problem

Manual code review of ML experiments tends to focus on obvious bugs and miss implicit assumptions: metric sufficiency, baseline meaningfulness, sample adequacy. Single-pass model review misses the cases that matter most: framing errors that need an independent question raised rather than an answer processed, work that sounds questionable but is actually sound, and cases where the honest verdict is “this is empirically open, run this test first” rather than a binary pass or fail.

Origin

ML Lab wasn’t designed speculatively. It emerged from a concrete technical investigation (whether FastText-encoded device attributes could serve as ML features for account-takeover detection), and the framework crystallized as that investigation forced workflow choices. Eight iterative versions refined the protocol through self-evaluation experiments. Everything in the workflow exists because it solved a real problem encountered along the way, then earned its place through calibration.

The workflow

The investigation is a fixed sequence of steps. Early steps sharpen the hypothesis and lock metrics. A review step subjects the proof-of-concept to structured critique. Only the tests both sides agree on proceed to empirical execution, and the hypothesis is gate-locked during that run to prevent drift. If results contradict the review’s assumptions, the whole review cycle re-opens with evidence in hand. Optional later steps add deep peer review, a production re-evaluation, and a cross-document coherence audit over every artifact produced.

Two review modes

Debate (default). A critic identifies untested claims, a defender responds through a structured rebuttal taxonomy, and a convergence loop runs until verdicts stabilize. The final verdict is computed by a deterministic Python function, so there is no LLM variance in the outcome. After v8 calibration, this is the recommended mode for standard investigations.
Ensemble (opt-in). Three independent critics run with no cross-visibility, and their findings are union-pooled and tagged by how many critics agreed. Reserved for exploratory audits where the risk surface is unknown and missing a real issue is costlier than the manual triage required to filter false positives.

Architecture

ML Lab is a Claude Code plugin built from focused subagents (an orchestrator plus dedicated critic, defender, reviewer, and report-writer agents) with deterministic verdict logic and append-only JSONL investigation logging for post-hoc audit. Each subagent does one job; the orchestrator runs the workflow.

Research finding

Eight compute-matched calibration experiments underwrite the protocol. The working paper When Does Debate Help? Divergent Detection and Convergent Judgment in Multi-Agent LLM Evaluation reports the empirical case: independent ensembles win on detecting issues, while multi-round debate wins on judging ambiguous cases. The headline guidance (“ensemble for detection, multi-round for judgment”) is why the system ships both modes rather than picking one. The latest calibration shifted the default to debate for standard investigations; ensemble remains the right tool for exploratory audits in new domains.

Documentation

See Diátaxis-structureddocumentation for Tutorials, How-to, Reference, Explanation and more.