Skip to content

Related Work

A literature survey covering the research areas that inform ml-lab's design.

Source location

The full related work survey is maintained at RELATED_WORK.md in the repository root.

Key areas

LLM-as-Judge

Using language models as evaluation judges — scoring, ranking, or comparing outputs against rubrics. ml-lab's evaluation uses LLM judges for scoring benchmark cases and relies on multi-judge panels for reliability.

Multi-Agent Debate

Structured argumentation between LLM agents to improve reasoning quality. ml-lab's critic-defender protocol draws from this literature but applies it specifically to methodology evaluation rather than general reasoning.

Evaluation Calibration

The problem of ensuring evaluation metrics actually measure what they claim. ml-lab's eight-version experiment arc is largely a calibration story — each version fixed measurement issues discovered in the previous one.

Pre-registration in ML

Borrowing pre-registration practices from experimental psychology and applying them to ML experimentation. ml-lab enforces pre-registration with automated drift detection.

Adversarial Testing

Using adversarial probes to find failure modes in ML systems. ml-lab's critic agent is an adversarial tester of methodology rather than model behavior.