Four small Python libraries written in the same spirit: each solves one narrow data-prep or feature-engineering problem inside the scikit-learn Pipeline API, rather than introducing a framework around it.
Steps: classical feature selection
Best-subsets and forward stepwise regression, scored by AIC or BIC, wrapped as scikit-learn-compatible selectors. Brings interpretable classical selection (well-established in statistics but rarely accessible in modern ML pipelines) into the standard fit / transform workflow. Adapts automatically: linear regression for continuous targets, logistic regression for classification.
StringCluster: fuzzy string deduplication
Identifies near-duplicate strings (the messy-data problem of “Acme Inc.”, “Acme, Inc” and “ACME INCORPORATED” all being the same entity) via TF-IDF character n-grams and a cosine-similarity threshold. Works against itself or against a master reference list, with regex stop tokens for domain-specific noise. Standard fit / transform; slots into a Pipeline like any other step.
TSFeast: time-series feature engineering
Lag, rolling, EWMA, differencing, datetime, and polynomial-feature transformers for time series, plus a TimeSeriesFeatures meta-transformer that combines several at once and an ARMARegressor that wraps an sklearn regressor with ARMA residual modeling. Keeps time-series prep inside the Pipeline where the fit / transform boundary is respected, instead of leaking into ad-hoc pandas code outside it.
DPipes: reusable function composition
PipeProcessor for method-chaining APIs (pandas, Polars, any object with a pipe method) and Pipeline for general function composition over arbitrary Python functions. Defines a transformation sequence once and applies it to any compatible input (train, test, new batch) without rewriting the chain or sacrificing readability for nested calls.
On the shared scope
None of these tries to be a framework. Each adds the one transformer or composition primitive that sklearn (or pandas, in DPipes’ case) doesn’t ship with, and stops there. That narrowness is the point: they slot into existing workflows without asking anyone to adopt a new abstraction layer.