← Back to homepage

Small Python libraries

Four narrowly-scoped Python libraries that each solve one repetitive data-prep problem cleanly (classical feature selection, fuzzy string deduplication, time-series feature engineering, and reusable function composition), three of them slotting into the standard scikit-learn Pipeline.

Four small Python libraries written in the same spirit: each solves one narrow data-prep or feature-engineering problem inside the scikit-learn Pipeline API, rather than introducing a framework around it.

Steps: classical feature selection

Best-subsets and forward stepwise regression, scored by AIC or BIC, wrapped as scikit-learn-compatible selectors. Brings interpretable classical selection (well-established in statistics but rarely accessible in modern ML pipelines) into the standard fit / transform workflow. Adapts automatically: linear regression for continuous targets, logistic regression for classification.

Repository · Docs

StringCluster: fuzzy string deduplication

Identifies near-duplicate strings (the messy-data problem of “Acme Inc.”, “Acme, Inc” and “ACME INCORPORATED” all being the same entity) via TF-IDF character n-grams and a cosine-similarity threshold. Works against itself or against a master reference list, with regex stop tokens for domain-specific noise. Standard fit / transform; slots into a Pipeline like any other step.

Repository

TSFeast: time-series feature engineering

Lag, rolling, EWMA, differencing, datetime, and polynomial-feature transformers for time series, plus a TimeSeriesFeatures meta-transformer that combines several at once and an ARMARegressor that wraps an sklearn regressor with ARMA residual modeling. Keeps time-series prep inside the Pipeline where the fit / transform boundary is respected, instead of leaking into ad-hoc pandas code outside it.

Repository

DPipes: reusable function composition

PipeProcessor for method-chaining APIs (pandas, Polars, any object with a pipe method) and Pipeline for general function composition over arbitrary Python functions. Defines a transformation sequence once and applies it to any compatible input (train, test, new batch) without rewriting the chain or sacrificing readability for nested calls.

Repository · Docs

On the shared scope

None of these tries to be a framework. Each adds the one transformer or composition primitive that sklearn (or pandas, in DPipes’ case) doesn’t ship with, and stops there. That narrowness is the point: they slot into existing workflows without asking anyone to adopt a new abstraction layer.