Quickstart#

step-select#

build codecov

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Install#

Create a virtual environment with Python 3.8 and install from PyPi:

pip install step-select

Use#

Preliminaries#

Note: this example requires two additional packages: pandas and statsmodels.

In this example we’ll show how the ForwardSelector and SubsetSelector classes can be used on their own or in conjuction with a Scikit-Learn Pipeline object.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import statsmodels.datasets
from statsmodels.api import OLS
from statsmodels.tools import add_constant

from steps.forward import ForwardSelector
from steps.subset import SubsetSelector

We’ll download the auto dataset via Statsmodels; we’ll use mpg as the endogenous variable and the remaining variables as exongenous. We won’t use make, as that will create several dummies and increase the number of paramters to 12+, which is too many for the SubsetSelector class; we’ll also drop price.

data = statsmodels.datasets.webuse('auto')
data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)
data.fillna(0, inplace=True)
data.head()
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
0 AMC Concord 4099 22 3.0 2.5 11 2930 186 40 121 3.58 0
1 AMC Pacer 4749 17 3.0 3.0 11 3350 173 40 258 2.53 0
2 AMC Spirit 3799 22 0.0 3.0 12 2640 168 35 121 3.08 0
3 Buick Century 4816 20 3.0 4.5 16 3250 196 40 196 2.93 0
4 Buick Electra 7827 15 4.0 4.0 20 4080 222 43 350 2.41 0
X = data.iloc[:, 3:]
y = data['mpg']

Forward Stepwise Selection#

The ForwardSelector follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.

The ForwardSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.

Parameter

Type

Description

normalize

bool

Whether to normalize features; default False

metric

str

Optimization metric to use; must be one of aic or bic; default aic

The ForwardSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.

selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
ForwardSelector(normalize=True)
X.loc[:, selector.best_support_]
rep78 weight length gear_ratio foreign
0 3.0 2930 186 3.58 0
1 3.0 3350 173 2.53 0
2 0.0 2640 168 3.08 0
3 3.0 3250 196 2.93 0
4 4.0 4080 222 2.41 0
... ... ... ... ... ...
69 4.0 2160 172 3.74 1
70 5.0 2040 155 3.78 1
71 4.0 1930 155 3.78 1
72 4.0 1990 156 3.78 1
73 5.0 3170 193 2.98 1

74 rows × 5 columns

Best Subset Selection#

The SubsetSelector follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.

The SubsetSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.

Parameter

Type

Description

normalize

bool

Whether to normalize features; default False

metric

str

Optimization metric to use; must be one of aic or bic; default aic

The SubsetSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.

selector = SubsetSelector(normalize=True, metric='aic')
selector.fit(X, y)
SubsetSelector(normalize=True)
X.loc[:, selector.get_support()]
rep78 weight length gear_ratio foreign
0 3.0 2930 186 3.58 0
1 3.0 3350 173 2.53 0
2 0.0 2640 168 3.08 0
3 3.0 3250 196 2.93 0
4 4.0 4080 222 2.41 0
... ... ... ... ... ...
69 4.0 2160 172 3.74 1
70 5.0 2040 155 3.78 1
71 4.0 1930 155 3.78 1
72 4.0 1990 156 3.78 1
73 5.0 3170 193 2.98 1

74 rows × 5 columns

Comparing the full model#

Using the SubsetSelector selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by weight, length, displacement and other features that are all related to the weight of a vehicle.

Note: Selection using BIC as the optimization metric yields a model where weight is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.

mod = OLS(endog=y, exog=add_constant(X)).fit()
mod.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.720
Model: OLS Adj. R-squared: 0.681
Method: Least Squares F-statistic: 18.33
Date: Sat, 07 Aug 2021 Prob (F-statistic): 1.29e-14
Time: 15:37:36 Log-Likelihood: -187.23
No. Observations: 74 AIC: 394.5
Df Residuals: 64 BIC: 417.5
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 39.0871 9.100 4.295 0.000 20.907 57.267
rep78 1.0021 0.357 2.809 0.007 0.290 1.715
headroom -0.0167 0.611 -0.027 0.978 -1.237 1.204
trunk -0.0772 0.154 -0.503 0.617 -0.384 0.230
weight -0.0037 0.002 -1.928 0.058 -0.008 0.000
length -0.0752 0.061 -1.229 0.223 -0.197 0.047
turn -0.1762 0.187 -0.941 0.350 -0.550 0.198
displacement 0.0131 0.011 1.180 0.243 -0.009 0.035
gear_ratio 3.7067 1.751 2.116 0.038 0.208 7.206
foreign -4.4633 1.385 -3.222 0.002 -7.230 -1.696
Omnibus: 28.364 Durbin-Watson: 2.523
Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.945
Skew: 1.389 Prob(JB): 3.18e-12
Kurtosis: 6.074 Cond. No. 7.55e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit()
mod.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.710
Model: OLS Adj. R-squared: 0.688
Method: Least Squares F-statistic: 33.25
Date: Sat, 07 Aug 2021 Prob (F-statistic): 5.22e-17
Time: 15:37:40 Log-Likelihood: -188.63
No. Observations: 74 AIC: 389.3
Df Residuals: 68 BIC: 403.1
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 40.3703 7.860 5.136 0.000 24.687 56.054
rep78 0.9040 0.342 2.647 0.010 0.223 1.586
weight -0.0030 0.002 -1.770 0.081 -0.006 0.000
length -0.1058 0.053 -1.990 0.051 -0.212 0.000
gear_ratio 2.6905 1.511 1.780 0.079 -0.325 5.706
foreign -4.0123 1.320 -3.040 0.003 -6.646 -1.379
Omnibus: 24.257 Durbin-Watson: 2.442
Prob(Omnibus): 0.000 Jarque-Bera (JB): 39.774
Skew: 1.252 Prob(JB): 2.31e-09
Kurtosis: 5.576 Cond. No. 6.59e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Use in Scikit-Learn Pipeline#

Both ForwardSelector and SubsetSelector objects are compatible with Scikit-Learn Pipeline objects, and can be used as feature selection steps:

pl = Pipeline([
    ('feature_selection', SubsetSelector(normalize=True)),
    ('regression', LinearRegression())
])
pl.fit(X, y)
Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
                ('regression', LinearRegression())])
pl.score(X, y)
0.7097132531085899