Quickstart#
step-select#
A SciKit-Learn style feature selector using best subsets and stepwise regression.
Install#
Create a virtual environment with Python 3.8 and install from PyPi:
pip install step-select
Use#
Preliminaries#
Note: this example requires two additional packages: pandas
and statsmodels
.
In this example we’ll show how the ForwardSelector
and SubsetSelector
classes can be used on their own or in conjuction with a Scikit-Learn Pipeline
object.
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import statsmodels.datasets
from statsmodels.api import OLS
from statsmodels.tools import add_constant
from steps.forward import ForwardSelector
from steps.subset import SubsetSelector
We’ll download the auto
dataset via Statsmodels
; we’ll use mpg
as the endogenous variable and the remaining variables as exongenous. We won’t use make
, as that will create several dummies and increase the number of paramters to 12+, which is too many for the SubsetSelector
class; we’ll also drop price
.
data = statsmodels.datasets.webuse('auto')
data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)
data.fillna(0, inplace=True)
data.head()
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | 0 |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | 0 |
2 | AMC Spirit | 3799 | 22 | 0.0 | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | 0 |
3 | Buick Century | 4816 | 20 | 3.0 | 4.5 | 16 | 3250 | 196 | 40 | 196 | 2.93 | 0 |
4 | Buick Electra | 7827 | 15 | 4.0 | 4.0 | 20 | 4080 | 222 | 43 | 350 | 2.41 | 0 |
X = data.iloc[:, 3:]
y = data['mpg']
Forward Stepwise Selection#
The ForwardSelector
follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.
The ForwardSelector
is instantiated with two parameters: normalize
and metric
. Normalize
defaults to False
, assuming that this class is part of a larger pipeline; metric
defaults to AIC.
Parameter |
Type |
Description |
---|---|---|
normalize |
bool |
Whether to normalize features; default |
metric |
str |
Optimization metric to use; must be one of |
The ForwardSelector
class follows the Scikit-Learn API. After fitting the selector using the .fit()
method, the selected features can be accessed using the boolean mask under the .best_support_
attribute.
selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
ForwardSelector(normalize=True)
X.loc[:, selector.best_support_]
rep78 | weight | length | gear_ratio | foreign | |
---|---|---|---|---|---|
0 | 3.0 | 2930 | 186 | 3.58 | 0 |
1 | 3.0 | 3350 | 173 | 2.53 | 0 |
2 | 0.0 | 2640 | 168 | 3.08 | 0 |
3 | 3.0 | 3250 | 196 | 2.93 | 0 |
4 | 4.0 | 4080 | 222 | 2.41 | 0 |
... | ... | ... | ... | ... | ... |
69 | 4.0 | 2160 | 172 | 3.74 | 1 |
70 | 5.0 | 2040 | 155 | 3.78 | 1 |
71 | 4.0 | 1930 | 155 | 3.78 | 1 |
72 | 4.0 | 1990 | 156 | 3.78 | 1 |
73 | 5.0 | 3170 | 193 | 2.98 | 1 |
74 rows × 5 columns
Best Subset Selection#
The SubsetSelector
follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.
The SubsetSelector
is instantiated with two parameters: normalize
and metric
. Normalize
defaults to False
, assuming that this class is part of a larger pipeline; metric
defaults to AIC.
Parameter |
Type |
Description |
---|---|---|
normalize |
bool |
Whether to normalize features; default |
metric |
str |
Optimization metric to use; must be one of |
The SubsetSelector
class follows the Scikit-Learn API. After fitting the selector using the .fit()
method, the selected features can be accessed using the boolean mask under the .best_support_
attribute.
selector = SubsetSelector(normalize=True, metric='aic')
selector.fit(X, y)
SubsetSelector(normalize=True)
X.loc[:, selector.get_support()]
rep78 | weight | length | gear_ratio | foreign | |
---|---|---|---|---|---|
0 | 3.0 | 2930 | 186 | 3.58 | 0 |
1 | 3.0 | 3350 | 173 | 2.53 | 0 |
2 | 0.0 | 2640 | 168 | 3.08 | 0 |
3 | 3.0 | 3250 | 196 | 2.93 | 0 |
4 | 4.0 | 4080 | 222 | 2.41 | 0 |
... | ... | ... | ... | ... | ... |
69 | 4.0 | 2160 | 172 | 3.74 | 1 |
70 | 5.0 | 2040 | 155 | 3.78 | 1 |
71 | 4.0 | 1930 | 155 | 3.78 | 1 |
72 | 4.0 | 1990 | 156 | 3.78 | 1 |
73 | 5.0 | 3170 | 193 | 2.98 | 1 |
74 rows × 5 columns
Comparing the full model#
Using the SubsetSelector
selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by weight
, length
, displacement
and other features that are all related to the weight of a vehicle.
Note: Selection using BIC as the optimization metric yields a model where weight
is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.
mod = OLS(endog=y, exog=add_constant(X)).fit()
mod.summary()
Dep. Variable: | mpg | R-squared: | 0.720 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.681 |
Method: | Least Squares | F-statistic: | 18.33 |
Date: | Sat, 07 Aug 2021 | Prob (F-statistic): | 1.29e-14 |
Time: | 15:37:36 | Log-Likelihood: | -187.23 |
No. Observations: | 74 | AIC: | 394.5 |
Df Residuals: | 64 | BIC: | 417.5 |
Df Model: | 9 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 39.0871 | 9.100 | 4.295 | 0.000 | 20.907 | 57.267 |
rep78 | 1.0021 | 0.357 | 2.809 | 0.007 | 0.290 | 1.715 |
headroom | -0.0167 | 0.611 | -0.027 | 0.978 | -1.237 | 1.204 |
trunk | -0.0772 | 0.154 | -0.503 | 0.617 | -0.384 | 0.230 |
weight | -0.0037 | 0.002 | -1.928 | 0.058 | -0.008 | 0.000 |
length | -0.0752 | 0.061 | -1.229 | 0.223 | -0.197 | 0.047 |
turn | -0.1762 | 0.187 | -0.941 | 0.350 | -0.550 | 0.198 |
displacement | 0.0131 | 0.011 | 1.180 | 0.243 | -0.009 | 0.035 |
gear_ratio | 3.7067 | 1.751 | 2.116 | 0.038 | 0.208 | 7.206 |
foreign | -4.4633 | 1.385 | -3.222 | 0.002 | -7.230 | -1.696 |
Omnibus: | 28.364 | Durbin-Watson: | 2.523 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 52.945 |
Skew: | 1.389 | Prob(JB): | 3.18e-12 |
Kurtosis: | 6.074 | Cond. No. | 7.55e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit()
mod.summary()
Dep. Variable: | mpg | R-squared: | 0.710 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.688 |
Method: | Least Squares | F-statistic: | 33.25 |
Date: | Sat, 07 Aug 2021 | Prob (F-statistic): | 5.22e-17 |
Time: | 15:37:40 | Log-Likelihood: | -188.63 |
No. Observations: | 74 | AIC: | 389.3 |
Df Residuals: | 68 | BIC: | 403.1 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 40.3703 | 7.860 | 5.136 | 0.000 | 24.687 | 56.054 |
rep78 | 0.9040 | 0.342 | 2.647 | 0.010 | 0.223 | 1.586 |
weight | -0.0030 | 0.002 | -1.770 | 0.081 | -0.006 | 0.000 |
length | -0.1058 | 0.053 | -1.990 | 0.051 | -0.212 | 0.000 |
gear_ratio | 2.6905 | 1.511 | 1.780 | 0.079 | -0.325 | 5.706 |
foreign | -4.0123 | 1.320 | -3.040 | 0.003 | -6.646 | -1.379 |
Omnibus: | 24.257 | Durbin-Watson: | 2.442 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 39.774 |
Skew: | 1.252 | Prob(JB): | 2.31e-09 |
Kurtosis: | 5.576 | Cond. No. | 6.59e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Use in Scikit-Learn Pipeline#
Both ForwardSelector
and SubsetSelector
objects are compatible with Scikit-Learn Pipeline
objects, and can be used as feature selection steps:
pl = Pipeline([
('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())
])
pl.fit(X, y)
Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())])
pl.score(X, y)
0.7097132531085899