Reducción de dimensionalidad usando SelectFromModel()#

  • Ultima modificación: 2023-03-11 | YouTube

Modelos lineales#

Los modelos lineales penalizados con una norma L1 tienen a hacer muchos de los coeficientes de las características iguales a cero, por lo que pueden ser usados para la reducción de la dimensionalidad de los datos (selección de variables). Se recomiendan los siguientes tipos de modelos:

  • Lasso()

  • LogisticRegress()

  • LinearSVC()

[1]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X.shape
[1]:
(150, 4)
[2]:
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC

#
# Crea y entrena un estimador
#
linearSVC = LinearSVC(
    C=0.01,
    penalty="l1",
    dual=False,
    max_iter=10000,
)

linearSVC.fit(X, y)

#
# Selector
#
model = SelectFromModel(
    # -------------------------------------------------------------------------
    # The base estimator from which the transformer is built. This can be both
    # a fitted (if prefit is set to True) or a non-fitted estimator.
    estimator=linearSVC,
    # -------------------------------------------------------------------------
    # The threshold value to use for feature selection. Features whose
    # importance is greater or equal are kept while the others are discarded.
    # * float.
    # * "median": the threshold value is the median of feature importances.
    # * "mean": the threshold value is the mean of feature importances.
    # * "1.25*mean": a scaling factor
    # * None: if penality is L1, then threshold is 1e-5, otherwise "mean"
    threshold=None,
    # -------------------------------------------------------------------------
    # Whether a prefit model is expected to be passed into the constructor
    # directly or not.
    prefit=True,
    # -------------------------------------------------------------------------
    # Order of the norm used to filter the vectors of coefficients below
    # threshold in the case where the coef_ attribute of the estimator is of
    # dimension 2.
    norm_order=1,
    # -------------------------------------------------------------------------
    # The maximum number of features to select.
    max_features=None,
)

X_new = model.transform(X)
X_new.shape
[2]:
(150, 3)

Usando árboles#

[3]:
from sklearn.ensemble import ExtraTreesClassifier

treeClassifier = ExtraTreesClassifier(n_estimators=50)
treeClassifier = treeClassifier.fit(X, y)
treeClassifier.feature_importances_
[3]:
array([0.07210089, 0.04735931, 0.42085408, 0.45968572])
[4]:
from sklearn.feature_selection import SelectFromModel

model = SelectFromModel(
    estimator=treeClassifier,
    prefit=True,
)

X_new = model.transform(X)
X_new.shape
[4]:
(150, 2)