Selección de características basada en la tasa de error (SelectFwe)#
Ultima modificación: 2023-03-11 | YouTube
En esta metodología se seleccionan los valores para los cuales:
\text{p-value}_i < \frac{\alpha}{n}
donde n es la cantidad de variables independientes.
[1]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
X.shape
[1]:
(569, 30)
[2]:
from sklearn.feature_selection import SelectFwe, chi2
selectFwe = SelectFwe(
# -------------------------------------------------------------------------
# Function taking two arrays X and y, and returning a pair of arrays
# (scores, pvalues).
score_func=chi2,
# -------------------------------------------------------------------------
# The highest p-value for features to be kept.
alpha=0.01,
)
selectFwe.fit(X, y)
X_new = selectFwe.transform(X)
X_new.shape
[2]:
(569, 15)
[3]:
selectFwe.scores_
[3]:
array([2.66104917e+02, 9.38975081e+01, 2.01110286e+03, 5.39916559e+04,
1.49899264e-01, 5.40307549e+00, 1.97123536e+01, 1.05440354e+01,
2.57379775e-01, 7.43065536e-05, 3.46752472e+01, 9.79353970e-03,
2.50571896e+02, 8.75850471e+03, 3.26620664e-03, 6.13785332e-01,
1.04471761e+00, 3.05231563e-01, 8.03633831e-05, 6.37136566e-03,
4.91689157e+02, 1.74449400e+02, 3.66503542e+03, 1.12598432e+05,
3.97365694e-01, 1.93149220e+01, 3.95169151e+01, 1.34854195e+01,
1.29886140e+00, 2.31522407e-01])
[4]:
selectFwe.pvalues_
[4]:
array([8.01397628e-060, 3.32292194e-022, 0.00000000e+000, 0.00000000e+000,
6.98631644e-001, 2.01012999e-002, 9.00175712e-006, 1.16563638e-003,
6.11926026e-001, 9.93122221e-001, 3.89553429e-009, 9.21168192e-001,
1.94877489e-056, 0.00000000e+000, 9.54425121e-001, 4.33366115e-001,
3.06726812e-001, 5.80621137e-001, 9.92847410e-001, 9.36379753e-001,
6.11324751e-109, 7.89668299e-040, 0.00000000e+000, 0.00000000e+000,
5.28452867e-001, 1.10836762e-005, 3.25230064e-010, 2.40424384e-004,
2.54421307e-001, 6.30397277e-001])