Selección de características con scores más altos usando SelectPercentil#

  • Ultima modificación: 2023-03-11 | YouTube

  • Esta metodología permite seleccionar las variables independientes que están en el percentil de scores más altos.

  • Esta es una metodología que permite realizar la selección independiente de cada característica. Es decir, no se tienen en cuenta interacciones entre características.

  • Para seleccionar las características se debe especificar una función que tome X y y, y retorne los scores y los valores críticos asociados.

  • Las disponibles funciones para computar la importancia de cada variable independiente son:

    • f_classif()

    • f_regression()

    • mutual_info_classif()

    • mutual_info_regression()

    • chi2() (únicamente para clasificación)

[1]:
#
# Carga el dataset de prueba
#
import numpy as np
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

X = np.hstack((X, 2 * np.random.random((X.shape[0], 36))))

X.shape
[1]:
(1797, 100)
[2]:
from sklearn.feature_selection import SelectPercentile, chi2

selectPercentile = SelectPercentile(
    # -------------------------------------------------------------------------
    # Function taking two arrays X and y, and returning a pair of arrays
    # (scores, pvalues) or a single array with scores.
    score_func=chi2,
    # -------------------------------------------------------------------------
    # Percent of features to keep.
    percentile=10,
)

selectPercentile.fit(X, y)
X_new = selectPercentile.transform(X)

X_new.shape
[2]:
(1797, 10)
[3]:
selectPercentile.scores_
[3]:
array([           nan, 8.11907004e+02, 3.50128250e+03, 6.98925257e+02,
       4.38529699e+02, 3.87981926e+03, 3.96945823e+03, 1.19356082e+03,
       2.47952140e+01, 2.95383109e+03, 2.58365199e+03, 3.88242059e+02,
       8.24690949e+02, 3.67648925e+03, 1.98357961e+03, 5.97241982e+02,
       8.95886124e+00, 1.92421690e+03, 2.40927141e+03, 3.55631595e+03,
       4.87194195e+03, 4.78219922e+03, 2.15517379e+03, 3.76765833e+02,
       7.90090158e+00, 2.47182418e+03, 4.51548150e+03, 2.98664315e+03,
       3.72409568e+03, 3.20864687e+03, 5.13807412e+03, 3.57127072e+01,
                  nan, 5.68825080e+03, 5.26246647e+03, 3.16506059e+03,
       3.23163943e+03, 2.53299696e+03, 3.28881404e+03,            nan,
       1.42850829e+02, 3.86385788e+03, 6.41608672e+03, 5.44825154e+03,
       4.07973153e+03, 2.13402540e+03, 4.48634098e+03, 3.13538981e+02,
       7.03992739e+01, 4.49723273e+02, 2.80197224e+03, 1.52754520e+03,
       1.65315892e+03, 3.07399804e+03, 5.25121749e+03, 6.83882273e+02,
       9.15254237e+00, 8.51067915e+02, 3.80024731e+03, 7.30929757e+02,
       1.85953966e+03, 4.37922504e+03, 5.05900552e+03, 2.28132864e+03,
       3.60008703e+00, 2.19217170e+00, 2.03066569e+00, 2.94171731e+00,
       3.27709832e+00, 1.71971160e+00, 7.34252314e-01, 4.64692864e+00,
       2.98543828e+00, 7.35578295e+00, 4.02951205e+00, 4.13465963e+00,
       4.19643601e+00, 2.92327956e+00, 2.53590121e+00, 2.34154316e+00,
       2.36389578e+00, 2.44285487e+00, 2.45927501e+00, 6.05639073e+00,
       2.85872832e+00, 2.74251548e+00, 3.12697422e+00, 5.57462076e+00,
       5.18707468e+00, 2.93059604e+00, 9.43106988e-01, 3.35913263e+00,
       4.63530197e+00, 1.17495076e+00, 7.74837679e-01, 3.07863232e+00,
       3.36366910e+00, 3.64506330e+00, 4.72851387e+00, 1.27804265e+00])
[4]:
selectPercentile.pvalues_
[4]:
array([            nan, 5.81310493e-169, 0.00000000e+000, 1.17740541e-144,
       8.11314242e-089, 0.00000000e+000, 0.00000000e+000, 2.97727113e-251,
       3.20626273e-003, 0.00000000e+000, 0.00000000e+000, 4.41344943e-078,
       1.02825052e-171, 0.00000000e+000, 0.00000000e+000, 8.18335060e-123,
       4.41080315e-001, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.23435651e-075,
       5.44163062e-001, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 4.45801029e-005,
                   nan, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000,             nan,
       2.65875300e-026, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.49452723e-062,
       1.27145348e-011, 3.28604761e-091, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 2.01600539e-141,
       4.23314114e-001, 2.14859356e-177, 0.00000000e+000, 1.54562173e-151,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       9.35711572e-001, 9.88051240e-001, 9.90973325e-001, 9.66560856e-001,
       9.52288627e-001, 9.95165159e-001, 9.99844029e-001, 8.63937029e-001,
       9.64869491e-001, 6.00129741e-001, 9.09459390e-001, 9.02327947e-001,
       8.98014222e-001, 9.67259052e-001, 9.79902343e-001, 9.84845736e-001,
       9.84322544e-001, 9.82381001e-001, 9.81958723e-001, 7.34261464e-001,
       9.69633314e-001, 9.73635033e-001, 9.59047184e-001, 7.81620699e-001,
       8.17705007e-001, 9.66983061e-001, 9.99557809e-001, 9.48341186e-001,
       8.64871463e-001, 9.98917018e-001, 9.99804541e-001, 9.61095610e-001,
       9.48117653e-001, 9.33184421e-001, 8.57303250e-001, 9.98482943e-001])