Estimación de la Información Mutua (mutual_info_classif y mutual_info_regression) entre variables#

  • Ultima modificación: 2023-03-11 | YouTube

  • Las funciones mutual_info_classif y mutual_info_regression estiman la información mutua entre cada una de las variables explicativas y la variable dependiente.

  • La información mutua es una medida de la dependencia mutua entre dos variables aleatorias.

  • En el caso de variables discretas, la información mutua se calcula como:

    I(x y) = \sum_x \sum_y \text{Prob}(x,y) \log \left( \frac{\text{Prob}(x,y)}{\text{Prob(x)}\cdot \text{Prob}(y)} \right)

    donde:

    • \text{Prob}(x,y) es la probabilidad conjunta de x y y.

    • \text{Prob}(x) y \text{Prob}(y) son las probabilidades marginales.

  • Esta métrica se basa en la divergencia de Kullback-Leibler, la cual es una medida entre la diferencia entre dos distribuciones de probabilidad.

  • En este caso se parte del supuesto de que si no hay relación entre x y y, ambas variables son independientes, y por tanto, \text{Prob}(x,y)=\text{Prob}(x)\times \text{Prob}(y), tal que I(x,y)=0.

Clasificación#

[1]:
import numpy as np
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=150,
    n_features=2,
    centers=3,
    cluster_std=0.8,
    shuffle=False,
    random_state=12345,
)

#
# Note que x0 y x1 son significativas, mientras que x2 es una variable
# aleatoria no explicativa
#
X = np.hstack((X, 2 * np.random.random((X.shape[0], 1))))
X.shape
[1]:
(150, 3)

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif

[2]:
from sklearn.feature_selection import mutual_info_classif

mutual_info = mutual_info_classif(
    # -------------------------------------------------------------------------
    # Feature matrix.
    X=X,
    # -------------------------------------------------------------------------
    # Target vector.
    y=y,
    # -------------------------------------------------------------------------
    # Number of neighbors to use for MI estimation for continuous variables.
    n_neighbors=3,
    # -------------------------------------------------------------------------
    # If bool, then determines whether to consider all features discrete or
    # continuous.
    discrete_features="auto",
    # -------------------------------------------------------------------------
    # Determines random number generation for adding small noise to continuous
    # variables in order to remove repeated values.
    random_state=None,
)

mutual_info
[2]:
array([1.10530858, 0.9911345 , 0.00894349])

Regresión#

[3]:
from sklearn.datasets import make_regression

X, y = make_regression(
    n_samples=300,
    n_features=4,
    n_informative=2,
    bias=0.0,
    tail_strength=0.9,
    noise=12.0,
    shuffle=False,
    coef=False,
    random_state=0,
)
[4]:
from sklearn.feature_selection import mutual_info_regression

mutual_info = mutual_info_regression(
    # -------------------------------------------------------------------------
    # Feature matrix.
    X=X,
    # -------------------------------------------------------------------------
    # Target vector.
    y=y,
    # -------------------------------------------------------------------------
    # If bool, then determines whether to consider all features discrete or
    # continuous.
    discrete_features="auto",
    # -------------------------------------------------------------------------
    # Number of neighbors to use for MI estimation for continuous variables.
    n_neighbors=3,
    # -------------------------------------------------------------------------
    # Determines random number generation for adding small noise to continuous
    # variables in order to remove repeated values.
    random_state=None,
)

mutual_info
[4]:
array([0.07191495, 0.63637812, 0.        , 0.09702994])