Estimación de la Información Mutua (mutual_info_classif y mutual_info_regression) entre variables#
Ultima modificación: 2023-03-11 | YouTube
Las funciones
mutual_info_classif
ymutual_info_regression
estiman la información mutua entre cada una de las variables explicativas y la variable dependiente.
La información mutua es una medida de la dependencia mutua entre dos variables aleatorias.
En el caso de variables discretas, la información mutua se calcula como:
I(x y) = \sum_x \sum_y \text{Prob}(x,y) \log \left( \frac{\text{Prob}(x,y)}{\text{Prob(x)}\cdot \text{Prob}(y)} \right)
donde:
\text{Prob}(x,y) es la probabilidad conjunta de x y y.
\text{Prob}(x) y \text{Prob}(y) son las probabilidades marginales.
Esta métrica se basa en la divergencia de Kullback-Leibler, la cual es una medida entre la diferencia entre dos distribuciones de probabilidad.
En este caso se parte del supuesto de que si no hay relación entre x y y, ambas variables son independientes, y por tanto, \text{Prob}(x,y)=\text{Prob}(x)\times \text{Prob}(y), tal que I(x,y)=0.
Clasificación#
[1]:
import numpy as np
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=150,
n_features=2,
centers=3,
cluster_std=0.8,
shuffle=False,
random_state=12345,
)
#
# Note que x0 y x1 son significativas, mientras que x2 es una variable
# aleatoria no explicativa
#
X = np.hstack((X, 2 * np.random.random((X.shape[0], 1))))
X.shape
[1]:
(150, 3)
[2]:
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(
# -------------------------------------------------------------------------
# Feature matrix.
X=X,
# -------------------------------------------------------------------------
# Target vector.
y=y,
# -------------------------------------------------------------------------
# Number of neighbors to use for MI estimation for continuous variables.
n_neighbors=3,
# -------------------------------------------------------------------------
# If bool, then determines whether to consider all features discrete or
# continuous.
discrete_features="auto",
# -------------------------------------------------------------------------
# Determines random number generation for adding small noise to continuous
# variables in order to remove repeated values.
random_state=None,
)
mutual_info
[2]:
array([1.10530858, 0.9911345 , 0.00894349])
Regresión#
[3]:
from sklearn.datasets import make_regression
X, y = make_regression(
n_samples=300,
n_features=4,
n_informative=2,
bias=0.0,
tail_strength=0.9,
noise=12.0,
shuffle=False,
coef=False,
random_state=0,
)
[4]:
from sklearn.feature_selection import mutual_info_regression
mutual_info = mutual_info_regression(
# -------------------------------------------------------------------------
# Feature matrix.
X=X,
# -------------------------------------------------------------------------
# Target vector.
y=y,
# -------------------------------------------------------------------------
# If bool, then determines whether to consider all features discrete or
# continuous.
discrete_features="auto",
# -------------------------------------------------------------------------
# Number of neighbors to use for MI estimation for continuous variables.
n_neighbors=3,
# -------------------------------------------------------------------------
# Determines random number generation for adding small noise to continuous
# variables in order to remove repeated values.
random_state=None,
)
mutual_info
[4]:
array([0.07191495, 0.63637812, 0. , 0.09702994])