Gradiente Descendente Estocástico para Clasificación#

Ultima modificación: 2023-03-11 | YouTube

En un modelo de regresión, la variable dependiente y es una función lineal de las variables independientes x_i, tal que:

f(x)= b + w^T x

donde:
- x es un patrón o ejemplo del dataset.
- b es el intercepto del modelo.
- w son los coeficientes de las variales independientes.

Para el caso de clasificación, se asume que el dataset esta formado por los ejemplos (x_1, y_1), \cdots, (x_n, y_n), donde cada x_i es un vector m dimensional y y \in \{-1, +1 \}.

Para determinar la clase, se aplica la función \text{sign}(u) con u = b + w^T x.

La estimación de parámetros se realiza utilizando el algoritmo de gradiente descendente estocástico, actualizando los parámetros del modelo a medida que va recorriendo los ejemplos (patrón de entrenamiento).

w \leftarrow w - \eta \left[ \alpha \frac{\partial R(w)}{\partial w} + \frac{\partial L(w^Tx_i + b_i, y_i)}{\partial w} \right]

Función de pérdida L():
- hinge:
L\left(y, f(x_i) \right) = \max (0, 1 - y_i f(x_i))
- squared hinge:
L\left(y, f(x_i) \right) = [\max (0, 1 - y_i f(x_i)]^2
- perceptron:
L\left(y, f(x_i) \right) = \max (0, -y_i f(x_i))
- modified_huber:
L\left(y, f(x_i) \right) = \begin{cases} \max(0, 1 - y_i f(x_i))^2 & \text{Si } y_i f(x_i) \gt 1 \\ \\ -4 y_i f(x_i) & \text{en cualquier otro caso} \end{cases}
- log: logistica
L\left(y, f(x_i) \right) = \log \left(1 + \exp(- y f(x_i)) \right)
- Least-squares:
L\left(y, f(x_i) \right) = \frac{1}{2} [y_i - f(x_i)]^2
- huber:
L(d, y)= \begin{cases} \frac{1}{2}[d - y]^2 & \text{Si } |d-y| \le \epsilon \\ \\ \epsilon |d-y|-\frac{1}{2}\epsilon^2 & \text{en cualquier otro caso} \end{cases}
- Cualquiera de las funciones usadas con SGDRegressor.

R(w) es el término de regularización, el cual puede ser especificado como:
- L2:
R(w) = \frac{1}{2} \sum_j w_j^2
- L1:
R(w) = \frac{1}{2} \sum_j |w_j|
- ElasticNet:
R(w) = \frac{\rho}{2} \sum_j w_j^2 + (1 - \rho) \sum_j |w_j|

[1]:

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

NPOINTS = 150

X, y = make_blobs(
    n_samples=NPOINTS,
    n_features=2,
    centers=3,
    cluster_std=0.8,
    shuffle=False,
    random_state=12345,
)


def plot_scatter():
    plt.scatter(X[:50, 0], X[:50, 1], color="tab:red", edgecolors="k", s=120)
    plt.scatter(X[50:100, 0], X[50:100, 1], color="tab:blue", edgecolors="k", s=120)
    plt.scatter(X[100:, 0], X[100:, 1], color="tab:green", edgecolors="k", s=120)
    plt.xlabel("$x_1$")
    plt.ylabel("$x_2$")
    plt.gca().spines["left"].set_color("gray")
    plt.gca().spines["bottom"].set_color("gray")
    plt.gca().spines["top"].set_visible(False)
    plt.gca().spines["right"].set_visible(False)
    plt.tight_layout()


plt.figure(figsize=(3.5, 3.5))
plot_scatter()
plt.show()

../_images/31_gradiente_descendente_estocastico_02_SGDClassifier_8_0.png

[2]:

from sklearn.linear_model import SGDClassifier

sgdClassifier = SGDClassifier(
    # --------------------------------------------------------------------------
    # The loss function to be used.
    # * ‘hinge’ gives a linear SVM.
    # * ‘log_loss’ gives logistic regression, a probabilistic classifier.
    # * ‘modified_huber’ is another smooth loss that brings tolerance to
    #   outliers as well as probability estimates.
    # * ‘squared_hinge’ is like hinge but is quadratically penalized.
    # * ‘perceptron’ is the linear loss used by the perceptron algorithm.
    # * The other losses, ‘squared_error’, ‘huber’, ‘epsilon_insensitive’
    #   and ‘squared_epsilon_insensitive’ are designed for regression but can be
    #   useful in classification as well;
    loss='hinge',
    # --------------------------------------------------------------------------
    # The penalty (aka regularization term) to be used.
    # * None
    # * 'l2'
    # * 'l1'
    # * 'elasticnet'
    penalty='l2',
    # --------------------------------------------------------------------------
    # Constant that multiplies the regularization term. The higher the value,
    # the stronger the regularization. Also used to compute the learning rate
    # when set to learning_rate is set to ‘optimal’. Values must be in the
    # range [0.0, inf).
    alpha=0.0001,
    # --------------------------------------------------------------------------
    # The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0
    # corresponds to L2 penalty, l1_ratio=1 to L1. Only used if penalty is
    # ‘elasticnet’. Values must be in the range [0.0, 1.0].
    l1_ratio=0.15,
    # --------------------------------------------------------------------------
    # Whether to fit the intercept for this model.
    fit_intercept=True,
    # --------------------------------------------------------------------------
    # The maximum number of passes over the training data (aka epochs).
    max_iter=1000,
    # --------------------------------------------------------------------------
    # The stopping criterion.
    tol=1e-3,
    # --------------------------------------------------------------------------
    # Whether or not the training data should be shuffled after each epoch.
    shuffle=True,
    # --------------------------------------------------------------------------
    # * Epsilon in the epsilon-insensitive loss functions; only if loss is
    #   ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.
    # * For ‘huber’, determines the threshold at which it becomes less important
    #   to get the prediction exactly right.
    # * For epsilon-insensitive, any differences between the current prediction
    #   and the correct label are ignored if they are less than this threshold.
    # Values must be in the range [0.0, inf).
    epsilon=0.1,
    # --------------------------------------------------------------------------
    # Used for shuffling the data, when shuffle is set to True.
    random_state=None,
    # --------------------------------------------------------------------------
    # The learning rate schedule:
    # * ‘constant’: eta = eta0
    # * ‘optimal’: eta = 1.0 / (alpha * (t + t0))
    # * ‘invscaling’: eta = eta0 / pow(t, power_t)
    # * ‘adaptive’
    learning_rate='optimal',
    # --------------------------------------------------------------------------
    # The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’
    # schedules. The default value is 0.0 as eta0 is not used by the default
    # schedule ‘optimal’. Values must be in the range (0.0, inf).
    eta0=0.0,
    # --------------------------------------------------------------------------
    # The exponent for inverse scaling learning rate
    power_t=0.5,
    # --------------------------------------------------------------------------
    # Whether to use early stopping to terminate training when validation score
    # is not improving.
    early_stopping=False,
    # --------------------------------------------------------------------------
    # The proportion of training data to set aside as validation set for early
    # stopping. Must be between 0 and 1. Only used if early_stopping is True.
    validation_fraction=0.1,
    # --------------------------------------------------------------------------
    # Number of iterations with no improvement to wait before stopping fitting.
    n_iter_no_change=5,
    # ---------------------------------------------------------------------
    # Weights associated with classes in the form {class_label: weight}
    class_weight=None,
    # --------------------------------------------------------------------------
    # When set to True, reuse the solution of the previous call to fit as
    # initialization, otherwise, just erase the previous solution.
    warm_start=False,
    # --------------------------------------------------------------------------
    # When set to True, computes the averaged SGD weights across all updates
    # and stores the result in the coef_ attribute. If set to an int greater
    # than 1, averaging will begin once the total number of samples seen
    # reaches average. So average=10 will begin averaging after seeing 10
    # samples.
    average=False,
)

sgdClassifier.fit(X, y)

sgdClassifier.predict(X)

[2]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

[3]:

sgdClassifier.coef_

[3]:

array([[ 24.96547041,  -4.74132087],
       [-28.86896993, -31.3886997 ],
       [  6.17853975,  43.28232526]])

[4]:

sgdClassifier.intercept_

[4]:

array([-118.8654999 ,   10.        ,    4.92455655])

[5]:

from sklearn.metrics import confusion_matrix

y_pred = sgdClassifier.predict(X)

confusion_matrix(
    y_true=y,
    y_pred=y_pred,
)

[5]:

array([[50,  0,  0],
       [ 0, 50,  0],
       [ 0,  0, 50]])

Para graficar las fronteras de decisión, se tiene en cuenta que la frontera entre dos clases se produce en f(0), de tal forma que:

w_0 + w_1x_1 + w_2x_2 = 0

Despejando:

x_2 = - \left( \frac{w_0}{w_2} + \frac{w_1}{w_2} x_1 \right)

[6]:

import numpy as np

plt.figure(figsize=(3.5, 3.5))
plot_scatter()

xlim = plt.gca().get_xlim()
ylim = plt.gca().get_ylim()

xplot = np.linspace(start=xlim[0], stop=xlim[1], num=20)

fronteraA = (
    -sgdClassifier.intercept_[0] / sgdClassifier.coef_[0][1]
    - sgdClassifier.coef_[0][0] / sgdClassifier.coef_[0][1] * xplot
)

fronteraB = (
    -sgdClassifier.intercept_[1] / sgdClassifier.coef_[1][1]
    - sgdClassifier.coef_[1][0] / sgdClassifier.coef_[1][1] * xplot
)

fronteraC = (
    -sgdClassifier.intercept_[2] / sgdClassifier.coef_[2][1]
    - sgdClassifier.coef_[2][0] / sgdClassifier.coef_[2][1] * xplot
)


plt.plot(xplot, fronteraA, color="tab:red", linewidth=3)
plt.plot(xplot, fronteraB, color="tab:blue", linewidth=3)
plt.plot(xplot, fronteraC, color="tab:green", linewidth=3)

plt.xlim(xlim)
plt.ylim(ylim)

plt.show()

../_images/31_gradiente_descendente_estocastico_02_SGDClassifier_14_0.png