MiniBatchKMeans#

Esta es una variación del algoritmo KMeans el cual usa batches de datos para reducir el tiempo de cómputo del algoritmo.

Se generan los centros de los clusters.
Se extraen b muestras del dataset para formar un mini batch.
Se actualizan los centros de los clusters a medida que se asignan las muestras (mecanismo online).

[1]:

from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=90,
    n_features=2,
    centers=[
        [8, -8],
        [7, 8],
        [-6, -1],
    ],
    cluster_std=2.0,
    shuffle=False,
    random_state=5,
)

[2]:

from sklearn.cluster import MiniBatchKMeans

batchKMeans = MiniBatchKMeans(
    # -------------------------------------------------------------------------
    # The number of clusters
    n_clusters=3,
    # -------------------------------------------------------------------------
    # Method for initialization
    init='k-means++',
    # -------------------------------------------------------------------------
    # Maximum number of iterations over the complete dataset before stopping
    # independently of any early stopping criterion heuristics.
    max_iter=100,
    # -------------------------------------------------------------------------
    # Size of the mini batches.
    batch_size=1024,
    # -------------------------------------------------------------------------
    # Verbosity mode.
    verbose=0,
    # -------------------------------------------------------------------------
    # Compute label assignment and inertia for the complete dataset once the
    # minibatch optimization has converged in fit.
    compute_labels=True,
    # -------------------------------------------------------------------------
    # Determines random number generation for centroid initialization and
    # random reassignment.
    random_state=None,
    # -------------------------------------------------------------------------
    # Control early stopping based on the consecutive number of mini batches
    # that does not yield an improvement on the smoothed inertia.
    max_no_improvement=10,
    # -------------------------------------------------------------------------
    # Number of samples to randomly sample for speeding up the initialization
    init_size=None,
    # -------------------------------------------------------------------------
    # Number of random initializations that are tried.
    n_init=3,
)


batchKMeans.fit(X)

batchKMeans.cluster_centers_

[2]:

array([[ 6.61952762,  8.43531948],
       [ 7.96411572, -7.41700257],
       [-5.72806242, -1.15554271]])

[3]:

batchKMeans.labels_

[3]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2], dtype=int32)