MiniBatchKMeans#
Esta es una variación del algoritmo KMeans el cual usa batches de datos para reducir el tiempo de cómputo del algoritmo.
Se generan los centros de los clusters.
Se extraen b muestras del dataset para formar un mini batch.
Se actualizan los centros de los clusters a medida que se asignan las muestras (mecanismo online).
[1]:
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=90,
n_features=2,
centers=[
[8, -8],
[7, 8],
[-6, -1],
],
cluster_std=2.0,
shuffle=False,
random_state=5,
)
[2]:
from sklearn.cluster import MiniBatchKMeans
batchKMeans = MiniBatchKMeans(
# -------------------------------------------------------------------------
# The number of clusters
n_clusters=3,
# -------------------------------------------------------------------------
# Method for initialization
init='k-means++',
# -------------------------------------------------------------------------
# Maximum number of iterations over the complete dataset before stopping
# independently of any early stopping criterion heuristics.
max_iter=100,
# -------------------------------------------------------------------------
# Size of the mini batches.
batch_size=1024,
# -------------------------------------------------------------------------
# Verbosity mode.
verbose=0,
# -------------------------------------------------------------------------
# Compute label assignment and inertia for the complete dataset once the
# minibatch optimization has converged in fit.
compute_labels=True,
# -------------------------------------------------------------------------
# Determines random number generation for centroid initialization and
# random reassignment.
random_state=None,
# -------------------------------------------------------------------------
# Control early stopping based on the consecutive number of mini batches
# that does not yield an improvement on the smoothed inertia.
max_no_improvement=10,
# -------------------------------------------------------------------------
# Number of samples to randomly sample for speeding up the initialization
init_size=None,
# -------------------------------------------------------------------------
# Number of random initializations that are tried.
n_init=3,
)
batchKMeans.fit(X)
batchKMeans.cluster_centers_
[2]:
array([[ 6.61952762, 8.43531948],
[ 7.96411572, -7.41700257],
[-5.72806242, -1.15554271]])
[3]:
batchKMeans.labels_
[3]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2], dtype=int32)