Introducción a scikit-learn — 21:48 min#

Ultima modificación: 2024-01-22 | YouTube

Estimadores#

Los modelos y algoritmos de machine learning son llamados estimadores en sklearn.

[1]:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)
X = [
    [1, 2, 3],
    [11, 12, 13],
]  # 2 samples, 3 features

y = [0, 1]  # classes of each sample
clf.fit(X, y)

[1]:

RandomForestClassifier(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[2]:

#
# Predice las clases de los datos de entrenamiento
#
clf.predict(X)

[2]:

array([0, 1])

[3]:

#
# Predice las clases de nuevos datos
#
clf.predict(
    [
        [4, 5, 6],
        [14, 15, 16],
    ]
)

[3]:

array([0, 1])

Transformadores y Preprocesadores#

[4]:

from sklearn.preprocessing import StandardScaler

X = [
    [0, 15],
    [1, -10],
]

# Transforma los datos de acuerdo a los valores de los datos
StandardScaler().fit(X).transform(X)

[4]:

array([[-1.,  1.],
       [ 1., -1.]])

Pipelines: encadenamiento de preprocesadores y estimadores#

[5]:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Crea un objeto pipeline
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(),
)

# Carga el dataset del iris y lo parte en conjuntos de entrenamiento y prueba
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=0,
)

# Entrena el pipeline
pipe.fit(X_train, y_train)

# Calcula la precisión del modelo para el conjunto de prueba
accuracy_score(
    pipe.predict(X_test),
    y_test,
)

[5]:

0.9736842105263158

Evaluación del modelo#

[6]:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

# Crea un conjunto artificial de datos de regresión
X, y = make_regression(n_samples=1000, random_state=0)

# Crea una instancia del modelo de regresión lineal
lr = LinearRegression()

# Calcula la metrica (r_squared) de validación cruzada, por defecto 5-fold CV
result = cross_validate(lr, X, y)

# Retorna r_squared en el conjunto de prueba
result["test_score"]

[6]:

array([1., 1., 1., 1., 1.])

Búsqueda automática de parámetros#

[7]:

from scipy.stats import randint
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split

# carga el dataset de california housing y lo divide en cojuntos de
# entrenammiento y test
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# define el espacio de parametros
param_distributions = {
    "n_estimators": randint(1, 5),
    "max_depth": randint(5, 10),
}


# crea un objeto de validación cruzada y lo ajusta a los datos
search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=0),
    n_iter=5,
    param_distributions=param_distributions,
    random_state=0,
)
search.fit(X_train, y_train)
search.best_params_

[7]:

{'max_depth': 9, 'n_estimators': 4}

[8]:

# el objeto de validación cruzada se comporta como un estimador
# random forest normal con max_depth=9 y n_estimators=4
search.score(X_test, y_test)

[8]:

0.735363411343253