El dataset diabetes — 1:02 min#

  • 1:02 min | Ultima modificación: Septiembre 28, 2021 | YouTube

https://scikit-learn.org/stable/datasets/toy_dataset.html

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

En este problema se desea pronósticar la evolución de la enfermedad de la diabetes un año adelante, a partir de una muestra de 442 pacientes, para los cuales se midieron los siguientes 10 variables:

* age age in years

* sex

* bmi body mass index

* bp average blood pressure

* s1 tc, total serum cholesterol

* s2 ldl, low-density lipoproteins

* s3 hdl, high-density lipoproteins

* s4 tch, total cholesterol / HDL

* s5 ltg, possibly log of serum triglycerides level

* s6 glu, blood sugar level

https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

[1]:
from sklearn.datasets import load_diabetes
[2]:
bunch = load_diabetes(
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=False,
)

bunch.keys()
[2]:
dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])
[3]:
bunch.target[:5]
[3]:
array([151.,  75., 141., 206., 135.])
[4]:
bunch.feature_names
[4]:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[5]:
X, y = load_diabetes(
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=True,
)

display(
    X[:5, :],
    y[:5],
)
array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])
array([151.,  75., 141., 206., 135.])
[6]:
#
# Carga a un dataframe de pandas
#
import pandas as pd

diabetes = pd.DataFrame(
    bunch.data,
    columns=bunch.feature_names,
)

diabetes["target"] = bunch.target

diabetes.head()
[6]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0
[7]:
#
# Carga desde un repo
# https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
#
pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/diabetes.csv"
).head()
[7]:
age sex bmi bp s1 s2 s3 s4 s5 s6 Y
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0