El dataset diabetes — 1:02 min#

1:02 min | Ultima modificación: Septiembre 28, 2021 | YouTube

https://scikit-learn.org/stable/datasets/toy_dataset.html

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

En este problema se desea pronósticar la evolución de la enfermedad de la diabetes un año adelante, a partir de una muestra de 442 pacientes, para los cuales se midieron los siguientes 10 variables:

* age age in years

* sex

* bmi body mass index

* bp average blood pressure

* s1 tc, total serum cholesterol

* s2 ldl, low-density lipoproteins

* s3 hdl, high-density lipoproteins

* s4 tch, total cholesterol / HDL

* s5 ltg, possibly log of serum triglycerides level

* s6 glu, blood sugar level

https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

[1]:

from sklearn.datasets import load_diabetes

[2]:

bunch = load_diabetes(
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=False,
)

bunch.keys()

[2]:

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

[3]:

bunch.target[:5]

[3]:

array([151.,  75., 141., 206., 135.])

[4]:

bunch.feature_names

[4]:

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

[5]:

X, y = load_diabetes(
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=True,
)

display(
    X[:5, :],
    y[:5],
)

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

array([151.,  75., 141., 206., 135.])

[6]:

#
# Carga a un dataframe de pandas
#
import pandas as pd

diabetes = pd.DataFrame(
    bunch.data,
    columns=bunch.feature_names,
)

diabetes["target"] = bunch.target

diabetes.head()

[6]:

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646	151.0
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204	75.0
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930	141.0
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362	206.0
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641	135.0

[7]:

#
# Carga desde un repo
# https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
#
pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/diabetes.csv"
).head()

[7]:

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	Y
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646	151.0
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204	75.0
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930	141.0
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362	206.0
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641	135.0