El dataset diabetes — 1:02 min#
1:02 min | Ultima modificación: Septiembre 28, 2021 | YouTube
https://scikit-learn.org/stable/datasets/toy_dataset.html
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
En este problema se desea pronósticar la evolución de la enfermedad de la diabetes un año adelante, a partir de una muestra de 442 pacientes, para los cuales se midieron los siguientes 10 variables:
* age age in years
* sex
* bmi body mass index
* bp average blood pressure
* s1 tc, total serum cholesterol
* s2 ldl, low-density lipoproteins
* s3 hdl, high-density lipoproteins
* s4 tch, total cholesterol / HDL
* s5 ltg, possibly log of serum triglycerides level
* s6 glu, blood sugar level
https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
[1]:
from sklearn.datasets import load_diabetes
[2]:
bunch = load_diabetes(
# -----------------------------------------------------
# If True, returns (data, target) instead of a Bunch
# object.
return_X_y=False,
)
bunch.keys()
[2]:
dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])
[3]:
bunch.target[:5]
[3]:
array([151., 75., 141., 206., 135.])
[4]:
bunch.feature_names
[4]:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[5]:
X, y = load_diabetes(
# -----------------------------------------------------
# If True, returns (data, target) instead of a Bunch
# object.
return_X_y=True,
)
display(
X[:5, :],
y[:5],
)
array([[ 0.03807591, 0.05068012, 0.06169621, 0.02187235, -0.0442235 ,
-0.03482076, -0.04340085, -0.00259226, 0.01990842, -0.01764613],
[-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
-0.01916334, 0.07441156, -0.03949338, -0.06832974, -0.09220405],
[ 0.08529891, 0.05068012, 0.04445121, -0.00567061, -0.04559945,
-0.03419447, -0.03235593, -0.00259226, 0.00286377, -0.02593034],
[-0.08906294, -0.04464164, -0.01159501, -0.03665645, 0.01219057,
0.02499059, -0.03603757, 0.03430886, 0.02269202, -0.00936191],
[ 0.00538306, -0.04464164, -0.03638469, 0.02187235, 0.00393485,
0.01559614, 0.00814208, -0.00259226, -0.03199144, -0.04664087]])
array([151., 75., 141., 206., 135.])
[6]:
#
# Carga a un dataframe de pandas
#
import pandas as pd
diabetes = pd.DataFrame(
bunch.data,
columns=bunch.feature_names,
)
diabetes["target"] = bunch.target
diabetes.head()
[6]:
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 | 151.0 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 | 75.0 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 | 141.0 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 | 206.0 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 | 135.0 |
[7]:
#
# Carga desde un repo
# https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
#
pd.read_csv(
"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/diabetes.csv"
).head()
[7]:
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 | 151.0 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 | 75.0 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 | 141.0 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 | 206.0 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 | 135.0 |