Predicción de gastos médicos usando regresión lineal#

Descripción del problema#

Una compañía de seguros desea pronósticar los gastos médicos de la población asegurada con el fin de recolectar un valor superior en ingresos, tal que le permita obtener utilidades. Los costos son difíciles de pronósticar ya que las condiciones más costosas son más raras y parecen aleatorias; y que ciertas condiciones son más probables para ciertos segmentos de la población (infarto en personas obesas y cáncer en fumadores).

El objetivo es usar una base de datos con 1338 registros de gastos médicos hipotéticos para pacientes de EU con el fin de estimar los costos para determinados segmentos de la población. La información registrada es la siguiente:

Age: entero hasta 64.
Sex: male, female.
bmi: Body mass index.
children: entero indicando la cantidad de hijos/dependientes cubiertos por el plan de salud.
smoker: yes, no.
region: northest, southeast, southwest, northwest.
charges: costos.

Lectura de datos#

[1]:

import pandas as pd

#
# Lectura del archivo y verificación
#
df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/insurance.csv",
    sep=",",
    thousands=None,
    decimal=".",
)

df.head()

[1]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

Especificación del modelo#

[2]:

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder

columnTransformer = ColumnTransformer(
    transformers=[
        (
            "ohe",
            OneHotEncoder(drop="if_binary"),
            make_column_selector(dtype_include=object),
        ),
    ],
    remainder="passthrough",
)

[3]:

from sklearn.feature_selection import SelectKBest, f_regression

selectKBest = SelectKBest(score_func=f_regression)

[4]:

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline(
    [
        ("columnTransformer", columnTransformer),
        ("selectKBest", selectKBest),
        ("linearRegression", LinearRegression()),
    ]
)

[5]:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "selectKBest__k": [1, 2, 3, 4, 5, 6, 7, 8],
}

gridSearchCV = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,
    scoring="neg_mean_squared_error",
    refit=True,
)

Estimación del modelo#

[6]:

X = df.copy().drop("charges", axis="columns")
y = df.copy().pop("charges")

[7]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

[8]:

gridSearchCV.fit(X_train, y_train)

[8]:

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('columnTransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary'),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f535ac670d0>)])),
                                       ('selectKBest',
                                        SelectKBest(score_func=<function f_regression at 0x7f532c161ca0>)),
                                       ('linearRegression',
                                        LinearRegression())]),
             param_grid={'selectKBest__k': [1, 2, 3, 4, 5, 6, 7, 8]},
             scoring='neg_mean_squared_error')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('columnTransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary'),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f535ac670d0>)])),
                                       ('selectKBest',
                                        SelectKBest(score_func=<function f_regression at 0x7f532c161ca0>)),
                                       ('linearRegression',
                                        LinearRegression())]),
             param_grid={'selectKBest__k': [1, 2, 3, 4, 5, 6, 7, 8]},
             scoring='neg_mean_squared_error')

estimator: Pipeline

Pipeline(steps=[('columnTransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe',
                                                  OneHotEncoder(drop='if_binary'),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f535ac670d0>)])),
                ('selectKBest',
                 SelectKBest(score_func=<function f_regression at 0x7f532c161ca0>)),
                ('linearRegression', LinearRegression())])

columnTransformer: ColumnTransformer

ColumnTransformer(remainder='passthrough',
                  transformers=[('ohe', OneHotEncoder(drop='if_binary'),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f535ac670d0>)])

ohe

<sklearn.compose._column_transformer.make_column_selector object at 0x7f535ac670d0>

OneHotEncoder

OneHotEncoder(drop='if_binary')

remainder

passthrough

passthrough

SelectKBest

SelectKBest(score_func=<function f_regression at 0x7f532c161ca0>)

LinearRegression

LinearRegression()

[9]:

gridSearchCV.best_params_

[9]:

{'selectKBest__k': 4}

[10]:

y_train_pred = gridSearchCV.predict(X_train)
y_test_pred = gridSearchCV.predict(X_test)

Evaluación de error#

[11]:

from sklearn.metrics import mean_squared_error, r2_score

print(f"Mean squared error: {mean_squared_error(y_train, y_train_pred):.2f}")
print(f"Variance score: {r2_score(y_train, y_train_pred):.2f}")

Mean squared error: 37396262.08
Variance score: 0.74

[12]:

print(f"Mean squared error: {mean_squared_error(y_test, y_test_pred):.2f}")
print(f"Variance score: {r2_score(y_test, y_test_pred):.2f}")

Mean squared error: 35299792.10
Variance score: 0.76

[13]:

import matplotlib.pyplot as plt
import numpy as np

def plot_forecasts(y_true, y_pred):
    plt.figure(figsize=(3.5, 3.5))
    plt.scatter(y_true, y_pred, color="tab:blue", alpha=0.8, edgecolors="white")
    plt.xlabel("y true")
    plt.ylabel("y pred")
    x_ = np.linspace(*plt.gca().get_xlim())
    plt.gca().plot(x_, x_, "k-")
    plt.gca().spines["left"].set_color("gray")
    plt.gca().spines["bottom"].set_color("gray")
    plt.gca().spines["top"].set_visible(False)
    plt.gca().spines["right"].set_visible(False)
    plt.plot()

plot_forecasts(y_train, y_train_pred)

../_images/24_regresores_clasicos_02_ejemplo_prediccion_de_gastos_medicos_20_0.png

[14]:

plot_forecasts(y_test, y_test_pred)

../_images/24_regresores_clasicos_02_ejemplo_prediccion_de_gastos_medicos_21_0.png