Identificación de créditos riesgosos usando árboles de decisión — 10:54 min#
- 10:54 min | Ultima modificación: Abril 14, 2021 | YouTube 
En el tutorial anterior se discutieron los fundamentos del uso de árboles de clasificación. En este tutorial, se presenta la identificación de créditos potencialmente riesgosos en una identidad crediticia.
Descripción del problema#
Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.
Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).
Los atributos y sus valores son los siguientes:
Attribute 1:  (qualitative)
         Status of existing checking account
         A11 :      ... <    0 DM
         A12 : 0 <= ... <  200 DM
         A13 :      ... >= 200 DM /
               salary assignments for at least 1 year
         A14 : no checking account
Attribute 2:  (numerical)
         Duration in month
Attribute 3:  (qualitative)
         Credit history
         A30 : no credits taken/
               all credits paid back duly
         A31 : all credits at this bank paid back duly
         A32 : existing credits paid back duly till now
         A33 : delay in paying off in the past
         A34 : critical account/
               other credits existing (not at this bank)
Attribute 4:  (qualitative)
         Purpose
         A40 : car (new)
         A41 : car (used)
         A42 : furniture/equipment
         A43 : radio/television
         A44 : domestic appliances
         A45 : repairs
         A46 : education
         A47 : (vacation - does not exist?)
         A48 : retraining
         A49 : business
         A410 : others
Attribute 5:  (numerical)
         Credit amount
Attribute 6:  (qualitative)
         Savings account/bonds
         A61 :          ... <  100 DM
         A62 :   100 <= ... <  500 DM
         A63 :   500 <= ... < 1000 DM
         A64 :          .. >= 1000 DM
         A65 :   unknown/ no savings account
Attribute 7:  (qualitative)
         Present employment since
         A71 : unemployed
         A72 :       ... < 1 year
         A73 : 1  <= ... < 4 years
         A74 : 4  <= ... < 7 years
         A75 :       .. >= 7 years
Attribute 8:  (numerical)
         Installment rate in percentage of disposable income
Attribute 9:  (qualitative)
         Personal status and sex
         A91 : male   : divorced/separated
         A92 : female : divorced/separated/married
         A93 : male   : single
         A94 : male   : married/widowed
         A95 : female : single
Attribute 10: (qualitative)
         Other debtors / guarantors
         A101 : none
         A102 : co-applicant
         A103 : guarantor
Attribute 11: (numerical)
         Present residence since
Attribute 12: (qualitative)
         Property
         A121 : real estate
         A122 : if not A121 : building society savings agreement/
                  life insurance
         A123 : if not A121/A122 : car or other, not in attribute 6
         A124 : unknown / no property
Attribute 13: (numerical)
         Age in years
Attribute 14: (qualitative)
         Other installment plans
         A141 : bank
         A142 : stores
         A143 : none
Attribute 15: (qualitative)
         Housing
         A151 : rent
         A152 : own
         A153 : for free
Attribute 16: (numerical)
         Number of existing credits at this bank
Attribute 17: (qualitative)
         Job
         A171 : unemployed/ unskilled  - non-resident
         A172 : unskilled - resident
         A173 : skilled employee / official
         A174 : management/ self-employed/
                highly qualified employee/ officer
Attribute 18: (numerical)
         Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
         Telephone
         A191 : none
         A192 : yes, registered under the customers name
Attribute 20: (qualitative)
         foreign worker
         A201 : yes
         A202 : no
Carga de datos#
[1]:
import pandas as pd
df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv",
    sep=",",         # separador de campos
    thousands=None,  # separador de miles para números
    decimal=".",     # separador de los decimales para números
    encoding="latin-1",
)  # idioma
#
# Verifica la lectura de los datos
#
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   checking_balance      1000 non-null   object
 1   months_loan_duration  1000 non-null   int64
 2   credit_history        1000 non-null   object
 3   purpose               1000 non-null   object
 4   amount                1000 non-null   int64
 5   savings_balance       1000 non-null   object
 6   employment_length     1000 non-null   object
 7   installment_rate      1000 non-null   int64
 8   personal_status       1000 non-null   object
 9   other_debtors         1000 non-null   object
 10  residence_history     1000 non-null   int64
 11  property              1000 non-null   object
 12  age                   1000 non-null   int64
 13  installment_plan      1000 non-null   object
 14  housing               1000 non-null   object
 15  existing_credits      1000 non-null   int64
 16  default               1000 non-null   int64
 17  dependents            1000 non-null   int64
 18  telephone             1000 non-null   object
 19  foreign_worker        1000 non-null   object
 20  job                   1000 non-null   object
dtypes: int64(8), object(13)
memory usage: 164.2+ KB
[2]:
#
# Contenido del archivo
#
df.head()
[2]:
| checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | ... | property | age | installment_plan | housing | existing_credits | default | dependents | telephone | foreign_worker | job | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | < 0 DM | 6 | critical | radio/tv | 1169 | unknown | > 7 yrs | 4 | single male | none | ... | real estate | 67 | none | own | 2 | 1 | 1 | yes | yes | skilled employee | 
| 1 | 1 - 200 DM | 48 | repaid | radio/tv | 5951 | < 100 DM | 1 - 4 yrs | 2 | female | none | ... | real estate | 22 | none | own | 1 | 2 | 1 | none | yes | skilled employee | 
| 2 | unknown | 12 | critical | education | 2096 | < 100 DM | 4 - 7 yrs | 2 | single male | none | ... | real estate | 49 | none | own | 1 | 1 | 2 | none | yes | unskilled resident | 
| 3 | < 0 DM | 42 | repaid | furniture | 7882 | < 100 DM | 4 - 7 yrs | 2 | single male | guarantor | ... | building society savings | 45 | none | for free | 1 | 1 | 2 | none | yes | skilled employee | 
| 4 | < 0 DM | 24 | delayed | car (new) | 4870 | < 100 DM | 1 - 4 yrs | 3 | single male | none | ... | unknown/none | 53 | none | for free | 2 | 2 | 2 | none | yes | skilled employee | 
5 rows × 21 columns
[3]:
#
# Se verifican los tipos de datos de las columnas
#
df.dtypes
[3]:
checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
default                  int64
dependents               int64
telephone               object
foreign_worker          object
job                     object
dtype: object
Análisis Exploratorio#
[4]:
#
# Algunas de las columnas son numéricas y
# las otras son factores.
# DM corresponde a Deutsche Marks
# se verifican algunos valores versus el code book.
#
df.checking_balance.value_counts()
[4]:
unknown       394
< 0 DM        274
1 - 200 DM    269
> 200 DM       63
Name: checking_balance, dtype: int64
[5]:
df.savings_balance.value_counts()
[5]:
< 100 DM         603
unknown          183
101 - 500 DM     103
501 - 1000 DM     63
> 1000 DM         48
Name: savings_balance, dtype: int64
[6]:
#
# El monto del préstamo va desde 250 DM hasta 18.424 DM
#
df.amount.describe()
[6]:
count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64
[7]:
#
# La duración del préstamo va desde 4 hasta 72 meses
#
df.months_loan_duration.describe()
[7]:
count    1000.000000
mean       20.903000
std        12.058814
min         4.000000
25%        12.000000
50%        18.000000
75%        24.000000
max        72.000000
Name: months_loan_duration, dtype: float64
[8]:
#
# La columna default indica si hubo problemas
# en el pago del préstamo (1- pago, 2- no pago)
# esta es la columna que se desea pronosticar
# 1-si, 2-no
#
df.default.value_counts()
[8]:
1    700
2    300
Name: default, dtype: int64
Preprocesamiento#
[9]:
from sklearn.preprocessing import LabelEncoder
#
# Se construye un codificador para transformar
# los strings a enteros (similar a factores en R)
#
enc = LabelEncoder()
#
# Se aplica el codificador a las columnas
# del dataset
#
columns = [
    "checking_balance",
    "credit_history",
    "purpose",
    "savings_balance",
    "employment_length",
    "personal_status",
    "other_debtors",
    "property",
    "installment_plan",
    "housing",
    "telephone",
    "foreign_worker",
    "job",
]
for column in columns:
    df[column] = enc.fit_transform(df[column])
Entrenamiento del modelo#
[10]:
#
#  Se usa el 90% de los datos para entrenamiento
#  y el 10% restante para prueba
#
train_sample = list(range(900))
test_sample = list(range(900, 1000))
#
# Genera los conjuntos de entrenamiento y prueba
#
X_train = df.iloc[train_sample, :].copy()
X_test = df.iloc[test_sample, :].copy()
#
# Se elimina la columna default que
# corresponde a la variable de salida
#
X_train.drop("default", axis=1, inplace=True)
X_test.drop("default", axis=1, inplace=True)
#
# Se genera la variable dependiente
#
y_train_true = df.default[train_sample]
y_test_true = df.default[test_sample]
#
# Construcción del arbol de clasificación
#
from sklearn.tree import DecisionTreeClassifier
#
# Se construye el arbol
#
clf = DecisionTreeClassifier()
#
# Se entrena para los datos de prueba
#
clf.fit(X_train, y_train_true)
#
# Se pronostica para la muestra de prueba
#
y_test_pred = clf.predict(X_test)
#
# Métricas de desempeño
#
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test_true, y_test_pred)
[10]:
array([[49, 19],
       [19, 13]])