Identificación de créditos riesgosos usando árboles de decisión — 10:54 min#

  • 10:54 min | Ultima modificación: Abril 14, 2021 | YouTube

En el tutorial anterior se discutieron los fundamentos del uso de árboles de clasificación. En este tutorial, se presenta la identificación de créditos potencialmente riesgosos en una identidad crediticia.

Descripción del problema#

Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.

Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).

Los atributos y sus valores son los siguientes:

Attribute 1:  (qualitative)
         Status of existing checking account
         A11 :      ... <    0 DM
         A12 : 0 <= ... <  200 DM
         A13 :      ... >= 200 DM /
               salary assignments for at least 1 year
         A14 : no checking account

Attribute 2:  (numerical)
         Duration in month

Attribute 3:  (qualitative)
         Credit history
         A30 : no credits taken/
               all credits paid back duly
         A31 : all credits at this bank paid back duly
         A32 : existing credits paid back duly till now
         A33 : delay in paying off in the past
         A34 : critical account/
               other credits existing (not at this bank)

Attribute 4:  (qualitative)
         Purpose
         A40 : car (new)
         A41 : car (used)
         A42 : furniture/equipment
         A43 : radio/television
         A44 : domestic appliances
         A45 : repairs
         A46 : education
         A47 : (vacation - does not exist?)
         A48 : retraining
         A49 : business
         A410 : others

Attribute 5:  (numerical)
         Credit amount

Attribute 6:  (qualitative)
         Savings account/bonds
         A61 :          ... <  100 DM
         A62 :   100 <= ... <  500 DM
         A63 :   500 <= ... < 1000 DM
         A64 :          .. >= 1000 DM
         A65 :   unknown/ no savings account

Attribute 7:  (qualitative)
         Present employment since
         A71 : unemployed
         A72 :       ... < 1 year
         A73 : 1  <= ... < 4 years
         A74 : 4  <= ... < 7 years
         A75 :       .. >= 7 years

Attribute 8:  (numerical)
         Installment rate in percentage of disposable income

Attribute 9:  (qualitative)
         Personal status and sex
         A91 : male   : divorced/separated
         A92 : female : divorced/separated/married
         A93 : male   : single
         A94 : male   : married/widowed
         A95 : female : single

Attribute 10: (qualitative)
         Other debtors / guarantors
         A101 : none
         A102 : co-applicant
         A103 : guarantor

Attribute 11: (numerical)
         Present residence since

Attribute 12: (qualitative)
         Property
         A121 : real estate
         A122 : if not A121 : building society savings agreement/
                  life insurance
         A123 : if not A121/A122 : car or other, not in attribute 6
         A124 : unknown / no property

Attribute 13: (numerical)
         Age in years

Attribute 14: (qualitative)
         Other installment plans
         A141 : bank
         A142 : stores
         A143 : none

Attribute 15: (qualitative)
         Housing
         A151 : rent
         A152 : own
         A153 : for free

Attribute 16: (numerical)
         Number of existing credits at this bank

Attribute 17: (qualitative)
         Job
         A171 : unemployed/ unskilled  - non-resident
         A172 : unskilled - resident
         A173 : skilled employee / official
         A174 : management/ self-employed/
                highly qualified employee/ officer

Attribute 18: (numerical)
         Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
         Telephone
         A191 : none
         A192 : yes, registered under the customers name

Attribute 20: (qualitative)
         foreign worker
         A201 : yes
         A202 : no

Carga de datos#

[1]:
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv",
    sep=",",         # separador de campos
    thousands=None,  # separador de miles para números
    decimal=".",     # separador de los decimales para números
    encoding="latin-1",
)  # idioma

#
# Verifica la lectura de los datos
#
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   checking_balance      1000 non-null   object
 1   months_loan_duration  1000 non-null   int64
 2   credit_history        1000 non-null   object
 3   purpose               1000 non-null   object
 4   amount                1000 non-null   int64
 5   savings_balance       1000 non-null   object
 6   employment_length     1000 non-null   object
 7   installment_rate      1000 non-null   int64
 8   personal_status       1000 non-null   object
 9   other_debtors         1000 non-null   object
 10  residence_history     1000 non-null   int64
 11  property              1000 non-null   object
 12  age                   1000 non-null   int64
 13  installment_plan      1000 non-null   object
 14  housing               1000 non-null   object
 15  existing_credits      1000 non-null   int64
 16  default               1000 non-null   int64
 17  dependents            1000 non-null   int64
 18  telephone             1000 non-null   object
 19  foreign_worker        1000 non-null   object
 20  job                   1000 non-null   object
dtypes: int64(8), object(13)
memory usage: 164.2+ KB
[2]:
#
# Contenido del archivo
#
df.head()
[2]:
checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors ... property age installment_plan housing existing_credits default dependents telephone foreign_worker job
0 < 0 DM 6 critical radio/tv 1169 unknown > 7 yrs 4 single male none ... real estate 67 none own 2 1 1 yes yes skilled employee
1 1 - 200 DM 48 repaid radio/tv 5951 < 100 DM 1 - 4 yrs 2 female none ... real estate 22 none own 1 2 1 none yes skilled employee
2 unknown 12 critical education 2096 < 100 DM 4 - 7 yrs 2 single male none ... real estate 49 none own 1 1 2 none yes unskilled resident
3 < 0 DM 42 repaid furniture 7882 < 100 DM 4 - 7 yrs 2 single male guarantor ... building society savings 45 none for free 1 1 2 none yes skilled employee
4 < 0 DM 24 delayed car (new) 4870 < 100 DM 1 - 4 yrs 3 single male none ... unknown/none 53 none for free 2 2 2 none yes skilled employee

5 rows × 21 columns

[3]:
#
# Se verifican los tipos de datos de las columnas
#
df.dtypes
[3]:
checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
default                  int64
dependents               int64
telephone               object
foreign_worker          object
job                     object
dtype: object

Análisis Exploratorio#

[4]:
#
# Algunas de las columnas son numéricas y
# las otras son factores.
# DM corresponde a Deutsche Marks
# se verifican algunos valores versus el code book.
#
df.checking_balance.value_counts()
[4]:
unknown       394
< 0 DM        274
1 - 200 DM    269
> 200 DM       63
Name: checking_balance, dtype: int64
[5]:
df.savings_balance.value_counts()
[5]:
< 100 DM         603
unknown          183
101 - 500 DM     103
501 - 1000 DM     63
> 1000 DM         48
Name: savings_balance, dtype: int64
[6]:
#
# El monto del préstamo va desde 250 DM hasta 18.424 DM
#
df.amount.describe()
[6]:
count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64
[7]:
#
# La duración del préstamo va desde 4 hasta 72 meses
#
df.months_loan_duration.describe()
[7]:
count    1000.000000
mean       20.903000
std        12.058814
min         4.000000
25%        12.000000
50%        18.000000
75%        24.000000
max        72.000000
Name: months_loan_duration, dtype: float64
[8]:
#
# La columna default indica si hubo problemas
# en el pago del préstamo (1- pago, 2- no pago)
# esta es la columna que se desea pronosticar
# 1-si, 2-no
#
df.default.value_counts()
[8]:
1    700
2    300
Name: default, dtype: int64

Preprocesamiento#

[9]:
from sklearn.preprocessing import LabelEncoder

#
# Se construye un codificador para transformar
# los strings a enteros (similar a factores en R)
#
enc = LabelEncoder()

#
# Se aplica el codificador a las columnas
# del dataset
#
columns = [
    "checking_balance",
    "credit_history",
    "purpose",
    "savings_balance",
    "employment_length",
    "personal_status",
    "other_debtors",
    "property",
    "installment_plan",
    "housing",
    "telephone",
    "foreign_worker",
    "job",
]

for column in columns:
    df[column] = enc.fit_transform(df[column])

Entrenamiento del modelo#

[10]:
#
#  Se usa el 90% de los datos para entrenamiento
#  y el 10% restante para prueba
#
train_sample = list(range(900))
test_sample = list(range(900, 1000))

#
# Genera los conjuntos de entrenamiento y prueba
#
X_train = df.iloc[train_sample, :].copy()
X_test = df.iloc[test_sample, :].copy()

#
# Se elimina la columna default que
# corresponde a la variable de salida
#
X_train.drop("default", axis=1, inplace=True)
X_test.drop("default", axis=1, inplace=True)

#
# Se genera la variable dependiente
#
y_train_true = df.default[train_sample]
y_test_true = df.default[test_sample]

#
# Construcción del arbol de clasificación
#
from sklearn.tree import DecisionTreeClassifier

#
# Se construye el arbol
#
clf = DecisionTreeClassifier()

#
# Se entrena para los datos de prueba
#
clf.fit(X_train, y_train_true)

#
# Se pronostica para la muestra de prueba
#
y_test_pred = clf.predict(X_test)

#
# Métricas de desempeño
#
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test_true, y_test_pred)
[10]:
array([[49, 19],
       [19, 13]])