Identificación de créditos riesgosos usando árboles de decisión — 10:54 min#
10:54 min | Ultima modificación: Abril 14, 2021 | YouTube
En el tutorial anterior se discutieron los fundamentos del uso de árboles de clasificación. En este tutorial, se presenta la identificación de créditos potencialmente riesgosos en una identidad crediticia.
Descripción del problema#
Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.
Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).
Los atributos y sus valores son los siguientes:
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attribute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
Carga de datos#
[1]:
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv",
sep=",", # separador de campos
thousands=None, # separador de miles para números
decimal=".", # separador de los decimales para números
encoding="latin-1",
) # idioma
#
# Verifica la lectura de los datos
#
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 checking_balance 1000 non-null object
1 months_loan_duration 1000 non-null int64
2 credit_history 1000 non-null object
3 purpose 1000 non-null object
4 amount 1000 non-null int64
5 savings_balance 1000 non-null object
6 employment_length 1000 non-null object
7 installment_rate 1000 non-null int64
8 personal_status 1000 non-null object
9 other_debtors 1000 non-null object
10 residence_history 1000 non-null int64
11 property 1000 non-null object
12 age 1000 non-null int64
13 installment_plan 1000 non-null object
14 housing 1000 non-null object
15 existing_credits 1000 non-null int64
16 default 1000 non-null int64
17 dependents 1000 non-null int64
18 telephone 1000 non-null object
19 foreign_worker 1000 non-null object
20 job 1000 non-null object
dtypes: int64(8), object(13)
memory usage: 164.2+ KB
[2]:
#
# Contenido del archivo
#
df.head()
[2]:
checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | ... | property | age | installment_plan | housing | existing_credits | default | dependents | telephone | foreign_worker | job | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | < 0 DM | 6 | critical | radio/tv | 1169 | unknown | > 7 yrs | 4 | single male | none | ... | real estate | 67 | none | own | 2 | 1 | 1 | yes | yes | skilled employee |
1 | 1 - 200 DM | 48 | repaid | radio/tv | 5951 | < 100 DM | 1 - 4 yrs | 2 | female | none | ... | real estate | 22 | none | own | 1 | 2 | 1 | none | yes | skilled employee |
2 | unknown | 12 | critical | education | 2096 | < 100 DM | 4 - 7 yrs | 2 | single male | none | ... | real estate | 49 | none | own | 1 | 1 | 2 | none | yes | unskilled resident |
3 | < 0 DM | 42 | repaid | furniture | 7882 | < 100 DM | 4 - 7 yrs | 2 | single male | guarantor | ... | building society savings | 45 | none | for free | 1 | 1 | 2 | none | yes | skilled employee |
4 | < 0 DM | 24 | delayed | car (new) | 4870 | < 100 DM | 1 - 4 yrs | 3 | single male | none | ... | unknown/none | 53 | none | for free | 2 | 2 | 2 | none | yes | skilled employee |
5 rows × 21 columns
[3]:
#
# Se verifican los tipos de datos de las columnas
#
df.dtypes
[3]:
checking_balance object
months_loan_duration int64
credit_history object
purpose object
amount int64
savings_balance object
employment_length object
installment_rate int64
personal_status object
other_debtors object
residence_history int64
property object
age int64
installment_plan object
housing object
existing_credits int64
default int64
dependents int64
telephone object
foreign_worker object
job object
dtype: object
Análisis Exploratorio#
[4]:
#
# Algunas de las columnas son numéricas y
# las otras son factores.
# DM corresponde a Deutsche Marks
# se verifican algunos valores versus el code book.
#
df.checking_balance.value_counts()
[4]:
unknown 394
< 0 DM 274
1 - 200 DM 269
> 200 DM 63
Name: checking_balance, dtype: int64
[5]:
df.savings_balance.value_counts()
[5]:
< 100 DM 603
unknown 183
101 - 500 DM 103
501 - 1000 DM 63
> 1000 DM 48
Name: savings_balance, dtype: int64
[6]:
#
# El monto del préstamo va desde 250 DM hasta 18.424 DM
#
df.amount.describe()
[6]:
count 1000.000000
mean 3271.258000
std 2822.736876
min 250.000000
25% 1365.500000
50% 2319.500000
75% 3972.250000
max 18424.000000
Name: amount, dtype: float64
[7]:
#
# La duración del préstamo va desde 4 hasta 72 meses
#
df.months_loan_duration.describe()
[7]:
count 1000.000000
mean 20.903000
std 12.058814
min 4.000000
25% 12.000000
50% 18.000000
75% 24.000000
max 72.000000
Name: months_loan_duration, dtype: float64
[8]:
#
# La columna default indica si hubo problemas
# en el pago del préstamo (1- pago, 2- no pago)
# esta es la columna que se desea pronosticar
# 1-si, 2-no
#
df.default.value_counts()
[8]:
1 700
2 300
Name: default, dtype: int64
Preprocesamiento#
[9]:
from sklearn.preprocessing import LabelEncoder
#
# Se construye un codificador para transformar
# los strings a enteros (similar a factores en R)
#
enc = LabelEncoder()
#
# Se aplica el codificador a las columnas
# del dataset
#
columns = [
"checking_balance",
"credit_history",
"purpose",
"savings_balance",
"employment_length",
"personal_status",
"other_debtors",
"property",
"installment_plan",
"housing",
"telephone",
"foreign_worker",
"job",
]
for column in columns:
df[column] = enc.fit_transform(df[column])
Entrenamiento del modelo#
[10]:
#
# Se usa el 90% de los datos para entrenamiento
# y el 10% restante para prueba
#
train_sample = list(range(900))
test_sample = list(range(900, 1000))
#
# Genera los conjuntos de entrenamiento y prueba
#
X_train = df.iloc[train_sample, :].copy()
X_test = df.iloc[test_sample, :].copy()
#
# Se elimina la columna default que
# corresponde a la variable de salida
#
X_train.drop("default", axis=1, inplace=True)
X_test.drop("default", axis=1, inplace=True)
#
# Se genera la variable dependiente
#
y_train_true = df.default[train_sample]
y_test_true = df.default[test_sample]
#
# Construcción del arbol de clasificación
#
from sklearn.tree import DecisionTreeClassifier
#
# Se construye el arbol
#
clf = DecisionTreeClassifier()
#
# Se entrena para los datos de prueba
#
clf.fit(X_train, y_train_true)
#
# Se pronostica para la muestra de prueba
#
y_test_pred = clf.predict(X_test)
#
# Métricas de desempeño
#
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test_true, y_test_pred)
[10]:
array([[49, 19],
[19, 13]])