Identificación de créditos riesgosos#
60 min | Ultima modificación: Noviembre 5, 2020
Descripción del problema#
Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.
Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).
Los atributos y sus valores son los siguientes:
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attribute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
Preparación y carga de datos#
[1]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv
--2020-11-01 15:48:59-- https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 133638 (131K) [text/plain]
Saving to: ‘credit.csv.2’
credit.csv.2 100%[===================>] 130.51K --.-KB/s in 0.1s
2020-11-01 15:49:00 (974 KB/s) - ‘credit.csv.2’ saved [133638/133638]
[2]:
##
## Mueve el archivo de datos al hdfs
##
!hdfs dfs -copyFromLocal credit.csv /tmp/credit.csv
copyFromLocal: `/tmp/credit.csv': File exists
[3]:
##
## Carga de las librerías de Spark
##
import findspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
findspark.init()
APP_NAME = "spark-app"
conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
[4]:
##
## Lectura del archivo.
##
spark_df = spark.read.load(
"/tmp/credit.csv", format="csv", sep=",", inferSchema="true", header="true"
)
##
## Número de registros cargados
##
spark_df.count()
[4]:
1000
[5]:
##
## Tipos de datos de los campos del DataFrame
##
spark_df.printSchema()
root
|-- checking_balance: string (nullable = true)
|-- months_loan_duration: integer (nullable = true)
|-- credit_history: string (nullable = true)
|-- purpose: string (nullable = true)
|-- amount: integer (nullable = true)
|-- savings_balance: string (nullable = true)
|-- employment_length: string (nullable = true)
|-- installment_rate: integer (nullable = true)
|-- personal_status: string (nullable = true)
|-- other_debtors: string (nullable = true)
|-- residence_history: integer (nullable = true)
|-- property: string (nullable = true)
|-- age: integer (nullable = true)
|-- installment_plan: string (nullable = true)
|-- housing: string (nullable = true)
|-- existing_credits: integer (nullable = true)
|-- default: integer (nullable = true)
|-- dependents: integer (nullable = true)
|-- telephone: string (nullable = true)
|-- foreign_worker: string (nullable = true)
|-- job: string (nullable = true)
[6]:
##
## Contenido del archivo
##
spark_df.show()
+----------------+--------------------+--------------+----------+------+---------------+-----------------+----------------+---------------+-------------+-----------------+--------------------+---+----------------+--------+----------------+-------+----------+---------+--------------+--------------------+
|checking_balance|months_loan_duration|credit_history| purpose|amount|savings_balance|employment_length|installment_rate|personal_status|other_debtors|residence_history| property|age|installment_plan| housing|existing_credits|default|dependents|telephone|foreign_worker| job|
+----------------+--------------------+--------------+----------+------+---------------+-----------------+----------------+---------------+-------------+-----------------+--------------------+---+----------------+--------+----------------+-------+----------+---------+--------------+--------------------+
| < 0 DM| 6| critical| radio/tv| 1169| unknown| > 7 yrs| 4| single male| none| 4| real estate| 67| none| own| 2| 1| 1| yes| yes| skilled employee|
| 1 - 200 DM| 48| repaid| radio/tv| 5951| < 100 DM| 1 - 4 yrs| 2| female| none| 2| real estate| 22| none| own| 1| 2| 1| none| yes| skilled employee|
| unknown| 12| critical| education| 2096| < 100 DM| 4 - 7 yrs| 2| single male| none| 3| real estate| 49| none| own| 1| 1| 2| none| yes| unskilled resident|
| < 0 DM| 42| repaid| furniture| 7882| < 100 DM| 4 - 7 yrs| 2| single male| guarantor| 4|building society ...| 45| none|for free| 1| 1| 2| none| yes| skilled employee|
| < 0 DM| 24| delayed| car (new)| 4870| < 100 DM| 1 - 4 yrs| 3| single male| none| 4| unknown/none| 53| none|for free| 2| 2| 2| none| yes| skilled employee|
| unknown| 36| repaid| education| 9055| unknown| 1 - 4 yrs| 2| single male| none| 4| unknown/none| 35| none|for free| 1| 1| 2| yes| yes| unskilled resident|
| unknown| 24| repaid| furniture| 2835| 501 - 1000 DM| > 7 yrs| 3| single male| none| 4|building society ...| 53| none| own| 1| 1| 1| none| yes| skilled employee|
| 1 - 200 DM| 36| repaid|car (used)| 6948| < 100 DM| 1 - 4 yrs| 2| single male| none| 2| other| 35| none| rent| 1| 1| 1| yes| yes|mangement self-em...|
| unknown| 12| repaid| radio/tv| 3059| > 1000 DM| 4 - 7 yrs| 2| divorced male| none| 4| real estate| 61| none| own| 1| 1| 1| none| yes| unskilled resident|
| 1 - 200 DM| 30| critical| car (new)| 5234| < 100 DM| unemployed| 4| married male| none| 2| other| 28| none| own| 2| 2| 1| none| yes|mangement self-em...|
| 1 - 200 DM| 12| repaid| car (new)| 1295| < 100 DM| 0 - 1 yrs| 3| female| none| 1| other| 25| none| rent| 1| 2| 1| none| yes| skilled employee|
| < 0 DM| 48| repaid| business| 4308| < 100 DM| 0 - 1 yrs| 3| female| none| 4|building society ...| 24| none| rent| 1| 2| 1| none| yes| skilled employee|
| 1 - 200 DM| 12| repaid| radio/tv| 1567| < 100 DM| 1 - 4 yrs| 1| female| none| 1| other| 22| none| own| 1| 1| 1| yes| yes| skilled employee|
| < 0 DM| 24| critical| car (new)| 1199| < 100 DM| > 7 yrs| 4| single male| none| 4| other| 60| none| own| 2| 2| 1| none| yes| unskilled resident|
| < 0 DM| 15| repaid| car (new)| 1403| < 100 DM| 1 - 4 yrs| 2| female| none| 4| other| 28| none| rent| 1| 1| 1| none| yes| skilled employee|
| < 0 DM| 24| repaid| radio/tv| 1282| 101 - 500 DM| 1 - 4 yrs| 4| female| none| 2| other| 32| none| own| 1| 2| 1| none| yes| unskilled resident|
| unknown| 24| critical| radio/tv| 2424| unknown| > 7 yrs| 4| single male| none| 4|building society ...| 53| none| own| 2| 1| 1| none| yes| skilled employee|
| < 0 DM| 30| fully repaid| business| 8072| unknown| 0 - 1 yrs| 2| single male| none| 3| other| 25| bank| own| 3| 1| 1| none| yes| skilled employee|
| 1 - 200 DM| 24| repaid|car (used)| 12579| < 100 DM| > 7 yrs| 4| female| none| 2| unknown/none| 44| none|for free| 1| 2| 1| yes| yes|mangement self-em...|
| unknown| 24| repaid| radio/tv| 3430| 501 - 1000 DM| > 7 yrs| 3| single male| none| 2| other| 31| none| own| 1| 1| 2| yes| yes| skilled employee|
+----------------+--------------------+--------------+----------+------+---------------+-----------------+----------------+---------------+-------------+-----------------+--------------------+---+----------------+--------+----------------+-------+----------+---------+--------------+--------------------+
only showing top 20 rows
Análisis Exploratorio#
[7]:
!pip3 -q install pandas
[8]:
##
## Algunas de las columnas son numéricas y
## las otras son factores.
## DM corresponde a Deutsche Marks
## se verifican algunos valores versus el code book.
##
spark_df.groupBy("checking_balance").count().toPandas()
[8]:
checking_balance | count | |
---|---|---|
0 | unknown | 394 |
1 | < 0 DM | 274 |
2 | > 200 DM | 63 |
3 | 1 - 200 DM | 269 |
[9]:
spark_df.groupBy("savings_balance").count().toPandas()
[9]:
savings_balance | count | |
---|---|---|
0 | unknown | 183 |
1 | 101 - 500 DM | 103 |
2 | < 100 DM | 603 |
3 | 501 - 1000 DM | 63 |
4 | > 1000 DM | 48 |
[10]:
##
## El monto del préstamo va desde 250 DM hasta 18.424 DM
##
spark_df.select("amount").describe().toPandas()
[10]:
summary | amount | |
---|---|---|
0 | count | 1000 |
1 | mean | 3271.258 |
2 | stddev | 2822.736875960441 |
3 | min | 250 |
4 | max | 18424 |
[11]:
##
## La duración del préstamo va desde 4 hasta 72 meses
##
spark_df.select("months_loan_duration").describe().toPandas()
[11]:
summary | months_loan_duration | |
---|---|---|
0 | count | 1000 |
1 | mean | 20.903 |
2 | stddev | 12.058814452756371 |
3 | min | 4 |
4 | max | 72 |
[12]:
##
## La columna default indica si hubo problemas
## en el pago del préstamo (1- pago, 2- no pago)
## esta es la columna que se desea pronosticar
## 1-si, 2-no
##
spark_df.select("default").describe().toPandas()
[12]:
summary | default | |
---|---|---|
0 | count | 1000 |
1 | mean | 1.3 |
2 | stddev | 0.4584868702702515 |
3 | min | 1 |
4 | max | 2 |
[13]:
##
## Reemplaza {1,2} por {0,1}
##
from pyspark.sql.functions import when
spark_df = spark_df.withColumn('default', when(spark_df['default'] == 1, 0).otherwise(spark_df['default']))
spark_df = spark_df.withColumn('default', when(spark_df['default'] == 2, 1).otherwise(spark_df['default']))
spark_df.select("default").describe().toPandas()
[13]:
summary | default | |
---|---|---|
0 | count | 1000 |
1 | mean | 0.3 |
2 | stddev | 0.45848687027025153 |
3 | min | 0 |
4 | max | 1 |
Preprocesamiento#
[14]:
from pyspark.ml.feature import StringIndexer
for column in [
"checking_balance",
"credit_history",
"purpose",
"savings_balance",
"employment_length",
"personal_status",
"other_debtors",
"property",
"installment_plan",
"housing",
"telephone",
"foreign_worker",
"job",
]:
##
## Se construye un codificador para transformar
## los strings a enteros (similar a factores en R)
##
stringIndexer = StringIndexer(
inputCol=column,
outputCol=column + "_",
)
##
## Se aplica el codificador a las columnas
## del dataset
##
spark_df = stringIndexer.fit(spark_df).transform(spark_df)
##
## Columnas resultantes
##
spark_df.dtypes
[14]:
[('checking_balance', 'string'),
('months_loan_duration', 'int'),
('credit_history', 'string'),
('purpose', 'string'),
('amount', 'int'),
('savings_balance', 'string'),
('employment_length', 'string'),
('installment_rate', 'int'),
('personal_status', 'string'),
('other_debtors', 'string'),
('residence_history', 'int'),
('property', 'string'),
('age', 'int'),
('installment_plan', 'string'),
('housing', 'string'),
('existing_credits', 'int'),
('default', 'int'),
('dependents', 'int'),
('telephone', 'string'),
('foreign_worker', 'string'),
('job', 'string'),
('checking_balance_', 'double'),
('credit_history_', 'double'),
('purpose_', 'double'),
('savings_balance_', 'double'),
('employment_length_', 'double'),
('personal_status_', 'double'),
('other_debtors_', 'double'),
('property_', 'double'),
('installment_plan_', 'double'),
('housing_', 'double'),
('telephone_', 'double'),
('foreign_worker_', 'double'),
('job_', 'double')]
Nombres de las columnas de interes#
[15]:
##
## Columnas que contiene el dataframe
##
inputCols = [a for a, _ in spark_df.dtypes]
inputCols
[15]:
['checking_balance',
'months_loan_duration',
'credit_history',
'purpose',
'amount',
'savings_balance',
'employment_length',
'installment_rate',
'personal_status',
'other_debtors',
'residence_history',
'property',
'age',
'installment_plan',
'housing',
'existing_credits',
'default',
'dependents',
'telephone',
'foreign_worker',
'job',
'checking_balance_',
'credit_history_',
'purpose_',
'savings_balance_',
'employment_length_',
'personal_status_',
'other_debtors_',
'property_',
'installment_plan_',
'housing_',
'telephone_',
'foreign_worker_',
'job_']
[16]:
##
## Elimina las columnas originales y deja solo las transformadas
##
inputCols = [a for a in inputCols if a + "_" not in inputCols]
inputCols
[16]:
['months_loan_duration',
'amount',
'installment_rate',
'residence_history',
'age',
'existing_credits',
'default',
'dependents',
'checking_balance_',
'credit_history_',
'purpose_',
'savings_balance_',
'employment_length_',
'personal_status_',
'other_debtors_',
'property_',
'installment_plan_',
'housing_',
'telephone_',
'foreign_worker_',
'job_']
[17]:
##
## Elimina la columna default
##
inputCols.remove("default")
inputCols
[17]:
['months_loan_duration',
'amount',
'installment_rate',
'residence_history',
'age',
'existing_credits',
'dependents',
'checking_balance_',
'credit_history_',
'purpose_',
'savings_balance_',
'employment_length_',
'personal_status_',
'other_debtors_',
'property_',
'installment_plan_',
'housing_',
'telephone_',
'foreign_worker_',
'job_']
Ensamble de las columnas#
[18]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(
inputCols=inputCols,
outputCol="features",
)
spark_df = vectorAssembler.transform(spark_df)
spark_df.select("features").show()
+--------------------+
| features|
+--------------------+
|[6.0,1169.0,4.0,4...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|[42.0,7882.0,2.0,...|
|(20,[0,1,2,3,4,5,...|
|[36.0,9055.0,2.0,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|[30.0,5234.0,4.0,...|
|(20,[0,1,2,3,4,5,...|
|[48.0,4308.0,3.0,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|[30.0,8072.0,2.0,...|
|[24.0,12579.0,4.0...|
|(20,[0,1,2,3,4,5,...|
+--------------------+
only showing top 20 rows
Conjuntos de datos#
[19]:
##
## Se usa el 90% de los datos para entrenamiento
## y el 10% restante para prueba
##
(train_df, test_df) = spark_df.randomSplit([0.9, 0.1])
Decision tree classifier#
[20]:
##
## Construcción del arbol de clasificación
##
from pyspark.ml.classification import DecisionTreeClassifier
##
## Se construye el arbol
##
clf = DecisionTreeClassifier(
featuresCol="features",
labelCol="default",
predictionCol="prediction_DTC",
probabilityCol='probability_DTC',
rawPredictionCol='rawPrediction_DTC',
)
##
## Entrenamiento
##
clf = clf.fit(train_df)
##
## Pronóstico
##
train_df = clf.transform(train_df)
test_df = clf.transform(test_df)
[21]:
##
## Métricas de desempeño
##
import pyspark.sql.functions as F
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
def print_stats(dataframes, prediction_col):
evaluator = MulticlassClassificationEvaluator(
labelCol="default",
predictionCol=prediction_col,
metricName="accuracy",
)
print("Train Error = %g " % (1.0 - evaluator.evaluate(train_df)))
print("Test Error = %g " % (1.0 - evaluator.evaluate(test_df)))
print()
for df in dataframes:
predictionAndLabels = df.select([prediction_col, "default"])
predictionAndLabels = predictionAndLabels.withColumn(
prediction_col, F.col(prediction_col).cast(FloatType())
)
predictionAndLabels = predictionAndLabels.withColumn(
"default", F.col("default").cast(FloatType())
)
metrics = MulticlassMetrics(predictionAndLabels.rdd.map(tuple))
print(metrics.confusionMatrix().toArray())
print()
[22]:
print_stats(dataframes=[train_df, test_df], prediction_col = 'prediction_DTC')
Train Error = 0.209071
Test Error = 0.270833
[[604. 32.]
[157. 111.]]
[[57. 7.]
[19. 13.]]
Random forest classifier#
[23]:
##
## Construcción del arbol de clasificación
##
from pyspark.ml.classification import RandomForestClassifier
##
## Se construye el arbol
##
clf = RandomForestClassifier(
featuresCol="features",
labelCol="default",
predictionCol="prediction_RFC",
probabilityCol='probability_RFC',
rawPredictionCol='rawPrediction_RFC',
numTrees = 20,
)
##
## Entrenamiento
##
clf = clf.fit(train_df)
##
## Pronóstico
##
train_df = clf.transform(train_df)
test_df = clf.transform(test_df)
print_stats(dataframes=[train_df, test_df], prediction_col = 'prediction_RFC')
Train Error = 0.19469
Test Error = 0.25
[[624. 12.]
[164. 104.]]
[[63. 1.]
[23. 9.]]
Gradient-boosted tree classifier#
[24]:
##
## Construcción del arbol de clasificación
##
from pyspark.ml.classification import GBTClassifier
##
## Se construye el arbol
##
clf = GBTClassifier(
featuresCol="features",
labelCol="default",
predictionCol="prediction_GBTC",
maxIter = 20,
)
##
## Entrenamiento
##
clf = clf.fit(train_df)
##
## Pronóstico
##
train_df = clf.transform(train_df)
test_df = clf.transform(test_df)
print_stats(dataframes=[train_df, test_df], prediction_col = 'prediction_GBTC')
Train Error = 0.0752212
Test Error = 0.21875
[[616. 20.]
[ 48. 220.]]
[[56. 8.]
[13. 19.]]