Identificación de créditos riesgosos#

  • 60 min | Ultima modificación: Noviembre 5, 2020

Descripción del problema#

Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.

Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).

Los atributos y sus valores son los siguientes:

Attribute 1:  (qualitative)
         Status of existing checking account
         A11 :      ... <    0 DM
         A12 : 0 <= ... <  200 DM
         A13 :      ... >= 200 DM /
               salary assignments for at least 1 year
         A14 : no checking account

Attribute 2:  (numerical)
         Duration in month

Attribute 3:  (qualitative)
         Credit history
         A30 : no credits taken/
               all credits paid back duly
         A31 : all credits at this bank paid back duly
         A32 : existing credits paid back duly till now
         A33 : delay in paying off in the past
         A34 : critical account/
               other credits existing (not at this bank)

Attribute 4:  (qualitative)
         Purpose
         A40 : car (new)
         A41 : car (used)
         A42 : furniture/equipment
         A43 : radio/television
         A44 : domestic appliances
         A45 : repairs
         A46 : education
         A47 : (vacation - does not exist?)
         A48 : retraining
         A49 : business
         A410 : others

Attribute 5:  (numerical)
         Credit amount

Attribute 6:  (qualitative)
         Savings account/bonds
         A61 :          ... <  100 DM
         A62 :   100 <= ... <  500 DM
         A63 :   500 <= ... < 1000 DM
         A64 :          .. >= 1000 DM
         A65 :   unknown/ no savings account

Attribute 7:  (qualitative)
         Present employment since
         A71 : unemployed
         A72 :       ... < 1 year
         A73 : 1  <= ... < 4 years
         A74 : 4  <= ... < 7 years
         A75 :       .. >= 7 years

Attribute 8:  (numerical)
         Installment rate in percentage of disposable income

Attribute 9:  (qualitative)
         Personal status and sex
         A91 : male   : divorced/separated
         A92 : female : divorced/separated/married
         A93 : male   : single
         A94 : male   : married/widowed
         A95 : female : single

Attribute 10: (qualitative)
         Other debtors / guarantors
         A101 : none
         A102 : co-applicant
         A103 : guarantor

Attribute 11: (numerical)
         Present residence since

Attribute 12: (qualitative)
         Property
         A121 : real estate
         A122 : if not A121 : building society savings agreement/
                  life insurance
         A123 : if not A121/A122 : car or other, not in attribute 6
         A124 : unknown / no property

Attribute 13: (numerical)
         Age in years

Attribute 14: (qualitative)
         Other installment plans
         A141 : bank
         A142 : stores
         A143 : none

Attribute 15: (qualitative)
         Housing
         A151 : rent
         A152 : own
         A153 : for free

Attribute 16: (numerical)
         Number of existing credits at this bank

Attribute 17: (qualitative)
         Job
         A171 : unemployed/ unskilled  - non-resident
         A172 : unskilled - resident
         A173 : skilled employee / official
         A174 : management/ self-employed/
                highly qualified employee/ officer

Attribute 18: (numerical)
         Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
         Telephone
         A191 : none
         A192 : yes, registered under the customers name

Attribute 20: (qualitative)
         foreign worker
         A201 : yes
         A202 : no

Preparación y carga de datos#

[1]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv
--2020-11-01 15:48:59--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 133638 (131K) [text/plain]
Saving to: ‘credit.csv.2’

credit.csv.2        100%[===================>] 130.51K  --.-KB/s    in 0.1s

2020-11-01 15:49:00 (974 KB/s) - ‘credit.csv.2’ saved [133638/133638]

[2]:
##
## Mueve el archivo de datos al hdfs
##
!hdfs dfs -copyFromLocal credit.csv /tmp/credit.csv
copyFromLocal: `/tmp/credit.csv': File exists
[3]:
##
## Carga de las librerías de Spark
##
import findspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

findspark.init()

APP_NAME = "spark-app"

conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
[4]:
##
## Lectura del archivo.
##
spark_df = spark.read.load(
    "/tmp/credit.csv", format="csv", sep=",", inferSchema="true", header="true"
)

##
## Número de registros cargados
##
spark_df.count()
[4]:
1000
[5]:
##
## Tipos de datos de los campos del DataFrame
##
spark_df.printSchema()
root
 |-- checking_balance: string (nullable = true)
 |-- months_loan_duration: integer (nullable = true)
 |-- credit_history: string (nullable = true)
 |-- purpose: string (nullable = true)
 |-- amount: integer (nullable = true)
 |-- savings_balance: string (nullable = true)
 |-- employment_length: string (nullable = true)
 |-- installment_rate: integer (nullable = true)
 |-- personal_status: string (nullable = true)
 |-- other_debtors: string (nullable = true)
 |-- residence_history: integer (nullable = true)
 |-- property: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- installment_plan: string (nullable = true)
 |-- housing: string (nullable = true)
 |-- existing_credits: integer (nullable = true)
 |-- default: integer (nullable = true)
 |-- dependents: integer (nullable = true)
 |-- telephone: string (nullable = true)
 |-- foreign_worker: string (nullable = true)
 |-- job: string (nullable = true)

[6]:
##
## Contenido del archivo
##
spark_df.show()
+----------------+--------------------+--------------+----------+------+---------------+-----------------+----------------+---------------+-------------+-----------------+--------------------+---+----------------+--------+----------------+-------+----------+---------+--------------+--------------------+
|checking_balance|months_loan_duration|credit_history|   purpose|amount|savings_balance|employment_length|installment_rate|personal_status|other_debtors|residence_history|            property|age|installment_plan| housing|existing_credits|default|dependents|telephone|foreign_worker|                 job|
+----------------+--------------------+--------------+----------+------+---------------+-----------------+----------------+---------------+-------------+-----------------+--------------------+---+----------------+--------+----------------+-------+----------+---------+--------------+--------------------+
|          < 0 DM|                   6|      critical|  radio/tv|  1169|        unknown|          > 7 yrs|               4|    single male|         none|                4|         real estate| 67|            none|     own|               2|      1|         1|      yes|           yes|    skilled employee|
|      1 - 200 DM|                  48|        repaid|  radio/tv|  5951|       < 100 DM|        1 - 4 yrs|               2|         female|         none|                2|         real estate| 22|            none|     own|               1|      2|         1|     none|           yes|    skilled employee|
|         unknown|                  12|      critical| education|  2096|       < 100 DM|        4 - 7 yrs|               2|    single male|         none|                3|         real estate| 49|            none|     own|               1|      1|         2|     none|           yes|  unskilled resident|
|          < 0 DM|                  42|        repaid| furniture|  7882|       < 100 DM|        4 - 7 yrs|               2|    single male|    guarantor|                4|building society ...| 45|            none|for free|               1|      1|         2|     none|           yes|    skilled employee|
|          < 0 DM|                  24|       delayed| car (new)|  4870|       < 100 DM|        1 - 4 yrs|               3|    single male|         none|                4|        unknown/none| 53|            none|for free|               2|      2|         2|     none|           yes|    skilled employee|
|         unknown|                  36|        repaid| education|  9055|        unknown|        1 - 4 yrs|               2|    single male|         none|                4|        unknown/none| 35|            none|for free|               1|      1|         2|      yes|           yes|  unskilled resident|
|         unknown|                  24|        repaid| furniture|  2835|  501 - 1000 DM|          > 7 yrs|               3|    single male|         none|                4|building society ...| 53|            none|     own|               1|      1|         1|     none|           yes|    skilled employee|
|      1 - 200 DM|                  36|        repaid|car (used)|  6948|       < 100 DM|        1 - 4 yrs|               2|    single male|         none|                2|               other| 35|            none|    rent|               1|      1|         1|      yes|           yes|mangement self-em...|
|         unknown|                  12|        repaid|  radio/tv|  3059|      > 1000 DM|        4 - 7 yrs|               2|  divorced male|         none|                4|         real estate| 61|            none|     own|               1|      1|         1|     none|           yes|  unskilled resident|
|      1 - 200 DM|                  30|      critical| car (new)|  5234|       < 100 DM|       unemployed|               4|   married male|         none|                2|               other| 28|            none|     own|               2|      2|         1|     none|           yes|mangement self-em...|
|      1 - 200 DM|                  12|        repaid| car (new)|  1295|       < 100 DM|        0 - 1 yrs|               3|         female|         none|                1|               other| 25|            none|    rent|               1|      2|         1|     none|           yes|    skilled employee|
|          < 0 DM|                  48|        repaid|  business|  4308|       < 100 DM|        0 - 1 yrs|               3|         female|         none|                4|building society ...| 24|            none|    rent|               1|      2|         1|     none|           yes|    skilled employee|
|      1 - 200 DM|                  12|        repaid|  radio/tv|  1567|       < 100 DM|        1 - 4 yrs|               1|         female|         none|                1|               other| 22|            none|     own|               1|      1|         1|      yes|           yes|    skilled employee|
|          < 0 DM|                  24|      critical| car (new)|  1199|       < 100 DM|          > 7 yrs|               4|    single male|         none|                4|               other| 60|            none|     own|               2|      2|         1|     none|           yes|  unskilled resident|
|          < 0 DM|                  15|        repaid| car (new)|  1403|       < 100 DM|        1 - 4 yrs|               2|         female|         none|                4|               other| 28|            none|    rent|               1|      1|         1|     none|           yes|    skilled employee|
|          < 0 DM|                  24|        repaid|  radio/tv|  1282|   101 - 500 DM|        1 - 4 yrs|               4|         female|         none|                2|               other| 32|            none|     own|               1|      2|         1|     none|           yes|  unskilled resident|
|         unknown|                  24|      critical|  radio/tv|  2424|        unknown|          > 7 yrs|               4|    single male|         none|                4|building society ...| 53|            none|     own|               2|      1|         1|     none|           yes|    skilled employee|
|          < 0 DM|                  30|  fully repaid|  business|  8072|        unknown|        0 - 1 yrs|               2|    single male|         none|                3|               other| 25|            bank|     own|               3|      1|         1|     none|           yes|    skilled employee|
|      1 - 200 DM|                  24|        repaid|car (used)| 12579|       < 100 DM|          > 7 yrs|               4|         female|         none|                2|        unknown/none| 44|            none|for free|               1|      2|         1|      yes|           yes|mangement self-em...|
|         unknown|                  24|        repaid|  radio/tv|  3430|  501 - 1000 DM|          > 7 yrs|               3|    single male|         none|                2|               other| 31|            none|     own|               1|      1|         2|      yes|           yes|    skilled employee|
+----------------+--------------------+--------------+----------+------+---------------+-----------------+----------------+---------------+-------------+-----------------+--------------------+---+----------------+--------+----------------+-------+----------+---------+--------------+--------------------+
only showing top 20 rows

Análisis Exploratorio#

[7]:
!pip3 -q install pandas
[8]:
##
## Algunas de las columnas son numéricas y
## las otras son factores.
## DM corresponde a Deutsche Marks
## se verifican algunos valores versus el code book.
##
spark_df.groupBy("checking_balance").count().toPandas()
[8]:
checking_balance count
0 unknown 394
1 < 0 DM 274
2 > 200 DM 63
3 1 - 200 DM 269
[9]:
spark_df.groupBy("savings_balance").count().toPandas()
[9]:
savings_balance count
0 unknown 183
1 101 - 500 DM 103
2 < 100 DM 603
3 501 - 1000 DM 63
4 > 1000 DM 48
[10]:
##
## El monto del préstamo va desde 250 DM hasta 18.424 DM
##
spark_df.select("amount").describe().toPandas()
[10]:
summary amount
0 count 1000
1 mean 3271.258
2 stddev 2822.736875960441
3 min 250
4 max 18424
[11]:
##
## La duración del préstamo va desde 4 hasta 72 meses
##
spark_df.select("months_loan_duration").describe().toPandas()
[11]:
summary months_loan_duration
0 count 1000
1 mean 20.903
2 stddev 12.058814452756371
3 min 4
4 max 72
[12]:
##
## La columna default indica si hubo problemas
## en el pago del préstamo (1- pago, 2- no pago)
## esta es la columna que se desea pronosticar
## 1-si, 2-no
##
spark_df.select("default").describe().toPandas()
[12]:
summary default
0 count 1000
1 mean 1.3
2 stddev 0.4584868702702515
3 min 1
4 max 2
[13]:
##
## Reemplaza {1,2} por {0,1}
##
from pyspark.sql.functions import when

spark_df = spark_df.withColumn('default', when(spark_df['default'] == 1, 0).otherwise(spark_df['default']))
spark_df = spark_df.withColumn('default', when(spark_df['default'] == 2, 1).otherwise(spark_df['default']))

spark_df.select("default").describe().toPandas()
[13]:
summary default
0 count 1000
1 mean 0.3
2 stddev 0.45848687027025153
3 min 0
4 max 1

Preprocesamiento#

[14]:
from pyspark.ml.feature import StringIndexer

for column in [
    "checking_balance",
    "credit_history",
    "purpose",
    "savings_balance",
    "employment_length",
    "personal_status",
    "other_debtors",
    "property",
    "installment_plan",
    "housing",
    "telephone",
    "foreign_worker",
    "job",
]:

    ##
    ## Se construye un codificador para transformar
    ## los strings a enteros (similar a factores en R)
    ##
    stringIndexer = StringIndexer(
        inputCol=column,
        outputCol=column + "_",
    )

    ##
    ## Se aplica el codificador a las columnas
    ## del dataset
    ##
    spark_df = stringIndexer.fit(spark_df).transform(spark_df)

##
## Columnas resultantes
##
spark_df.dtypes
[14]:
[('checking_balance', 'string'),
 ('months_loan_duration', 'int'),
 ('credit_history', 'string'),
 ('purpose', 'string'),
 ('amount', 'int'),
 ('savings_balance', 'string'),
 ('employment_length', 'string'),
 ('installment_rate', 'int'),
 ('personal_status', 'string'),
 ('other_debtors', 'string'),
 ('residence_history', 'int'),
 ('property', 'string'),
 ('age', 'int'),
 ('installment_plan', 'string'),
 ('housing', 'string'),
 ('existing_credits', 'int'),
 ('default', 'int'),
 ('dependents', 'int'),
 ('telephone', 'string'),
 ('foreign_worker', 'string'),
 ('job', 'string'),
 ('checking_balance_', 'double'),
 ('credit_history_', 'double'),
 ('purpose_', 'double'),
 ('savings_balance_', 'double'),
 ('employment_length_', 'double'),
 ('personal_status_', 'double'),
 ('other_debtors_', 'double'),
 ('property_', 'double'),
 ('installment_plan_', 'double'),
 ('housing_', 'double'),
 ('telephone_', 'double'),
 ('foreign_worker_', 'double'),
 ('job_', 'double')]

Nombres de las columnas de interes#

[15]:
##
## Columnas que contiene el dataframe
##
inputCols = [a for a, _ in spark_df.dtypes]
inputCols
[15]:
['checking_balance',
 'months_loan_duration',
 'credit_history',
 'purpose',
 'amount',
 'savings_balance',
 'employment_length',
 'installment_rate',
 'personal_status',
 'other_debtors',
 'residence_history',
 'property',
 'age',
 'installment_plan',
 'housing',
 'existing_credits',
 'default',
 'dependents',
 'telephone',
 'foreign_worker',
 'job',
 'checking_balance_',
 'credit_history_',
 'purpose_',
 'savings_balance_',
 'employment_length_',
 'personal_status_',
 'other_debtors_',
 'property_',
 'installment_plan_',
 'housing_',
 'telephone_',
 'foreign_worker_',
 'job_']
[16]:
##
## Elimina las columnas originales y deja solo las transformadas
##
inputCols = [a for a in inputCols if a + "_" not in inputCols]
inputCols
[16]:
['months_loan_duration',
 'amount',
 'installment_rate',
 'residence_history',
 'age',
 'existing_credits',
 'default',
 'dependents',
 'checking_balance_',
 'credit_history_',
 'purpose_',
 'savings_balance_',
 'employment_length_',
 'personal_status_',
 'other_debtors_',
 'property_',
 'installment_plan_',
 'housing_',
 'telephone_',
 'foreign_worker_',
 'job_']
[17]:
##
## Elimina la columna default
##
inputCols.remove("default")
inputCols
[17]:
['months_loan_duration',
 'amount',
 'installment_rate',
 'residence_history',
 'age',
 'existing_credits',
 'dependents',
 'checking_balance_',
 'credit_history_',
 'purpose_',
 'savings_balance_',
 'employment_length_',
 'personal_status_',
 'other_debtors_',
 'property_',
 'installment_plan_',
 'housing_',
 'telephone_',
 'foreign_worker_',
 'job_']

Ensamble de las columnas#

[18]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(
    inputCols=inputCols,
    outputCol="features",
)

spark_df = vectorAssembler.transform(spark_df)

spark_df.select("features").show()
+--------------------+
|            features|
+--------------------+
|[6.0,1169.0,4.0,4...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|[42.0,7882.0,2.0,...|
|(20,[0,1,2,3,4,5,...|
|[36.0,9055.0,2.0,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|[30.0,5234.0,4.0,...|
|(20,[0,1,2,3,4,5,...|
|[48.0,4308.0,3.0,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|[30.0,8072.0,2.0,...|
|[24.0,12579.0,4.0...|
|(20,[0,1,2,3,4,5,...|
+--------------------+
only showing top 20 rows

Conjuntos de datos#

[19]:
##
## Se usa el 90% de los datos para entrenamiento
## y el 10% restante para prueba
##
(train_df, test_df) = spark_df.randomSplit([0.9, 0.1])

Decision tree classifier#

[20]:
##
## Construcción del arbol de clasificación
##
from pyspark.ml.classification import DecisionTreeClassifier

##
## Se construye el arbol
##
clf = DecisionTreeClassifier(
    featuresCol="features",
    labelCol="default",
    predictionCol="prediction_DTC",
    probabilityCol='probability_DTC',
    rawPredictionCol='rawPrediction_DTC',
)

##
## Entrenamiento
##
clf = clf.fit(train_df)

##
## Pronóstico
##
train_df = clf.transform(train_df)
test_df = clf.transform(test_df)

[21]:
##
## Métricas de desempeño
##
import pyspark.sql.functions as F
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


def print_stats(dataframes, prediction_col):

    evaluator = MulticlassClassificationEvaluator(
        labelCol="default",
        predictionCol=prediction_col,
        metricName="accuracy",
    )

    print("Train Error = %g " % (1.0 - evaluator.evaluate(train_df)))
    print("Test Error = %g " % (1.0 - evaluator.evaluate(test_df)))
    print()

    for df in dataframes:

        predictionAndLabels = df.select([prediction_col, "default"])
        predictionAndLabels = predictionAndLabels.withColumn(
            prediction_col, F.col(prediction_col).cast(FloatType())
        )
        predictionAndLabels = predictionAndLabels.withColumn(
            "default", F.col("default").cast(FloatType())
        )

        metrics = MulticlassMetrics(predictionAndLabels.rdd.map(tuple))

        print(metrics.confusionMatrix().toArray())
        print()
[22]:
print_stats(dataframes=[train_df, test_df], prediction_col = 'prediction_DTC')
Train Error = 0.209071
Test Error = 0.270833

[[604.  32.]
 [157. 111.]]

[[57.  7.]
 [19. 13.]]

Random forest classifier#

[23]:
##
## Construcción del arbol de clasificación
##
from pyspark.ml.classification import RandomForestClassifier

##
## Se construye el arbol
##
clf = RandomForestClassifier(
    featuresCol="features",
    labelCol="default",
    predictionCol="prediction_RFC",
    probabilityCol='probability_RFC',
    rawPredictionCol='rawPrediction_RFC',
    numTrees = 20,
)

##
## Entrenamiento
##
clf = clf.fit(train_df)

##
## Pronóstico
##
train_df = clf.transform(train_df)
test_df = clf.transform(test_df)

print_stats(dataframes=[train_df, test_df], prediction_col = 'prediction_RFC')
Train Error = 0.19469
Test Error = 0.25

[[624.  12.]
 [164. 104.]]

[[63.  1.]
 [23.  9.]]

Gradient-boosted tree classifier#

[24]:
##
## Construcción del arbol de clasificación
##
from pyspark.ml.classification import GBTClassifier

##
## Se construye el arbol
##
clf = GBTClassifier(
    featuresCol="features",
    labelCol="default",
    predictionCol="prediction_GBTC",
    maxIter = 20,
)

##
## Entrenamiento
##
clf = clf.fit(train_df)

##
## Pronóstico
##
train_df = clf.transform(train_df)
test_df = clf.transform(test_df)

print_stats(dataframes=[train_df, test_df], prediction_col = 'prediction_GBTC')
Train Error = 0.0752212
Test Error = 0.21875

[[616.  20.]
 [ 48. 220.]]

[[56.  8.]
 [13. 19.]]