Pronóstico de la popularidad de libros (MLlib: RDD-based)#

30 min | Última modificación: Noviembre 6, 2020

Definición del problema#

La editorial O’Really desea construir una herramienta analítica que le permita a un editor estimar la popularidad relativa de un nuevo libro antes de su lanzamiento, con el fin de poder priorizar los títulos a publicar e inclusive rechazar posibles proyectos editoriales. Para resolver este problema se tiene una base de datos con los 100 libros más vendidos por O’Really durante el año 2011. La base contiene el título del libro, su descripción y su ranking en pupularidad. Para este caso se hipotetiza que la aparición de ciertas palabras en la descripción del libro permitirá determinar su popularidad.

Preparación del archivo de datos#

[1]:

!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/oreilly.csv

--2020-11-06 23:43:03--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/oreilly.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203329 (199K) [text/plain]
Saving to: ‘oreilly.csv’

oreilly.csv         100%[===================>] 198.56K   173KB/s    in 1.1s

2020-11-06 23:43:06 (173 KB/s) - ‘oreilly.csv’ saved [203329/203329]

Inicialización de Spark#

[2]:

#
# Carga de las librerías de Spark
#
import findspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

findspark.init()

APP_NAME = "spark-app"

conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

Carga de datos#

[3]:

!pip3 install -q pandas

[4]:

#
# Este archivo resulta particularmente difícil de
# leer en Spark, por lo que se lee usando Pandas
# para luego cargarlo en Spark.
#
import pandas as pd

pandas_df = pd.read_csv(
    "oreilly.csv", sep=",", thousands=None, decimal=".", encoding="latin-1"
)

[5]:

#
# Se crea el esquema de la tabla en Spark
#
from pyspark.sql.types import *

mySchema = StructType([
    StructField("IP_Family", StringType(), True),\
    StructField("BOOK_title", StringType(), True),\
    StructField("BOOK_ISBN", StringType(), True),\
    StructField("Rank", IntegerType(), True),\
    StructField("Long_Desc", StringType(), True)])

#
# Se crea el DataFrame de Spark a partir
# del DataFrame de Spark
#
rdd = spark.createDataFrame(pandas_df, schema=mySchema).rdd

#
# Primeros dos registros
#
rdd.collect()[0:2]

[5]:

[Row(IP_Family='9780596000271.IP', BOOK_title='Programming Perl, 3E', BOOK_ISBN='9780596000271', Rank=1, Long_Desc='Perl is a powerful programming language that has grown in popularity since it first appeared in 1988. The first edition of this book, Programming Perl, hit the shelves in 1990, and was quickly adopted as the undisputed bible of the language. Since then, Perl has grown with the times, and so has this book.\r\r\r\rProgramming Perl is not just a book about Perl. It is also a unique introduction to the language and its culture, as one might expect only from its authors. Larry Wall is the inventor of Perl, and provides a unique perspective on the evolution of Perl and its future direction. Tom Christiansen was one of the first champions of the language, and lives and breathes the complexities of Perl internals as few other mortals do. Jon Orwant is the editor of \r\rThe Perl Journal, which has brought together the Perl community as a common forum for new developments in Perl.\r\r\r\rAny Perl book can show the syntax of Perl\'s functions, but only this one is a comprehensive guide to all the nooks and crannies of the language. Any Perl book can explain typeglobs, pseudohashes, and closures, but only this one shows how they really work. Any Perl book can say that my is faster than local, but only this one explains why. Any Perl book can have a title, but only this book is affectionately known by all Perl programmers as "The Camel." \r\r\r\rThis third edition of Programming Perl has been expanded to cover version 5.6 of this maturing language. New topics include threading, the compiler, Unicode, and other new features that have been added since the previous edition.'),
Row(IP_Family='9781565923928.IP', BOOK_title='Javascript: The Definitive Guide, 3E', BOOK_ISBN='9781565923928', Rank=2, Long_Desc="JavaScript is a powerful scripting language that can be embedded directly in HTML. It allows you to create dynamic, interactive Web-based applications that run completely within a Web browser; you don't have to do any server-side programming, like writing CGI scripts. \r\r\r\rJavaScript is a simpler language than Java. It can be embedded directly in Web pages without compilation, so it is more flexible and easier to use for simple tasks like animation. However, although you can write reasonably robust and complete Web applications using JavaScript alone, JavaScript is not a substitute for Java. In fact, JavaScript is a good client-side complement to Java; using the two together allows you to create more complex applications than are possible with JavaScript alone.\r\r\r\rJavaScript: The Definitive Guide provides a thorough description of the core JavaScript language and its client-side framework, complete with sophisticated examples that show you how to handle common tasks, like validating form data and working with cookies. The book also contains a definitive, in-depth reference section that covers every core and client-side JavaScript function, object, method, property, constructor, and event handler. This book is an indispensable reference for all JavaScript programmers, regardless of experience level.\r\r\r\rThis third edition of JavaScript: The Definitive Guide describes the latest version of the language, JavaScript 1.2, as supported by Netscape Navigator 4 and Internet Explorer 4. The book also covers JavaScript 1.1, which is the first industry-standard version known as ECMAScript. The new features of JavaScript 1.2, which are likely to be embodied in a later ECMAScript standard release, are clearly indicated, so that you can use them as appropriate in your scripts.")]

Preparación del texto#

[6]:

#
# Selecciona las columnas 3 (rank) y 4 (Long_Desc)
#
rdd_rank = rdd.map(lambda w: w[3])
rdd_text = rdd.map(lambda w: w[4])

[7]:

#
# Descarga NLTK
#
!pip3 install -q nltk

[8]:

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

[8]:

True

[9]:

#
# Crea un consecutivo de 1's y 0's para indicar
# si el libro fue exitoso o no. Esta será la variable
# de salida del modelo de regresión logística
#
rdd_rank = rdd_rank.map(lambda w: 1 if w >= 50 else 0)

[10]:

#
# Procesa la descripción del libro
#
from nltk.tokenize import word_tokenize

#
# Divide el texto por palabras
#
rdd_text = rdd_text.map(lambda w: word_tokenize(w))

#
# Selecciona las palabras que están conformadas
# únicamente por letras
#
import re

rdd_text = rdd_text.map(lambda w: [re.sub(r'[^A-Za-z]', '', word) for word in w])
rdd_text = rdd_text.map(lambda w: [word for word in w if word != ''])

#
# Transforma el texto a minusculas
#
rdd_text = rdd_text.map(lambda w: [word.lower() for word in w])

#
# Elimina las stopwords
#
STOPWORDS = nltk.corpus.stopwords.words('english')
rdd_text = rdd_text.map(lambda w: [word for word in w if word not in STOPWORDS])

#
# Reduce las palabras a su raíz
#
from nltk.stem import PorterStemmer
ps = PorterStemmer()
rdd_text = rdd_text.map(lambda w: [ps.stem(word) for word in w])

#
# Construye la matriz de documento-termino
#
from pyspark.mllib.feature import HashingTF, IDF

hashingTF = HashingTF()
tf = hashingTF.transform(rdd_text)

tf.cache()
idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)
tfidf.collect()[0]

#
# Construye el RDD para entrenamiento del modelo.
# Une los dos RDD (rank, description)
#
rdd_LR = rdd_rank.zip(tfidf)

#
# Etiqueta los puntos para el modelo de regresión
#
from pyspark.mllib.regression import LabeledPoint

rdd_LR = rdd_LR.map(lambda w: LabeledPoint(w[0], w[1]))

Regresión Logística#

[11]:

#
# Especificación y entrenamiento del modelo de regresión logística
#
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

model = LogisticRegressionWithLBFGS.train(
    data=rdd_LR,
    regParam=0.0,
    regType='l2',
    intercept=False,
    numClasses=2,
)

#
# Evaluación del modelo.
#
labelsAndPreds = rdd_LR.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(rdd_LR.count())
print("Training Error = " + str(trainErr))

# rdd_rank.collect()[-5:]

Training Error = 0.01

Linear support vector machines#

[ ]:

from pyspark.mllib.classification import SVMWithSGD

model = SVMWithSGD.train(rdd_LR, iterations=100)

labelsAndPreds = rdd_LR.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(rdd_LR.count())
print("Training Error = " + str(trainErr))

Decision Tree#

[ ]:

from pyspark.mllib.tree import DecisionTree

model = DecisionTree.trainClassifier(
    rdd_LR,
    numClasses=2,
    categoricalFeaturesInfo={},
    impurity="gini",
    maxDepth=3,
    maxBins=5,
#    minInstancesPerNode=1,
#    minInfoGain=0.0,
)

model

[ ]:

model.toDebugString()

[ ]:

labelsAndPreds = rdd_LR.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(
    rdd_LR.count()
)
print("Training Error = " + str(trainErr))