Conteo de palabras en Apache Hive (interactivo)#

Última modificación: Mayo 17, 2022

Cell magic `%%hive`#

[1]:

from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n    > ", "\r\nhive> "]
QUIT = "quit;"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def hive(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())

Datos de prueba#

[2]:

!mkdir -p /tmp/wordcount/

[3]:

%%writefile /tmp/wordcount/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include predictive analytics, prescriptive
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,
marketing optimization and marketing mix modeling, web analytics, call analytics, speech
analytics, sales force sizing and optimization, price and promotion modeling, predictive
science, credit risk analysis, and fraud analytics. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics.

Writing /tmp/wordcount/text0.txt

[4]:

%%writefile /tmp/wordcount/text1.txt
The field of data analysis. Analytics often involves studying past historical data to
research potential trends, to analyze the effects of certain decisions or events, or to
evaluate the performance of a given tool or scenario. The goal of analytics is to improve
the business by gaining knowledge which can be used to make improvements or changes.

Writing /tmp/wordcount/text1.txt

[5]:

%%writefile /tmp/wordcount/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions
about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions and by
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.

Writing /tmp/wordcount/text2.txt

Creación de tablas#

En esta aplicación se usarán dos tablas:

docs: para cargar el contenido de los archivos de texto, donde cada línea equivale a un registro.
word_counts: En donde aparece cada palabra y su respectivo conteo.

A continuación se elimnan dichas tablas si existen en el sistema, y luego se crea la tabla docs con un solo campo del tipo STRING.

[6]:

%%hive
DROP TABLE IF EXISTS docs;
DROP TABLE IF EXISTS word_counts;
CREATE TABLE docs (line STRING);

OK
Time taken: 6.001 seconds
OK
Time taken: 0.021 seconds
OK
Time taken: 0.492 seconds

Carga de datos#

En el siguiente código, se hace la carga directa de todos los archivos que se encuentran en el directorio wordcount en la tabla docs. Luego, se imprimen los primeros cinco registros de la tabla para verificar que la lectura fue correcta.

[7]:

%%hive
LOAD DATA LOCAL INPATH "/tmp/wordcount/" OVERWRITE INTO TABLE docs;
SELECT * FROM docs LIMIT 5;

Loading data to table default.docs
OK
Time taken: 1.563 seconds
OK
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Time taken: 1.04 seconds, Fetched: 5 row(s)

Transformación de datos#

Una vez cargados los archivos, se procede a partir las líneas por palabras, usando la función split(line, '\\s'); la expresión \\s indica que se realice la partición por los espacios en blanco; de esta forma, split() genera una lista de palabras. La función explode(.) de Hive en conjunto con SELECT, genera un nuevo registro por cada palabra en line.

[8]:

%%hive
SELECT explode(split(line, '\\s')) AS word FROM docs LIMIT 5;

OK
Analytics
is
the
discovery,
interpretation,
Time taken: 0.312 seconds, Fetched: 5 row(s)

Conteo de palabras#

Para realizar el conteo, la expresión SELECT word, count(1) AS count ... GROUP BY word genera una tabla con dos columnas, donde la primera columna (word) correspodne a cada palabra en el texto, y la segunda columna representa la cantidad de veces que aparece en los registros generados por la expresión SELECT explode(split(line, '\\s')) AS word FROM docs.

[9]:

%%hive
CREATE TABLE word_counts
AS
    SELECT word, count(1) AS count
    FROM
        (SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY
    word
ORDER BY
    word;

Query ID = root_20220527151554_8829d099-a624-4639-8dc7-583cb62d3c87
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1653664460816_0001, Tracking URL = http://a668e4f67ef8:8088/proxy/application_1653664460816_0001/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1653664460816_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-27 15:16:03,483 Stage-1 map = 0%,  reduce = 0%
2022-05-27 15:16:07,841 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.0 sec
2022-05-27 15:16:11,970 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.07 sec
MapReduce Total cumulative CPU time: 3 seconds 70 msec
Ended Job = job_1653664460816_0001
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1653664460816_0002, Tracking URL = http://a668e4f67ef8:8088/proxy/application_1653664460816_0002/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1653664460816_0002
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-05-27 15:16:23,576 Stage-2 map = 0%,  reduce = 0%
2022-05-27 15:16:27,696 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.16 sec
2022-05-27 15:16:31,806 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.71 sec
MapReduce Total cumulative CPU time: 2 seconds 710 msec
Ended Job = job_1653664460816_0002
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/word_counts
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.07 sec   HDFS Read: 10195 HDFS Write: 4345 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 2.71 sec   HDFS Read: 9498 HDFS Write: 1731 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 780 msec
OK
Time taken: 38.791 seconds

Resultado#

Para visualizar los resultados obtenidos, se realiza un SELECT sobre la tabla word_counts.

[10]:

%%hive
SELECT * FROM word_counts LIMIT 10;

OK
20
(DA)    1
(see    1
Analytics       2
Analytics,      1
Big     1
Data    3
Especially      1
Organizations   1
Since   1
Time taken: 0.096 seconds, Fetched: 10 row(s)

[11]:

%quit