Procesamiento de datos con Hive#

Última modificación: Mayo 17, 2022 | YouTube

Este tutorial esta basado en https://es.hortonworks.com/tutorial/how-to-process-data-with-apache-hive/

El objetivo de este tutorial es implemetar consultas en Hive para analizar, procesar y filtrar los datos existentes en una bodega de datos, usando lenguaje SQL estándar.

Cell magic `%%hive`#

[1]:

from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n    > ", "\r\nhive> "]
QUIT = "quit;"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def hive(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())

Preparación#

Los datos se encuentran almacenados en la carpeta drivers del directorio actual. A continución se procede a crear la carpeta /tmp/drivers en el sistema de archivos de Hadoop (HDFS).

[2]:

!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv

--2022-05-17 14:41:50--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2043 (2.0K) [text/plain]
Saving to: ‘drivers.csv’

drivers.csv         100%[===================>]   2.00K  --.-KB/s    in 0.001s

2022-05-17 14:41:51 (1.50 MB/s) - ‘drivers.csv’ saved [2043/2043]

--2022-05-17 14:41:51--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26205 (26K) [text/plain]
Saving to: ‘timesheet.csv’

timesheet.csv       100%[===================>]  25.59K  --.-KB/s    in 0.01s

2022-05-17 14:41:52 (1.90 MB/s) - ‘timesheet.csv’ saved [26205/26205]

[3]:

#
# Crea la carpeta drivers en el HDFS
#
!hdfs dfs -mkdir /tmp/drivers

#
# Copia los archivos al HDFS
#
!hdfs dfs -copyFromLocal drivers.csv  /tmp/drivers/
!hdfs dfs -copyFromLocal timesheet.csv  /tmp/drivers/

#
# Lista los archivos al HDFS para verificar
# que los archivos fueron copiados correctamente.
##
!hdfs dfs -ls /tmp/drivers/*

mkdir: `/tmp/drivers': File exists
-rw-r--r--   1 root supergroup       2043 2022-05-17 14:41 /tmp/drivers/drivers.csv
Found 1 items
-rwxr-xr-x   1 root supergroup       1800 2022-05-17 13:33 /tmp/drivers/specific-columns/000000_0
-rw-r--r--   1 root supergroup      26205 2022-05-17 14:41 /tmp/drivers/timesheet.csv

El contenido de un archivo puede ser visualizado parcialmente usando el comando tail. Se usa para realizar una inspección rápida del contenido de los archivos.

[4]:

#
# Se imprime el final del archivo drivers
#
!hdfs dfs -tail /tmp/drivers/drivers.csv

Box 213- 8948 Nec Ave,Y,hours
27,Mark Lochbihler,392603159,8355 Ipsum St.,Y,hours
28,Olivier Renault,959908181,P.O. Box 243- 6509 Erat. Avenue,Y,hours
29,Teddy Choi,185502192,P.O. Box 106- 7003 Amet Rd.,Y,hours
30,Dan Rice,282307061,Ap #881-9267 Mollis Avenue,Y,hours
31,Rommel Garcia,858912101,P.O. Box 945- 6015 Sociis St.,Y,hours
32,Ryan Templeton,290304287,765-6599 Egestas. Av.,Y,hours
33,Sridhara Sabbella,967409015,Ap #477-2507 Sagittis Avenue,Y,hours
34,Frank Romano,391407216,Ap #753-6814 Quis Ave,Y,hours
35,Emil Siemes,971401151,321-2976 Felis Rd.,Y,hours
36,Andrew Grande,245303216,Ap #685-9598 Egestas Rd.,Y,hours
37,Wes Floyd,190504074,P.O. Box 269- 9611 Nulla Street,Y,hours
38,Scott Shaw,386411175,276 Lobortis Road,Y,hours
39,David Kaiser,967706052,9185 At Street,Y,hours
40,Nicolas Maillard,208510217,1027 Quis Rd.,Y,hours
41,Greg Phillips,308103116,P.O. Box 847- 5961 Arcu. Road,Y,hours
42,Randy Gelhausen,853302254,145-4200 In- Avenue,Y,hours
43,Dave Patton,977706052,3028 A- St.,Y,hours

[5]:

!hdfs dfs -tail /tmp/drivers/timesheet.csv

42,36,56,2612
42,37,48,2550
42,38,55,2527
42,39,57,2723
42,40,55,2728
42,41,50,2557
42,42,53,2773
42,43,55,2786
42,44,54,2638
42,45,57,2542
42,46,48,2526
42,47,50,2795
42,48,53,2609
42,49,58,2584
42,50,48,2692
42,51,50,2566
42,52,48,2735
43,1,46,2622
43,2,47,2688
43,3,50,2544
43,4,56,2573
43,5,54,2691
43,6,52,2796
43,7,53,2564
43,8,58,2624
43,9,50,2528
43,10,57,2721
43,11,51,2722
43,12,59,2681
43,13,52,2683
43,14,46,2663
43,15,53,2579
43,16,56,2519
43,17,54,2584
43,18,47,2665
43,19,55,2511
43,20,60,2677
43,21,52,2585
43,22,60,2719
43,23,48,2655
43,24,48,2641
43,25,53,2512
43,26,48,2612
43,27,58,2614
43,28,60,2551
43,29,55,2682
43,30,49,2504
43,31,51,2701
43,32,57,2554
43,33,52,2730
43,34,54,2783
43,35,51,2681
43,36,51,2655
43,37,46,2629
43,38,58,2739
43,39,47,2535
43,40,50,2512
43,41,51,2701
43,42,55,2538
43,43,58,2775
43,44,56,2545
43,45,46,2671
43,46,57,2680
43,47,50,2572
43,48,52,2517
43,49,56,2743
43,50,59,2665
43,51,58,2593
43,52,48,2764

Creación de la tabla `temp_drivers`#

A continuación se crea la tabla temp_drivers, que es almacenada en el disco como un archivo de texto, para almacenar la información de los conductores.

[6]:

%%hive
DROP TABLE IF EXISTS temp_drivers;
CREATE TABLE temp_drivers (col_value STRING) STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");

OK
Time taken: 5.696 seconds
OK
Time taken: 0.56 seconds

Seguidamente, se visualizan las tablas en la base de datos actual que empiezan por t para verificar que la tabla fue creada.

[7]:

%%hive
SHOW TABLES LIKE 't*';

OK
temp_drivers
Time taken: 0.11 seconds, Fetched: 1 row(s)

Carga de datos para la tabla `temp_drivers`#

La siguiente consulta realiza la carga de los datos del archivo drivers.csv en la tabla temp_drivers.

[8]:

%%hive
LOAD DATA INPATH '/tmp/drivers/drivers.csv' OVERWRITE INTO TABLE temp_drivers;

ers; DATA INPATH '/tmp/drivers/drivers.csv' OVERWRITE INTO TABLE temp_driv
Loading data to table default.temp_drivers
OK
Time taken: 0.708 seconds

Hive consume los datos, es decir, mueve los datos a la bodega de datos, de tal forma que el archivo drivers.csv es eliminado de la carpeta /tmp/drivers.

[9]:

!hdfs dfs -ls /tmp/drivers

Found 2 items
drwxr-xr-x   - root supergroup          0 2022-05-17 13:33 /tmp/drivers/specific-columns
-rw-r--r--   1 root supergroup      26205 2022-05-17 14:41 /tmp/drivers/timesheet.csv

Se obtiene los primeros 10 registros de la tabla para realizar una inspección rápida de los datos y verificar que los datos fueron cargados correctamente.

[10]:

%%hive
SELECT * FROM temp_drivers LIMIT 10;

OK
10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
11,Jamie Engesser,262112338,366-4125 Ac Street,N,miles
12,Paul Coddin,198041975,Ap #622-957 Risus. Street,Y,hours
13,Joe Niemiec,139907145,2071 Hendrerit. Ave,Y,hours
14,Adis Cesir,820812209,Ap #810-1228 In St.,Y,hours
15,Rohit Bakshi,239005227,648-5681 Dui- Rd.,Y,hours
16,Tom McCuch,363303105,P.O. Box 313- 962 Parturient Rd.,Y,hours
17,Eric Mizell,123808238,P.O. Box 579- 2191 Gravida. Street,Y,hours
18,Grant Liu,171010151,Ap #928-3159 Vestibulum Av.,Y,hours
19,Ajay Singh,160005158,592-9430 Nonummy Avenue,Y,hours
Time taken: 1.005 seconds, Fetched: 10 row(s)

Creación de la tabla `drivers`#

A continuación se crea la tabla drivers en donde se colocará la información extraída de la tabla temp_drivers.

[11]:

%%hive

DROP TABLE IF EXISTS drivers;

CREATE TABLE drivers (driverId  INT,
                      name      STRING,
                      ssn       BIGINT,
                      location  STRING,
                      certified STRING,
                      wageplan  STRING)

TBLPROPERTIES ("skip.header.line.count"="1");

OK
Time taken: 0.018 seconds
OK
Time taken: 0.061 seconds

Ya que cada registro de la tabla temp_drivers es una línea de texto, se aplica una expresión regular (regexp_extract) para realizar la división del texto por las comas. La parte {1} representa la primera cadena de caracteres después de realizar la partición, {2} la segunda y así sucesivamente. Después de la llamada a la función regexp_extract se indica el nombre de la columna en la tabla drivers.

[12]:

%%hive
INSERT OVERWRITE TABLE drivers
SELECT
    regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId,
    regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) name,
    regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ssn,
    regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) location,
    regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) certified,
    regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) wageplan
FROM
    temp_drivers;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144213_e23a371d-a2df-47f4-bab3-aaad8ab30155
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0007, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0007/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 14:42:20,276 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:42:24,437 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.94 sec
MapReduce Total cumulative CPU time: 1 seconds 940 msec
Ended Job = job_1652793922537_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/drivers/.hive-staging_hive_2022-05-17_14-42-13_686_1675122652526265324-1/-ext-10000
Loading data to table default.drivers
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.94 sec   HDFS Read: 6812 HDFS Write: 2036 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
Time taken: 13.097 seconds

Se aplica la instrucción SELECT para revisar el resultado de la carga de los datos.

[13]:

%%hive
SELECT * FROM drivers LIMIT 10;

OK
11      Jamie Engesser  262112338       366-4125 Ac Street      N       miles
12      Paul Coddin     198041975       Ap #622-957 Risus. Street       Y       hours
13      Joe Niemiec     139907145       2071 Hendrerit. Ave     Y       hours
14      Adis Cesir      820812209       Ap #810-1228 In St.     Y       hours
15      Rohit Bakshi    239005227       648-5681 Dui- Rd.       Y       hours
16      Tom McCuch      363303105       P.O. Box 313- 962 Parturient Rd.        Y       hours
17      Eric Mizell     123808238       P.O. Box 579- 2191 Gravida. Street      Y       hours
18      Grant Liu       171010151       Ap #928-3159 Vestibulum Av.     Y       hours
19      Ajay Singh      160005158       592-9430 Nonummy Avenue Y       hours
20      Chris Harris    921812303       883-2691 Proin Avenue   Y       hours
Time taken: 0.112 seconds, Fetched: 10 row(s)

Creación de la tabla `temp_timesheet`#

Se procede a crear la tabla y cargar los datos para el archivo time_sheet.

[14]:

%%hive

DROP TABLE IF EXISTS temp_timesheet;

CREATE TABLE temp_timesheet (col_value string)
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");

LOAD DATA INPATH '/tmp/drivers/timesheet.csv' OVERWRITE INTO TABLE temp_timesheet;

SELECT * FROM temp_timesheet LIMIT 10;

OK
Time taken: 0.017 seconds
OK
Time taken: 0.216 seconds
mesheet;A INPATH '/tmp/drivers/timesheet.csv' OVERWRITE INTO TABLE temp_ti
Loading data to table default.temp_timesheet
OK
Time taken: 0.22 seconds
OK
10,1,70,3300
10,2,70,3300
10,3,60,2800
10,4,70,3100
10,5,70,3200
10,6,70,3300
10,7,70,3000
10,8,70,3300
10,9,70,3200
10,10,50,2500
Time taken: 0.096 seconds, Fetched: 10 row(s)

Creación de la tabla `timesheet`#

Se procede igual que en las tablas anteriores.

[15]:

%%hive

DROP TABLE IF EXISTS timesheet;

CREATE TABLE timesheet (driverId INT, week INT, hours_logged INT , miles_logged INT)
TBLPROPERTIES ("skip.header.line.count"="1");

INSERT OVERWRITE TABLE timesheet
SELECT
    regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId,
    regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) week,
    regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) hours_logged,
    regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) miles_logged
FROM
    temp_timesheet;

SELECT * FROM timesheet LIMIT 10;

OK
Time taken: 0.018 seconds
ogged INT)LE timesheet (driverId INT, week INT, hours_logged INT , miles_l
OK
Time taken: 0.042 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144228_ef115f87-6300-428d-a380-9d530398d507
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0008, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0008/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 14:42:35,597 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:42:39,707 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1652793922537_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/timesheet/.hive-staging_hive_2022-05-17_14-42-28_623_1595757489952936503-1/-ext-10000
Loading data to table default.timesheet
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.85 sec   HDFS Read: 30739 HDFS Write: 24476 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 850 msec
OK
Time taken: 12.328 seconds
OK
10      2       70      3300
10      3       60      2800
10      4       70      3100
10      5       70      3200
10      6       70      3300
10      7       70      3000
10      8       70      3300
10      9       70      3200
10      10      50      2500
10      11      70      2900
Time taken: 0.093 seconds, Fetched: 10 row(s)

Cantidad de horas y millas de cada conductor por año.#

En la siguiente consulta se desea obtener para cada conductor la cantidad de horas y millas por año.

[16]:

%%hive
SELECT
    driverId,
    sum(hours_logged),
    sum(miles_logged)
FROM
    timesheet
GROUP BY
    driverId;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144241_4fe1f8e4-cd53-475e-8c16-51b039bc665a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0009, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0009/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 14:42:50,503 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:42:54,612 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.68 sec
2022-05-17 14:42:59,749 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.31 sec
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1652793922537_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.31 sec   HDFS Read: 33422 HDFS Write: 1005 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
10      3162    143850
11      3642    179300
12      2639    135962
13      2727    134126
14      2781    136624
15      2734    138750
16      2746    137205
17      2701    135992
18      2654    137834
19      2738    137968
20      2644    134564
21      2751    138719
22      2733    137550
23      2750    137980
24      2647    134461
25      2723    139180
26      2730    137530
27      2771    137922
28      2723    137469
29      2760    138255
30      2773    137473
31      2704    137057
32      2736    137422
33      2759    139285
34      2811    137728
35      2728    138727
36      2795    138025
37      2694    137223
38      2760    137464
39      2745    138788
40      2700    136931
41      2723    138407
42      2697    136673
43      2750    136993
Time taken: 19.249 seconds, Fetched: 34 row(s)

Consulta para unir las tablas#

El paso final consiste en crear una consulta que agregue el nombre del conductor de la tabla drivers con la cantidad de horas y millas por año.

[17]:

%%hive
SELECT
    d.driverId,
    d.name,
    t.total_hours,
    t.total_miles
FROM
    drivers d
JOIN (
    SELECT
        driverId,
        sum(hours_logged)total_hours,
        sum(miles_logged)total_miles
    FROM
        timesheet
    GROUP BY
        driverId
    ) t
ON
    (d.driverId = t.driverId);

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144301_fcf22209-53f3-42ce-9af0-d532ace4a917
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0010, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0010/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0010
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-05-17 14:43:09,601 Stage-2 map = 0%,  reduce = 0%
2022-05-17 14:43:13,719 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.01 sec
2022-05-17 14:43:18,828 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.63 sec
MapReduce Total cumulative CPU time: 3 seconds 630 msec
Ended Job = job_1652793922537_0010
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2022-05-17 14:43:26     Starting to launch local task to process map join;      maximum memory = 477626368
2022-05-17 14:43:28     Dump the side-table for tag: 0 with group count: 33 into file: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-01_867_3518588412713534238-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2022-05-17 14:43:28     Uploaded 1 File to: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-01_867_3518588412713534238-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (1325 bytes)
2022-05-17 14:43:28     End of local task; Time Taken: 1.4 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0011, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0011/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0011
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-05-17 14:43:33,515 Stage-3 map = 0%,  reduce = 0%
2022-05-17 14:43:37,621 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1652793922537_0011
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.63 sec   HDFS Read: 32521 HDFS Write: 946 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 1.53 sec   HDFS Read: 6784 HDFS Write: 1411 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 160 msec
OK
11      Jamie Engesser  3642    179300
12      Paul Coddin     2639    135962
13      Joe Niemiec     2727    134126
14      Adis Cesir      2781    136624
15      Rohit Bakshi    2734    138750
16      Tom McCuch      2746    137205
17      Eric Mizell     2701    135992
18      Grant Liu       2654    137834
19      Ajay Singh      2738    137968
20      Chris Harris    2644    134564
21      Jeff Markham    2751    138719
22      Nadeem Asghar   2733    137550
23      Adam Diaz       2750    137980
24      Don Hilborn     2647    134461
25      Jean-Philippe Playe     2723    139180
26      Michael Aube    2730    137530
27      Mark Lochbihler 2771    137922
28      Olivier Renault 2723    137469
29      Teddy Choi      2760    138255
30      Dan Rice        2773    137473
31      Rommel Garcia   2704    137057
32      Ryan Templeton  2736    137422
33      Sridhara Sabbella       2759    139285
34      Frank Romano    2811    137728
35      Emil Siemes     2728    138727
36      Andrew Grande   2795    138025
37      Wes Floyd       2694    137223
38      Scott Shaw      2760    137464
39      David Kaiser    2745    138788
40      Nicolas Maillard        2700    136931
41      Greg Phillips   2723    138407
42      Randy Gelhausen 2697    136673
43      Dave Patton     2750    136993
Time taken: 36.822 seconds, Fetched: 33 row(s)

Almacenamiento de los resultados#

Finalmente, se agrega una porción de codigo adicional a la consulta anterior para almacenar la tabla final obtenida en la carpeta /tmp/drivers/summary del HDFS para que otras aplicaciones puedan usar estos resultados.

[18]:

%%hive
INSERT OVERWRITE DIRECTORY '/tmp/drivers/summary'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT
    d.driverId,
    d.name,
    t.total_hours,
    t.total_miles
FROM
    drivers d
JOIN (
    SELECT
        driverId,
        sum(hours_logged)total_hours,
        sum(miles_logged)total_miles
    FROM
        timesheet
    GROUP BY
        driverId
    ) t
ON
    (d.driverId = t.driverId);

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144339_0c63ca4e-baa5-46ab-9b53-25662d9d6082
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0012, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0012/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0012
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-05-17 14:43:48,364 Stage-2 map = 0%,  reduce = 0%
2022-05-17 14:43:52,451 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.11 sec
2022-05-17 14:43:57,573 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.44 sec
MapReduce Total cumulative CPU time: 3 seconds 440 msec
Ended Job = job_1652793922537_0012
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2022-05-17 14:44:04     Starting to launch local task to process map join;      maximum memory = 477626368
2022-05-17 14:44:05     Dump the side-table for tag: 0 with group count: 33 into file: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-39_841_4775949701697963899-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile10--.hashtable
2022-05-17 14:44:05     Uploaded 1 File to: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-39_841_4775949701697963899-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile10--.hashtable (1325 bytes)
2022-05-17 14:44:05     End of local task; Time Taken: 1.437 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0013, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0013/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0013
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-05-17 14:44:10,576 Stage-3 map = 0%,  reduce = 0%
2022-05-17 14:44:14,682 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.69 sec
MapReduce Total cumulative CPU time: 1 seconds 690 msec
Ended Job = job_1652793922537_0013
Moving data to directory /tmp/drivers/summary
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.44 sec   HDFS Read: 32512 HDFS Write: 946 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 1.69 sec   HDFS Read: 6345 HDFS Write: 928 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 130 msec
OK
Time taken: 35.911 seconds

[19]:

!hdfs dfs -ls /tmp/drivers/summary/

Found 1 items
-rwxr-xr-x   1 root supergroup        928 2022-05-17 14:44 /tmp/drivers/summary/000000_0

[20]:

!hdfs dfs -tail /tmp/drivers/summary/000000_0

11,Jamie Engesser,3642,179300
12,Paul Coddin,2639,135962
13,Joe Niemiec,2727,134126
14,Adis Cesir,2781,136624
15,Rohit Bakshi,2734,138750
16,Tom McCuch,2746,137205
17,Eric Mizell,2701,135992
18,Grant Liu,2654,137834
19,Ajay Singh,2738,137968
20,Chris Harris,2644,134564
21,Jeff Markham,2751,138719
22,Nadeem Asghar,2733,137550
23,Adam Diaz,2750,137980
24,Don Hilborn,2647,134461
25,Jean-Philippe Playe,2723,139180
26,Michael Aube,2730,137530
27,Mark Lochbihler,2771,137922
28,Olivier Renault,2723,137469
29,Teddy Choi,2760,138255
30,Dan Rice,2773,137473
31,Rommel Garcia,2704,137057
32,Ryan Templeton,2736,137422
33,Sridhara Sabbella,2759,139285
34,Frank Romano,2811,137728
35,Emil Siemes,2728,138727
36,Andrew Grande,2795,138025
37,Wes Floyd,2694,137223
38,Scott Shaw,2760,137464
39,David Kaiser,2745,138788
40,Nicolas Maillard,2700,136931
41,Greg Phillips,2723,138407
42,Randy Gelhausen,2697,136673
43,Dave Patton,2750,136993

[21]:

%quit

[22]:

!rm *.csv *.log

Procesamiento de datos con Hive#

Cell magic %%hive#

Preparación#

Creación de la tabla temp_drivers#

Carga de datos para la tabla temp_drivers#

Creación de la tabla drivers#

Creación de la tabla temp_timesheet#

Creación de la tabla timesheet#