Procesamiento de datos con Hive#
Última modificación: Mayo 17, 2022 | YouTube
Este tutorial esta basado en https://es.hortonworks.com/tutorial/how-to-process-data-with-apache-hive/
El objetivo de este tutorial es implemetar consultas en Hive para analizar, procesar y filtrar los datos existentes en una bodega de datos, usando lenguaje SQL estándar.
Cell magic %%hive
#
[1]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn
TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n > ", "\r\nhive> "]
QUIT = "quit;"
@magics_class
class Magic(Magics):
def __init__(self, shell):
super().__init__(shell)
self.app = spawn(PROG, timeout=60)
self.app.expect(PROMPT)
@cell_magic
def hive(self, line, cell):
cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
for cell_line in cell_lines:
self.app.sendline(cell_line)
self.app.expect(PROMPT, timeout=TIMEOUT)
output = self.app.before.decode()
output = output.replace("\r\n", "\n")
output = output.split("\n")
output = [output_line.strip() for output_line in output]
for output_line in output:
if output_line not in cell_lines:
print(output_line)
return None
@line_magic
def quit(self, line):
self.app.sendline(QUIT)
def load_ipython_extension(ip):
ip.register_magics(Magic(ip))
load_ipython_extension(ip=get_ipython())
Preparación#
Los datos se encuentran almacenados en la carpeta drivers
del directorio actual. A continución se procede a crear la carpeta /tmp/drivers
en el sistema de archivos de Hadoop (HDFS).
[2]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv
--2022-05-17 14:41:50-- https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2043 (2.0K) [text/plain]
Saving to: ‘drivers.csv’
drivers.csv 100%[===================>] 2.00K --.-KB/s in 0.001s
2022-05-17 14:41:51 (1.50 MB/s) - ‘drivers.csv’ saved [2043/2043]
--2022-05-17 14:41:51-- https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26205 (26K) [text/plain]
Saving to: ‘timesheet.csv’
timesheet.csv 100%[===================>] 25.59K --.-KB/s in 0.01s
2022-05-17 14:41:52 (1.90 MB/s) - ‘timesheet.csv’ saved [26205/26205]
[3]:
#
# Crea la carpeta drivers en el HDFS
#
!hdfs dfs -mkdir /tmp/drivers
#
# Copia los archivos al HDFS
#
!hdfs dfs -copyFromLocal drivers.csv /tmp/drivers/
!hdfs dfs -copyFromLocal timesheet.csv /tmp/drivers/
#
# Lista los archivos al HDFS para verificar
# que los archivos fueron copiados correctamente.
##
!hdfs dfs -ls /tmp/drivers/*
mkdir: `/tmp/drivers': File exists
-rw-r--r-- 1 root supergroup 2043 2022-05-17 14:41 /tmp/drivers/drivers.csv
Found 1 items
-rwxr-xr-x 1 root supergroup 1800 2022-05-17 13:33 /tmp/drivers/specific-columns/000000_0
-rw-r--r-- 1 root supergroup 26205 2022-05-17 14:41 /tmp/drivers/timesheet.csv
El contenido de un archivo puede ser visualizado parcialmente usando el comando tail
. Se usa para realizar una inspección rápida del contenido de los archivos.
[4]:
#
# Se imprime el final del archivo drivers
#
!hdfs dfs -tail /tmp/drivers/drivers.csv
Box 213- 8948 Nec Ave,Y,hours
27,Mark Lochbihler,392603159,8355 Ipsum St.,Y,hours
28,Olivier Renault,959908181,P.O. Box 243- 6509 Erat. Avenue,Y,hours
29,Teddy Choi,185502192,P.O. Box 106- 7003 Amet Rd.,Y,hours
30,Dan Rice,282307061,Ap #881-9267 Mollis Avenue,Y,hours
31,Rommel Garcia,858912101,P.O. Box 945- 6015 Sociis St.,Y,hours
32,Ryan Templeton,290304287,765-6599 Egestas. Av.,Y,hours
33,Sridhara Sabbella,967409015,Ap #477-2507 Sagittis Avenue,Y,hours
34,Frank Romano,391407216,Ap #753-6814 Quis Ave,Y,hours
35,Emil Siemes,971401151,321-2976 Felis Rd.,Y,hours
36,Andrew Grande,245303216,Ap #685-9598 Egestas Rd.,Y,hours
37,Wes Floyd,190504074,P.O. Box 269- 9611 Nulla Street,Y,hours
38,Scott Shaw,386411175,276 Lobortis Road,Y,hours
39,David Kaiser,967706052,9185 At Street,Y,hours
40,Nicolas Maillard,208510217,1027 Quis Rd.,Y,hours
41,Greg Phillips,308103116,P.O. Box 847- 5961 Arcu. Road,Y,hours
42,Randy Gelhausen,853302254,145-4200 In- Avenue,Y,hours
43,Dave Patton,977706052,3028 A- St.,Y,hours
[5]:
!hdfs dfs -tail /tmp/drivers/timesheet.csv
42,36,56,2612
42,37,48,2550
42,38,55,2527
42,39,57,2723
42,40,55,2728
42,41,50,2557
42,42,53,2773
42,43,55,2786
42,44,54,2638
42,45,57,2542
42,46,48,2526
42,47,50,2795
42,48,53,2609
42,49,58,2584
42,50,48,2692
42,51,50,2566
42,52,48,2735
43,1,46,2622
43,2,47,2688
43,3,50,2544
43,4,56,2573
43,5,54,2691
43,6,52,2796
43,7,53,2564
43,8,58,2624
43,9,50,2528
43,10,57,2721
43,11,51,2722
43,12,59,2681
43,13,52,2683
43,14,46,2663
43,15,53,2579
43,16,56,2519
43,17,54,2584
43,18,47,2665
43,19,55,2511
43,20,60,2677
43,21,52,2585
43,22,60,2719
43,23,48,2655
43,24,48,2641
43,25,53,2512
43,26,48,2612
43,27,58,2614
43,28,60,2551
43,29,55,2682
43,30,49,2504
43,31,51,2701
43,32,57,2554
43,33,52,2730
43,34,54,2783
43,35,51,2681
43,36,51,2655
43,37,46,2629
43,38,58,2739
43,39,47,2535
43,40,50,2512
43,41,51,2701
43,42,55,2538
43,43,58,2775
43,44,56,2545
43,45,46,2671
43,46,57,2680
43,47,50,2572
43,48,52,2517
43,49,56,2743
43,50,59,2665
43,51,58,2593
43,52,48,2764
Creación de la tabla temp_drivers
#
A continuación se crea la tabla temp_drivers
, que es almacenada en el disco como un archivo de texto, para almacenar la información de los conductores.
[6]:
%%hive
DROP TABLE IF EXISTS temp_drivers;
CREATE TABLE temp_drivers (col_value STRING) STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");
OK
Time taken: 5.696 seconds
OK
Time taken: 0.56 seconds
Seguidamente, se visualizan las tablas en la base de datos actual que empiezan por t para verificar que la tabla fue creada.
[7]:
%%hive
SHOW TABLES LIKE 't*';
OK
temp_drivers
Time taken: 0.11 seconds, Fetched: 1 row(s)
Carga de datos para la tabla temp_drivers
#
La siguiente consulta realiza la carga de los datos del archivo drivers.csv
en la tabla temp_drivers
.
[8]:
%%hive
LOAD DATA INPATH '/tmp/drivers/drivers.csv' OVERWRITE INTO TABLE temp_drivers;
ers; DATA INPATH '/tmp/drivers/drivers.csv' OVERWRITE INTO TABLE temp_driv
Loading data to table default.temp_drivers
OK
Time taken: 0.708 seconds
Hive consume los datos, es decir, mueve los datos a la bodega de datos, de tal forma que el archivo drivers.csv
es eliminado de la carpeta /tmp/drivers
.
[9]:
!hdfs dfs -ls /tmp/drivers
Found 2 items
drwxr-xr-x - root supergroup 0 2022-05-17 13:33 /tmp/drivers/specific-columns
-rw-r--r-- 1 root supergroup 26205 2022-05-17 14:41 /tmp/drivers/timesheet.csv
Se obtiene los primeros 10 registros de la tabla para realizar una inspección rápida de los datos y verificar que los datos fueron cargados correctamente.
[10]:
%%hive
SELECT * FROM temp_drivers LIMIT 10;
OK
10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
11,Jamie Engesser,262112338,366-4125 Ac Street,N,miles
12,Paul Coddin,198041975,Ap #622-957 Risus. Street,Y,hours
13,Joe Niemiec,139907145,2071 Hendrerit. Ave,Y,hours
14,Adis Cesir,820812209,Ap #810-1228 In St.,Y,hours
15,Rohit Bakshi,239005227,648-5681 Dui- Rd.,Y,hours
16,Tom McCuch,363303105,P.O. Box 313- 962 Parturient Rd.,Y,hours
17,Eric Mizell,123808238,P.O. Box 579- 2191 Gravida. Street,Y,hours
18,Grant Liu,171010151,Ap #928-3159 Vestibulum Av.,Y,hours
19,Ajay Singh,160005158,592-9430 Nonummy Avenue,Y,hours
Time taken: 1.005 seconds, Fetched: 10 row(s)
Creación de la tabla drivers
#
A continuación se crea la tabla drivers
en donde se colocará la información extraída de la tabla temp_drivers
.
[11]:
%%hive
DROP TABLE IF EXISTS drivers;
CREATE TABLE drivers (driverId INT,
name STRING,
ssn BIGINT,
location STRING,
certified STRING,
wageplan STRING)
TBLPROPERTIES ("skip.header.line.count"="1");
OK
Time taken: 0.018 seconds
OK
Time taken: 0.061 seconds
Ya que cada registro de la tabla temp_drivers
es una línea de texto, se aplica una expresión regular (regexp_extract
) para realizar la división del texto por las comas. La parte {1}
representa la primera cadena de caracteres después de realizar la partición, {2}
la segunda y así sucesivamente. Después de la llamada a la función regexp_extract
se indica el nombre de la columna en la tabla drivers
.
[12]:
%%hive
INSERT OVERWRITE TABLE drivers
SELECT
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) name,
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ssn,
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) location,
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) certified,
regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) wageplan
FROM
temp_drivers;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144213_e23a371d-a2df-47f4-bab3-aaad8ab30155
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0007, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0007/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 14:42:20,276 Stage-1 map = 0%, reduce = 0%
2022-05-17 14:42:24,437 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.94 sec
MapReduce Total cumulative CPU time: 1 seconds 940 msec
Ended Job = job_1652793922537_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/drivers/.hive-staging_hive_2022-05-17_14-42-13_686_1675122652526265324-1/-ext-10000
Loading data to table default.drivers
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.94 sec HDFS Read: 6812 HDFS Write: 2036 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
Time taken: 13.097 seconds
Se aplica la instrucción SELECT
para revisar el resultado de la carga de los datos.
[13]:
%%hive
SELECT * FROM drivers LIMIT 10;
OK
11 Jamie Engesser 262112338 366-4125 Ac Street N miles
12 Paul Coddin 198041975 Ap #622-957 Risus. Street Y hours
13 Joe Niemiec 139907145 2071 Hendrerit. Ave Y hours
14 Adis Cesir 820812209 Ap #810-1228 In St. Y hours
15 Rohit Bakshi 239005227 648-5681 Dui- Rd. Y hours
16 Tom McCuch 363303105 P.O. Box 313- 962 Parturient Rd. Y hours
17 Eric Mizell 123808238 P.O. Box 579- 2191 Gravida. Street Y hours
18 Grant Liu 171010151 Ap #928-3159 Vestibulum Av. Y hours
19 Ajay Singh 160005158 592-9430 Nonummy Avenue Y hours
20 Chris Harris 921812303 883-2691 Proin Avenue Y hours
Time taken: 0.112 seconds, Fetched: 10 row(s)
Creación de la tabla temp_timesheet
#
Se procede a crear la tabla y cargar los datos para el archivo time_sheet
.
[14]:
%%hive
DROP TABLE IF EXISTS temp_timesheet;
CREATE TABLE temp_timesheet (col_value string)
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/tmp/drivers/timesheet.csv' OVERWRITE INTO TABLE temp_timesheet;
SELECT * FROM temp_timesheet LIMIT 10;
OK
Time taken: 0.017 seconds
OK
Time taken: 0.216 seconds
mesheet;A INPATH '/tmp/drivers/timesheet.csv' OVERWRITE INTO TABLE temp_ti
Loading data to table default.temp_timesheet
OK
Time taken: 0.22 seconds
OK
10,1,70,3300
10,2,70,3300
10,3,60,2800
10,4,70,3100
10,5,70,3200
10,6,70,3300
10,7,70,3000
10,8,70,3300
10,9,70,3200
10,10,50,2500
Time taken: 0.096 seconds, Fetched: 10 row(s)
Creación de la tabla timesheet
#
Se procede igual que en las tablas anteriores.
[15]:
%%hive
DROP TABLE IF EXISTS timesheet;
CREATE TABLE timesheet (driverId INT, week INT, hours_logged INT , miles_logged INT)
TBLPROPERTIES ("skip.header.line.count"="1");
INSERT OVERWRITE TABLE timesheet
SELECT
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) week,
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) hours_logged,
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) miles_logged
FROM
temp_timesheet;
SELECT * FROM timesheet LIMIT 10;
OK
Time taken: 0.018 seconds
ogged INT)LE timesheet (driverId INT, week INT, hours_logged INT , miles_l
OK
Time taken: 0.042 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144228_ef115f87-6300-428d-a380-9d530398d507
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0008, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0008/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 14:42:35,597 Stage-1 map = 0%, reduce = 0%
2022-05-17 14:42:39,707 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1652793922537_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/timesheet/.hive-staging_hive_2022-05-17_14-42-28_623_1595757489952936503-1/-ext-10000
Loading data to table default.timesheet
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.85 sec HDFS Read: 30739 HDFS Write: 24476 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 850 msec
OK
Time taken: 12.328 seconds
OK
10 2 70 3300
10 3 60 2800
10 4 70 3100
10 5 70 3200
10 6 70 3300
10 7 70 3000
10 8 70 3300
10 9 70 3200
10 10 50 2500
10 11 70 2900
Time taken: 0.093 seconds, Fetched: 10 row(s)
Cantidad de horas y millas de cada conductor por año.#
En la siguiente consulta se desea obtener para cada conductor la cantidad de horas y millas por año.
[16]:
%%hive
SELECT
driverId,
sum(hours_logged),
sum(miles_logged)
FROM
timesheet
GROUP BY
driverId;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144241_4fe1f8e4-cd53-475e-8c16-51b039bc665a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0009, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0009/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 14:42:50,503 Stage-1 map = 0%, reduce = 0%
2022-05-17 14:42:54,612 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.68 sec
2022-05-17 14:42:59,749 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.31 sec
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1652793922537_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.31 sec HDFS Read: 33422 HDFS Write: 1005 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
10 3162 143850
11 3642 179300
12 2639 135962
13 2727 134126
14 2781 136624
15 2734 138750
16 2746 137205
17 2701 135992
18 2654 137834
19 2738 137968
20 2644 134564
21 2751 138719
22 2733 137550
23 2750 137980
24 2647 134461
25 2723 139180
26 2730 137530
27 2771 137922
28 2723 137469
29 2760 138255
30 2773 137473
31 2704 137057
32 2736 137422
33 2759 139285
34 2811 137728
35 2728 138727
36 2795 138025
37 2694 137223
38 2760 137464
39 2745 138788
40 2700 136931
41 2723 138407
42 2697 136673
43 2750 136993
Time taken: 19.249 seconds, Fetched: 34 row(s)
Consulta para unir las tablas#
El paso final consiste en crear una consulta que agregue el nombre del conductor de la tabla drivers
con la cantidad de horas y millas por año.
[17]:
%%hive
SELECT
d.driverId,
d.name,
t.total_hours,
t.total_miles
FROM
drivers d
JOIN (
SELECT
driverId,
sum(hours_logged)total_hours,
sum(miles_logged)total_miles
FROM
timesheet
GROUP BY
driverId
) t
ON
(d.driverId = t.driverId);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144301_fcf22209-53f3-42ce-9af0-d532ace4a917
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0010, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0010/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0010
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-05-17 14:43:09,601 Stage-2 map = 0%, reduce = 0%
2022-05-17 14:43:13,719 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.01 sec
2022-05-17 14:43:18,828 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.63 sec
MapReduce Total cumulative CPU time: 3 seconds 630 msec
Ended Job = job_1652793922537_0010
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2022-05-17 14:43:26 Starting to launch local task to process map join; maximum memory = 477626368
2022-05-17 14:43:28 Dump the side-table for tag: 0 with group count: 33 into file: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-01_867_3518588412713534238-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2022-05-17 14:43:28 Uploaded 1 File to: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-01_867_3518588412713534238-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (1325 bytes)
2022-05-17 14:43:28 End of local task; Time Taken: 1.4 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0011, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0011/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0011
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-05-17 14:43:33,515 Stage-3 map = 0%, reduce = 0%
2022-05-17 14:43:37,621 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1652793922537_0011
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.63 sec HDFS Read: 32521 HDFS Write: 946 SUCCESS
Stage-Stage-3: Map: 1 Cumulative CPU: 1.53 sec HDFS Read: 6784 HDFS Write: 1411 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 160 msec
OK
11 Jamie Engesser 3642 179300
12 Paul Coddin 2639 135962
13 Joe Niemiec 2727 134126
14 Adis Cesir 2781 136624
15 Rohit Bakshi 2734 138750
16 Tom McCuch 2746 137205
17 Eric Mizell 2701 135992
18 Grant Liu 2654 137834
19 Ajay Singh 2738 137968
20 Chris Harris 2644 134564
21 Jeff Markham 2751 138719
22 Nadeem Asghar 2733 137550
23 Adam Diaz 2750 137980
24 Don Hilborn 2647 134461
25 Jean-Philippe Playe 2723 139180
26 Michael Aube 2730 137530
27 Mark Lochbihler 2771 137922
28 Olivier Renault 2723 137469
29 Teddy Choi 2760 138255
30 Dan Rice 2773 137473
31 Rommel Garcia 2704 137057
32 Ryan Templeton 2736 137422
33 Sridhara Sabbella 2759 139285
34 Frank Romano 2811 137728
35 Emil Siemes 2728 138727
36 Andrew Grande 2795 138025
37 Wes Floyd 2694 137223
38 Scott Shaw 2760 137464
39 David Kaiser 2745 138788
40 Nicolas Maillard 2700 136931
41 Greg Phillips 2723 138407
42 Randy Gelhausen 2697 136673
43 Dave Patton 2750 136993
Time taken: 36.822 seconds, Fetched: 33 row(s)
Almacenamiento de los resultados#
Finalmente, se agrega una porción de codigo adicional a la consulta anterior para almacenar la tabla final obtenida en la carpeta /tmp/drivers/summary
del HDFS para que otras aplicaciones puedan usar estos resultados.
[18]:
%%hive
INSERT OVERWRITE DIRECTORY '/tmp/drivers/summary'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT
d.driverId,
d.name,
t.total_hours,
t.total_miles
FROM
drivers d
JOIN (
SELECT
driverId,
sum(hours_logged)total_hours,
sum(miles_logged)total_miles
FROM
timesheet
GROUP BY
driverId
) t
ON
(d.driverId = t.driverId);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517144339_0c63ca4e-baa5-46ab-9b53-25662d9d6082
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0012, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0012/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0012
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-05-17 14:43:48,364 Stage-2 map = 0%, reduce = 0%
2022-05-17 14:43:52,451 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.11 sec
2022-05-17 14:43:57,573 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.44 sec
MapReduce Total cumulative CPU time: 3 seconds 440 msec
Ended Job = job_1652793922537_0012
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2022-05-17 14:44:04 Starting to launch local task to process map join; maximum memory = 477626368
2022-05-17 14:44:05 Dump the side-table for tag: 0 with group count: 33 into file: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-39_841_4775949701697963899-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile10--.hashtable
2022-05-17 14:44:05 Uploaded 1 File to: file:/tmp/root/ca88b647-8729-4e04-9264-2e3e6a5c74e1/hive_2022-05-17_14-43-39_841_4775949701697963899-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile10--.hashtable (1325 bytes)
2022-05-17 14:44:05 End of local task; Time Taken: 1.437 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0013, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0013/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0013
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-05-17 14:44:10,576 Stage-3 map = 0%, reduce = 0%
2022-05-17 14:44:14,682 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.69 sec
MapReduce Total cumulative CPU time: 1 seconds 690 msec
Ended Job = job_1652793922537_0013
Moving data to directory /tmp/drivers/summary
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.44 sec HDFS Read: 32512 HDFS Write: 946 SUCCESS
Stage-Stage-3: Map: 1 Cumulative CPU: 1.69 sec HDFS Read: 6345 HDFS Write: 928 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 130 msec
OK
Time taken: 35.911 seconds
[19]:
!hdfs dfs -ls /tmp/drivers/summary/
Found 1 items
-rwxr-xr-x 1 root supergroup 928 2022-05-17 14:44 /tmp/drivers/summary/000000_0
[20]:
!hdfs dfs -tail /tmp/drivers/summary/000000_0
11,Jamie Engesser,3642,179300
12,Paul Coddin,2639,135962
13,Joe Niemiec,2727,134126
14,Adis Cesir,2781,136624
15,Rohit Bakshi,2734,138750
16,Tom McCuch,2746,137205
17,Eric Mizell,2701,135992
18,Grant Liu,2654,137834
19,Ajay Singh,2738,137968
20,Chris Harris,2644,134564
21,Jeff Markham,2751,138719
22,Nadeem Asghar,2733,137550
23,Adam Diaz,2750,137980
24,Don Hilborn,2647,134461
25,Jean-Philippe Playe,2723,139180
26,Michael Aube,2730,137530
27,Mark Lochbihler,2771,137922
28,Olivier Renault,2723,137469
29,Teddy Choi,2760,138255
30,Dan Rice,2773,137473
31,Rommel Garcia,2704,137057
32,Ryan Templeton,2736,137422
33,Sridhara Sabbella,2759,139285
34,Frank Romano,2811,137728
35,Emil Siemes,2728,138727
36,Andrew Grande,2795,138025
37,Wes Floyd,2694,137223
38,Scott Shaw,2760,137464
39,David Kaiser,2745,138788
40,Nicolas Maillard,2700,136931
41,Greg Phillips,2723,138407
42,Randy Gelhausen,2697,136673
43,Dave Patton,2750,136993
[21]:
%quit
[22]:
!rm *.csv *.log