Operaciones básicas en Hive#
Última modificación: Mayo 17, 2022
Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/
En este tutorial se ejemplifica:
La carga de datos.
El uso básico de consultas.
La exportación de resultados.
Cell magic %%hive
#
[1]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn
TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n > ", "\r\nhive> "]
QUIT = "quit;"
@magics_class
class Magic(Magics):
def __init__(self, shell):
super().__init__(shell)
self.app = spawn(PROG, timeout=60)
self.app.expect(PROMPT)
@cell_magic
def hive(self, line, cell):
cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
for cell_line in cell_lines:
self.app.sendline(cell_line)
self.app.expect(PROMPT, timeout=TIMEOUT)
output = self.app.before.decode()
output = output.replace("\r\n", "\n")
output = output.split("\n")
output = [output_line.strip() for output_line in output]
for output_line in output:
if output_line not in cell_lines:
print(output_line)
return None
@line_magic
def quit(self, line):
self.app.sendline(QUIT)
def load_ipython_extension(ip):
ip.register_magics(Magic(ip))
load_ipython_extension(ip=get_ipython())
Preparación#
Los datos se encuentran almacenados en la carpeta drivers
del directorio actual. A continución se procede a crear la carpeta /tmp/drivers
en el sistema de archivos de Hadoop (HDFS).
[2]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv
--2022-05-17 13:29:26-- https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2272077 (2.2M) [text/plain]
Saving to: ‘truck_event_text_partition.csv’
truck_event_text_pa 100%[===================>] 2.17M 4.68MB/s in 0.5s
2022-05-17 13:29:27 (4.68 MB/s) - ‘truck_event_text_partition.csv’ saved [2272077/2272077]
[3]:
#
# Borra la carpeta si existe
#
!hdfs dfs -rm -r -f /tmp/drivers
#
# Crea la carpeta drivers en el HDFS
#
!hdfs dfs -mkdir /tmp/drivers
#
# Copia los archivos al HDFS
#
!hdfs dfs -copyFromLocal truck_event_text_partition.csv /tmp/drivers/truck_event_text_partition.csv
#
# Lista los archivos al HDFS para verificar
# que los archivos fueron copiados correctamente.
#
!hdfs dfs -ls /tmp/drivers/*
-rw-r--r-- 1 root supergroup 2272077 2022-05-17 13:30 /tmp/drivers/truck_event_text_partition.csv
Carga de los datos de los eventos de los conductores#
En el siguiente código se crea crea la tabla de eventos de los conductores en el sistema; la primera instrucción borra la tabla si ya existe. Note que se debe especificar que los campos en las filas están delimitados por comas para que Hive los importe correctamente.
[5]:
%%hive
DROP TABLE IF EXISTS truck_events;
CREATE TABLE truck_events (driverId INT,
truckId INT,
eventTime STRING,
eventType STRING,
longitude DOUBLE,
latitude DOUBLE,
eventKey STRING,
correlationId STRING,
driverName STRING,
routeId BIGINT,
routeName STRING,
eventDate STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
TBLPROPERTIES ("skip.header.line.count"="1");
OK
Time taken: 0.061 seconds
OK
Time taken: 0.485 seconds
Se verifican las tablas existentes en la base de datos.
[6]:
%%hive
SHOW TABLES;
OK
docs
truck_events
word_counts
Time taken: 0.03 seconds, Fetched: 3 row(s)
A continuación se muestra la información detallada de creación de la tabla truck_events
.
[7]:
%%hive
SHOW CREATE TABLE truck_events;
OK
CREATE TABLE `truck_events`(
`driverid` int,
`truckid` int,
`eventtime` string,
`eventtype` string,
`longitude` double,
`latitude` double,
`eventkey` string,
`correlationid` string,
`drivername` string,
`routeid` bigint,
`routename` string,
`eventdate` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://0.0.0.0:9000/user/hive/warehouse/truck_events'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1652794284')
Time taken: 0.19 seconds, Fetched: 27 row(s)
También es posible visualizar los campos y su contenido con el comando DESCRIBE
.
[8]:
%%hive
DESCRIBE truck_events;
OK
driverid int
truckid int
eventtime string
eventtype string
longitude double
latitude double
eventkey string
correlationid string
drivername string
routeid bigint
routename string
eventdate string
Time taken: 0.045 seconds, Fetched: 12 row(s)
Carga de datos#
La carga de datos se realiza con la siguiente consulta.
[9]:
%%hive
LOAD DATA INPATH '/tmp/drivers/truck_event_text_partition.csv' OVERWRITE
INTO TABLE truck_events;
Loading data to table default.truck_events
OK
Time taken: 0.536 seconds
Se verifican las propieades de la tabla después de la carga de datos.
[10]:
%%hive
SHOW TBLPROPERTIES truck_events;
OK
numFiles 1
numRows 0
rawDataSize 0
skip.header.line.count 1
totalSize 2272077
transient_lastDdlTime 1652794303
Time taken: 0.046 seconds, Fetched: 6 row(s)
Visualización de datos#
La visualización se realiza mediante consultas con SELECT
.
[11]:
%%hive
SELECT * FROM truck_events LIMIT 10;
OK
14 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 3.66E+18 Adis Cesir 160405074 Joplin to Kansas City Route 2 2016-05-27-22
18 16 59:21.7 Normal -89.66 39.78 18|16|9223370572464814089 3.66E+18 Grant Liu 1565885487 Springfield to KC Via Hanibal 2016-05-27-22
27 105 59:21.7 Normal -90.21 38.65 27|105|9223370572464814070 3.66E+18 Mark Lochbihler 1325562373 Springfield to KC Via Columbia Route 2 2016-05-27-22
11 74 59:21.7 Normal -90.2 38.65 11|74|9223370572464814123 3.66E+18 Jamie Engesser 1567254452 Saint Louis to Memphis Route2 2016-05-27-22
22 87 59:21.7 Normal -90.04 35.19 22|87|9223370572464814101 3.66E+18 Nadeem Asghar 1198242881 Saint Louis to Chicago Route2 2016-05-27-22
22 87 59:22.3 Normal -90.37 35.21 22|87|9223370572464813486 3.66E+18 Nadeem Asghar 1198242881 Saint Louis to Chicago Route2 2016-05-27-22
23 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 3.66E+18 Adam Diaz 160405074 Joplin to Kansas City Route 2 2016-05-27-22
11 74 59:22.5 Normal -89.74 39.1 11|74|9223370572464813355 3.66E+18 Jamie Engesser 1567254452 Saint Louis to Memphis Route2 2016-05-27-22
20 41 59:22.5 Normal -93.36 41.69 20|41|9223370572464813344 3.66E+18 Chris Harris 160779139 Des Moines to Chicago Route 2 2016-05-27-22
32 42 59:22.5 Normal -90.37 35.21 32|42|9223370572464813296 3.66E+18 Ryan Templeton 1090292248 Peoria to Ceder Rapids Route 2 2016-05-27-22
Time taken: 1.051 seconds, Fetched: 10 row(s)
Obtención de un subconjunto de datos#
En hive es posible un subconjunto de datos y almacenarlo en una nueva tabla a partir de una consulta que permita obtener los datos deseados. En el siguiente código, se crea la tabla truck_events_subset
con los primeros 100 registros de la tabla truck_events
.
[12]:
%%hive
DROP TABLE IF EXISTS truck_events_subset;
CREATE TABLE truck_events_subset
AS
SELECT *
FROM truck_events
LIMIT 100;
OK
Time taken: 0.022 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133245_ae2e9ed4-9c7e-4fe6-a66d-6e6cd8d8908e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0003, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0003/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 13:32:52,118 Stage-1 map = 0%, reduce = 0%
2022-05-17 13:32:55,316 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.42 sec
2022-05-17 13:33:00,452 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.98 sec
MapReduce Total cumulative CPU time: 2 seconds 980 msec
Ended Job = job_1652793922537_0003
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/truck_events_subset
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.98 sec HDFS Read: 28712 HDFS Write: 13676 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 980 msec
OK
Time taken: 15.862 seconds
El código anterior es equivalente al siguiente, donde se usa LIKE
en CREATE TABLE
para indicar que la nueva tabla truck_events_subset
tiene la misma estructura de la tabla existente truck_events
.
[13]:
%%hive
DROP TABLE IF EXISTS truck_events_subset;
CREATE TABLE truck_events_subset LIKE truck_events;
INSERT OVERWRITE TABLE truck_events_subset
SELECT
*
FROM
truck_events
LIMIT
100;
OK
Time taken: 0.277 seconds
OK
Time taken: 0.063 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133302_8608c166-9f0a-4c91-ab65-14d3becad10d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0004, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0004/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 13:33:11,640 Stage-1 map = 0%, reduce = 0%
2022-05-17 13:33:14,739 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
2022-05-17 13:33:19,866 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.65 sec
MapReduce Total cumulative CPU time: 2 seconds 650 msec
Ended Job = job_1652793922537_0004
Loading data to table default.truck_events_subset
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.65 sec HDFS Read: 29426 HDFS Write: 13676 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 650 msec
OK
Time taken: 18.434 seconds
[14]:
%%hive
SELECT * FROM truck_events_subset LIMIT 5;
OK
31 18 59:36.3 Normal -94.58 37.03 31|18|9223370572464799462 3.66E+18 Rommel Garcia 1594289134 Memphis to Little Rock Route 2 2016-05-27-22
18 16 59:36.3 Normal -92.42 39.76 18|16|9223370572464799486 3.66E+18 Grant Liu 1565885487 Springfield to KC Via Hanibal 2016-05-27-22
26 57 59:35.9 Normal -92.74 37.6 26|57|9223370572464799895 3.66E+18 Michael Aube 1325712174 Saint Louis to Tulsa Route2 2016-05-27-22
14 25 59:35.8 Normal -94.46 37.16 14|25|9223370572464800006 3.66E+18 Adis Cesir 160405074 Joplin to Kansas City Route 2 2016-05-27-22
27 105 59:35.6 Normal -92.85 38.93 27|105|9223370572464800175 3.66E+18 Mark Lochbihler 1325562373 Springfield to KC Via Columbia Route 2 2016-05-27-22
Time taken: 0.105 seconds, Fetched: 5 row(s)
Obtención de un subconjunto de datos#
En el siguiente código se obtienen algunas columnas de la tabla truck_events_subset
para ser almacenadas en una tabla diferente.
[15]:
%%hive
DROP TABLE IF EXISTS specific_columns;
CREATE TABLE specific_columns
AS
SELECT
driverId,
eventTime,
eventType
FROM
truck_events_subset;
SELECT * FROM specific_columns LIMIT 5;
OK
Time taken: 0.025 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133328_bee0c4dd-b470-4137-ab82-65ae82d111c7
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0005, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0005/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 13:33:32,359 Stage-1 map = 0%, reduce = 0%
2022-05-17 13:33:36,475 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.34 sec
MapReduce Total cumulative CPU time: 1 seconds 340 msec
Ended Job = job_1652793922537_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/.hive-staging_hive_2022-05-17_13-33-28_831_1063867166954300449-1/-ext-10002
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/specific_columns
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.34 sec HDFS Read: 18146 HDFS Write: 1883 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 340 msec
OK
Time taken: 8.869 seconds
OK
31 59:36.3 Normal
18 59:36.3 Normal
26 59:35.9 Normal
14 59:35.8 Normal
27 59:35.6 Normal
Time taken: 0.094 seconds, Fetched: 5 row(s)
Escritura de la tabla en el HDFS#
Seguidamente, se procede a escribir el contenido de la tabla en el directorio /tmp/drivers/specific-columns
del HDFS.
[16]:
%%hive
INSERT OVERWRITE DIRECTORY '/tmp/drivers/specific-columns'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT
*
FROM
specific_columns;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133348_0c6ba7f6-4d7e-4003-b9db-04b378c4c34b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0006, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0006/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 13:33:53,250 Stage-1 map = 0%, reduce = 0%
2022-05-17 13:33:57,366 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.31 sec
MapReduce Total cumulative CPU time: 1 seconds 310 msec
Ended Job = job_1652793922537_0006
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/tmp/drivers/specific-columns/.hive-staging_hive_2022-05-17_13-33-48_654_2103552764040297975-1/-ext-10000
Moving data to directory /tmp/drivers/specific-columns
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.31 sec HDFS Read: 5526 HDFS Write: 1800 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 310 msec
OK
Time taken: 9.792 seconds
[17]:
#
# Se visualiza el contenido del directorio
#
!hdfs dfs -ls /tmp/drivers/specific-columns/
Found 1 items
-rwxr-xr-x 1 root supergroup 1800 2022-05-17 13:33 /tmp/drivers/specific-columns/000000_0
[18]:
#
# Se visualiza la parte final del archivo
#
!hdfs dfs -tail /tmp/drivers/specific-columns/000000_0
,59:29.6,Normal
13,59:29.5,Normal
27,59:29.3,Normal
17,59:29.2,Normal
12,59:29.1,Normal
15,59:28.8,Normal
16,59:28.8,Normal
13,59:28.5,Normal
23,59:28.4,Normal
11,59:28.3,Normal
30,59:28.0,Normal
24,59:27.9,Normal
25,59:27.8,Normal
28,59:27.7,Normal
27,59:27.7,Normal
13,59:27.6,Normal
23,59:27.4,Normal
25,59:27.0,Normal
26,59:27.0,Normal
28,59:26.9,Normal
10,59:26.8,Normal
22,59:26.6,Normal
23,59:26.6,Normal
25,59:26.2,Normal
27,59:25.9,Normal
19,59:25.9,Normal
13,59:25.9,Normal
21,59:25.7,Normal
16,59:25.3,Normal
26,59:25.2,Normal
19,59:25.1,Normal
18,59:25.0,Normal
22,59:25.0,Normal
29,59:24.7,Normal
25,59:24.3,Normal
24,59:24.3,Normal
32,59:24.2,Normal
22,59:24.2,Normal
14,59:24.2,Normal
25,59:23.5,Normal
31,59:23.5,Normal
16,59:23.4,Normal
15,59:23.4,Normal
28,59:23.3,Normal
14,59:23.3,Normal
17,59:23.2,Normal
27,59:22.6,Normal
32,59:22.5,Normal
20,59:22.5,Normal
11,59:22.5,Normal
23,59:22.4,Normal
22,59:22.3,Normal
22,59:21.7,Normal
11,59:21.7,Normal
27,59:21.7,Normal
18,59:21.7,Normal
14,59:21.4,Normal
[19]:
%%hive
DROP TABLE drivers;
DROP TABLE specific_columns;
DROP TABLE temp_drivers;
DROP TABLE temp_timesheet;
DROP TABLE timesheet;
DROP TABLE truck_events;
DROP TABLE truck_events_subset;
OK
Time taken: 0.051 seconds
OK
Time taken: 0.063 seconds
OK
Time taken: 0.02 seconds
OK
Time taken: 0.018 seconds
OK
Time taken: 0.017 seconds
OK
Time taken: 0.059 seconds
OK
Time taken: 0.06 seconds
[20]:
!rm *.csv *.log
[21]:
%quit