Operaciones básicas en Hive#

  • Última modificación: Mayo 17, 2022

Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/

En este tutorial se ejemplifica:

  • La carga de datos.

  • El uso básico de consultas.

  • La exportación de resultados.

Cell magic %%hive#

[1]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n    > ", "\r\nhive> "]
QUIT = "quit;"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def hive(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())

Preparación#

Los datos se encuentran almacenados en la carpeta drivers del directorio actual. A continución se procede a crear la carpeta /tmp/drivers en el sistema de archivos de Hadoop (HDFS).

[2]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv
--2022-05-17 13:29:26--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2272077 (2.2M) [text/plain]
Saving to: ‘truck_event_text_partition.csv’

truck_event_text_pa 100%[===================>]   2.17M  4.68MB/s    in 0.5s

2022-05-17 13:29:27 (4.68 MB/s) - ‘truck_event_text_partition.csv’ saved [2272077/2272077]

[3]:
#
# Borra la carpeta si existe
#
!hdfs dfs -rm -r -f /tmp/drivers

#
# Crea la carpeta drivers en el HDFS
#
!hdfs dfs -mkdir /tmp/drivers

#
# Copia los archivos al HDFS
#
!hdfs dfs -copyFromLocal truck_event_text_partition.csv  /tmp/drivers/truck_event_text_partition.csv

#
# Lista los archivos al HDFS para verificar
# que los archivos fueron copiados correctamente.
#
!hdfs dfs -ls /tmp/drivers/*
-rw-r--r--   1 root supergroup    2272077 2022-05-17 13:30 /tmp/drivers/truck_event_text_partition.csv

Carga de los datos de los eventos de los conductores#

En el siguiente código se crea crea la tabla de eventos de los conductores en el sistema; la primera instrucción borra la tabla si ya existe. Note que se debe especificar que los campos en las filas están delimitados por comas para que Hive los importe correctamente.

[5]:
%%hive
DROP TABLE IF EXISTS truck_events;

CREATE TABLE truck_events (driverId       INT,
                           truckId        INT,
                           eventTime      STRING,
                           eventType      STRING,
                           longitude      DOUBLE,
                           latitude       DOUBLE,
                           eventKey       STRING,
                           correlationId  STRING,
                           driverName     STRING,
                           routeId        BIGINT,
                           routeName      STRING,
                           eventDate      STRING)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
TBLPROPERTIES ("skip.header.line.count"="1");
OK
Time taken: 0.061 seconds
OK
Time taken: 0.485 seconds

Se verifican las tablas existentes en la base de datos.

[6]:
%%hive
SHOW TABLES;
OK
docs
truck_events
word_counts
Time taken: 0.03 seconds, Fetched: 3 row(s)

A continuación se muestra la información detallada de creación de la tabla truck_events.

[7]:
%%hive
SHOW CREATE TABLE truck_events;
OK
CREATE TABLE `truck_events`(
`driverid` int,
`truckid` int,
`eventtime` string,
`eventtype` string,
`longitude` double,
`latitude` double,
`eventkey` string,
`correlationid` string,
`drivername` string,
`routeid` bigint,
`routename` string,
`eventdate` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://0.0.0.0:9000/user/hive/warehouse/truck_events'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1652794284')
Time taken: 0.19 seconds, Fetched: 27 row(s)

También es posible visualizar los campos y su contenido con el comando DESCRIBE.

[8]:
%%hive
DESCRIBE truck_events;
OK
driverid                int
truckid                 int
eventtime               string
eventtype               string
longitude               double
latitude                double
eventkey                string
correlationid           string
drivername              string
routeid                 bigint
routename               string
eventdate               string
Time taken: 0.045 seconds, Fetched: 12 row(s)

Carga de datos#

La carga de datos se realiza con la siguiente consulta.

[9]:
%%hive
LOAD DATA INPATH '/tmp/drivers/truck_event_text_partition.csv' OVERWRITE
INTO TABLE truck_events;
Loading data to table default.truck_events
OK
Time taken: 0.536 seconds

Se verifican las propieades de la tabla después de la carga de datos.

[10]:
%%hive
SHOW TBLPROPERTIES truck_events;
OK
numFiles        1
numRows 0
rawDataSize     0
skip.header.line.count  1
totalSize       2272077
transient_lastDdlTime   1652794303
Time taken: 0.046 seconds, Fetched: 6 row(s)

Visualización de datos#

La visualización se realiza mediante consultas con SELECT.

[11]:
%%hive
SELECT * FROM truck_events LIMIT 10;
OK
14      25      59:21.4 Normal  -94.58  37.03   14|25|9223370572464814373       3.66E+18        Adis Cesir      160405074       Joplin to Kansas City Route 2   2016-05-27-22
18      16      59:21.7 Normal  -89.66  39.78   18|16|9223370572464814089       3.66E+18        Grant Liu       1565885487      Springfield to KC Via Hanibal   2016-05-27-22
27      105     59:21.7 Normal  -90.21  38.65   27|105|9223370572464814070      3.66E+18        Mark Lochbihler 1325562373      Springfield to KC Via Columbia Route 2  2016-05-27-22
11      74      59:21.7 Normal  -90.2   38.65   11|74|9223370572464814123       3.66E+18        Jamie Engesser  1567254452      Saint Louis to Memphis Route2   2016-05-27-22
22      87      59:21.7 Normal  -90.04  35.19   22|87|9223370572464814101       3.66E+18        Nadeem Asghar   1198242881       Saint Louis to Chicago Route2  2016-05-27-22
22      87      59:22.3 Normal  -90.37  35.21   22|87|9223370572464813486       3.66E+18        Nadeem Asghar   1198242881       Saint Louis to Chicago Route2  2016-05-27-22
23      68      59:22.4 Normal  -89.91  40.86   23|68|9223370572464813450       3.66E+18        Adam Diaz       160405074       Joplin to Kansas City Route 2   2016-05-27-22
11      74      59:22.5 Normal  -89.74  39.1    11|74|9223370572464813355       3.66E+18        Jamie Engesser  1567254452      Saint Louis to Memphis Route2   2016-05-27-22
20      41      59:22.5 Normal  -93.36  41.69   20|41|9223370572464813344       3.66E+18        Chris Harris    160779139       Des Moines to Chicago Route 2   2016-05-27-22
32      42      59:22.5 Normal  -90.37  35.21   32|42|9223370572464813296       3.66E+18        Ryan Templeton  1090292248      Peoria to Ceder Rapids Route 2  2016-05-27-22
Time taken: 1.051 seconds, Fetched: 10 row(s)

Obtención de un subconjunto de datos#

En hive es posible un subconjunto de datos y almacenarlo en una nueva tabla a partir de una consulta que permita obtener los datos deseados. En el siguiente código, se crea la tabla truck_events_subset con los primeros 100 registros de la tabla truck_events.

[12]:
%%hive
DROP TABLE IF EXISTS truck_events_subset;

CREATE TABLE truck_events_subset
AS
    SELECT *
    FROM truck_events
    LIMIT 100;
OK
Time taken: 0.022 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133245_ae2e9ed4-9c7e-4fe6-a66d-6e6cd8d8908e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0003, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0003/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 13:32:52,118 Stage-1 map = 0%,  reduce = 0%
2022-05-17 13:32:55,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.42 sec
2022-05-17 13:33:00,452 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.98 sec
MapReduce Total cumulative CPU time: 2 seconds 980 msec
Ended Job = job_1652793922537_0003
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/truck_events_subset
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.98 sec   HDFS Read: 28712 HDFS Write: 13676 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 980 msec
OK
Time taken: 15.862 seconds

El código anterior es equivalente al siguiente, donde se usa LIKE en CREATE TABLE para indicar que la nueva tabla truck_events_subset tiene la misma estructura de la tabla existente truck_events.

[13]:
%%hive

DROP TABLE IF EXISTS truck_events_subset;

CREATE TABLE truck_events_subset LIKE truck_events;

INSERT OVERWRITE TABLE truck_events_subset
SELECT
    *
FROM
    truck_events
LIMIT
    100;
OK
Time taken: 0.277 seconds
OK
Time taken: 0.063 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133302_8608c166-9f0a-4c91-ab65-14d3becad10d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0004, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0004/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 13:33:11,640 Stage-1 map = 0%,  reduce = 0%
2022-05-17 13:33:14,739 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.36 sec
2022-05-17 13:33:19,866 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.65 sec
MapReduce Total cumulative CPU time: 2 seconds 650 msec
Ended Job = job_1652793922537_0004
Loading data to table default.truck_events_subset
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.65 sec   HDFS Read: 29426 HDFS Write: 13676 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 650 msec
OK
Time taken: 18.434 seconds
[14]:
%%hive
SELECT * FROM truck_events_subset LIMIT 5;
OK
31      18      59:36.3 Normal  -94.58  37.03   31|18|9223370572464799462       3.66E+18        Rommel Garcia   1594289134      Memphis to Little Rock Route 2  2016-05-27-22
18      16      59:36.3 Normal  -92.42  39.76   18|16|9223370572464799486       3.66E+18        Grant Liu       1565885487      Springfield to KC Via Hanibal   2016-05-27-22
26      57      59:35.9 Normal  -92.74  37.6    26|57|9223370572464799895       3.66E+18        Michael Aube    1325712174      Saint Louis to Tulsa Route2     2016-05-27-22
14      25      59:35.8 Normal  -94.46  37.16   14|25|9223370572464800006       3.66E+18        Adis Cesir      160405074       Joplin to Kansas City Route 2   2016-05-27-22
27      105     59:35.6 Normal  -92.85  38.93   27|105|9223370572464800175      3.66E+18        Mark Lochbihler 1325562373      Springfield to KC Via Columbia Route 2  2016-05-27-22
Time taken: 0.105 seconds, Fetched: 5 row(s)

Obtención de un subconjunto de datos#

En el siguiente código se obtienen algunas columnas de la tabla truck_events_subset para ser almacenadas en una tabla diferente.

[15]:
%%hive

DROP TABLE IF EXISTS specific_columns;

CREATE TABLE specific_columns
AS
    SELECT
        driverId,
        eventTime,
        eventType
    FROM
        truck_events_subset;

SELECT * FROM specific_columns LIMIT 5;
OK
Time taken: 0.025 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133328_bee0c4dd-b470-4137-ab82-65ae82d111c7
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0005, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0005/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 13:33:32,359 Stage-1 map = 0%,  reduce = 0%
2022-05-17 13:33:36,475 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
MapReduce Total cumulative CPU time: 1 seconds 340 msec
Ended Job = job_1652793922537_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/.hive-staging_hive_2022-05-17_13-33-28_831_1063867166954300449-1/-ext-10002
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/specific_columns
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.34 sec   HDFS Read: 18146 HDFS Write: 1883 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 340 msec
OK
Time taken: 8.869 seconds
OK
31      59:36.3 Normal
18      59:36.3 Normal
26      59:35.9 Normal
14      59:35.8 Normal
27      59:35.6 Normal
Time taken: 0.094 seconds, Fetched: 5 row(s)

Escritura de la tabla en el HDFS#

Seguidamente, se procede a escribir el contenido de la tabla en el directorio /tmp/drivers/specific-columns del HDFS.

[16]:
%%hive
INSERT OVERWRITE DIRECTORY '/tmp/drivers/specific-columns'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT
    *
FROM
    specific_columns;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517133348_0c6ba7f6-4d7e-4003-b9db-04b378c4c34b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0006, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0006/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 13:33:53,250 Stage-1 map = 0%,  reduce = 0%
2022-05-17 13:33:57,366 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.31 sec
MapReduce Total cumulative CPU time: 1 seconds 310 msec
Ended Job = job_1652793922537_0006
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/tmp/drivers/specific-columns/.hive-staging_hive_2022-05-17_13-33-48_654_2103552764040297975-1/-ext-10000
Moving data to directory /tmp/drivers/specific-columns
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.31 sec   HDFS Read: 5526 HDFS Write: 1800 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 310 msec
OK
Time taken: 9.792 seconds
[17]:
#
# Se visualiza el contenido del directorio
#
!hdfs dfs -ls /tmp/drivers/specific-columns/
Found 1 items
-rwxr-xr-x   1 root supergroup       1800 2022-05-17 13:33 /tmp/drivers/specific-columns/000000_0
[18]:
#
# Se visualiza la parte final del archivo
#
!hdfs dfs -tail /tmp/drivers/specific-columns/000000_0
,59:29.6,Normal
13,59:29.5,Normal
27,59:29.3,Normal
17,59:29.2,Normal
12,59:29.1,Normal
15,59:28.8,Normal
16,59:28.8,Normal
13,59:28.5,Normal
23,59:28.4,Normal
11,59:28.3,Normal
30,59:28.0,Normal
24,59:27.9,Normal
25,59:27.8,Normal
28,59:27.7,Normal
27,59:27.7,Normal
13,59:27.6,Normal
23,59:27.4,Normal
25,59:27.0,Normal
26,59:27.0,Normal
28,59:26.9,Normal
10,59:26.8,Normal
22,59:26.6,Normal
23,59:26.6,Normal
25,59:26.2,Normal
27,59:25.9,Normal
19,59:25.9,Normal
13,59:25.9,Normal
21,59:25.7,Normal
16,59:25.3,Normal
26,59:25.2,Normal
19,59:25.1,Normal
18,59:25.0,Normal
22,59:25.0,Normal
29,59:24.7,Normal
25,59:24.3,Normal
24,59:24.3,Normal
32,59:24.2,Normal
22,59:24.2,Normal
14,59:24.2,Normal
25,59:23.5,Normal
31,59:23.5,Normal
16,59:23.4,Normal
15,59:23.4,Normal
28,59:23.3,Normal
14,59:23.3,Normal
17,59:23.2,Normal
27,59:22.6,Normal
32,59:22.5,Normal
20,59:22.5,Normal
11,59:22.5,Normal
23,59:22.4,Normal
22,59:22.3,Normal
22,59:21.7,Normal
11,59:21.7,Normal
27,59:21.7,Normal
18,59:21.7,Normal
14,59:21.4,Normal
[19]:
%%hive
DROP TABLE drivers;
DROP TABLE specific_columns;
DROP TABLE temp_drivers;
DROP TABLE temp_timesheet;
DROP TABLE timesheet;
DROP TABLE truck_events;
DROP TABLE truck_events_subset;
OK
Time taken: 0.051 seconds
OK
Time taken: 0.063 seconds
OK
Time taken: 0.02 seconds
OK
Time taken: 0.018 seconds
OK
Time taken: 0.017 seconds
OK
Time taken: 0.059 seconds
OK
Time taken: 0.06 seconds
[20]:
!rm *.csv *.log
[21]:
%quit