Manejo de Datos Complejos y Tablas Particionadas#

Última modificación: Mayo 17, 2022 | YouTube

A diferencia de los gestores tradicionales de bases de datos, Hive permite la creación de datos complejos que pueden ser almacenados en los campos de las tablas. Debido a que las tablas pueden ser muy grandes, Hive también permite hacer la partición de las tablas en archivos separados a partir uno o más campos claves optimizando las consultas, ya que las búsquedas se pueden realizar sobre particiones específicas de una tabla.

En este tutorial se ejemplifica:

La creación de un archivo de texto con datos complejos para ser cargado y consultado en Hive.
La creación de tablas con datos complejos.
La construcción de consultas sobre campos complejos.
La creación de tablas particionadas y la inserción de registros en ellas.

Cell magic `%%hive`#

[1]:

from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n    > ", "\r\nhive> "]
QUIT = "quit;"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def hive(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())

Información utilizada#

La información utilizada corresponde al ranking de popularidad de lenguajes de programación publicado por IEEE en sus sitios web:

El archivo presentado a continuación fue construído manualmente para propósitos ilustrativos. Las columnas son:

El nombre del lenguaje de programación.
La puntuación obtenida para cada año entre 2014 y 2018.
Las categorías en que aparece el lenguaje (Web, Mobile, Enterprise, Embedded).

El formato es explicado más adelante.

[2]:

!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/language-score.tsv

--2022-05-17 16:17:05--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/language-score.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3490 (3.4K) [text/plain]
Saving to: ‘language-score.tsv’

language-score.tsv  100%[===================>]   3.41K  --.-KB/s    in 0.002s

2022-05-17 16:17:05 (1.86 MB/s) - ‘language-score.tsv’ saved [3490/3490]

[3]:

!cat language-score.tsv

ABAP    2018#22.8,2017#28.9,2016#15.9,2015#24.6,2014#20.8       Enterprise
Actionscript    2018#0.6,2017#0.0,2016#3.3,2015#6.8,2014#22.0   Web,Mobile
Ada     2018#20.9,2017#29.9,2016#20.5,2015#32.8,2014#30.1       Enterprise,Embedded
Arduino 2018#69.0,2017#74.4,2016#69.5,2015#63.0,2014#62.9       Embedded
Assembly        2018#74.1,2017#73.4,2016#68.0,2015#67.9,2014#66.9       Embedded
C       2018#96.7,2017#99.7,2016#100.0,2015#99.9,2014#99.2      Mobile,Enterprise,Embedded
C#      2018#89.4,2017#88.8,2016#86.4,2015#91.8,2014#92.2       Web,Mobile,Enterprise
C++     2018#99.7,2017#97.4,2016#95.8,2015#99.6,2014#95.5       Mobile, Enterprise, Embedded
Clojure 2018#25.6,2017#28.8,2016#22.0,2015#30.9,2014#36.0       Web,Enterprise
Cobol   2018#24.6,2017#29.2,2016#17.9,2015#19.0,2014#30.6       Enterprise
Cuda    2018#43.0,2017#56.5,2016#52.3,2015#48.6,2014#   Enterprise
D       2018#40.6,2017#41.5,2016#37.7,2015#44.6,2014#49.9       Web,Embedded
Delphi  2018#38.7,2017#,2016#41.7,2015#39.4,2014#40.2   Mobile,Enterprise
Erlang  2018#26.9,2017#31.5,2016#27.0,2015#30.6,2014#32.1       Enterprise,Embedded
Forth   2018#0.0,2017#5.7,2016#0.0,2015#8.5,2014#2.3    Embedded
Fortran 2018#49.5,2017#43.2,2016#40.9,2015#39.7,2014#45.5       Enterprise
Go      2018#76.4,2017#77.2,2016#71.5,2015#67.9,2014#59.8       Web,Enterprise
Haskell 2018#48.6,2017#50.8,2016#43.0,2015#39.2,2014#44.8       Enterprise,Embedded
HTML    2018#71.2,2017#68.0,2016#66.7,2015#54.3,2014#64.1       Web
J       2018#18.1,2017#14.2,2016#10.4,2015#7.7,2014#13.6        Enterprise
Java    2018#97.5,2017#99.4,2016#98.1,2015#100,2014#100 Web,Mobile,Enterprise
JavaScript      2018#82.6,2017#86.2,2016#81.9,2015#83.0,2014#84.9       Web,Mobile
Julia   2018#35.1,2017#40.6,2016#31.4,2015#17.4,2014#   Enterprise
LabView 2018#32.7,2017#36.4,2016#32.6,2015#30.9 Enterprise,Embedded
Ladder Logic    2018#11.5,2017#6.0,2016#26.2,2015#21.8,2014#20.2        Embedded
Lisp    2018#33.3,2017#33.9,2016#33.6,2015#40.4,2014#43.7       Enterprise
Lua     2018#49.8,2017#54.9,2016#50.9,2015#45.9,2014#47.8       Web,Enterprise
Matlab  2018#72.8,2017#70.4,2016#68.7,2015#72.4,2014#73.1       Enterprise
Objective-C     2018#50.5,2017#48.4,2016#52.4,2015#58.7,2014#63.7       Mobile,Enterprise
Ocaml   2018#14.4,2017#13.7,2016#3.2,2015#0.0,2014#18.5 Web,Enterprise
Perl    2018#57.4,2017#59.1,2016#57.5,2015#66.9,2014#69.6       Web,Enterprise
PHP     2018#84.9,2017#82.3,2016#82.4,2015#84.5,2014#84.6       Web
Processing      2018#53.1,2017#52.7,2016#50.3,2015#49.2,2014#52.0       Web,Enterprise
Prolog  2018#33.2,2017#36.5,2016#25.0,2015#23.2,2014#38.6       Enterprise
Python  2018#100.0,2017#100,2016#97.9,2015#95.8,2014#93.4       Web,Enterprise,Embedded
R       2018#82.9,2017#88.8,2016#87.7,2015#84.7,2014#74.1       Enterprise
Ruby    2018#71.4,2017#73.9,2016#74.0,2015#75.3,2014#79.3       Web,Enterprise
Rust    2018#41.8,2017#55.0,2016#42.2,2015#29.6,2014#   Web,Enterprise
SAS     2018#25.6,2017#30.7,2016#21.7,2015#43.7,2014#37.1       Enterprise
Scala   2018#72.1,2017#70.0,2016#66.3,2015#62.4,2014#62.7       Web,Mobile
Scheme  2018#18.8,2017#24.5,2016#14.5,2015#14.8,2014#27.8       Mobile,Enterprise
Shell   2018#66.1,2017#68.6,2016#52.7,2015#71.4,2014#63.3       Enterprise
SQL     2018#49.3,2017#52.8,2016#48.8,2015#70.9,2014#71.6       Enterprise
Swift   2018#53.9,2017#76.5,2016#67.6,2015#62.4 Mobile,Enterprise
TCL     2018#21.9,2017#16.2,2016#8.7,2015#11.6,2014#23.1        Enterprise,Embedded
Verilog 2018#41.2,2017#35.6,2016#25.1,2015#23.7,2014#33.8       Embedded
VHDL    2018#45.4,2017#39.9,2016#33.3,2015#35.4,2014#42.6       Embedded
Visual Basic    2018#45.1,2017#58.9,2016#55.7,2015#63.4,2014#66.0       Enterprise
ASP.NET 2014#35.9       Web
CoffeeScript    2014#25.5       Web
Apex Code       2014#18.0       Enterprise
Eiffel  2014#6.4        Enterprise
Scilab  2014#1.0        Enterprise
NetLogo 2014#0.0        Enterprise

Preparación#

[4]:

#
# Se mueve el archivo al HDFS
#
!hdfs dfs -copyFromLocal language-score.tsv /tmp/language-score.tsv

Creación de la tabla `journals`#

A continuación se crea la tabla para almacenar la información, con las siguientes características (revise el archivo languagescore.tsv):

El campo language es un STRING que contiene el nombre del lenguaje de programación.
El campo ranking es un MAP que tiene como clave el año (tipo INT) y como valor asociado el puntaje (popularidad) de dicho año (tipo FLOAT).
El campo categories es un ARRAY conformado por una lista de STRINGS separados por comas que corresponden a las categorías a las que está asociado el lenguaje de programación.
Los campos están delimitados por tabulador (\t).
Los registros (líneas) están terminados por \n.
Los elementos (COLLECTION ITEMS) del ARRAY (campo categories) y del MAP (campo ranking) están separados entre si por ,.
Para los campos tipo MAP, las claves y sus valores asociados están separados por #.
La tabla se almacenará en la carpeta /tmp/hive-journals del HDFS como un archivo de texto.

[5]:

%%hive
DROP TABLE IF EXISTS languages;

CREATE TABLE languages (
    language   STRING,
    ranking    MAP<INT,FLOAT>,
    categories ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '#'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/hive-journals';

OK
Time taken: 5.548 seconds
OK
Time taken: 0.409 seconds

Carga de datos#

Se cargan los datos en la forma usual.

[6]:

%%hive
LOAD DATA INPATH '/tmp/language-score.tsv' INTO TABLE languages;

Loading data to table default.languages
OK
Time taken: 0.781 seconds

Se muestran los primeros 10 registros para verificar que la tabla fue cargada exitósamente. Note que los campos MAP (columna ranking), aparecen delimitados por { y }, donde las parejas clave-valor están separadas entre si por comas, y la separación entre la clave y el valor se hace mediante :. Los ARRAY están delimitados por [ y ].

[7]:

%%hive
SELECT * FROM languages LIMIT 10;

OK
ABAP    {2018:22.8,2017:28.9,2016:15.9,2015:24.6,2014:20.8}     ["Enterprise"]
Actionscript    {2018:0.6,2017:0.0,2016:3.3,2015:6.8,2014:22.0} ["Web","Mobile"]
Ada     {2018:20.9,2017:29.9,2016:20.5,2015:32.8,2014:30.1}     ["Enterprise","Embedded"]
Arduino {2018:69.0,2017:74.4,2016:69.5,2015:63.0,2014:62.9}     ["Embedded"]
Assembly        {2018:74.1,2017:73.4,2016:68.0,2015:67.9,2014:66.9}     ["Embedded"]
C       {2018:96.7,2017:99.7,2016:100.0,2015:99.9,2014:99.2}    ["Mobile","Enterprise","Embedded"]
C#      {2018:89.4,2017:88.8,2016:86.4,2015:91.8,2014:92.2}     ["Web","Mobile","Enterprise"]
C++     {2018:99.7,2017:97.4,2016:95.8,2015:99.6,2014:95.5}     ["Mobile"," Enterprise"," Embedded"]
Clojure {2018:25.6,2017:28.8,2016:22.0,2015:30.9,2014:36.0}     ["Web","Enterprise"]
Cobol   {2018:24.6,2017:29.2,2016:17.9,2015:19.0,2014:30.6}     ["Enterprise"]
Time taken: 1.041 seconds, Fetched: 10 row(s)

Almacenamiento de la tabla en el HDFS#

La tabla es almacenada en la carpeta /tmp/hive-journals, tal como se indicó en su creación.

[8]:

!hdfs dfs -ls /tmp/hive-journals/

Found 1 items
-rwxrwxrwx   1 root supergroup       3490 2022-05-17 16:17 /tmp/hive-journals/language-score.tsv

La tabla ahora es administrada por Hive y conserva el mismo formato que se indicó en su creación.

[9]:

!hdfs dfs -tail /tmp/hive-journals/language-score.tsv

8.8,2016#87.7,2015#84.7,2014#74.1       Enterprise
Ruby    2018#71.4,2017#73.9,2016#74.0,2015#75.3,2014#79.3       Web,Enterprise
Rust    2018#41.8,2017#55.0,2016#42.2,2015#29.6,2014#   Web,Enterprise
SAS     2018#25.6,2017#30.7,2016#21.7,2015#43.7,2014#37.1       Enterprise
Scala   2018#72.1,2017#70.0,2016#66.3,2015#62.4,2014#62.7       Web,Mobile
Scheme  2018#18.8,2017#24.5,2016#14.5,2015#14.8,2014#27.8       Mobile,Enterprise
Shell   2018#66.1,2017#68.6,2016#52.7,2015#71.4,2014#63.3       Enterprise
SQL     2018#49.3,2017#52.8,2016#48.8,2015#70.9,2014#71.6       Enterprise
Swift   2018#53.9,2017#76.5,2016#67.6,2015#62.4 Mobile,Enterprise
TCL     2018#21.9,2017#16.2,2016#8.7,2015#11.6,2014#23.1        Enterprise,Embedded
Verilog 2018#41.2,2017#35.6,2016#25.1,2015#23.7,2014#33.8       Embedded
VHDL    2018#45.4,2017#39.9,2016#33.3,2015#35.4,2014#42.6       Embedded
Visual Basic    2018#45.1,2017#58.9,2016#55.7,2015#63.4,2014#66.0       Enterprise
ASP.NET 2014#35.9       Web
CoffeeScript    2014#25.5       Web
Apex Code       2014#18.0       Enterprise
Eiffel  2014#6.4        Enterprise
Scilab  2014#1.0        Enterprise
NetLogo 2014#0.0        Enterprise

Popularidad por año#

En la siguiente consulta se obtienen los diez lenguajes de programación más populares para el año 2017. En Hive no es posible aplicar ORDER BY sobre un valor de un MAP, por lo que se crea una subconsulta que crea una tabla, llamada table1, en donde el ranking para el año 2017 se nombra como ranking2017 con el fin de poder realizar el ordenamiento.

[10]:

%%hive
SELECT
    language,
    ranking2017
FROM (
    SELECT
        language,
        ranking[2017] AS ranking2017
    FROM
        languages
) table1
ORDER BY
    ranking2017 DESC
LIMIT 10;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161720_b16a79a0-1833-4f5b-ae74-bfe877e39614
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0026, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0026/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0026
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 16:17:26,022 Stage-1 map = 0%,  reduce = 0%
2022-05-17 16:17:30,206 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.76 sec
2022-05-17 16:17:35,354 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.19 sec
MapReduce Total cumulative CPU time: 3 seconds 190 msec
Ended Job = job_1652793922537_0026
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.19 sec   HDFS Read: 11897 HDFS Write: 305 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 190 msec
OK
Python  100.0
C       99.7
Java    99.4
C++     97.4
R       88.8
C#      88.8
JavaScript      86.2
PHP     82.3
Go      77.2
Swift   76.5
Time taken: 15.816 seconds, Fetched: 10 row(s)

Popularidad por Categoría y por Año#

En la siguiente consulta se obtienen los lenguajes más populares para la categoría Mobile en el año 2017. En este caso, se usa la función array_contains(categories, 'Mobile') que retorna verdadero si Mobile aparece en categories. Las funciones disponibles pueden ser consultadas en https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

[11]:

%%hive
SELECT
    language,
    ranking2018,
    ranking2017,
    ranking2016
FROM (
    SELECT
        language,
        ranking[2018] AS ranking2018,
        ranking[2017] AS ranking2017,
        ranking[2016] AS ranking2016
    FROM
        languages
    WHERE
        array_contains(categories, 'Mobile')
) table1
LIMIT 10;

OK
Actionscript    0.6     0.0     3.3
C       96.7    99.7    100.0
C#      89.4    88.8    86.4
C++     99.7    97.4    95.8
Delphi  38.7    NULL    41.7
Java    97.5    99.4    98.1
JavaScript      82.6    86.2    81.9
Objective-C     50.5    48.4    52.4
Scala   72.1    70.0    66.3
Scheme  18.8    24.5    14.5
Time taken: 0.204 seconds, Fetched: 10 row(s)

Expansión de la tabla por categoría#

Es posible generar un registro por cada valor de un campo complejo. En el siguiente ejemplo, se expande la tabla por cada valor del campo categories. El proceso de expansión es realizado mediante la instrucción LATERAL VIEW que genera un valor por cada categoría en el campo categories, el cual es combinado con el resto de los campos indicados de la instrucción SELECT.

[12]:

%%hive
SELECT
    language,
    ranking[2018],
    c0
FROM
    languages
LATERAL VIEW
    explode(categories) languages AS c0
LIMIT 10;

OK
ABAP    22.8    Enterprise
Actionscript    0.6     Web
Actionscript    0.6     Mobile
Ada     20.9    Enterprise
Ada     20.9    Embedded
Arduino 69.0    Embedded
Assembly        74.1    Embedded
C       96.7    Mobile
C       96.7    Enterprise
C       96.7    Embedded
Time taken: 0.083 seconds, Fetched: 10 row(s)

Expansión de la tabla por categoría y popularidad#

En el siguiente código, se genera la tabla detail que contiene las columnas:

Lenguaje
Año
Puntaje
Categoría

La tabla es llenada mediante la expansión de las columnas ranking y categories. Note que en este código, primero, se realiza una subconsulta para expandir por el campo categories y luego en la consulta principal se expande por el campo ranking;

[13]:

%%hive
DROP TABLE IF EXISTS detail;

CREATE TABLE detail
AS
    SELECT
        language,
        key AS year,
        value AS popularity,
        category
    FROM (
        SELECT
            language,
            ranking,
            category
        FROM
            languages
        LATERAL VIEW
            explode(categories) languages AS category
    ) t0
    LATERAL VIEW
        explode(ranking) t0
;

SELECT * FROM detail LIMIT 10;

OK
Time taken: 0.025 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161739_b236835c-1691-41bf-bb57-a19ddc622dd8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0027, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0027/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 16:17:45,253 Stage-1 map = 0%,  reduce = 0%
2022-05-17 16:17:49,549 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1652793922537_0027
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/.hive-staging_hive_2022-05-17_16-17-39_393_5248961109886827483-1/-ext-10001
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/detail
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.53 sec   HDFS Read: 9857 HDFS Write: 9876 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 530 msec
OK
Time taken: 11.42 seconds
OK
ABAP    2018    22.8    Enterprise
ABAP    2017    28.9    Enterprise
ABAP    2016    15.9    Enterprise
ABAP    2015    24.6    Enterprise
ABAP    2014    20.8    Enterprise
Actionscript    2018    0.6     Web
Actionscript    2017    0.0     Web
Actionscript    2016    3.3     Web
Actionscript    2015    6.8     Web
Actionscript    2014    22.0    Web
Time taken: 0.101 seconds, Fetched: 10 row(s)

La instrucción DESCRIBE permite ver los tipos de datos de la tabla.

[14]:

%%hive
DESCRIBE detail;

OK
language                string
year                    int
popularity              float
category                string
Time taken: 0.036 seconds, Fetched: 4 row(s)

Construcción de campos compuestos#

También es posible crear datos complejos a partir de los campos de una tabla. En la siguiente consulta, se construye un ARRAY mediante la función collect_set() agrupando las catetorias por lenguaje de la tabla detail.

[15]:

%%hive
SELECT
    language,
    collect_set(category)
FROM
    detail
GROUP BY
    language
LIMIT 10;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161751_ee55cd99-8e19-4e07-8e83-1d0eaf19dbe6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0028, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0028/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0028
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 16:18:00,072 Stage-1 map = 0%,  reduce = 0%
2022-05-17 16:18:04,176 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.43 sec
2022-05-17 16:18:08,304 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.01 sec
MapReduce Total cumulative CPU time: 3 seconds 10 msec
Ended Job = job_1652793922537_0028
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.01 sec   HDFS Read: 18995 HDFS Write: 426 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 10 msec
OK
ABAP    ["Enterprise"]
ASP.NET ["Web"]
Actionscript    ["Web","Mobile"]
Ada     ["Enterprise","Embedded"]
Apex Code       ["Enterprise"]
Arduino ["Embedded"]
Assembly        ["Embedded"]
C       ["Mobile","Enterprise","Embedded"]
C#      ["Web","Mobile","Enterprise"]
C++     ["Mobile"," Enterprise"," Embedded"]
Time taken: 17.831 seconds, Fetched: 10 row(s)

En la siguiente consulta, se construye un MAP a partir del año y la popularidad del lenguaje.

[16]:

%%hive
SELECT
    language,
    map(year, popularity) as p,
    category
FROM
    detail
LIMIT 10;

OK
ABAP    {2018:22.8}     Enterprise
ABAP    {2017:28.9}     Enterprise
ABAP    {2016:15.9}     Enterprise
ABAP    {2015:24.6}     Enterprise
ABAP    {2014:20.8}     Enterprise
Actionscript    {2018:0.6}      Web
Actionscript    {2017:0.0}      Web
Actionscript    {2016:3.3}      Web
Actionscript    {2015:6.8}      Web
Actionscript    {2014:22.0}     Web
Time taken: 0.097 seconds, Fetched: 10 row(s)

Particionamiento#

Como ya se indicó, las tablas pueden partirse físicamente dependiendo del valor de uno o más campos usados como claves. Esto permite que las búsquedas sean más eficientes, ya que las consultas operan sobre las partficiones. En el siguiente código, se crea la tabla detailpart en que la partición está dada por el campo year.

[17]:

%%hive
DROP TABLE IF EXISTS detailpart;

CREATE TABLE detailpart (
    language   STRING,
    popularity DOUBLE,
    category   STRING
)
PARTITIONED BY (year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '#'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/hive-partitioned';
;

OK
Time taken: 0.021 seconds
OK
Time taken: 0.053 seconds

Seguidamente, se llena la tabla detailpart con la información almacenada en la tabla detail. Note que se indicar una instrucción INSERT OVERWRITE por cada valor posible de la partición (columna year).

[18]:

%%hive
FROM detail
INSERT OVERWRITE TABLE detailpart PARTITION(year=2018)
SELECT
    language,
    popularity,
    category
WHERE
    year=2018
INSERT OVERWRITE TABLE detailpart PARTITION(year=2017)
SELECT
    language,
    popularity,
    category
WHERE
    year=2017
INSERT OVERWRITE TABLE detailpart PARTITION(year=2016)
SELECT
    language,
    popularity,
    category
WHERE
    year=2016
INSERT OVERWRITE TABLE detailpart PARTITION(year=2015)
SELECT
    language,
    popularity,
    category
WHERE
    year=2015
INSERT OVERWRITE TABLE detailpart PARTITION(year=2014)
SELECT
    language,
    popularity,
    category
WHERE
    year=2014
;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161812_09187314-f24b-4ba3-8811-523b83ca14d7
Total jobs = 11
Launching Job 1 out of 11
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0029, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0029/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0029
Hadoop job information for Stage-5: number of mappers: 1; number of reducers: 0
2022-05-17 16:18:19,471 Stage-5 map = 0%,  reduce = 0%
2022-05-17 16:18:23,594 Stage-5 map = 100%,  reduce = 0%, Cumulative CPU 2.5 sec
MapReduce Total cumulative CPU time: 2 seconds 500 msec
Ended Job = job_1652793922537_0029
Stage-8 is selected by condition resolver.
Stage-7 is filtered out by condition resolver.
Stage-9 is filtered out by condition resolver.
Stage-14 is selected by condition resolver.
Stage-13 is filtered out by condition resolver.
Stage-15 is filtered out by condition resolver.
Stage-20 is selected by condition resolver.
Stage-19 is filtered out by condition resolver.
Stage-21 is filtered out by condition resolver.
Stage-26 is selected by condition resolver.
Stage-25 is filtered out by condition resolver.
Stage-27 is filtered out by condition resolver.
Stage-32 is selected by condition resolver.
Stage-31 is filtered out by condition resolver.
Stage-33 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2018/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10000
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2017/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10002
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2016/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10004
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2015/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10006
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2014/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10008
Loading data to table default.detailpart partition (year=2018)
Loading data to table default.detailpart partition (year=2017)
Loading data to table default.detailpart partition (year=2016)
Loading data to table default.detailpart partition (year=2015)
Loading data to table default.detailpart partition (year=2014)
MapReduce Jobs Launched:
Stage-Stage-5: Map: 1   Cumulative CPU: 2.5 sec   HDFS Read: 18799 HDFS Write: 12254 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 500 msec
OK
Time taken: 12.818 seconds

A conitnuación se verifican las particiones creadas.

[19]:

%%hive
SHOW PARTITIONS detailpart;

OK
year=2014
year=2015
year=2016
year=2017
year=2018
Time taken: 0.059 seconds, Fetched: 5 row(s)

En una tabla particionada, se crea un subdirectorio por cada uno de los valores posibles que puede tomar el campo usado para especificar la partición; es así, como existe un directorio para year=2014, otro para year=2015 y así sucesivamente.

[20]:

!hdfs dfs -ls /tmp/hive-partitioned

Found 5 items
drwxrwxrwx   - root supergroup          0 2022-05-17 16:18 /tmp/hive-partitioned/year=2014
drwxrwxrwx   - root supergroup          0 2022-05-17 16:18 /tmp/hive-partitioned/year=2015
drwxrwxrwx   - root supergroup          0 2022-05-17 16:18 /tmp/hive-partitioned/year=2016
drwxrwxrwx   - root supergroup          0 2022-05-17 16:18 /tmp/hive-partitioned/year=2017
drwxrwxrwx   - root supergroup          0 2022-05-17 16:18 /tmp/hive-partitioned/year=2018

[21]:

!hdfs dfs -ls /tmp/hive-partitioned/year=2014

Found 1 items
-rwxrwxrwx   1 root supergroup       2347 2022-05-17 16:18 /tmp/hive-partitioned/year=2014/000000_0

Note que year ya no es parte del contenido de la tabla.

[22]:

!hdfs dfs -tail /tmp/hive-partitioned/year=2014/000000_0

Mobile
Objective-C     63.70000076293945       Enterprise
Ocaml   18.5    Web
Ocaml   18.5    Enterprise
Perl    69.5999984741211        Web
Perl    69.5999984741211        Enterprise
PHP     84.5999984741211        Web
Processing      52.0    Web
Processing      52.0    Enterprise
Prolog  38.599998474121094      Enterprise
Python  93.4000015258789        Web
Python  93.4000015258789        Enterprise
Python  93.4000015258789        Embedded
R       74.0999984741211        Enterprise
Ruby    79.30000305175781       Web
Ruby    79.30000305175781       Enterprise
Rust    \N      Web
Rust    \N      Enterprise
SAS     37.099998474121094      Enterprise
Scala   62.70000076293945       Web
Scala   62.70000076293945       Mobile
Scheme  27.799999237060547      Mobile
Scheme  27.799999237060547      Enterprise
Shell   63.29999923706055       Enterprise
SQL     71.5999984741211        Enterprise
TCL     23.100000381469727      Enterprise
TCL     23.100000381469727      Embedded
Verilog 33.79999923706055       Embedded
VHDL    42.599998474121094      Embedded
Visual Basic    66.0    Enterprise
ASP.NET 35.900001525878906      Web
CoffeeScript    25.5    Web
Apex Code       18.0    Enterprise
Eiffel  6.400000095367432       Enterprise
Scilab  1.0     Enterprise
NetLogo 0.0     Enterprise

Tablas externas#

Las tablas creadas hasta el momento son completamente manejadas por Hive. Sin embargo, estas tablas no son convenientes para compartir directamente información con otras aplicaciones, ya que Hive administra los datos. Para analizar la información generada por otras aplicaciones resulta más conveniente usar tablas externas.

[23]:

%%hive
CREATE EXTERNAL TABLE externaltbl
LIKE
    languages
LOCATION '/tmp/externaltbl';

OK
Time taken: 0.056 seconds

La carga de datos se realizar como en los ejemplos anteriores.

[24]:

%%hive
FROM languages
INSERT OVERWRITE TABLE externaltbl
SELECT
    language,
    ranking,
    categories;

SELECT * FROM externaltbl LIMIT 10;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161831_6e05e846-3ec1-4f4c-9d20-55af50819a37
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0030, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0030/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 16:18:34,747 Stage-1 map = 0%,  reduce = 0%
2022-05-17 16:18:39,045 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.42 sec
MapReduce Total cumulative CPU time: 1 seconds 420 msec
Ended Job = job_1652793922537_0030
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/tmp/externaltbl/.hive-staging_hive_2022-05-17_16-18-31_221_3581705079924987506-1/-ext-10000
Loading data to table default.externaltbl
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.42 sec   HDFS Read: 7777 HDFS Write: 3581 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 420 msec
OK
Time taken: 9.048 seconds
OK
ABAP    {2018:22.8,2017:28.9,2016:15.9,2015:24.6,2014:20.8}     ["Enterprise"]
Actionscript    {2018:0.6,2017:0.0,2016:3.3,2015:6.8,2014:22.0} ["Web","Mobile"]
Ada     {2018:20.9,2017:29.9,2016:20.5,2015:32.8,2014:30.1}     ["Enterprise","Embedded"]
Arduino {2018:69.0,2017:74.4,2016:69.5,2015:63.0,2014:62.9}     ["Embedded"]
Assembly        {2018:74.1,2017:73.4,2016:68.0,2015:67.9,2014:66.9}     ["Embedded"]
C       {2018:96.7,2017:99.7,2016:100.0,2015:99.9,2014:99.2}    ["Mobile","Enterprise","Embedded"]
C#      {2018:89.4,2017:88.8,2016:86.4,2015:91.8,2014:92.2}     ["Web","Mobile","Enterprise"]
C++     {2018:99.7,2017:97.4,2016:95.8,2015:99.6,2014:95.5}     ["Mobile"," Enterprise"," Embedded"]
Clojure {2018:25.6,2017:28.8,2016:22.0,2015:30.9,2014:36.0}     ["Web","Enterprise"]
Cobol   {2018:24.6,2017:29.2,2016:17.9,2015:19.0,2014:30.6}     ["Enterprise"]
Time taken: 0.086 seconds, Fetched: 10 row(s)

[25]:

!hdfs dfs -ls /tmp/externaltbl

Found 1 items
-rwxrwxrwx   1 root supergroup       3504 2022-05-17 16:18 /tmp/externaltbl/000000_0

[26]:

!hdfs dfs -tail /tmp/externaltbl/000000_0

8,2016#87.7,2015#84.7,2014#74.1 Enterprise
Ruby    2018#71.4,2017#73.9,2016#74.0,2015#75.3,2014#79.3       Web,Enterprise
Rust    2018#41.8,2017#55.0,2016#42.2,2015#29.6,2014#\N Web,Enterprise
SAS     2018#25.6,2017#30.7,2016#21.7,2015#43.7,2014#37.1       Enterprise
Scala   2018#72.1,2017#70.0,2016#66.3,2015#62.4,2014#62.7       Web,Mobile
Scheme  2018#18.8,2017#24.5,2016#14.5,2015#14.8,2014#27.8       Mobile,Enterprise
Shell   2018#66.1,2017#68.6,2016#52.7,2015#71.4,2014#63.3       Enterprise
SQL     2018#49.3,2017#52.8,2016#48.8,2015#70.9,2014#71.6       Enterprise
Swift   2018#53.9,2017#76.5,2016#67.6,2015#62.4 Mobile,Enterprise
TCL     2018#21.9,2017#16.2,2016#8.7,2015#11.6,2014#23.1        Enterprise,Embedded
Verilog 2018#41.2,2017#35.6,2016#25.1,2015#23.7,2014#33.8       Embedded
VHDL    2018#45.4,2017#39.9,2016#33.3,2015#35.4,2014#42.6       Embedded
Visual Basic    2018#45.1,2017#58.9,2016#55.7,2015#63.4,2014#66.0       Enterprise
ASP.NET 2014#35.9       Web
CoffeeScript    2014#25.5       Web
Apex Code       2014#18.0       Enterprise
Eiffel  2014#6.4        Enterprise
Scilab  2014#1.0        Enterprise
NetLogo 2014#0.0        Enterprise

Borrado de las tablas#

Cuando se borra la tabla languages también se borra la carpeta correspondiente del HDFS.

[27]:

%%hive
DROP TABLE languages;
DROP TABLE detail;
DROP TABLE detailpart;
DROP TABLE externaltbl;

OK
Time taken: 0.229 seconds
OK
Time taken: 0.05 seconds
OK
Time taken: 0.187 seconds
OK
Time taken: 0.048 seconds

[28]:

!hdfs dfs -ls /tmp/

Found 5 items
drwxr-xr-x   - root supergroup          0 2022-05-17 16:16 /tmp/drivers
drwxrwxrwx   - root supergroup          0 2022-05-17 16:16 /tmp/drivers-json
drwxrwxrwx   - root supergroup          0 2022-05-17 16:18 /tmp/externaltbl
drwxrwx---   - root supergroup          0 2022-05-17 13:25 /tmp/hadoop-yarn
drwxrwxrwx   - root supergroup          0 2022-05-17 13:26 /tmp/hive

[29]:

%quit

[30]:

!rm *.tsv *.log