Manejo de Datos Complejos y Tablas Particionadas#
Última modificación: Mayo 17, 2022 | YouTube
A diferencia de los gestores tradicionales de bases de datos, Hive permite la creación de datos complejos que pueden ser almacenados en los campos de las tablas. Debido a que las tablas pueden ser muy grandes, Hive también permite hacer la partición de las tablas en archivos separados a partir uno o más campos claves optimizando las consultas, ya que las búsquedas se pueden realizar sobre particiones específicas de una tabla.
En este tutorial se ejemplifica:
La creación de un archivo de texto con datos complejos para ser cargado y consultado en Hive.
La creación de tablas con datos complejos.
La construcción de consultas sobre campos complejos.
La creación de tablas particionadas y la inserción de registros en ellas.
Cell magic %%hive
#
[1]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn
TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n > ", "\r\nhive> "]
QUIT = "quit;"
@magics_class
class Magic(Magics):
def __init__(self, shell):
super().__init__(shell)
self.app = spawn(PROG, timeout=60)
self.app.expect(PROMPT)
@cell_magic
def hive(self, line, cell):
cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
for cell_line in cell_lines:
self.app.sendline(cell_line)
self.app.expect(PROMPT, timeout=TIMEOUT)
output = self.app.before.decode()
output = output.replace("\r\n", "\n")
output = output.split("\n")
output = [output_line.strip() for output_line in output]
for output_line in output:
if output_line not in cell_lines:
print(output_line)
return None
@line_magic
def quit(self, line):
self.app.sendline(QUIT)
def load_ipython_extension(ip):
ip.register_magics(Magic(ip))
load_ipython_extension(ip=get_ipython())
Información utilizada#
La información utilizada corresponde al ranking de popularidad de lenguajes de programación publicado por IEEE en sus sitios web:
https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages
http://sites.ieee.org/houston/article-2017-top-programming-languages/
https://spectrum.ieee.org/static/interactive-the-top-programming-languages-2016
https://spectrum.ieee.org/static/interactive-the-top-programming-languages-2015/
https://spectrum.ieee.org/static/interactive-the-top-programming-languages
El archivo presentado a continuación fue construído manualmente para propósitos ilustrativos. Las columnas son:
El nombre del lenguaje de programación.
La puntuación obtenida para cada año entre 2014 y 2018.
Las categorías en que aparece el lenguaje (Web, Mobile, Enterprise, Embedded).
El formato es explicado más adelante.
[2]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/language-score.tsv
--2022-05-17 16:17:05-- https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/language-score.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3490 (3.4K) [text/plain]
Saving to: ‘language-score.tsv’
language-score.tsv 100%[===================>] 3.41K --.-KB/s in 0.002s
2022-05-17 16:17:05 (1.86 MB/s) - ‘language-score.tsv’ saved [3490/3490]
[3]:
!cat language-score.tsv
ABAP 2018#22.8,2017#28.9,2016#15.9,2015#24.6,2014#20.8 Enterprise
Actionscript 2018#0.6,2017#0.0,2016#3.3,2015#6.8,2014#22.0 Web,Mobile
Ada 2018#20.9,2017#29.9,2016#20.5,2015#32.8,2014#30.1 Enterprise,Embedded
Arduino 2018#69.0,2017#74.4,2016#69.5,2015#63.0,2014#62.9 Embedded
Assembly 2018#74.1,2017#73.4,2016#68.0,2015#67.9,2014#66.9 Embedded
C 2018#96.7,2017#99.7,2016#100.0,2015#99.9,2014#99.2 Mobile,Enterprise,Embedded
C# 2018#89.4,2017#88.8,2016#86.4,2015#91.8,2014#92.2 Web,Mobile,Enterprise
C++ 2018#99.7,2017#97.4,2016#95.8,2015#99.6,2014#95.5 Mobile, Enterprise, Embedded
Clojure 2018#25.6,2017#28.8,2016#22.0,2015#30.9,2014#36.0 Web,Enterprise
Cobol 2018#24.6,2017#29.2,2016#17.9,2015#19.0,2014#30.6 Enterprise
Cuda 2018#43.0,2017#56.5,2016#52.3,2015#48.6,2014# Enterprise
D 2018#40.6,2017#41.5,2016#37.7,2015#44.6,2014#49.9 Web,Embedded
Delphi 2018#38.7,2017#,2016#41.7,2015#39.4,2014#40.2 Mobile,Enterprise
Erlang 2018#26.9,2017#31.5,2016#27.0,2015#30.6,2014#32.1 Enterprise,Embedded
Forth 2018#0.0,2017#5.7,2016#0.0,2015#8.5,2014#2.3 Embedded
Fortran 2018#49.5,2017#43.2,2016#40.9,2015#39.7,2014#45.5 Enterprise
Go 2018#76.4,2017#77.2,2016#71.5,2015#67.9,2014#59.8 Web,Enterprise
Haskell 2018#48.6,2017#50.8,2016#43.0,2015#39.2,2014#44.8 Enterprise,Embedded
HTML 2018#71.2,2017#68.0,2016#66.7,2015#54.3,2014#64.1 Web
J 2018#18.1,2017#14.2,2016#10.4,2015#7.7,2014#13.6 Enterprise
Java 2018#97.5,2017#99.4,2016#98.1,2015#100,2014#100 Web,Mobile,Enterprise
JavaScript 2018#82.6,2017#86.2,2016#81.9,2015#83.0,2014#84.9 Web,Mobile
Julia 2018#35.1,2017#40.6,2016#31.4,2015#17.4,2014# Enterprise
LabView 2018#32.7,2017#36.4,2016#32.6,2015#30.9 Enterprise,Embedded
Ladder Logic 2018#11.5,2017#6.0,2016#26.2,2015#21.8,2014#20.2 Embedded
Lisp 2018#33.3,2017#33.9,2016#33.6,2015#40.4,2014#43.7 Enterprise
Lua 2018#49.8,2017#54.9,2016#50.9,2015#45.9,2014#47.8 Web,Enterprise
Matlab 2018#72.8,2017#70.4,2016#68.7,2015#72.4,2014#73.1 Enterprise
Objective-C 2018#50.5,2017#48.4,2016#52.4,2015#58.7,2014#63.7 Mobile,Enterprise
Ocaml 2018#14.4,2017#13.7,2016#3.2,2015#0.0,2014#18.5 Web,Enterprise
Perl 2018#57.4,2017#59.1,2016#57.5,2015#66.9,2014#69.6 Web,Enterprise
PHP 2018#84.9,2017#82.3,2016#82.4,2015#84.5,2014#84.6 Web
Processing 2018#53.1,2017#52.7,2016#50.3,2015#49.2,2014#52.0 Web,Enterprise
Prolog 2018#33.2,2017#36.5,2016#25.0,2015#23.2,2014#38.6 Enterprise
Python 2018#100.0,2017#100,2016#97.9,2015#95.8,2014#93.4 Web,Enterprise,Embedded
R 2018#82.9,2017#88.8,2016#87.7,2015#84.7,2014#74.1 Enterprise
Ruby 2018#71.4,2017#73.9,2016#74.0,2015#75.3,2014#79.3 Web,Enterprise
Rust 2018#41.8,2017#55.0,2016#42.2,2015#29.6,2014# Web,Enterprise
SAS 2018#25.6,2017#30.7,2016#21.7,2015#43.7,2014#37.1 Enterprise
Scala 2018#72.1,2017#70.0,2016#66.3,2015#62.4,2014#62.7 Web,Mobile
Scheme 2018#18.8,2017#24.5,2016#14.5,2015#14.8,2014#27.8 Mobile,Enterprise
Shell 2018#66.1,2017#68.6,2016#52.7,2015#71.4,2014#63.3 Enterprise
SQL 2018#49.3,2017#52.8,2016#48.8,2015#70.9,2014#71.6 Enterprise
Swift 2018#53.9,2017#76.5,2016#67.6,2015#62.4 Mobile,Enterprise
TCL 2018#21.9,2017#16.2,2016#8.7,2015#11.6,2014#23.1 Enterprise,Embedded
Verilog 2018#41.2,2017#35.6,2016#25.1,2015#23.7,2014#33.8 Embedded
VHDL 2018#45.4,2017#39.9,2016#33.3,2015#35.4,2014#42.6 Embedded
Visual Basic 2018#45.1,2017#58.9,2016#55.7,2015#63.4,2014#66.0 Enterprise
ASP.NET 2014#35.9 Web
CoffeeScript 2014#25.5 Web
Apex Code 2014#18.0 Enterprise
Eiffel 2014#6.4 Enterprise
Scilab 2014#1.0 Enterprise
NetLogo 2014#0.0 Enterprise
Preparación#
[4]:
#
# Se mueve el archivo al HDFS
#
!hdfs dfs -copyFromLocal language-score.tsv /tmp/language-score.tsv
Creación de la tabla journals
#
A continuación se crea la tabla para almacenar la información, con las siguientes características (revise el archivo languagescore.tsv
):
El campo
language
es un STRING que contiene el nombre del lenguaje de programación.El campo
ranking
es un MAP que tiene como clave el año (tipo INT) y como valor asociado el puntaje (popularidad) de dicho año (tipo FLOAT).El campo
categories
es un ARRAY conformado por una lista de STRINGS separados por comas que corresponden a las categorías a las que está asociado el lenguaje de programación.Los campos están delimitados por tabulador (
\t
).Los registros (líneas) están terminados por
\n
.Los elementos (COLLECTION ITEMS) del ARRAY (campo
categories
) y del MAP (campo ranking) están separados entre si por,
.Para los campos tipo MAP, las claves y sus valores asociados están separados por
#
.La tabla se almacenará en la carpeta
/tmp/hive-journals
del HDFS como un archivo de texto.
[5]:
%%hive
DROP TABLE IF EXISTS languages;
CREATE TABLE languages (
language STRING,
ranking MAP<INT,FLOAT>,
categories ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '#'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/hive-journals';
OK
Time taken: 5.548 seconds
OK
Time taken: 0.409 seconds
Carga de datos#
Se cargan los datos en la forma usual.
[6]:
%%hive
LOAD DATA INPATH '/tmp/language-score.tsv' INTO TABLE languages;
Loading data to table default.languages
OK
Time taken: 0.781 seconds
Se muestran los primeros 10 registros para verificar que la tabla fue cargada exitósamente. Note que los campos MAP (columna ranking), aparecen delimitados por {
y }
, donde las parejas clave-valor están separadas entre si por comas, y la separación entre la clave y el valor se hace mediante :
. Los ARRAY están delimitados por [
y ]
.
[7]:
%%hive
SELECT * FROM languages LIMIT 10;
OK
ABAP {2018:22.8,2017:28.9,2016:15.9,2015:24.6,2014:20.8} ["Enterprise"]
Actionscript {2018:0.6,2017:0.0,2016:3.3,2015:6.8,2014:22.0} ["Web","Mobile"]
Ada {2018:20.9,2017:29.9,2016:20.5,2015:32.8,2014:30.1} ["Enterprise","Embedded"]
Arduino {2018:69.0,2017:74.4,2016:69.5,2015:63.0,2014:62.9} ["Embedded"]
Assembly {2018:74.1,2017:73.4,2016:68.0,2015:67.9,2014:66.9} ["Embedded"]
C {2018:96.7,2017:99.7,2016:100.0,2015:99.9,2014:99.2} ["Mobile","Enterprise","Embedded"]
C# {2018:89.4,2017:88.8,2016:86.4,2015:91.8,2014:92.2} ["Web","Mobile","Enterprise"]
C++ {2018:99.7,2017:97.4,2016:95.8,2015:99.6,2014:95.5} ["Mobile"," Enterprise"," Embedded"]
Clojure {2018:25.6,2017:28.8,2016:22.0,2015:30.9,2014:36.0} ["Web","Enterprise"]
Cobol {2018:24.6,2017:29.2,2016:17.9,2015:19.0,2014:30.6} ["Enterprise"]
Time taken: 1.041 seconds, Fetched: 10 row(s)
Almacenamiento de la tabla en el HDFS#
La tabla es almacenada en la carpeta /tmp/hive-journals
, tal como se indicó en su creación.
[8]:
!hdfs dfs -ls /tmp/hive-journals/
Found 1 items
-rwxrwxrwx 1 root supergroup 3490 2022-05-17 16:17 /tmp/hive-journals/language-score.tsv
La tabla ahora es administrada por Hive y conserva el mismo formato que se indicó en su creación.
[9]:
!hdfs dfs -tail /tmp/hive-journals/language-score.tsv
8.8,2016#87.7,2015#84.7,2014#74.1 Enterprise
Ruby 2018#71.4,2017#73.9,2016#74.0,2015#75.3,2014#79.3 Web,Enterprise
Rust 2018#41.8,2017#55.0,2016#42.2,2015#29.6,2014# Web,Enterprise
SAS 2018#25.6,2017#30.7,2016#21.7,2015#43.7,2014#37.1 Enterprise
Scala 2018#72.1,2017#70.0,2016#66.3,2015#62.4,2014#62.7 Web,Mobile
Scheme 2018#18.8,2017#24.5,2016#14.5,2015#14.8,2014#27.8 Mobile,Enterprise
Shell 2018#66.1,2017#68.6,2016#52.7,2015#71.4,2014#63.3 Enterprise
SQL 2018#49.3,2017#52.8,2016#48.8,2015#70.9,2014#71.6 Enterprise
Swift 2018#53.9,2017#76.5,2016#67.6,2015#62.4 Mobile,Enterprise
TCL 2018#21.9,2017#16.2,2016#8.7,2015#11.6,2014#23.1 Enterprise,Embedded
Verilog 2018#41.2,2017#35.6,2016#25.1,2015#23.7,2014#33.8 Embedded
VHDL 2018#45.4,2017#39.9,2016#33.3,2015#35.4,2014#42.6 Embedded
Visual Basic 2018#45.1,2017#58.9,2016#55.7,2015#63.4,2014#66.0 Enterprise
ASP.NET 2014#35.9 Web
CoffeeScript 2014#25.5 Web
Apex Code 2014#18.0 Enterprise
Eiffel 2014#6.4 Enterprise
Scilab 2014#1.0 Enterprise
NetLogo 2014#0.0 Enterprise
Popularidad por año#
En la siguiente consulta se obtienen los diez lenguajes de programación más populares para el año 2017. En Hive no es posible aplicar ORDER BY sobre un valor de un MAP, por lo que se crea una subconsulta que crea una tabla, llamada table1
, en donde el ranking para el año 2017 se nombra como ranking2017
con el fin de poder realizar el ordenamiento.
[10]:
%%hive
SELECT
language,
ranking2017
FROM (
SELECT
language,
ranking[2017] AS ranking2017
FROM
languages
) table1
ORDER BY
ranking2017 DESC
LIMIT 10;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161720_b16a79a0-1833-4f5b-ae74-bfe877e39614
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0026, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0026/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0026
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 16:17:26,022 Stage-1 map = 0%, reduce = 0%
2022-05-17 16:17:30,206 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.76 sec
2022-05-17 16:17:35,354 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.19 sec
MapReduce Total cumulative CPU time: 3 seconds 190 msec
Ended Job = job_1652793922537_0026
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.19 sec HDFS Read: 11897 HDFS Write: 305 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 190 msec
OK
Python 100.0
C 99.7
Java 99.4
C++ 97.4
R 88.8
C# 88.8
JavaScript 86.2
PHP 82.3
Go 77.2
Swift 76.5
Time taken: 15.816 seconds, Fetched: 10 row(s)
Popularidad por Categoría y por Año#
En la siguiente consulta se obtienen los lenguajes más populares para la categoría Mobile
en el año 2017. En este caso, se usa la función array_contains(categories, 'Mobile')
que retorna verdadero si Mobile
aparece en categories
. Las funciones disponibles pueden ser consultadas en https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
[11]:
%%hive
SELECT
language,
ranking2018,
ranking2017,
ranking2016
FROM (
SELECT
language,
ranking[2018] AS ranking2018,
ranking[2017] AS ranking2017,
ranking[2016] AS ranking2016
FROM
languages
WHERE
array_contains(categories, 'Mobile')
) table1
LIMIT 10;
OK
Actionscript 0.6 0.0 3.3
C 96.7 99.7 100.0
C# 89.4 88.8 86.4
C++ 99.7 97.4 95.8
Delphi 38.7 NULL 41.7
Java 97.5 99.4 98.1
JavaScript 82.6 86.2 81.9
Objective-C 50.5 48.4 52.4
Scala 72.1 70.0 66.3
Scheme 18.8 24.5 14.5
Time taken: 0.204 seconds, Fetched: 10 row(s)
Expansión de la tabla por categoría#
Es posible generar un registro por cada valor de un campo complejo. En el siguiente ejemplo, se expande la tabla por cada valor del campo categories
. El proceso de expansión es realizado mediante la instrucción LATERAL VIEW
que genera un valor por cada categoría en el campo categories
, el cual es combinado con el resto de los campos indicados de la instrucción SELECT
.
[12]:
%%hive
SELECT
language,
ranking[2018],
c0
FROM
languages
LATERAL VIEW
explode(categories) languages AS c0
LIMIT 10;
OK
ABAP 22.8 Enterprise
Actionscript 0.6 Web
Actionscript 0.6 Mobile
Ada 20.9 Enterprise
Ada 20.9 Embedded
Arduino 69.0 Embedded
Assembly 74.1 Embedded
C 96.7 Mobile
C 96.7 Enterprise
C 96.7 Embedded
Time taken: 0.083 seconds, Fetched: 10 row(s)
Expansión de la tabla por categoría y popularidad#
En el siguiente código, se genera la tabla detail
que contiene las columnas:
Lenguaje
Año
Puntaje
Categoría
La tabla es llenada mediante la expansión de las columnas ranking
y categories
. Note que en este código, primero, se realiza una subconsulta para expandir por el campo categories
y luego en la consulta principal se expande por el campo ranking
;
[13]:
%%hive
DROP TABLE IF EXISTS detail;
CREATE TABLE detail
AS
SELECT
language,
key AS year,
value AS popularity,
category
FROM (
SELECT
language,
ranking,
category
FROM
languages
LATERAL VIEW
explode(categories) languages AS category
) t0
LATERAL VIEW
explode(ranking) t0
;
SELECT * FROM detail LIMIT 10;
OK
Time taken: 0.025 seconds
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161739_b236835c-1691-41bf-bb57-a19ddc622dd8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0027, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0027/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 16:17:45,253 Stage-1 map = 0%, reduce = 0%
2022-05-17 16:17:49,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1652793922537_0027
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/.hive-staging_hive_2022-05-17_16-17-39_393_5248961109886827483-1/-ext-10001
Moving data to directory hdfs://0.0.0.0:9000/user/hive/warehouse/detail
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.53 sec HDFS Read: 9857 HDFS Write: 9876 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 530 msec
OK
Time taken: 11.42 seconds
OK
ABAP 2018 22.8 Enterprise
ABAP 2017 28.9 Enterprise
ABAP 2016 15.9 Enterprise
ABAP 2015 24.6 Enterprise
ABAP 2014 20.8 Enterprise
Actionscript 2018 0.6 Web
Actionscript 2017 0.0 Web
Actionscript 2016 3.3 Web
Actionscript 2015 6.8 Web
Actionscript 2014 22.0 Web
Time taken: 0.101 seconds, Fetched: 10 row(s)
La instrucción DESCRIBE
permite ver los tipos de datos de la tabla.
[14]:
%%hive
DESCRIBE detail;
OK
language string
year int
popularity float
category string
Time taken: 0.036 seconds, Fetched: 4 row(s)
Construcción de campos compuestos#
También es posible crear datos complejos a partir de los campos de una tabla. En la siguiente consulta, se construye un ARRAY mediante la función collect_set()
agrupando las catetorias por lenguaje de la tabla detail
.
[15]:
%%hive
SELECT
language,
collect_set(category)
FROM
detail
GROUP BY
language
LIMIT 10;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161751_ee55cd99-8e19-4e07-8e83-1d0eaf19dbe6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0028, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0028/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0028
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-05-17 16:18:00,072 Stage-1 map = 0%, reduce = 0%
2022-05-17 16:18:04,176 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.43 sec
2022-05-17 16:18:08,304 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.01 sec
MapReduce Total cumulative CPU time: 3 seconds 10 msec
Ended Job = job_1652793922537_0028
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.01 sec HDFS Read: 18995 HDFS Write: 426 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 10 msec
OK
ABAP ["Enterprise"]
ASP.NET ["Web"]
Actionscript ["Web","Mobile"]
Ada ["Enterprise","Embedded"]
Apex Code ["Enterprise"]
Arduino ["Embedded"]
Assembly ["Embedded"]
C ["Mobile","Enterprise","Embedded"]
C# ["Web","Mobile","Enterprise"]
C++ ["Mobile"," Enterprise"," Embedded"]
Time taken: 17.831 seconds, Fetched: 10 row(s)
En la siguiente consulta, se construye un MAP a partir del año y la popularidad del lenguaje.
[16]:
%%hive
SELECT
language,
map(year, popularity) as p,
category
FROM
detail
LIMIT 10;
OK
ABAP {2018:22.8} Enterprise
ABAP {2017:28.9} Enterprise
ABAP {2016:15.9} Enterprise
ABAP {2015:24.6} Enterprise
ABAP {2014:20.8} Enterprise
Actionscript {2018:0.6} Web
Actionscript {2017:0.0} Web
Actionscript {2016:3.3} Web
Actionscript {2015:6.8} Web
Actionscript {2014:22.0} Web
Time taken: 0.097 seconds, Fetched: 10 row(s)
Particionamiento#
Como ya se indicó, las tablas pueden partirse físicamente dependiendo del valor de uno o más campos usados como claves. Esto permite que las búsquedas sean más eficientes, ya que las consultas operan sobre las partficiones. En el siguiente código, se crea la tabla detailpart
en que la partición está dada por el campo year
.
[17]:
%%hive
DROP TABLE IF EXISTS detailpart;
CREATE TABLE detailpart (
language STRING,
popularity DOUBLE,
category STRING
)
PARTITIONED BY (year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '#'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/hive-partitioned';
;
OK
Time taken: 0.021 seconds
OK
Time taken: 0.053 seconds
Seguidamente, se llena la tabla detailpart
con la información almacenada en la tabla detail
. Note que se indicar una instrucción INSERT OVERWRITE
por cada valor posible de la partición (columna year
).
[18]:
%%hive
FROM detail
INSERT OVERWRITE TABLE detailpart PARTITION(year=2018)
SELECT
language,
popularity,
category
WHERE
year=2018
INSERT OVERWRITE TABLE detailpart PARTITION(year=2017)
SELECT
language,
popularity,
category
WHERE
year=2017
INSERT OVERWRITE TABLE detailpart PARTITION(year=2016)
SELECT
language,
popularity,
category
WHERE
year=2016
INSERT OVERWRITE TABLE detailpart PARTITION(year=2015)
SELECT
language,
popularity,
category
WHERE
year=2015
INSERT OVERWRITE TABLE detailpart PARTITION(year=2014)
SELECT
language,
popularity,
category
WHERE
year=2014
;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161812_09187314-f24b-4ba3-8811-523b83ca14d7
Total jobs = 11
Launching Job 1 out of 11
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0029, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0029/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0029
Hadoop job information for Stage-5: number of mappers: 1; number of reducers: 0
2022-05-17 16:18:19,471 Stage-5 map = 0%, reduce = 0%
2022-05-17 16:18:23,594 Stage-5 map = 100%, reduce = 0%, Cumulative CPU 2.5 sec
MapReduce Total cumulative CPU time: 2 seconds 500 msec
Ended Job = job_1652793922537_0029
Stage-8 is selected by condition resolver.
Stage-7 is filtered out by condition resolver.
Stage-9 is filtered out by condition resolver.
Stage-14 is selected by condition resolver.
Stage-13 is filtered out by condition resolver.
Stage-15 is filtered out by condition resolver.
Stage-20 is selected by condition resolver.
Stage-19 is filtered out by condition resolver.
Stage-21 is filtered out by condition resolver.
Stage-26 is selected by condition resolver.
Stage-25 is filtered out by condition resolver.
Stage-27 is filtered out by condition resolver.
Stage-32 is selected by condition resolver.
Stage-31 is filtered out by condition resolver.
Stage-33 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2018/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10000
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2017/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10002
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2016/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10004
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2015/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10006
Moving data to directory hdfs://0.0.0.0:9000/tmp/hive-partitioned/year=2014/.hive-staging_hive_2022-05-17_16-18-12_776_6637532130254074137-1/-ext-10008
Loading data to table default.detailpart partition (year=2018)
Loading data to table default.detailpart partition (year=2017)
Loading data to table default.detailpart partition (year=2016)
Loading data to table default.detailpart partition (year=2015)
Loading data to table default.detailpart partition (year=2014)
MapReduce Jobs Launched:
Stage-Stage-5: Map: 1 Cumulative CPU: 2.5 sec HDFS Read: 18799 HDFS Write: 12254 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 500 msec
OK
Time taken: 12.818 seconds
A conitnuación se verifican las particiones creadas.
[19]:
%%hive
SHOW PARTITIONS detailpart;
OK
year=2014
year=2015
year=2016
year=2017
year=2018
Time taken: 0.059 seconds, Fetched: 5 row(s)
En una tabla particionada, se crea un subdirectorio por cada uno de los valores posibles que puede tomar el campo usado para especificar la partición; es así, como existe un directorio para year=2014
, otro para year=2015
y así sucesivamente.
[20]:
!hdfs dfs -ls /tmp/hive-partitioned
Found 5 items
drwxrwxrwx - root supergroup 0 2022-05-17 16:18 /tmp/hive-partitioned/year=2014
drwxrwxrwx - root supergroup 0 2022-05-17 16:18 /tmp/hive-partitioned/year=2015
drwxrwxrwx - root supergroup 0 2022-05-17 16:18 /tmp/hive-partitioned/year=2016
drwxrwxrwx - root supergroup 0 2022-05-17 16:18 /tmp/hive-partitioned/year=2017
drwxrwxrwx - root supergroup 0 2022-05-17 16:18 /tmp/hive-partitioned/year=2018
[21]:
!hdfs dfs -ls /tmp/hive-partitioned/year=2014
Found 1 items
-rwxrwxrwx 1 root supergroup 2347 2022-05-17 16:18 /tmp/hive-partitioned/year=2014/000000_0
Note que year
ya no es parte del contenido de la tabla.
[22]:
!hdfs dfs -tail /tmp/hive-partitioned/year=2014/000000_0
Mobile
Objective-C 63.70000076293945 Enterprise
Ocaml 18.5 Web
Ocaml 18.5 Enterprise
Perl 69.5999984741211 Web
Perl 69.5999984741211 Enterprise
PHP 84.5999984741211 Web
Processing 52.0 Web
Processing 52.0 Enterprise
Prolog 38.599998474121094 Enterprise
Python 93.4000015258789 Web
Python 93.4000015258789 Enterprise
Python 93.4000015258789 Embedded
R 74.0999984741211 Enterprise
Ruby 79.30000305175781 Web
Ruby 79.30000305175781 Enterprise
Rust \N Web
Rust \N Enterprise
SAS 37.099998474121094 Enterprise
Scala 62.70000076293945 Web
Scala 62.70000076293945 Mobile
Scheme 27.799999237060547 Mobile
Scheme 27.799999237060547 Enterprise
Shell 63.29999923706055 Enterprise
SQL 71.5999984741211 Enterprise
TCL 23.100000381469727 Enterprise
TCL 23.100000381469727 Embedded
Verilog 33.79999923706055 Embedded
VHDL 42.599998474121094 Embedded
Visual Basic 66.0 Enterprise
ASP.NET 35.900001525878906 Web
CoffeeScript 25.5 Web
Apex Code 18.0 Enterprise
Eiffel 6.400000095367432 Enterprise
Scilab 1.0 Enterprise
NetLogo 0.0 Enterprise
Tablas externas#
Las tablas creadas hasta el momento son completamente manejadas por Hive
. Sin embargo, estas tablas no son convenientes para compartir directamente información con otras aplicaciones, ya que Hive
administra los datos. Para analizar la información generada por otras aplicaciones resulta más conveniente usar tablas externas.
[23]:
%%hive
CREATE EXTERNAL TABLE externaltbl
LIKE
languages
LOCATION '/tmp/externaltbl';
OK
Time taken: 0.056 seconds
La carga de datos se realizar como en los ejemplos anteriores.
[24]:
%%hive
FROM languages
INSERT OVERWRITE TABLE externaltbl
SELECT
language,
ranking,
categories;
SELECT * FROM externaltbl LIMIT 10;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220517161831_6e05e846-3ec1-4f4c-9d20-55af50819a37
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1652793922537_0030, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0030/
Kill Command = /opt/hadoop/bin/hadoop job -kill job_1652793922537_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-05-17 16:18:34,747 Stage-1 map = 0%, reduce = 0%
2022-05-17 16:18:39,045 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.42 sec
MapReduce Total cumulative CPU time: 1 seconds 420 msec
Ended Job = job_1652793922537_0030
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:9000/tmp/externaltbl/.hive-staging_hive_2022-05-17_16-18-31_221_3581705079924987506-1/-ext-10000
Loading data to table default.externaltbl
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.42 sec HDFS Read: 7777 HDFS Write: 3581 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 420 msec
OK
Time taken: 9.048 seconds
OK
ABAP {2018:22.8,2017:28.9,2016:15.9,2015:24.6,2014:20.8} ["Enterprise"]
Actionscript {2018:0.6,2017:0.0,2016:3.3,2015:6.8,2014:22.0} ["Web","Mobile"]
Ada {2018:20.9,2017:29.9,2016:20.5,2015:32.8,2014:30.1} ["Enterprise","Embedded"]
Arduino {2018:69.0,2017:74.4,2016:69.5,2015:63.0,2014:62.9} ["Embedded"]
Assembly {2018:74.1,2017:73.4,2016:68.0,2015:67.9,2014:66.9} ["Embedded"]
C {2018:96.7,2017:99.7,2016:100.0,2015:99.9,2014:99.2} ["Mobile","Enterprise","Embedded"]
C# {2018:89.4,2017:88.8,2016:86.4,2015:91.8,2014:92.2} ["Web","Mobile","Enterprise"]
C++ {2018:99.7,2017:97.4,2016:95.8,2015:99.6,2014:95.5} ["Mobile"," Enterprise"," Embedded"]
Clojure {2018:25.6,2017:28.8,2016:22.0,2015:30.9,2014:36.0} ["Web","Enterprise"]
Cobol {2018:24.6,2017:29.2,2016:17.9,2015:19.0,2014:30.6} ["Enterprise"]
Time taken: 0.086 seconds, Fetched: 10 row(s)
[25]:
!hdfs dfs -ls /tmp/externaltbl
Found 1 items
-rwxrwxrwx 1 root supergroup 3504 2022-05-17 16:18 /tmp/externaltbl/000000_0
[26]:
!hdfs dfs -tail /tmp/externaltbl/000000_0
8,2016#87.7,2015#84.7,2014#74.1 Enterprise
Ruby 2018#71.4,2017#73.9,2016#74.0,2015#75.3,2014#79.3 Web,Enterprise
Rust 2018#41.8,2017#55.0,2016#42.2,2015#29.6,2014#\N Web,Enterprise
SAS 2018#25.6,2017#30.7,2016#21.7,2015#43.7,2014#37.1 Enterprise
Scala 2018#72.1,2017#70.0,2016#66.3,2015#62.4,2014#62.7 Web,Mobile
Scheme 2018#18.8,2017#24.5,2016#14.5,2015#14.8,2014#27.8 Mobile,Enterprise
Shell 2018#66.1,2017#68.6,2016#52.7,2015#71.4,2014#63.3 Enterprise
SQL 2018#49.3,2017#52.8,2016#48.8,2015#70.9,2014#71.6 Enterprise
Swift 2018#53.9,2017#76.5,2016#67.6,2015#62.4 Mobile,Enterprise
TCL 2018#21.9,2017#16.2,2016#8.7,2015#11.6,2014#23.1 Enterprise,Embedded
Verilog 2018#41.2,2017#35.6,2016#25.1,2015#23.7,2014#33.8 Embedded
VHDL 2018#45.4,2017#39.9,2016#33.3,2015#35.4,2014#42.6 Embedded
Visual Basic 2018#45.1,2017#58.9,2016#55.7,2015#63.4,2014#66.0 Enterprise
ASP.NET 2014#35.9 Web
CoffeeScript 2014#25.5 Web
Apex Code 2014#18.0 Enterprise
Eiffel 2014#6.4 Enterprise
Scilab 2014#1.0 Enterprise
NetLogo 2014#0.0 Enterprise
Borrado de las tablas#
Cuando se borra la tabla languages
también se borra la carpeta correspondiente del HDFS.
[27]:
%%hive
DROP TABLE languages;
DROP TABLE detail;
DROP TABLE detailpart;
DROP TABLE externaltbl;
OK
Time taken: 0.229 seconds
OK
Time taken: 0.05 seconds
OK
Time taken: 0.187 seconds
OK
Time taken: 0.048 seconds
[28]:
!hdfs dfs -ls /tmp/
Found 5 items
drwxr-xr-x - root supergroup 0 2022-05-17 16:16 /tmp/drivers
drwxrwxrwx - root supergroup 0 2022-05-17 16:16 /tmp/drivers-json
drwxrwxrwx - root supergroup 0 2022-05-17 16:18 /tmp/externaltbl
drwxrwx--- - root supergroup 0 2022-05-17 13:25 /tmp/hadoop-yarn
drwxrwxrwx - root supergroup 0 2022-05-17 13:26 /tmp/hive
[29]:
%quit
[30]:
!rm *.tsv *.log