Uso de SQL en Apache HBase usando Apache Phoenix#
Última modificación: Mayo 25, 2022
Introducción#
Es una capa SQL sobre HBase.
Compila las consultas SQL en scans de HBase.
Orquesta la ejecución de los scans en paralelo.
Aplica optimizaciones.
Creación del archivo SQL#
[1]:
%%writefile /tmp/us_population.sql
CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city)
);
Overwriting /tmp/us_population.sql
Creación del archivo de datos#
[2]:
%%writefile /tmp/us_population.csv
NY,New York,8143197
CA,Los Angeles,3844829
IL,Chicago,2842518
TX,Houston,2016582
PA,Philadelphia,1463281
AZ,Phoenix,1461575
TX,San Antonio,1256509
CA,San Diego,1255540
TX,Dallas,1213825
CA,San Jose,912332
Overwriting /tmp/us_population.csv
Creación y llenado de la tabla#
[3]:
!psql.py localhost /tmp/us_population.sql /tmp/us_population.csv
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/phoenix/phoenix-client-hbase-2.3-5.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/05/27 18:46:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
no rows upserted
Time: 0.012 sec(s)
csv columns from database.
CSV Upsert complete. 10 rows upserted
Time: 0.054 sec(s)
Cell magic %%sqlline
#
[4]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn
TIMEOUT = 60
PROG = "sqlline.py localhost"
PROMPT = ["0: jdbc:phoenix:localhost> ", "0: jdbc:phoenix:localhost> ", ". . . . . . . . semicolon> "]
QUIT = "!quit"
@magics_class
class Magic(Magics):
def __init__(self, shell):
super().__init__(shell)
self.app = spawn(PROG, timeout=60)
self.app.expect(PROMPT)
@cell_magic
def sqlline(self, line, cell):
cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
for cell_line in cell_lines:
self.app.sendline(cell_line)
self.app.expect(PROMPT, timeout=TIMEOUT)
output = self.app.before.decode()
output = output.replace("\r\n", "\n")
output = output.split("\n")
output = [output_line.strip() for output_line in output]
for output_line in output:
if output_line not in cell_lines:
print(output_line)
return None
@line_magic
def quit(self, line):
self.app.sendline(QUIT)
def load_ipython_extension(ip):
ip.register_magics(Magic(ip))
load_ipython_extension(ip=get_ipython())
sqlline.py localhost
[5]:
%%sqlline
SELECT
state AS "State",
count(city) AS "City Count",
sum(population) AS "Population Sum"
FROM us_population
GROUP BY state
ORDER BY sum(population) DESC;
count(city) AS "City Count",
sum(population) AS "Population Sum"
ORDER BY sum(population) DESC;
>+-------+------------+----------------+
| State | City Count | Population Sum |
+-------+------------+----------------+
| NY | 1 | 8143197 |
| CA | 3 | 6012701 |
| TX | 3 | 4486916 |
| IL | 1 | 2842518 |
| PA | 1 | 1463281 |
| AZ | 1 | 1461575 |
+-------+------------+----------------+
6 rows selected (0.062 seconds)
=
[6]:
%quit