Uso de SQL en Apache HBase usando Apache Phoenix#

  • Última modificación: Mayo 25, 2022

Introducción#

  • Es una capa SQL sobre HBase.

  • Compila las consultas SQL en scans de HBase.

  • Orquesta la ejecución de los scans en paralelo.

  • Aplica optimizaciones.

Creación del archivo SQL#

[1]:
%%writefile /tmp/us_population.sql
CREATE TABLE IF NOT EXISTS us_population (
      state CHAR(2) NOT NULL,
      city VARCHAR NOT NULL,
      population BIGINT
      CONSTRAINT my_pk PRIMARY KEY (state, city)
);
Overwriting /tmp/us_population.sql

Creación del archivo de datos#

[2]:
%%writefile /tmp/us_population.csv
NY,New York,8143197
CA,Los Angeles,3844829
IL,Chicago,2842518
TX,Houston,2016582
PA,Philadelphia,1463281
AZ,Phoenix,1461575
TX,San Antonio,1256509
CA,San Diego,1255540
TX,Dallas,1213825
CA,San Jose,912332
Overwriting /tmp/us_population.csv

Creación y llenado de la tabla#

[3]:
!psql.py localhost /tmp/us_population.sql /tmp/us_population.csv
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/phoenix/phoenix-client-hbase-2.3-5.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/05/27 18:46:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
no rows upserted
Time: 0.012 sec(s)

csv columns from database.
CSV Upsert complete. 10 rows upserted
Time: 0.054 sec(s)

Cell magic %%sqlline#

[4]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "sqlline.py localhost"
PROMPT = ["0: jdbc:phoenix:localhost> ", "0: jdbc:phoenix:localhost> ", ". . . . . . . . semicolon> "]
QUIT = "!quit"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def sqlline(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())
sqlline.py localhost
[5]:
%%sqlline
SELECT
    state AS "State",
    count(city) AS "City Count",
    sum(population) AS "Population Sum"
FROM us_population
GROUP BY state
ORDER BY sum(population) DESC;


count(city) AS "City Count",

sum(population) AS "Population Sum"



ORDER BY sum(population) DESC;
>+-------+------------+----------------+
| State | City Count | Population Sum |
+-------+------------+----------------+
| NY    | 1          | 8143197        |
| CA    | 3          | 6012701        |
| TX    | 3          | 4486916        |
| IL    | 1          | 2842518        |
| PA    | 1          | 1463281        |
| AZ    | 1          | 1461575        |
+-------+------------+----------------+
6 rows selected (0.062 seconds)
=
[6]:
%quit