Conteo de palabras en Apache Pig usando Jupyter#

  • Última modificación: Mayo 16, 2021 | YouTube

Cell magic %%pig#

[1]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "pig"
PROMPT = ["\r\n>> ", "\r\ngrunt> "]
DISCARD = ["INFO  org.apache", "WARN  org.apache"]
QUIT = "quit"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def pig(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    if not any(word in output_line for word in DISCARD):
                        print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())

Archivos de prueba#

[2]:
!mkdir /tmp/input
mkdir: cannot create directory ‘/tmp/input’: File exists
[3]:
%%writefile /tmp/input/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include predictive analytics, prescriptive
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,
marketing optimization and marketing mix modeling, web analytics, call analytics, speech
analytics, sales force sizing and optimization, price and promotion modeling, predictive
science, credit risk analysis, and fraud analytics. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics.
Overwriting /tmp/input/text0.txt
[4]:
%%writefile /tmp/input/text1.txt
The field of data analysis. Analytics often involves studying past historical data to
research potential trends, to analyze the effects of certain decisions or events, or to
evaluate the performance of a given tool or scenario. The goal of analytics is to improve
the business by gaining knowledge which can be used to make improvements or changes.
Overwriting /tmp/input/text1.txt
[5]:
%%writefile /tmp/input/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions
about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions and by
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.
Overwriting /tmp/input/text2.txt

Ejecución de Pig en Jupyter#

[6]:
%%pig
fs -mkdir input
fs -put /tmp/input/  .
fs -ls input/
mkdir: `input': File exists
put: `input/text0.txt': File exists
put: `input/text1.txt': File exists
put: `input/text2.txt': File exists
Found 3 items
-rw-r--r--   1 root supergroup       1093 2022-05-16 23:31 input/text0.txt
-rw-r--r--   1 root supergroup        352 2022-05-16 23:31 input/text1.txt
-rw-r--r--   1 root supergroup        440 2022-05-16 23:31 input/text2.txt
[7]:
%%pig
lines = LOAD 'input/text*.txt' AS (line:CHARARRAY);
DUMP lines;
(Analytics is the discovery, interpretation, and communication of meaningful patterns )
(in data. Especially valuable in areas rich with recorded information, analytics relies )
(on the simultaneous application of statistics, computer programming and operations research )
(to quantify performance.)
()
(Organizations may apply analytics to business data to describe, predict, and improve business )
(performance. Specifically, areas within analytics include predictive analytics, prescriptive )
(analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big )
(Data Analytics, retail analytics, store assortment and stock-keeping unit optimization, )
(marketing optimization and marketing mix modeling, web analytics, call analytics, speech )
(analytics, sales force sizing and optimization, price and promotion modeling, predictive )
(science, credit risk analysis, and fraud analytics. Since analytics can require extensive )
(computation (see big data), the algorithms and software used for analytics harness the most )
(current methods in computer science, statistics, and mathematics.)
(Data analytics (DA) is the process of examining data sets in order to draw conclusions )
(about the information they contain, increasingly with the aid of specialized systems )
(and software. Data analytics technologies and techniques are widely used in commercial )
(industries to enable organizations to make more-informed business decisions and by )
(scientists and researchers to verify or disprove scientific models, theories and )
(hypotheses.)
(The field of data analysis. Analytics often involves studying past historical data to )
(research potential trends, to analyze the effects of certain decisions or events, or to )
(evaluate the performance of a given tool or scenario. The goal of analytics is to improve )
(the business by gaining knowledge which can be used to make improvements or changes.)
[8]:
%%pig
-- genera una tabla llamada words con una palabra por registro
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
DUMP words;
(Analytics)
(is)
(the)
(discovery)
(interpretation)
(and)
(communication)
(of)
(meaningful)
(patterns)
(in)
(data.)
(Especially)
(valuable)
(in)
(areas)
(rich)
(with)
(recorded)
(information)
(analytics)
(relies)
(on)
(the)
(simultaneous)
(application)
(of)
(statistics)
(computer)
(programming)
(and)
(operations)
(research)
(to)
(quantify)
(performance.)
()
(Organizations)
(may)
(apply)
(analytics)
(to)
(business)
(data)
(to)
(describe)
(predict)
(and)
(improve)
(business)
(performance.)
(Specifically)
(areas)
(within)
(analytics)
(include)
(predictive)
(analytics)
(prescriptive)
(analytics)
(enterprise)
(decision)
(management)
(descriptive)
(analytics)
(cognitive)
(analytics)
(Big)
(Data)
(Analytics)
(retail)
(analytics)
(store)
(assortment)
(and)
(stock-keeping)
(unit)
(optimization)
(marketing)
(optimization)
(and)
(marketing)
(mix)
(modeling)
(web)
(analytics)
(call)
(analytics)
(speech)
(analytics)
(sales)
(force)
(sizing)
(and)
(optimization)
(price)
(and)
(promotion)
(modeling)
(predictive)
(science)
(credit)
(risk)
(analysis)
(and)
(fraud)
(analytics.)
(Since)
(analytics)
(can)
(require)
(extensive)
(computation)
(see)
(big)
(data)
(the)
(algorithms)
(and)
(software)
(used)
(for)
(analytics)
(harness)
(the)
(most)
(current)
(methods)
(in)
(computer)
(science)
(statistics)
(and)
(mathematics.)
(Data)
(analytics)
(DA)
(is)
(the)
(process)
(of)
(examining)
(data)
(sets)
(in)
(order)
(to)
(draw)
(conclusions)
(about)
(the)
(information)
(they)
(contain)
(increasingly)
(with)
(the)
(aid)
(of)
(specialized)
(systems)
(and)
(software.)
(Data)
(analytics)
(technologies)
(and)
(techniques)
(are)
(widely)
(used)
(in)
(commercial)
(industries)
(to)
(enable)
(organizations)
(to)
(make)
(more-informed)
(business)
(decisions)
(and)
(by)
(scientists)
(and)
(researchers)
(to)
(verify)
(or)
(disprove)
(scientific)
(models)
(theories)
(and)
(hypotheses.)
(The)
(field)
(of)
(data)
(analysis.)
(Analytics)
(often)
(involves)
(studying)
(past)
(historical)
(data)
(to)
(research)
(potential)
(trends)
(to)
(analyze)
(the)
(effects)
(of)
(certain)
(decisions)
(or)
(events)
(or)
(to)
(evaluate)
(the)
(performance)
(of)
(a)
(given)
(tool)
(or)
(scenario.)
(The)
(goal)
(of)
(analytics)
(is)
(to)
(improve)
(the)
(business)
(by)
(gaining)
(knowledge)
(which)
(can)
(be)
(used)
(to)
(make)
(improvements)
(or)
(changes.)
[9]:
%%pig
-- agrupa los registros que tienen la misma palabra
grouped = GROUP words BY word;
DUMP grouped;
(a,{(a)})
(DA,{(DA)})
(be,{(be)})
(by,{(by),(by)})
(in,{(in),(in),(in),(in),(in)})
(is,{(is),(is),(is)})
(of,{(of),(of),(of),(of),(of),(of),(of),(of)})
(on,{(on)})
(or,{(or),(or),(or),(or),(or)})
(to,{(to),(to),(to),(to),(to),(to),(to),(to),(to),(to),(to),(to)})
(Big,{(Big)})
(The,{(The),(The)})
(aid,{(aid)})
(and,{(and),(and),(and),(and),(and),(and),(and),(and),(and),(and),(and),(and),(and),(and),(and)})
(are,{(are)})
(big,{(big)})
(can,{(can),(can)})
(for,{(for)})
(may,{(may)})
(mix,{(mix)})
(see,{(see)})
(the,{(the),(the),(the),(the),(the),(the),(the),(the),(the),(the)})
(web,{(web)})
(Data,{(Data),(Data),(Data)})
(call,{(call)})
(data,{(data),(data),(data),(data),(data)})
(draw,{(draw)})
(goal,{(goal)})
(make,{(make),(make)})
(most,{(most)})
(past,{(past)})
(rich,{(rich)})
(risk,{(risk)})
(sets,{(sets)})
(they,{(they)})
(tool,{(tool)})
(unit,{(unit)})
(used,{(used),(used),(used)})
(with,{(with),(with)})
(Since,{(Since)})
(about,{(about)})
(apply,{(apply)})
(areas,{(areas),(areas)})
(data.,{(data.)})
(field,{(field)})
(force,{(force)})
(fraud,{(fraud)})
(given,{(given)})
(often,{(often)})
(order,{(order)})
(price,{(price)})
(sales,{(sales)})
(store,{(store)})
(which,{(which)})
(credit,{(credit)})
(enable,{(enable)})
(events,{(events)})
(models,{(models)})
(relies,{(relies)})
(retail,{(retail)})
(sizing,{(sizing)})
(speech,{(speech)})
(trends,{(trends)})
(verify,{(verify)})
(widely,{(widely)})
(within,{(within)})
(analyze,{(analyze)})
(certain,{(certain)})
(contain,{(contain)})
(current,{(current)})
(effects,{(effects)})
(gaining,{(gaining)})
(harness,{(harness)})
(improve,{(improve),(improve)})
(include,{(include)})
(methods,{(methods)})
(predict,{(predict)})
(process,{(process)})
(require,{(require)})
(science,{(science),(science)})
(systems,{(systems)})
(analysis,{(analysis)})
(business,{(business),(business),(business),(business)})
(changes.,{(changes.)})
(computer,{(computer),(computer)})
(decision,{(decision)})
(describe,{(describe)})
(disprove,{(disprove)})
(evaluate,{(evaluate)})
(involves,{(involves)})
(modeling,{(modeling),(modeling)})
(patterns,{(patterns)})
(quantify,{(quantify)})
(recorded,{(recorded)})
(research,{(research),(research)})
(software,{(software)})
(studying,{(studying)})
(theories,{(theories)})
(valuable,{(valuable)})
(Analytics,{(Analytics),(Analytics),(Analytics)})
(analysis.,{(analysis.)})
(analytics,{(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics),(analytics)})
(cognitive,{(cognitive)})
(decisions,{(decisions),(decisions)})
(discovery,{(discovery)})
(examining,{(examining)})
(extensive,{(extensive)})
(knowledge,{(knowledge)})
(marketing,{(marketing),(marketing)})
(potential,{(potential)})
(promotion,{(promotion)})
(scenario.,{(scenario.)})
(software.,{(software.)})
(Especially,{(Especially)})
(algorithms,{(algorithms)})
(analytics.,{(analytics.)})
(assortment,{(assortment)})
(commercial,{(commercial)})
(enterprise,{(enterprise)})
(historical,{(historical)})
(industries,{(industries)})
(management,{(management)})
(meaningful,{(meaningful)})
(operations,{(operations)})
(predictive,{(predictive),(predictive)})
(scientific,{(scientific)})
(scientists,{(scientists)})
(statistics,{(statistics),(statistics)})
(techniques,{(techniques)})
(application,{(application)})
(computation,{(computation)})
(conclusions,{(conclusions)})
(descriptive,{(descriptive)})
(hypotheses.,{(hypotheses.)})
(information,{(information),(information)})
(performance,{(performance)})
(programming,{(programming)})
(researchers,{(researchers)})
(specialized,{(specialized)})
(Specifically,{(Specifically)})
(improvements,{(improvements)})
(increasingly,{(increasingly)})
(mathematics.,{(mathematics.)})
(optimization,{(optimization),(optimization),(optimization)})
(performance.,{(performance.),(performance.)})
(prescriptive,{(prescriptive)})
(simultaneous,{(simultaneous)})
(technologies,{(technologies)})
(Organizations,{(Organizations)})
(communication,{(communication)})
(more-informed,{(more-informed)})
(organizations,{(organizations)})
(stock-keeping,{(stock-keeping)})
(interpretation,{(interpretation)})
(,{()})
[10]:
%%pig
-- genera una variable que cuenta las ocurrencias por cada grupo
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
(a,1)
(DA,1)
(be,1)
(by,2)
(in,5)
(is,3)
(of,8)
(on,1)
(or,5)
(to,12)
(Big,1)
(The,2)
(aid,1)
(and,15)
(are,1)
(big,1)
(can,2)
(for,1)
(may,1)
(mix,1)
(see,1)
(the,10)
(web,1)
(Data,3)
(call,1)
(data,5)
(draw,1)
(goal,1)
(make,2)
(most,1)
(past,1)
(rich,1)
(risk,1)
(sets,1)
(they,1)
(tool,1)
(unit,1)
(used,3)
(with,2)
(Since,1)
(about,1)
(apply,1)
(areas,2)
(data.,1)
(field,1)
(force,1)
(fraud,1)
(given,1)
(often,1)
(order,1)
(price,1)
(sales,1)
(store,1)
(which,1)
(credit,1)
(enable,1)
(events,1)
(models,1)
(relies,1)
(retail,1)
(sizing,1)
(speech,1)
(trends,1)
(verify,1)
(widely,1)
(within,1)
(analyze,1)
(certain,1)
(contain,1)
(current,1)
(effects,1)
(gaining,1)
(harness,1)
(improve,2)
(include,1)
(methods,1)
(predict,1)
(process,1)
(require,1)
(science,2)
(systems,1)
(analysis,1)
(business,4)
(changes.,1)
(computer,2)
(decision,1)
(describe,1)
(disprove,1)
(evaluate,1)
(involves,1)
(modeling,2)
(patterns,1)
(quantify,1)
(recorded,1)
(research,2)
(software,1)
(studying,1)
(theories,1)
(valuable,1)
(Analytics,3)
(analysis.,1)
(analytics,16)
(cognitive,1)
(decisions,2)
(discovery,1)
(examining,1)
(extensive,1)
(knowledge,1)
(marketing,2)
(potential,1)
(promotion,1)
(scenario.,1)
(software.,1)
(Especially,1)
(algorithms,1)
(analytics.,1)
(assortment,1)
(commercial,1)
(enterprise,1)
(historical,1)
(industries,1)
(management,1)
(meaningful,1)
(operations,1)
(predictive,2)
(scientific,1)
(scientists,1)
(statistics,2)
(techniques,1)
(application,1)
(computation,1)
(conclusions,1)
(descriptive,1)
(hypotheses.,1)
(information,2)
(performance,1)
(programming,1)
(researchers,1)
(specialized,1)
(Specifically,1)
(improvements,1)
(increasingly,1)
(mathematics.,1)
(optimization,3)
(performance.,2)
(prescriptive,1)
(simultaneous,1)
(technologies,1)
(Organizations,1)
(communication,1)
(more-informed,1)
(organizations,1)
(stock-keeping,1)
(interpretation,1)
(,0)
[11]:
%%pig
-- selecciona las primeras 15 palabras
s = LIMIT wordcount 15;
DUMP s;
(a,1)
(DA,1)
(be,1)
(by,2)
(in,5)
(is,3)
(of,8)
(on,1)
(or,5)
(to,12)
(Big,1)
(The,2)
(aid,1)
(and,15)
(are,1)
[12]:
%quit