Análisis básico de información con Apache Pig#

  • 60 min | Última modificación: Noviembre 07, 2019

Descarga de datos#

[1]:
filenames = [
    "drivers.csv",
    "timesheet.csv",
    "truck_event_text_partition.csv",
]

url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/"

!mkdir -p /tmp/drivers/
for filename in filenames:
    !wget --quiet {url + filename} -P /tmp/drivers/

Movimiento de datos al HDFS#

[2]:
!hdfs dfs -rm -r drivers/ output/
!hdfs dfs -mkdir drivers/
!hdfs dfs -copyFromLocal /tmp/drivers/*.csv  drivers/
!hdfs dfs -ls drivers/*
Deleted drivers
Deleted output
-rw-r--r--   1 root supergroup       2043 2022-05-31 16:45 drivers/drivers.csv
-rw-r--r--   1 root supergroup      26205 2022-05-31 16:45 drivers/timesheet.csv
-rw-r--r--   1 root supergroup    2272077 2022-05-31 16:45 drivers/truck_event_text_partition.csv

Selección de un subconjunto de datos#

[3]:
%%writefile truck-events.pig

truck_events = LOAD 'drivers/truck_event_text_partition.csv' USING PigStorage(',')
    AS (
            driverId:int,
            truckId:int,
            eventTime:chararray,
            eventType:chararray,
            longitude:double,
            latitude:double,
            eventKey:chararray,
            correlationId:long,
            driverName:chararray,
            routeId:long,
            routeName:chararray,
            eventDate:chararray
    );

truck_events_subset = LIMIT truck_events 10;

specific_columns = FOREACH truck_events_subset GENERATE driverId, eventTime, eventType;

STORE specific_columns INTO 'output/specific_columns' USING PigStorage(',');
Overwriting truck-events.pig
[4]:
!pig -f truck-events.pig
2022-05-31 16:45:31,859 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:32,582 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:32,655 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:45:32,673 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:45:33,143 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:45:33,272 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0021
2022-05-31 16:45:33,404 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:45:33,448 [JobControl] INFO  org.apache.hadoop.conf.Configuration - resource-types.xml not found
2022-05-31 16:45:33,448 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find 'resource-types.xml'.
2022-05-31 16:45:33,452 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
2022-05-31 16:45:33,452 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = vcores, units = , type = COUNTABLE
2022-05-31 16:45:33,488 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0021
2022-05-31 16:45:33,512 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0021/
2022-05-31 16:45:48,619 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:48,627 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:45:48,724 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:48,728 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:45:48,745 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:48,748 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:45:48,881 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:48,895 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:45:48,910 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:45:48,936 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:45:48,974 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0022
2022-05-31 16:45:48,977 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:45:49,205 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0022
2022-05-31 16:45:49,209 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0022/
2022-05-31 16:46:09,517 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,523 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,580 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,585 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,604 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,608 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,638 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,641 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,656 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,659 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,676 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,678 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,694 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,697 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,713 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,716 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:09,738 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:09,742 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
[5]:
!hdfs dfs -ls output/specific_columns/
Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:46 output/specific_columns/_SUCCESS
-rw-r--r--   1 root supergroup        183 2022-05-31 16:46 output/specific_columns/part-r-00000
[6]:
!hdfs dfs -text output/specific_columns/part-r-00000 | head
11,59:21.7,Normal
11,59:22.5,Normal
14,59:21.4,Normal
18,59:21.7,Normal
20,59:22.5,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
27,59:21.7,Normal
,eventTime,eventType

Ejecución de un join#

[7]:
%%writefile join.pig

truck_events = LOAD 'drivers/truck_event_text_partition.csv' USING PigStorage(',')
    AS (
            driverId:int,
            truckId:int,
            eventTime:chararray,
            eventType:chararray,
            longitude:double,
            latitude:double,
            eventKey:chararray,
            correlationId:long,
            driverName:chararray,
            routeId:long,
            routeName:chararray,
            eventDate:chararray
    );

drivers =  LOAD 'drivers/drivers.csv' USING PigStorage(',')
    AS (
            driverId:int,
            name:chararray,
            ssn:chararray,
            location:chararray,
            certified:chararray,
            wage_plan:chararray
    );

join_data = JOIN  truck_events BY (driverId), drivers BY (driverId);

STORE join_data INTO 'output/join_data' USING PigStorage(',');
Overwriting join.pig
[8]:
!pig -f join.pig
2022-05-31 16:46:16,018 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:16,369 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:16,443 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:46:16,469 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:16,488 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:16,521 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:2
2022-05-31 16:46:16,659 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0023
2022-05-31 16:46:16,794 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:46:16,837 [JobControl] INFO  org.apache.hadoop.conf.Configuration - resource-types.xml not found
2022-05-31 16:46:16,837 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find 'resource-types.xml'.
2022-05-31 16:46:16,840 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
2022-05-31 16:46:16,841 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = vcores, units = , type = COUNTABLE
2022-05-31 16:46:16,876 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0023
2022-05-31 16:46:16,899 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0023/
2022-05-31 16:46:32,017 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:32,024 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:32,121 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:32,125 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:32,143 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:32,146 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:32,177 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:32,180 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:32,195 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:32,198 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:32,215 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:32,219 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
[9]:
!hdfs dfs -ls output/join_data/
Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:46 output/join_data/_SUCCESS
-rw-r--r--   1 root supergroup    3283088 2022-05-31 16:46 output/join_data/part-r-00000
[10]:
!hdfs dfs -cat output/join_data/part-r-00000 | head
10,85,00:35.2,Normal,-92.99,37.34,10|85|9223370572464740606,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,23,58:48.7,Normal,-90.69,38.5,10|23|9223370572126447149,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,23,59:04.1,Normal,-93.69,37.16,10|23|9223370572126431719,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,43,37:06.0,Normal,-90.69,38.5,10|43|9223370572419349763,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-28-11,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,39,08:56.0,Normal,-91.44,38.09,10|39|9223370571956639801,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,23,58:53.0,Normal,-91.44,38.09,10|23|9223370572126442820,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,39,12:15.4,Normal,-95.69,36.25,10|39|9223370571956440410,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,85,00:03.3,Normal,-92.89,37.51,10|85|9223370572464772515,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,85,59:45.1,Normal,-95.14,36.66,10|85|9223370572464790666,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,39,09:28.6,Normal,-93.69,37.16,10|39|9223370571956607201,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
cat: Unable to write to output stream.

Ordenamiento de datos usando ‘ORDER BY’#

[11]:
%%writefile sort.pig

drivers =  LOAD 'drivers/drivers.csv' USING PigStorage(',')
    AS (
            driverId:int,
            name:chararray,
            ssn:chararray,
            location:chararray,
            certified:chararray,
            wage_plan:chararray
    );

ordered_data = ORDER drivers BY name asc;

STORE ordered_data INTO 'output/ordered_data' USING PigStorage(',');
Overwriting sort.pig
[12]:
!pig -f sort.pig
2022-05-31 16:46:38,366 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:38,702 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:38,774 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:46:38,794 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:38,893 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:46:39,050 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0024
2022-05-31 16:46:39,194 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:46:39,242 [JobControl] INFO  org.apache.hadoop.conf.Configuration - resource-types.xml not found
2022-05-31 16:46:39,242 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find 'resource-types.xml'.
2022-05-31 16:46:39,246 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
2022-05-31 16:46:39,246 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = vcores, units = , type = COUNTABLE
2022-05-31 16:46:39,282 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0024
2022-05-31 16:46:39,308 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0024/
2022-05-31 16:46:49,416 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:49,421 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:49,514 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:49,519 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:49,531 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:49,534 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:46:49,678 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:49,688 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:46:49,700 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:49,720 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:46:49,754 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0025
2022-05-31 16:46:49,757 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:46:49,781 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0025
2022-05-31 16:46:49,784 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0025/
2022-05-31 16:47:09,795 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:09,801 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:09,853 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:09,858 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:09,876 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:09,880 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:09,986 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:09,997 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:47:10,007 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:47:10,829 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:47:11,253 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0026
2022-05-31 16:47:11,258 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:47:11,286 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0026
2022-05-31 16:47:11,288 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0026/
2022-05-31 16:47:26,395 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,400 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,442 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,445 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,457 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,459 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,483 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,485 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,499 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,502 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,513 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,516 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,532 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,535 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,549 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,551 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,565 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,567 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,582 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,585 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,597 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,599 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:26,612 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:26,614 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
[13]:
!hdfs dfs -ls output/ordered_data/
Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:47 output/ordered_data/_SUCCESS
-rw-r--r--   1 root supergroup       2002 2022-05-31 16:47 output/ordered_data/part-r-00000
[14]:
!hdfs dfs -cat output/ordered_data/part-r-00000 | head
23,Adam Diaz,928312208,P.O. Box 260- 6127 Vitae Road,Y,hours
14,Adis Cesir,820812209,Ap #810-1228 In St.,Y,hours
19,Ajay Singh,160005158,592-9430 Nonummy Avenue,Y,hours
36,Andrew Grande,245303216,Ap #685-9598 Egestas Rd.,Y,hours
20,Chris Harris,921812303,883-2691 Proin Avenue,Y,hours
30,Dan Rice,282307061,Ap #881-9267 Mollis Avenue,Y,hours
43,Dave Patton,977706052,3028 A- St.,Y,hours
39,David Kaiser,967706052,9185 At Street,Y,hours
24,Don Hilborn,254412152,4361 Ac Road,Y,hours
35,Emil Siemes,971401151,321-2976 Felis Rd.,Y,hours

Filtrado y agrupamiento usando “GROUP BY”#

[15]:
%%writefile groupby.pig

truck_events = LOAD 'drivers/truck_event_text_partition.csv' USING PigStorage(',')
    AS (
            driverId:int,
            truckId:int,
            eventTime:chararray,
            eventType:chararray,
            longitude:double,
            latitude:double,
            eventKey:chararray,
            correlationId:long,
            driverName:chararray,
            routeId:long,
            routeName:chararray,
            eventDate:chararray
    );

filtered_events = FILTER truck_events BY NOT (eventType MATCHES 'Normal');

grouped_events = GROUP filtered_events BY driverId;

STORE grouped_events INTO 'output/grouped_events' USING PigStorage(',');
Overwriting groupby.pig
[16]:
!pig -f groupby.pig
2022-05-31 16:47:32,676 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:33,054 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:33,125 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:47:33,145 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:47:33,195 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:47:33,344 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0027
2022-05-31 16:47:33,487 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:47:33,530 [JobControl] INFO  org.apache.hadoop.conf.Configuration - resource-types.xml not found
2022-05-31 16:47:33,530 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find 'resource-types.xml'.
2022-05-31 16:47:33,534 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
2022-05-31 16:47:33,534 [JobControl] INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils - Adding resource type - name = vcores, units = , type = COUNTABLE
2022-05-31 16:47:33,571 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1654012563278_0027
2022-05-31 16:47:33,597 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ca2b226216b1:8088/proxy/application_1654012563278_0027/
2022-05-31 16:47:48,714 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:48,721 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:48,813 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:48,817 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:48,835 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:48,838 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:48,869 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:48,872 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:48,888 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:48,891 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2022-05-31 16:47:48,906 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:48,910 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
[17]:
!hdfs dfs -ls output/grouped_events/
Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:47 output/grouped_events/_SUCCESS
-rw-r--r--   1 root supergroup       5613 2022-05-31 16:47 output/grouped_events/part-r-00000
[18]:
!hdfs dfs -cat output/grouped_events/part-r-00000 | head
10,{(10,85,00:13.1,Unsafe tail distance,-91.18,38.22,10|85|9223370572464762694,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22),(10,85,00:39.7,Overspeed,-94.23,37.09,10|85|9223370572464736126,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22),(10,85,59:46.9,Overspeed,-95.5,36.37,10|85|9223370572464788896,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22)}
11,{(11,74,00:14.1,Lane Departure,-88.77,40.76,11|74|9223370572464761716,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:49.6,Lane Departure,-89.71,37.47,11|74|9223370572464726246,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:05.4,Unsafe following distance,-89.74,39.1,11|74|9223370572464770396,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:41.0,Lane Departure,-90.07,35.68,11|74|9223370572464734786,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,59:56.4,Lane Departure,-87.67,41.87,11|74|9223370572464779456,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,59:38.0,Unsafe tail distance,-89.17,40.38,11|74|9223370572464797796,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,59:47.3,Unsafe tail distance,-89.63,39.84,11|74|9223370572464788546,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,59:29.1,Overspeed,-88.07,41.48,11|74|9223370572464806746,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:32.0,Unsafe tail distance,-90.2,38.65,11|74|9223370572464743846,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:23.1,Unsafe tail distance,-88.42,41.11,11|74|9223370572464752715,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22)}
12,{(12,104,00:47.6,Unsafe following distance,-90.0,37.72,12|104|9223370572464728186,3660000000000000000,Paul Codding,24929475,Peoria to Ceder Rapids,2016-05-27-22)}
13,{(13,89,00:47.7,Lane Departure,-89.03,41.92,13|89|9223370572464728156,3660000000000000000,Joe Niemiec,927636994,Des Moines to Chicago.kml,2016-05-27-22)}
14,{(14,25,00:48.4,Unsafe following distance,-91.63,41.72,14|25|9223370572464727394,3660000000000000000,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22)}
15,{(15,51,00:48.8,Lane Departure,-90.04,35.19,15|51|9223370572464727025,3660000000000000000,Rohit Bakshi,1384345811,Joplin to Kansas City,2016-05-27-22)}
16,{(16,12,00:48.9,Lane Departure,-89.52,40.7,16|12|9223370572464726925,3660000000000000000,Tom McCuch,1961634315,Saint Louis to Memphis,2016-05-27-22)}
17,{(17,15,00:48.4,Lane Departure,-90.79,38.83,17|15|9223370572464727374,3660000000000000000,Eric Mizell,1927624662,Springfield to KC Via Columbia,2016-05-27-22)}
18,{(18,16,00:47.2,Overspeed,-94.28,39.53,18|16|9223370572464728575,3660000000000000000,Grant Liu,1565885487,Springfield to KC Via Hanibal,2016-05-27-22)}
19,{(19,26,00:48.6,Unsafe following distance,-94.57,35.37,19|26|9223370572464727224,3660000000000000000,Ajay Singh,1962261785,Wichita to Little Rock.kml,2016-05-27-22)}

[19]:
!rm *log *.pig *.csv