SE6023 Lab2 Pig

tags: `hadoop`

Introducing Pig!

Pig is made up of two pieces:
– The Language used to express data flows, called Pig Latin.
– The execution environment to run Pig Latin programs.
There are currently two environments:
– Local execution in a single JVM by pig -x local
– Distributed execution on a Hadoop cluster (Our focus!)
Pig handles semi-structure data (半結構化資料)

Key Properties

Easy of programming.
It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility.
Users can create their own functions to do special-purpose processing.

Definitions and Data Types

Relations, Bags, Tuples, Fields

Pig Latin statements work with relations. A relation can be defined as follows:

A field is a piece of data.

A tuple is an ordered set of fields.

A bag is a collection of tuples.

A relation is a bag (more specifically, an outer bag).

For example,

> A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
> DUMP A;
(John,18,4.0F) <-tuple
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)

	1st Field	2nd Field	3th Field
Data type	chararray	int	float
Positional notation(generated by system)	$0	$1	$2
Possible name (assigned by you using a schema)	name	age	gpa
Field value (for the first tuple)	John	18	4.0

For debugging purposes and ease of comprehension, it is better to use field names.

Type	Example
tuple	(1,‘programmer’)
bag	{(1,‘programmer’),(2),(3,‘apple’)}

Conventions	Description	Example
( )	Parentheses enclose one or more items.	Multiple items: `(1, abc, (2,4,6) )`

Expressions

In Pig Latin, expressions are language constructs used with the FILTER, FOREACH, GROUP, and SPLIT operators as well as the eval functions.

For example:

## An arithmetic expression could look like this:
> X = GROUP A BY f2*f3;

## A string expression could look like this, where a and b are both chararrays:
> X = FOREACH A GENERATE CONCAT(a,b);

## A boolean expression could look like this:
> X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));

Reference: https://pig.apache.org/docs/latest/basic.html

Pig Hands-on

Note
Replace all occurrences of a000000000 with your username.

Import data

import test.txt

## Retrieve test data
~$ wget pdc19.csie.ncu.edu.tw/lab2/test.txt
~$ cat test.txt
1950 11 1
1950 12 2
1955 11 1
1954 12 2
1953 13 1
1954 12 2
...
~$ hadoop fs -put ./test.txt /user/a000000000/test.txt

## Run pig
# pig -x tez  # will make some query faster, but will fail to ILLUSTRATE
# pig -d WARN # will hide some annoying INFO messages
~$ pig

## Load data of test.txt
## Use PigStorage(' ') to use ' '(space) as the delimiter.
grunt> records = load '/user/a000000000/test.txt' using PigStorage(' ') as (year:chararray, temperature:int, quality:int);

## Show the structure of a relation
grunt> describe records;
records: {year: chararray,temperature: int,quality: int}

## Show records in test.txt
grunt> dump records;
(1950,11,1)
(1950,12,2)
(1955,11,1)
(1954,12,2)
(1953,13,1)
(1954,12,2)
(1953,11,1)
(1952,12,2)
(1951,13,1)
(1952,14,2)
(1953,13,2)
(1954,12,1)
(1955,11,1)
(1956,12,2)
(1954,13,2)

grunt>

Data Processing

Grouping: Group data in simple relation
Filtering: Removes unwanted rows from a relation

## Data filtering
grunt> rec1 = filter records by temperature >11 and (quality == 0 or quality == 2);
grunt> dump rec1;
(1950,12,2)
(1954,12,2)
(1954,12,2)
(1952,12,2)
(1952,14,2)
(1953,13,2)
(1956,12,2)
(1954,13,2)

## Data Grouping
grunt> rec2 = group rec1 by year;
grunt> describe rec2;
rec2: <group: charryareray, rec1 <year: chararry, temperature: int, quality: int>>
grunt> dump rec2;
(1950,{(1950,12,2)})
(1952,{(1952,12,2),(1952,14,2)})
(1953,{(1953,13,2)})
(1954,{(1954,12,2),(1954,12,2),(1954,13,2)})
(1956,{(1956,12,2)})

Reduce

foreach: Processes every row to generate a derived set of rows
generate: using a GENERATE clause to define the fields in each derived row.

grunt> maxt = FOREACH rec2 GENERATE group, MAX(rec1.temperature);
grunt> dump maxt
(1950,12)
(1952,14)
(1953,13)
(1954,13)
(1956,12)

Illustrate

Show where the result come from

grunt> ILLUSTRATE maxt
----------------------------------------------------------------------------
| records     | year:chararray     | temperature:int     | quality:int     | 
----------------------------------------------------------------------------
|             | 1952               | 14                  | 2               | 
|             | 1952               | 12                  | 2               | 
|             | 1953               | 11                  | 1               | 
|             | 1954               | 12                  | 1               | 
|             | 1952               | 12                  | 0               | 
----------------------------------------------------------------------------
-------------------------------------------------------------------------
| rec1     | year:chararray     | temperature:int     | quality:int     | 
-------------------------------------------------------------------------
|          | 1952               | 14                  | 2               | 
|          | 1952               | 12                  | 2               | 
|          | 1952               | 12                  | 0               | 
-------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
| rec2     | group:chararray     | rec1:bag{:tuple(year:chararray,temperature:int,quality:int)}                     | 
---------------------------------------------------------------------------------------------------------------------
|          | 1952                | {}                                                                               | 
|          | 1952                | {}                                                                               | 
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------
| maxt     | group:chararray     | :int     | 
---------------------------------------------
|          | 1952                | 14       | 
---------------------------------------------

Write your first pig scripts

/*
 * test.pig
 */
records = load '/user/a000000000/test.txt' using PigStorage(' ') as (year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature >11 and (quality == 0 OR quality == 2);
rec2 = group rec1 by year;
maxt = foreach rec2 generate group, MAX(rec1.temperature);
dump maxt;

~$ pig test.pig

Comparing with RDBMS/Hive

Name	Description
Database	SQL commands, pre-defined schema
Hadoop Hive	SQL-Like commands on Hadoop HDFS
Pig	Data flow commands and runtime schema on Hadoop HDFS

Practices

Deadline: 4/22 pm12:00

Practice1

Test Input
pdata.txt

Write a pig latin script to:

Load pdata.txt. Collect all the data after 1951 (1951 is not included)
Filter the data where quality =1
Group the data by temperature, and then calculate the largest year for each temperature

Save your script named studentID_practice1.pig in your home directory/lab2 (~/lab2).

Expected value

(0,1991)
(3,1991)
(4,1989)
(5,1957)
(9,1993)
(10,1967)
(11,1961)
(13,1958)
(15,1972)
(16,1989)
(17,1976)
(18,1978)
(19,1967)
(20,1977)
(22,1994)
(23,1973)
(24,1996)
(25,1996)
(26,1983)
(28,1995)
(32,1970)
(34,1954)
(35,1993)
(37,1986)
(38,1985)
(39,1982)
(40,1993)
(42,1958)
(43,1985)
(44,1986)
(45,1987)
(46,1996)
(47,1982)
(48,1978)
(49,1985)
(51,1973)
(54,1989)
(55,1973)
(56,1965)
(58,1966)
(59,1980)
(60,1988)
(62,1998)
(64,1959)
(65,1989)
(66,1965)
(68,1993)
(69,1998)
(70,1989)
(71,1983)
(72,1952)
(73,1975)
(74,1989)
(75,1998)
(77,1987)
(78,1965)
(79,1992)
(81,1976)
(83,1976)
(84,1999)
(85,1959)
(88,1991)
(89,1991)
(90,1980)
(91,1991)
(92,1995)
(94,1995)
(95,1966)
(96,1979)
(97,1965)
(98,1975)

Practice2

Test Input
pbad.txt

Write a pig latin script:

Load pbad.txt, remove bad data.
Split good data into new data if quality matches ‘[0123456789]’, and old data if quality does not match ‘[0123456789]’.
Convert temperature in degrees Celsius (°C) to in degrees Fahrenheit (°F)
Use dump to show the maximum temperature with decimal fraction for each year in the old data

Tip

Save your script named studentID_practice2.pig in your home directory/lab2 (~/lab2).