Technical excerpts: BigData - Data Transformations with Apache Pig

1. Pig is used for ETL operations designed to work on both structured and unstructured data

2. Hive is designed to work on Structured and Unstructured data storage but not designed for data transformations. In very simple terms, Hive is a data warehouse and Pig is ETL to get data loaded into Hive.

3. Relations are basic structures in Pig to hold data on which transformations are performed. Relations are similar to variables.

4. We will use LOAD to load data into Pig relations, STORE to store data into files, DUMP to display data in Pig

5. General structure of Pig program is

a. Load data into Pig

b. Data is stored in a Relation

c. Data in Relation is subject to transformation and updates

d. Once transformation and updates are completed, store final data to file or display on screen

6. Pig commands can be executed in interactive mode or batchmode

a. To launch interactive mode, we just have to type “pig” then it will launch the Grunt Shell. By default Pig works with HDFS file system. If we want to work with local file system then we have to launch pig grunt shell using “pig -x local”

b. Batch mode is used to run the pig scripts stored in a file. This is can be done by supplying file name to pig command like “pig file.pig” (or -f option can be added for more clarity)

c. Using “pig --help” for any additional information

7. Relations in Pig are immutable. Any updates to relations create new relation.

8. Relations store in memory and exists only during a single Pig session. As soon as session completed, related are cleared from memory. For persistence of relations, we must store them in a physical file.

9. Relations are not evaluated until they are displayed (DUMP) or written to file (STORE) . This is called lazy evaluations. This allows Pig to perform optimization techniques.

10. PigStorage() – used for loading data into Pig relations

a. Bulkdeals = load ‘/home/rajendra/IdeaProjects/BigData-Practice/data/input/01-01-2013-TO-31-12-2013_bulk.csv’ using PigStorage(‘|’); ------- default delimiter PigStorage functions uses is tab (‘\t’)

b. Load command does not delete duplicates. It also accepts directories and loads all files in the directory , and duplicate records are retained.

c. Load in above point (a) is loaded without schema. We can also load using a schema as below

load ‘/home/rajendra/IdeaProjects/BigData-Practice/data/input/01-01-2013-TO-31-12-2013_bulk.csv’ using PigStorage(‘|’)

(

Tdate: datetime,

Symbol: chararray,

…

)

d. If we use describe, then we can see the schema of a relation

11. FOR EACH

a. Using for each, we can loop through related data structure and generate a specific elemenet using index (index starts from 0).. for example “tdateandsymbol = foreach Bulkdeals generate $0, $1;”

b. We can also use field names for schema defined relations

12. SPLIT

a. Splits a relation into multiple relations

b. Example ; “split orders into orders_1 if (order ==1) , order_more if (order > 1);”

13. FILTER

a. Filters relations based on a condition

b. “orders_filter = filter order by order >= ‘o4’;”

14. DISTINCT,LIMIT, ORDER BY

a. Orders_no_duplicates = distinct orders;

b. Orders_3 = limit order 3;

c. Orders_desc = order orders by order_id desc;

15. LIMIT function can be used to select limited number of records example 5bulkdeals = limit bulkdeals 5;

16. STORE function can be used as “store 5bulkdeals into ‘subdir’ using PigStorage();”… note subdir should not existing before running the store command.

17. Case sensitivity

a. Following care case sensitive

i. Relation names

ii. Field names within relations

iii. Function names such as PigStorage(), SUM(), COUNT() etc

b. Following are not case sensitive

i. Keywords such as load, store, foreach, generate, group by , order by , dump

18. Data Types

a. Scalar or primitive that represents single entity

i. Boolean

1. To represent true or false values

ii. Numeric

1. int – 4 bytes , 2³¹ to 2³¹ -1

2. long – 8byes

3. float – 4 bytes

4. double – 8bytes

iii. String

1. chararray - Variable length unbounded, size not to be specified … i.e. fieldname: chaarray(01) is invalid where as fieldname: chararray is valid.

iv. Date/time

1. datetime - represents date and time upto nano seconds

v. Bytes

1. bytearray – BLOB of data to represent anything.. when schema not given , then this is default datatype for unknow schema elements.

b. Complex or collection types to represent group of entities

i. Tuple

1. Ordered collection of fields where each fields will its own primitive data type.. example (a,1) or (a,b,01-jan-2018,0,true)

2. In a tuple if data type not mentioned for an element, default bytearray data type is assumed

3. A tuple can be considered as a row in a traditional relational database

4. TOTUPLE() can be used to generate a tuple from individual fields

5. We can define tuple field as “field name: tuple(e1:chararray,e2:int,e3:Boolean)”

6. Individual fields in the tuple can be accessed using . (dot)notation. For example if we have 3^rd filed tuple in a relation then we can access tuple fields as $2.$0, $2.$1 etc..

ii. Bag

1. Unordered collection of tuples

2. This is similar to set in java and python but it allows duplicates

3. Are enclosed by {}

4. A relation is nothing but collection of tuples.. this is called inner bag

5. TOBAG() to convert fields to a bag structure

iii. Map

1. Enclosed by []

2. Key value pairs.. key should always by chararray, but value can be any type

3. # is delimiter.. for example [name#jon, job#engineer] and access it as $3#’name’

4. TOMAP() to convert fields to a map structure

19. Partial schema specification and casting

a. While loading data , we can specify all field names along with data types.. this is called full schema specification

b. We can just specify the field names but not the data types.. this is called partial schema specification

c. We do not specify field names or data types, this is no schema..

d. Pig works in all cases

e. We can also cast fields to different data types using casing operators i.e. (int)$3 to convert 4^th field to int.

f. Bytearray type can be casted to any other datatype but other data types have limitation in terms of implicit conversion or explicit casting.

20. Pig Functions

a. UDF – User Defined Functions

i. PigStorage(), TOMAP(),TOBAG(),TOTUPLE() etc are build in UDFs

ii. Build in funcitons can be categorized into 4 groups based on the functions they perform

1. Load – loading data to Pig

a. PigStorage()

b. HBaseStorage()

c. JsonLoader()

d. AvroStorage()

e. CSVExcelStorage()

2. Store – storing data to a file

a. Same functions as of load

3. Evaluate – transformations on record or fields

a. Math functions

i. ROUND(int) returns long type

b. String functions

i. SUBSTRING(string,startpos,stoppos)

ii. REPLACE(String,existing,new)

c. Data functions

i. ToDate(date,’dateformat’)

ii. GetMonth(‘date’)

d. Complex type functions

i. TOMAP(),TOTUPLE(),TOBAG()

e. Aggregate functions

4. Filter – to filter individual records

21. GROUP BY

a. Order_grp = group by orders item;

b. orders_cnt = foreach orders_grp generate group, SUM(orders.quantity);

22. JOIN

a. LEFT OUTER JOIN

i. Names_trades_lo = join names by symbol left outer, trades by symbol;

b. RIGHT OUTER JOIN

c. FULL OUTER JOIN

d. SELF JOIN or JOIN

i. Names_trades_jn = join names by symbol, trades by symbol;

e. CROSS JOIN

i. Cartesian join

ii. Names_trades_cross = cross names, trades;

23. UNION

a. Both relations should have same number of fileds and compitable schema

b. UNION does not preserve order of tuples

c. Preserve duplicates

d. all_names = union names, other_names;

24. UNION when schema is mismatched

a. When both relations have different fields then result will be null

b. If the both relations have same number of fields but data type not matched, the pig will try to find common ground by casting to higher type

c. If fields are of complex time with incompaitable inner fields then fields does not what to do therefore it will result in null

d. UNION ONSCHEMA can be used to union relations with mismatched schema

e. all_names_2 = union onschema names, other_names;

25. FLATTEN

a. Can be used to flatten bag type to primitive type fields row

b. It will create separate records for each different value in complex fields

c. Flatten_activities = foreach student_activity_bag generate name, flatten(activities) as activity;

26. Nested foreach

a. Collision_stats_nstd_foreach = foreach collision_data { total = collision_data.SUM(total); generate total;}

b. Within {}, we can use several intermediate operations like sort, order by , limit etc.. and generate intermediate relations.. at then using generate, we can generate the fields required

Technical excerpts

Search This Blog

Sunday, 31 December 2017

BigData - Data Transformations with Apache Pig

No comments:

Post a Comment