1.
Pig is used for ETL operations designed to work
on both structured and unstructured data
2.
Hive is designed to work on Structured and
Unstructured data storage but not designed for data transformations. In very
simple terms, Hive is a data warehouse and Pig is ETL to get data loaded into
Hive.
3.
Relations are basic structures in Pig to hold
data on which transformations are performed. Relations are similar to
variables.
4.
We will use LOAD to load data into Pig
relations, STORE to store data into files, DUMP to display data in Pig
5.
General structure of Pig program is
a.
Load data into Pig
b.
Data is stored in a Relation
c.
Data in Relation is subject to transformation
and updates
d.
Once transformation and updates are completed,
store final data to file or display on screen
6.
Pig commands can be executed in interactive mode
or batchmode
a.
To launch interactive mode, we just have to type
“pig” then it will launch the Grunt
Shell. By default Pig works with HDFS file system. If we want to work with
local file system then we have to launch pig grunt shell using “pig -x local”
b.
Batch mode is used to run the pig scripts stored
in a file. This is can be done by supplying file name to pig command like “pig
file.pig” (or -f option can be added for more clarity)
c.
Using “pig --help” for any additional
information
7.
Relations in Pig are immutable. Any updates to
relations create new relation.
8.
Relations store in memory and exists only during
a single Pig session. As soon as session completed, related are cleared from
memory. For persistence of relations, we must store them in a physical file.
9.
Relations are not evaluated until they are
displayed (DUMP) or written to file (STORE) . This is called lazy evaluations.
This allows Pig to perform optimization techniques.
10.
PigStorage() – used for loading data into Pig
relations
a.
Bulkdeals = load ‘/home/rajendra/IdeaProjects/BigData-Practice/data/input/01-01-2013-TO-31-12-2013_bulk.csv’
using PigStorage(‘|’); ------- default delimiter PigStorage functions uses is
tab (‘\t’)
b.
Load command does not delete duplicates. It also
accepts directories and loads all files in the directory , and duplicate
records are retained.
c.
Load in above point (a) is loaded without
schema. We can also load using a schema as below
load ‘/home/rajendra/IdeaProjects/BigData-Practice/data/input/01-01-2013-TO-31-12-2013_bulk.csv’
using PigStorage(‘|’)
as
(
Tdate: datetime,
Symbol: chararray,
…
)
d.
If we use describe, then we can see the schema
of a relation
11.
FOR EACH
a.
Using for each, we can loop through related data
structure and generate a specific elemenet using index (index starts from 0)..
for example “tdateandsymbol = foreach Bulkdeals generate $0, $1;”
b.
We can also use field names for schema defined
relations
12.
SPLIT
a.
Splits a relation into multiple relations
b.
Example ; “split orders into orders_1 if (order ==1)
, order_more if (order > 1);”
13.
FILTER
a.
Filters relations based on a condition
b.
“orders_filter = filter order by order >=
‘o4’;”
14.
DISTINCT,LIMIT, ORDER BY
a.
Orders_no_duplicates = distinct orders;
b.
Orders_3 = limit order 3;
c.
Orders_desc = order orders by order_id desc;
15.
LIMIT function can be used to select limited
number of records example 5bulkdeals = limit bulkdeals 5;
16.
STORE function can be used as “store 5bulkdeals
into ‘subdir’ using PigStorage();”… note subdir should not existing before
running the store command.
17.
Case sensitivity
a.
Following care case sensitive
i.
Relation names
ii.
Field names within relations
iii.
Function names such as PigStorage(), SUM(),
COUNT() etc
b.
Following are not case sensitive
i.
Keywords such as load, store, foreach, generate,
group by , order by , dump
18.
Data Types
a.
Scalar or primitive that represents single
entity
i.
Boolean
1.
To represent true or false values
ii.
Numeric
1.
int – 4 bytes , 231 to 231
-1
2.
long – 8byes
3.
float – 4 bytes
4.
double – 8bytes
iii.
String
1.
chararray - Variable length unbounded, size not
to be specified … i.e. fieldname: chaarray(01) is invalid where as fieldname:
chararray is valid.
iv.
Date/time
1.
datetime
- represents date and time upto nano seconds
v.
Bytes
1.
bytearray – BLOB of data to represent anything..
when schema not given , then this is default datatype for unknow schema
elements.
b.
Complex or collection types to represent group
of entities
i.
Tuple
1.
Ordered collection of fields where each fields
will its own primitive data type.. example (a,1) or (a,b,01-jan-2018,0,true)
2.
In a tuple if data type not mentioned for an
element, default bytearray data type is assumed
3.
A tuple can be considered as a row in a
traditional relational database
4.
TOTUPLE() can be used to generate a tuple from
individual fields
5.
We can define tuple field as “field name:
tuple(e1:chararray,e2:int,e3:Boolean)”
6.
Individual fields in the tuple can be accessed
using . (dot)notation. For example if we have 3rd filed tuple in a
relation then we can access tuple fields as $2.$0, $2.$1 etc..
ii.
Bag
1.
Unordered collection of tuples
2.
This is similar to set in java and python but it
allows duplicates
3.
Are enclosed by {}
4.
A relation is nothing but collection of tuples..
this is called inner bag
5.
TOBAG() to convert fields to a bag structure
iii.
Map
1.
Enclosed by []
2.
Key value pairs.. key should always by
chararray, but value can be any type
3.
# is delimiter.. for example [name#jon,
job#engineer] and access it as $3#’name’
4.
TOMAP() to convert fields to a map structure
19.
Partial schema specification and casting
a.
While loading data , we can specify all field
names along with data types.. this is called full schema specification
b.
We can just specify the field names but not the
data types.. this is called partial schema specification
c.
We do not specify field names or data types,
this is no schema..
d.
Pig works in all cases
e.
We can also cast fields to different data types
using casing operators i.e. (int)$3 to convert 4th field to int.
f.
Bytearray type can be casted to any other
datatype but other data types have limitation in terms of implicit conversion or explicit casting.
20.
Pig Functions
a.
UDF – User Defined Functions
i.
PigStorage(), TOMAP(),TOBAG(),TOTUPLE() etc are
build in UDFs
ii.
Build in funcitons can be categorized into 4
groups based on the functions they perform
1.
Load – loading data to Pig
a.
PigStorage()
b.
HBaseStorage()
c.
JsonLoader()
d.
AvroStorage()
e.
CSVExcelStorage()
2.
Store – storing data to a file
a.
Same functions as of load
3.
Evaluate – transformations on record or fields
a.
Math functions
i.
ROUND(int) returns long type
b.
String functions
i.
SUBSTRING(string,startpos,stoppos)
ii.
REPLACE(String,existing,new)
c.
Data functions
i.
ToDate(date,’dateformat’)
ii.
GetMonth(‘date’)
d.
Complex type functions
i.
TOMAP(),TOTUPLE(),TOBAG()
e.
Aggregate functions
4.
Filter – to filter individual records
21.
GROUP BY
a.
Order_grp = group by orders item;
b.
orders_cnt = foreach orders_grp generate group,
SUM(orders.quantity);
22.
JOIN
a.
LEFT OUTER JOIN
i.
Names_trades_lo = join names by symbol left
outer, trades by symbol;
b.
RIGHT OUTER JOIN
c.
FULL OUTER JOIN
d.
SELF JOIN or JOIN
i.
Names_trades_jn = join names by symbol, trades
by symbol;
e.
CROSS JOIN
i.
Cartesian join
ii.
Names_trades_cross = cross names, trades;
23.
UNION
a.
Both relations should have same number of fileds
and compitable schema
b.
UNION does not preserve order of tuples
c.
Preserve duplicates
d.
all_names = union names, other_names;
24.
UNION when schema is mismatched
a.
When both relations have different fields then
result will be null
b.
If the both relations have same number of fields
but data type not matched, the pig will try to find common ground by casting to
higher type
c.
If fields are of complex time with incompaitable
inner fields then fields does not what to do therefore it will result in null
d.
UNION ONSCHEMA can be used to union relations
with mismatched schema
e.
all_names_2 = union onschema names, other_names;
25.
FLATTEN
a.
Can be used to flatten bag type to primitive
type fields row
b.
It will create separate records for each
different value in complex fields
c.
Flatten_activities = foreach
student_activity_bag generate name, flatten(activities) as activity;
26.
Nested foreach
a.
Collision_stats_nstd_foreach = foreach collision_data { total =
collision_data.SUM(total); generate total;}
b.
Within {}, we can use several intermediate
operations like sort, order by , limit etc.. and generate intermediate
relations.. at then using generate, we can generate the fields required
No comments:
Post a Comment