Search This Blog

Monday, 25 December 2017

BigData - Analyzing BigData with Hive

1.       Hive provides structure to unstructured data by providing Schema
2.       Hive architecture consists relational metastore, this is hive stores metadata like table definitions, location of underlying data, data type information, how tables partitioned etc.. this metastore is relational database. By default Hive Installs light weight Derby database, however Derby database has many limitations especially multi user access therefore typically Hive metastore is installed in MySQL.
3.       HiveQL Drive is responsible for compiling, optimizing and translating HiveQL into 0 or many MapReduce jobs
4.       Web UI or HDInsight provide a webinterface where we can interact with hive using HiveSQL
5.       Hive manages 2 types of data
a.       Internal
                                                               i.      Hive warehouse (not to be confused with data warehouse term) consists of metastore (all meta data about objects) and data
                                                             ii.      Hive owns and manages the data stored in the Hive Warehouse.. i.e. if table is dropped then hive drops all metadata as well as actual data.
b.       External
                                                               i.      Data can be stored anywhere Hive knows how to communicated with. If data stored outside Hive warehouse, its called external data. Hive still manages metadata even for external data.
                                                             ii.      Hive can create tables for data stored externally. Hive does not own the data stored externally i.e. if table is dropped, only meta data is dropped not the actual data
                                                           iii.      User has flexibility of managing and storing data as fit
6.       Hive follows SQL syntax similar to MySQL
7.       From Hive interactive CLI, we can run linux shell commands by prefixing !.. for example if we want to see directory listing in HDFS from Hive CLI.. we can run “!hdfs dfs -ls /dirname”
8.       One of subtle difference between HiveSQ and SQL is that, HiveSQ uses Jave expressions. For example if want to select all columns except ID and NAME, we can write select ‘(ID|NAME)?+.+’ from tablename
9.       Another difference between HiveSQ and SQL is  that interchangeable constructs.. i.e. we can write “from tablename select fields where condition” instead of “select fields from tablename where condition”
10.   Hive is not case sensitive and hive expects explicit termination of sentence using “;” just like Java.
11.   Hive supports subqueries only in the FROM list.. but does not support subqueries in SELECT Column list
12.   When we create a database in Hive, its creates database.db file in /hive/warehouse directory. This default directory name can be changed while creating databse by supplying LOCATION clause value
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
 [COMMENT comments]
[LOCATION  path]
[WITH DBPROPERTIES (propname=value, propname=value ….)];

To switch current context and default to database name..
USE db_name
To drop database
DROP [DATABASE|SCHEMA] [IF EXISTS] db_name

Creating a table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name datatype [COMMENT comments])]
[COMMENT comments]
[PARTITIONED BY (col_name datatype [COMMENT comments])]
[ROW FORMAT rowformat] [STORED as file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (propname=value, propname=value ….)];

13.   LOAD DATA INPATH ‘hdfs file path’ [OVERWRITE] INTO TABLE ‘tablename’ is to load a table from hdfs file.. after this command execution, the file gets deleted form source hdfs location. This is because as the source file is already on HDFS, hive just moves data and organizes instead of making copies of data
14.   LOAD DATA LOCAL INPATH ‘hdfs file path’ INTO TABLE ‘tablename’ is to load a table from local file.
15.   If we are loading a table using HDFS files then its better to create external tables keeping data in same location
16.   We can also use CREATE AS <table_name> SELECT <fields> FROM <table> to create a table and user CREATE TABLE <tablename> LIKE <tablename> to copy table structure;
17.   Hive data types
a.       Hive support basic data types and complex data types
                                                               i.      Basic datatypes
1.       Numeric
a.       TINYINT, SMALLINT, INT, BIGINT
b.       FLOAT
c.       DOUBLE
d.       DECIMAL
2.       Dates/Time
a.       DATE
b.       TIMESTAMP
3.       Others
a.       BOOLEAN
b.       STRING
c.       BINARY (for bigger like blob)
                                                             ii.      Complex types
1.       ARRAY
2.       MAP (Key value pair .. dictionay)
a.       Key must be one of primitive data type
b.       Value can be any data type including complex data type.. example nameToID=map('A',1,'B',2); and access it as nameToID['A'] 
3.       STRUCT
a.       Data structures type.. this is similar to creating our own structure or class in programming language
b.       STRUCT<fild1:type,field2:type…>
c.       We will use . notation to access STRUCT element .. like structname.field2
4.       UNIONTYPE
a.       Hive assumes one of the data types provided in the UNIONTYPE.. for example if we define a column   MISC UNIONTYPE<INT, STRING, ARRAY<double>>… MISC column can have any of 3 data types defined using UNIONTYPE and based on content hive assumes one of the data type from list
18.   Hive provides implicit data type conversion where small ones coverts automatically to counterpart higher ones example SMALLINT to INT is automatic. We can also explicitly convert using cast example CAST (value as datatype). CAST can be cascaded if needed.
19.   For Managed tables, we can partition tables in Hive for improving performance. The partitioning concept is same as other relational database.
a.       The partitions are created based on the columns.
b.       We have to specify PARTITIONED BY <column names> at the time of table creation
c.       Also we have to specify partition name at the time of LOAD DATA otherwise Hive will throw exception
d.       There will be separate folders created in datastore in HDFS for each partition
e.       Partition columns are virtual columns , they are not part of actual table data columns but they are represented as actual columns therefore we can perform all select operations on virtual columns as we do on actual columns.
f.        If table data column present in partition column then hive will throw exception therefore if we need a partition columns in data columns then we should rename data columns to something else so that partitions can be queried with virtual column name.
20.   For external tables in order to partition, the LOCATION is optional at CREATE statement. CREATE statement will only have PARTITIONED BY but the location is added via ALTER TABLE … ADD PARTITION statement
21.   If we added data in hive partitioned structure in HDFS file system but Hive not aware of this, we can repair table to update all new partitions meta data using “msck repair table table_name”;
22.    Hive allows Multiple Inserts using below syntax
FROM <from statement>
INSERT INTO TABLE …….
INSERT OVERWRITE TABLE……
INSERT OVERWRITE DIRECTORY ………….
Example
FROM source table
INSERT INTO TABLE targetable SELECT columns WHERE conditions (Note here the FROM block moved top therefore SELECT does not hve FROM block, Hive allows block interchange) (if we are inserting into partition then PARTITIONED BY should appear before SELECT)
1.       The advantage of multiple inserts is that Hive will have to scan the source table only once and perform all insert operations multiple types
23.   We can create dynamic partitions by supllying the column values from source table for example PARTITION BY (p1=”01-Jan_2013”, p2, p3)  .. here static partition column is p1 and p2, p3 are dynamic partiotion columns… all dynamic partition columns should appear after all static partition columns and at the end in select column list
24.   Maximum allowed dynamic partitions are 1000 by default, this can be changed by setting value to hive.exec.max.dynamic.partitions parameter.
25.   We can select a table and do GROUP BY.
a.       GROUPING SETS
                                                               i.      Grouping sets defines the groups.. for example
Select a,b,sum(c) from table group by a,b grouping sets((a,b),a) is equivalent to
Select a,b,sum(c) from table group by a,b
Union all
Select a,null,sum(c) from table group by a;
b.       WITH CUBE
                                                               i.      CUBE is nothing all possible group sets for the group by columns.. i.e. group by a,b,c WITH CUBE is equal to group by a, group by b, group by c, group by( a,b), group by( a,c), group by( c,b), group by( a,b,c) , group by( )
c.       WITH ROLLUP
                                                               i.      Its similar to GROUPING SETS but only applies to fields hierarchy. Top level in hierarchy is the first element provided in group by clause and second element in second level in hierarchy etc.. i.e. GROUP BY a,b,c WITH ROLLUP is equal to GROUP BY (a,b,c), GROUP BY (a,b) , GROUP BY a,  GROUP  BY ()
26.   Hive Build-In functions
a.       Mathemetical
                                                               i.      Rand() – generate random numbers
                                                             ii.      Pow(a,b) – ab
                                                           iii.      Abs(doubale a)
                                                           iv.      Round(double a, int d)
                                                             v.      Floor(double a)
                                                           vi.      ….
b.       Collection
                                                               i.      Size, map_keys(map<k.v>), array_contains(array,’test’)
c.       Type Conversion
d.       Date
                                                               i.      Unix_timestamp(), data_add(), to_date(), etc..
e.       Conditional
                                                               i.      If (cond, true, false)
                                                             ii.      Coalesce() --- null functions like nvl
                                                           iii.      CASE WHEN THEN END
f.        String
                                                               i.      Concat(), contat_ws() i.e. with separator, refex_replace(), substr()
g.       Misc.
                                                               i.       
h.       xPath
27.   UDAFs (user defined aggregate functions)
a.       Count, sum, avg, min, max, variance, stddev_pop, histogram_numeric(col,b)
28.   UDTFs (User defined table functions)
a.       Explode(), flattern() etc
29.   UDF – User defined functions
a.       We can create our own functions  by extending UDF baseclass and overwriting EVALUATE mentod.
b.       We can then create a JAR and add this jar to hive environment using ADD JAR command and also use DESCRIBE FUNCTION command to see the function specs described in @DESCRIPTION annotation given at the time of function definition
c.       We can create a temporary function as an alias to original function using “CREATE TEMPORARY FUNCTION fname as ‘jarname’;
d.       This temporary function is only available in the session.. but if  we want to make it available to all users, we need to create ../bin/.hiverc file or initialization file and supply initialization when starting hive session with -i option
30.   Sorting and controlling data flow
a.       In select query, we can user ORDER BY Clause to order the results. ORDER BY Clause trigger only one Reducer and sorts globally.
b.       If we want to run multiple Reducers parallelly to improve the performance and utilize distributed computing power, ORDER BY Clause does not work as it triggers only one Reducer
c.       We need to use SORT BY to trigger multiple Reducers. But the sorting is done per Reducer i.e. we will never see globally sorted list. Each reducer creates its own output part i.e. part-000000, part-000001 etc.. and each output is sorted.
d.       In order to control what data the each reducer process in SORT BY, we can use DISTRIBUTED BY.
e.       For example if we are interested to see each user login time ordered by time then we can write query  like SELECT userid, timestamp FROM table  DISTRIBUTED BY userid SORT BY time;
f.        If same columns is used in both DISTRIBUTED BY and SORT BY then we can use short notation CLUSTER BY
31.   Command line interface and variable substitution
a.       Hive -e ‘sql’ – To run HiveQL via command line in batch mode
b.       Hive -S -e ‘sql’ – To run HiveQL via command line in silent mode and in batch mode
c.       Hive -f filenameofHiveQL – To run HiveSQL that is stored in a file
d.       For variable substitution hive provide 4 namespaces. Each name space variable is designated for special purpose
                                                               i.      Hivevar  -- is place to define user defined variables  ex; set hivevar:variablename=value or hive -d varname=value then use it as ${hivevar:varname}
                                                             ii.      Hiveconf – is where hive stored most of configuration ex; set hiveconf:propertyname=value
                                                           iii.      System – for more Hadoop related properties ex; set hivevar:propertyname=value
                                                           iv.      Env – for environment variables ex; set hivevar:propertyname=value
32.   Bucketing
a.       Bucking is similar to partitioning. It can further split partitions. Bucket can be applied at table level or partition level.
b.       Bucking is an approach to cluster or distribute data for better performance at Map-side joins and more efficient sampling
c.       Buckets can also be sorted
d.       Example CREATE TABLE table (field1 type, field2 type….) CLUSTERED BY (field1) INTO 256 BUCKETS… in this case Hive runs HASH function against column field1 and split that into 256 buckets
e.       Another example  CREATE TABLE table (field1 type, field2 type….) PARTITIONED BY (field3 type) CLUSTERED BY (field1) SORTED BY (fields2) INTO 64 BUCKETS.. this bucketing on top of partitions and sorted within buckets. SORTED enables further optimization when using JOINs.. this will impact performance when saving the data into table but this still ok as we would write once but read several times.. therefore writing slowly is ok as long as read is fast.
f.        Hive does not control or enforce bucking during the data load. Its user responsibility enforcing bucking when loading data. We can use one of following 2 approaches
Set mapred.reduce.tasks=64
INSERT OVERWRITE INTO TABLE table SELECT columns TABLE table CLUSTER BY col;
Or
Set hive.enforce.bucketing=true;
INSERT OVERWRITE TABLE table SELECT columns FROM table;
33.   Data Sampling
a.       Data Sampling is concerned with working with subset of data from very large data set
b.       Data Sampling can be applied to any table whether bucketed or not… however bucketed table improves performance
c.       Syntax is using TABLESAMPLE keyword example SELECT * FROM sourcetable TABLESAMPLE(BUCKET x OUT OF y [ON name])… even though we use BUCKET key word , it does not mean we have to use only bucketed tables
d.       We can also sampling data using block percentage i.e. SELECT * FROM sourcetable TABLESAMPLE(n PERCENT); but this approach does not work always
34.   JOINs
a.       JOIN (inner join) --- returns only matched from both sides
b.       LEFT, RIGHT, FULL [OUTER] JOIN
c.       LEFT SEMI JOIN
                                                               i.      IN or EXISTS do not work in Hive.. we need to use Semi Join.. for example SELECT col1 FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.col1=t2.col2) do not work.. instead we need to write SELECT col1 FROM t1 LEFT SEMI JOIN t2 ON t1.col1=t2.col1
d.       CROSS JOIN same as Cartesian join.
e.       Hive does not support >,<,=>,=< etc in join conditions.. it only support = in join conditions
f.        STREAMABLE – hive always assumes the small table comes first in the multi join.. i.e. in join a,b,c tables hive assumes that a<b<c .. this can be changed using STREAMABLE key word in select col list
g.       Other ways of optimizing joins are Map Join, Map-side join (works on bucketed tables), distributed cache
35.   Windowing and analytical functions
a.       LEAD/LAG
b.       FIRST_VALUE
c.       LAST_VALUE
d.       PARTITION BY
e.       OVER Clause
f.        WINDOW
g.       RANK, ROW_NUMBER, DENSE_RANK etc

36.   Hive has TRANSFORM clause, this allow hive data to be sent to STDIN and process this STDIN using external programs like shell script or python or java etc..  example SELECT TRANSFORM (f1,f2) USING ‘./file.sh’ AS (f1,f2) FROM table;  notice here that file is in same directory.. to have file available to hive, we need to add the file to distributed cache using “ADD FILE /path/file.sh”

No comments:

Post a Comment