1.
Hive provides structure to unstructured data by
providing Schema
2.
Hive architecture consists relational metastore,
this is hive stores metadata like table definitions, location of underlying
data, data type information, how tables partitioned etc.. this metastore is
relational database. By default Hive Installs light weight Derby database,
however Derby database has many limitations especially multi user access
therefore typically Hive metastore is installed in MySQL.
3.
HiveQL Drive is responsible for compiling,
optimizing and translating HiveQL into 0 or many MapReduce jobs
4.
Web UI or HDInsight provide a webinterface where
we can interact with hive using HiveSQL
5.
Hive manages 2 types of data
a.
Internal
i.
Hive warehouse (not to be confused with data
warehouse term) consists of metastore (all meta data about objects) and data
ii.
Hive owns and manages the data stored in the
Hive Warehouse.. i.e. if table is dropped then hive drops all metadata as well
as actual data.
b.
External
i.
Data can be stored anywhere Hive knows how to
communicated with. If data stored outside Hive warehouse, its called external
data. Hive still manages metadata even for external data.
ii.
Hive can create tables for data stored
externally. Hive does not own the data stored externally i.e. if table is dropped,
only meta data is dropped not the actual data
iii.
User has flexibility of managing and storing
data as fit
6.
Hive follows SQL syntax similar to MySQL
7.
From Hive interactive CLI, we can run linux
shell commands by prefixing !.. for example if we want to see directory listing
in HDFS from Hive CLI.. we can run “!hdfs dfs -ls /dirname”
8.
One of subtle difference between HiveSQ and SQL
is that, HiveSQ uses Jave expressions. For example if want to select all
columns except ID and NAME, we can write select ‘(ID|NAME)?+.+’ from tablename
9.
Another difference between HiveSQ and SQL is that interchangeable constructs.. i.e. we can
write “from tablename select fields where condition” instead of “select fields
from tablename where condition”
10.
Hive is not case sensitive and hive expects
explicit termination of sentence using “;” just like Java.
11.
Hive supports subqueries only in the FROM list..
but does not support subqueries in SELECT Column list
12.
When we create a database in Hive, its creates
database.db file in /hive/warehouse directory. This default directory name can
be changed while creating databse by supplying LOCATION clause value
CREATE (DATABASE|SCHEMA) [IF NOT
EXISTS] database_name
[COMMENT comments]
[LOCATION path]
[WITH DBPROPERTIES
(propname=value, propname=value ….)];
To switch current context and
default to database name..
USE db_name
To drop database
DROP [DATABASE|SCHEMA] [IF EXISTS]
db_name
Creating a table
CREATE [EXTERNAL] TABLE [IF NOT
EXISTS] [db_name.]table_name [(col_name datatype [COMMENT comments])]
[COMMENT comments]
[PARTITIONED BY (col_name datatype
[COMMENT comments])]
[ROW FORMAT rowformat] [STORED as
file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (propname=value,
propname=value ….)];
13.
LOAD DATA INPATH ‘hdfs file path’ [OVERWRITE]
INTO TABLE ‘tablename’ is to load a table from hdfs file.. after this command
execution, the file gets deleted form source hdfs location. This is because as
the source file is already on HDFS, hive just moves data and organizes instead
of making copies of data
14.
LOAD DATA LOCAL INPATH ‘hdfs file path’ INTO
TABLE ‘tablename’ is to load a table from local file.
15.
If we are loading a table using HDFS files then its
better to create external tables keeping data in same location
16.
We can also use CREATE AS <table_name>
SELECT <fields> FROM <table> to create a table and user CREATE
TABLE <tablename> LIKE <tablename> to copy table structure;
17.
Hive data types
a.
Hive support basic data types and complex data
types
i.
Basic datatypes
1.
Numeric
a.
TINYINT, SMALLINT, INT, BIGINT
b.
FLOAT
c.
DOUBLE
d.
DECIMAL
2.
Dates/Time
a.
DATE
b.
TIMESTAMP
3.
Others
a.
BOOLEAN
b.
STRING
c.
BINARY (for bigger like blob)
ii.
Complex types
1.
ARRAY
2.
MAP (Key value pair .. dictionay)
a.
Key must be one of primitive data type
b.
Value can be any data type including complex
data type.. example nameToID=map('A',1,'B',2); and access it as nameToID['A']
3.
STRUCT
a.
Data structures type.. this is similar to
creating our own structure or class in programming language
b.
STRUCT<fild1:type,field2:type…>
c.
We will use . notation to access STRUCT element
.. like structname.field2
4.
UNIONTYPE
a.
Hive assumes one of the data types provided in
the UNIONTYPE.. for example if we define a column MISC UNIONTYPE<INT, STRING, ARRAY<double>>… MISC column
can have any of 3 data types defined using UNIONTYPE and based on content hive
assumes one of the data type from list
18.
Hive provides implicit data type conversion
where small ones coverts automatically to counterpart higher ones example
SMALLINT to INT is automatic. We can also explicitly convert using cast example
CAST (value as datatype). CAST can be cascaded if needed.
19.
For Managed tables, we can partition tables in
Hive for improving performance. The partitioning concept is same as other
relational database.
a.
The partitions are created based on the columns.
b.
We have to specify PARTITIONED BY <column
names> at the time of table creation
c.
Also we have to specify partition name at the
time of LOAD DATA otherwise Hive will throw exception
d.
There will be separate folders created in
datastore in HDFS for each partition
e.
Partition columns are virtual columns , they are
not part of actual table data columns but they are represented as actual
columns therefore we can perform all select operations on virtual columns as we
do on actual columns.
f.
If table data column present in partition column
then hive will throw exception therefore if we need a partition columns in data
columns then we should rename data columns to something else so that partitions
can be queried with virtual column name.
20.
For external tables in order to partition, the
LOCATION is optional at CREATE statement. CREATE statement will only have PARTITIONED
BY but the location is added via ALTER TABLE … ADD PARTITION statement
21.
If we added data in hive partitioned structure
in HDFS file system but Hive not aware of this, we can repair table to update
all new partitions meta data using “msck repair table table_name”;
22.
Hive
allows Multiple Inserts using below syntax
FROM <from statement>
INSERT INTO TABLE …….
INSERT OVERWRITE TABLE……
INSERT OVERWRITE DIRECTORY ………….
…
Example
FROM source table
INSERT INTO TABLE targetable
SELECT columns WHERE conditions (Note here the FROM block moved top therefore
SELECT does not hve FROM block, Hive allows block interchange) (if we are
inserting into partition then PARTITIONED BY should appear before SELECT)
1.
The advantage of multiple inserts is that Hive
will have to scan the source table only once and perform all insert operations
multiple types
23.
We can create dynamic partitions by supllying
the column values from source table for example PARTITION BY (p1=”01-Jan_2013”,
p2, p3) .. here static partition column
is p1 and p2, p3 are dynamic partiotion columns… all dynamic partition columns
should appear after all static partition columns and at the end in select
column list
24.
Maximum allowed dynamic partitions are 1000 by
default, this can be changed by setting value to hive.exec.max.dynamic.partitions
parameter.
25.
We can select a table and do GROUP BY.
a.
GROUPING SETS
i.
Grouping sets defines the groups.. for example
Select a,b,sum(c) from table
group by a,b grouping sets((a,b),a) is equivalent to
Select a,b,sum(c) from table
group by a,b
Union all
Select a,null,sum(c) from table
group by a;
b.
WITH CUBE
i.
CUBE is nothing all possible group sets for the
group by columns.. i.e. group by a,b,c WITH CUBE is equal to group by a, group
by b, group by c, group by( a,b), group by( a,c), group by( c,b), group by(
a,b,c) , group by( )
c.
WITH ROLLUP
i.
Its similar to GROUPING SETS but only applies to
fields hierarchy. Top level in hierarchy is the first element provided in group
by clause and second element in second level in hierarchy etc.. i.e. GROUP BY
a,b,c WITH ROLLUP is equal to GROUP BY (a,b,c), GROUP BY (a,b) , GROUP BY
a, GROUP
BY ()
26.
Hive Build-In functions
a.
Mathemetical
i.
Rand() – generate random numbers
ii.
Pow(a,b) – ab
iii.
Abs(doubale a)
iv.
Round(double a, int d)
v.
Floor(double a)
vi.
….
b.
Collection
i.
Size, map_keys(map<k.v>),
array_contains(array,’test’)
c.
Type Conversion
d.
Date
i.
Unix_timestamp(), data_add(), to_date(), etc..
e.
Conditional
i.
If (cond, true, false)
ii.
Coalesce() --- null functions like nvl
iii.
CASE WHEN THEN END
f.
String
i.
Concat(), contat_ws() i.e. with separator,
refex_replace(), substr()
g.
Misc.
i.
h.
xPath
27.
UDAFs (user defined aggregate functions)
a.
Count, sum, avg, min, max, variance, stddev_pop,
histogram_numeric(col,b)
28.
UDTFs (User defined table functions)
a.
Explode(), flattern() etc
29.
UDF – User defined functions
a.
We can create our own functions by extending UDF baseclass and overwriting
EVALUATE mentod.
b.
We can then create a JAR and add this jar to
hive environment using ADD JAR command and also use DESCRIBE FUNCTION command
to see the function specs described in @DESCRIPTION annotation given at the
time of function definition
c.
We can create a temporary function as an alias
to original function using “CREATE TEMPORARY FUNCTION fname as ‘jarname’;
d.
This temporary function is only available in the
session.. but if we want to make it
available to all users, we need to create ../bin/.hiverc file or initialization
file and supply initialization when starting hive session with -i option
30.
Sorting and controlling data flow
a.
In select query, we can user ORDER BY Clause to
order the results. ORDER BY Clause trigger only one Reducer and sorts globally.
b.
If we want to run multiple Reducers parallelly
to improve the performance and utilize distributed computing power, ORDER BY
Clause does not work as it triggers only one Reducer
c.
We need to use SORT BY to trigger multiple
Reducers. But the sorting is done per Reducer i.e. we will never see globally
sorted list. Each reducer creates its own output part i.e. part-000000,
part-000001 etc.. and each output is sorted.
d.
In order to control what data the each reducer
process in SORT BY, we can use DISTRIBUTED BY.
e.
For example if we are interested to see each
user login time ordered by time then we can write query like SELECT userid, timestamp FROM table DISTRIBUTED BY userid SORT BY time;
f.
If same columns is used in both DISTRIBUTED BY
and SORT BY then we can use short notation CLUSTER BY
31.
Command line interface and variable substitution
a.
Hive -e ‘sql’ – To run HiveQL via command line
in batch mode
b.
Hive -S -e ‘sql’ – To run HiveQL via command
line in silent mode and in batch mode
c.
Hive -f filenameofHiveQL – To run HiveSQL that
is stored in a file
d.
For variable substitution hive provide 4
namespaces. Each name space variable is designated for special purpose
i.
Hivevar
-- is place to define user defined variables ex; set hivevar:variablename=value or hive -d
varname=value then use it as ${hivevar:varname}
ii.
Hiveconf – is where hive stored most of
configuration ex; set hiveconf:propertyname=value
iii.
System – for more Hadoop related properties ex;
set hivevar:propertyname=value
iv.
Env – for environment variables ex; set
hivevar:propertyname=value
32.
Bucketing
a.
Bucking is similar to partitioning. It can
further split partitions. Bucket can be applied at table level or partition
level.
b.
Bucking is an approach to cluster or distribute
data for better performance at Map-side joins and more efficient sampling
c.
Buckets can also be sorted
d.
Example CREATE TABLE table (field1 type, field2
type….) CLUSTERED BY (field1) INTO 256 BUCKETS… in this case Hive runs HASH
function against column field1 and split that into 256 buckets
e.
Another example
CREATE TABLE table (field1 type, field2 type….) PARTITIONED BY (field3
type) CLUSTERED BY (field1) SORTED BY (fields2) INTO 64 BUCKETS.. this
bucketing on top of partitions and sorted within buckets. SORTED enables
further optimization when using JOINs.. this will impact performance when
saving the data into table but this still ok as we would write once but read
several times.. therefore writing slowly is ok as long as read is fast.
f.
Hive does not control or enforce bucking during
the data load. Its user responsibility enforcing bucking when loading data. We
can use one of following 2 approaches
Set mapred.reduce.tasks=64
INSERT OVERWRITE INTO TABLE table
SELECT columns TABLE table CLUSTER BY col;
Or
Set hive.enforce.bucketing=true;
INSERT OVERWRITE TABLE table
SELECT columns FROM table;
33.
Data Sampling
a.
Data Sampling is concerned with working with
subset of data from very large data set
b.
Data Sampling can be applied to any table
whether bucketed or not… however bucketed table improves performance
c.
Syntax is using TABLESAMPLE keyword example
SELECT * FROM sourcetable TABLESAMPLE(BUCKET x OUT OF y [ON name])… even though
we use BUCKET key word , it does not mean we have to use only bucketed tables
d.
We can also sampling data using block percentage
i.e. SELECT * FROM sourcetable TABLESAMPLE(n PERCENT); but this approach does
not work always
34.
JOINs
a.
JOIN (inner join) --- returns only matched from
both sides
b.
LEFT, RIGHT, FULL [OUTER] JOIN
c.
LEFT SEMI JOIN
i.
IN or EXISTS do not work in Hive.. we need to
use Semi Join.. for example SELECT col1 FROM t1 WHERE EXISTS (SELECT 1 FROM t2
WHERE t1.col1=t2.col2) do not work.. instead we need to write SELECT col1 FROM
t1 LEFT SEMI JOIN t2 ON t1.col1=t2.col1
d.
CROSS JOIN same as Cartesian join.
e.
Hive does not support >,<,=>,=< etc
in join conditions.. it only support = in join conditions
f.
STREAMABLE – hive always assumes the small table
comes first in the multi join.. i.e. in join a,b,c tables hive assumes that
a<b<c .. this can be changed using STREAMABLE key word in select col list
g.
Other ways of optimizing joins are Map Join,
Map-side join (works on bucketed tables), distributed cache
35.
Windowing and analytical functions
a.
LEAD/LAG
b.
FIRST_VALUE
c.
LAST_VALUE
d.
PARTITION BY
e.
OVER Clause
f.
WINDOW
g.
RANK, ROW_NUMBER, DENSE_RANK etc
36.
Hive has TRANSFORM clause, this allow hive data
to be sent to STDIN and process this STDIN using external programs like shell
script or python or java etc.. example
SELECT TRANSFORM (f1,f2) USING ‘./file.sh’ AS (f1,f2) FROM table; notice here that file is in same directory..
to have file available to hive, we need to add the file to distributed cache
using “ADD FILE /path/file.sh”
No comments:
Post a Comment