1.
Hadoop Limitations
a.
Unstructured data: Hadoop hold mostly
unstructured data. Even though data confirms to a schema, Hadoop or hdfs itself
does not provide a way to associate a schema to data
b.
No Random Access: Hadoop supports bulk data
processing and batch processing but does not support randon access of data i.e.
accessing a particular record is not supported
c.
High Latency: Hadoop not best suited for
transactional database. Processing is suppose to be on very huge data and
latency will be high on small datasets, but on huge data Hadoop is faster
compared with traditional database or file system
d.
Not ACID Compliant: HDFS file system does not
guarantee the data integrity
2.
These Hadoop limitations could be overcomed with
database like HBase
3.
HBASE
a.
Hbase is NoSQL database
b.
It supports a loose data structure
c.
HBase supports Row Keys, which allows updates to
one record or random access
d.
Some transactions in HBase will have ACID
properties
4.
Properties of HBase
a.
Columnar Store database
b.
Denormalized storage
i.
Queries span over multiple tables not allowed in
Hbase therefore we should denormalize and store all related data together
ii.
No Indexes and No Constraints supported
c.
Only CRUD Operations (Create, Read, Update and
Delete)
i.
Functions like ORDER BY, JOIN, GROUP BY not
allowed in HBASE
ii.
HBase does not support SQL, its NoSQL database
d.
ACID at the row level
5.
HBase Data Model
a.
HBase follows 4-dimensional model. It has 4
types of data in each row
i.
Row Key
1.
Rowkey is like primary key or index in RDMS. Rowkey
not required to be specified at the time of table creation, it can be given at
the time of data insert or load
ii.
Column Family
1.
At the time of table creation, we need to define
the column family
2.
All the related columns could be grouped together
as column family
3.
Every row in a table will have same set of
column families
4.
one column family is stored in one file i.e. a
column family is fetched together even if we query only one column in column
family
iii.
Column
1.
Column is actual column that contains data
2.
Columns are not specified at the time of table
creation and we can add columns dynamically
3.
Each column must belong to a column family
4.
Data type not associated with column, everything
internally a bytearray
5.
Column is referenced as “columnfamily: columnname”
iv.
Timestamp
1.
Value in column is associated with timestamp.
This timestamp identifies when was last time value changed
2.
Every time value modified, new entry is created
with new timestamp therefore it allows versioning values using timestamp
6.
HBASE Shell can be invoked by typing “hbase
shell”
7.
To create a table just type “create ‘tablename’,
‘columnfam1’, ‘columnfamily2’”
8.
To insert data type “put ‘tablename’, rowkeyval,
‘columnfam1: col1’, col1val”…. We can run multiple puts with same rowkey value
to add columns to same row… if we run put again with same rowkey and same
column, then it updates the value
9.
Use “scan tablename” to read table data
10.
Use “count tablename” to count records in a
table
11.
Using “get tablename, rowkey” to get a specific
row in a table.. get can further be granular by specifying column names
12.
We can use scan for retrieving range of records ,
we have more options at scan.. for example “scan ‘tablename’,{COLUMNS => ‘[coumnfam:
columnname]’, STARTROW =>”2”, STOPROW => “4”} retrieves one column from
second and third row
13.
“delete” is similar to get but it deletes
instead of reading
14.
Disable table before drop.. “enable”,
“disable”,”exists”,”drop” few more hbase shell commands
No comments:
Post a Comment