Technical excerpts: BigData - Getting Started with HBase

1. Hadoop Limitations

a. Unstructured data: Hadoop hold mostly unstructured data. Even though data confirms to a schema, Hadoop or hdfs itself does not provide a way to associate a schema to data

b. No Random Access: Hadoop supports bulk data processing and batch processing but does not support randon access of data i.e. accessing a particular record is not supported

c. High Latency: Hadoop not best suited for transactional database. Processing is suppose to be on very huge data and latency will be high on small datasets, but on huge data Hadoop is faster compared with traditional database or file system

d. Not ACID Compliant: HDFS file system does not guarantee the data integrity

2. These Hadoop limitations could be overcomed with database like HBase

3. HBASE

a. Hbase is NoSQL database

b. It supports a loose data structure

c. HBase supports Row Keys, which allows updates to one record or random access

d. Some transactions in HBase will have ACID properties

4. Properties of HBase

a. Columnar Store database

b. Denormalized storage

i. Queries span over multiple tables not allowed in Hbase therefore we should denormalize and store all related data together

ii. No Indexes and No Constraints supported

c. Only CRUD Operations (Create, Read, Update and Delete)

i. Functions like ORDER BY, JOIN, GROUP BY not allowed in HBASE

ii. HBase does not support SQL, its NoSQL database

d. ACID at the row level

5. HBase Data Model

a. HBase follows 4-dimensional model. It has 4 types of data in each row

i. Row Key

1. Rowkey is like primary key or index in RDMS. Rowkey not required to be specified at the time of table creation, it can be given at the time of data insert or load

ii. Column Family

1. At the time of table creation, we need to define the column family

2. All the related columns could be grouped together as column family

3. Every row in a table will have same set of column families

4. one column family is stored in one file i.e. a column family is fetched together even if we query only one column in column family

iii. Column

1. Column is actual column that contains data

2. Columns are not specified at the time of table creation and we can add columns dynamically

3. Each column must belong to a column family

4. Data type not associated with column, everything internally a bytearray

5. Column is referenced as “columnfamily: columnname”

iv. Timestamp

1. Value in column is associated with timestamp. This timestamp identifies when was last time value changed

2. Every time value modified, new entry is created with new timestamp therefore it allows versioning values using timestamp

6. HBASE Shell can be invoked by typing “hbase shell”

7. To create a table just type “create ‘tablename’, ‘columnfam1’, ‘columnfamily2’”

8. To insert data type “put ‘tablename’, rowkeyval, ‘columnfam1: col1’, col1val”…. We can run multiple puts with same rowkey value to add columns to same row… if we run put again with same rowkey and same column, then it updates the value

9. Use “scan tablename” to read table data

10. Use “count tablename” to count records in a table

11. Using “get tablename, rowkey” to get a specific row in a table.. get can further be granular by specifying column names

12. We can use scan for retrieving range of records , we have more options at scan.. for example “scan ‘tablename’,{COLUMNS => ‘[coumnfam: columnname]’, STARTROW =>”2”, STOPROW => “4”} retrieves one column from second and third row

13. “delete” is similar to get but it deletes instead of reading

14. Disable table before drop.. “enable”, “disable”,”exists”,”drop” few more hbase shell commands

Technical excerpts

Search This Blog

Saturday, 20 January 2018

BigData - Getting Started with HBase

No comments:

Post a Comment