Technical excerpts: BigData - Getting Started with HDFS

1. Some basic facts about HDFS are , its distributed file system and build for batch processing with high fault tolerance.

2. HDFS Build under the premise the hardward will fail. It has 1 NameNode and several DataNodes in a cluster. A cluster does not need to be in same premise, it can be distributed across regions over internet

3. Diff between HDFS 1.0 and HDFS 2.0

a. HDFS 1.0

i. Single NameNode which is single point of failure

ii. Resource manager and node manager are part of MapReduce

iii. Scalability and Performance suffer with large cluster

b. HDFS 2.0

i. NameNode highly available with secondary name node and metadata files (fsimage and edit)

ii. YARN was introduced for resource management and node management

iii. Bringing in YARN helped fixing scalability and performance issues with large cluster

4. Actions allowed in HDFS

a. Copy files from local to HDFS and HDFS to Local

b. Create New Files

c. Append Files

d. Read Files

e. Delete Files

5. Actions not allowed in HDFS

a. Edit Files – HDFS does not allow file modifications

6. Using HDFS Shell commands

a. Add HDFS installed bin directory to PATH variable

b. Then use “hdfs dfs -command [options] [input] [output]”

c. We can also “hadoop fs -command [options] [input] [output]”

d. If default file system is not hdfs then use Hadoop command , but with hdfs installed both are same

e. Get all help using hdfs dfs -help [command]

f. Basic commands

i. Hdfs dfs -help

ii. Hdfs dfs -ls

iii. Hdfs dfs -touchz filename (like linux touch command)

iv. Hdfs dfs -cat filename

v. Hdfs dfs -mkdir directoryname

vi. Hdfs dfs -cp sourcefile targetfile (This copied files within HDFS)

vii. Hdfs dfs -mv filename1 filename2

g. File and Directory permissions

i. Hdfs dfs -chmod [num] file name

ii. Permisions are same as linux permissions

iii. Other commands like -chgrp, -chown etc same as linux

h. Moving files between HDFS and Local file system

i. -put to put files insider HDFS

ii. -get to get files from HDFS to local file systems

iii. Also we can user -copyToLocal and -copyFromLocal

iv. Another command is -moveFromLocal

i. Maintenance Shell Commands

i. Hdfs dfs -rm -r (to remove files and empty directory).. this does not really physically delete it.. rm is only trash it unlike in linux file system

ii. Hdfs dfs -expunge .. is to clear the trash i.e. physically or permanently deleting the files from trash

7. Transferring relational data to HDFS using Sqoop

a. Sqoop is a tool to bulk transfer of data between relational database and HDFS

b. Sqoop does this by running a MapReduce job and using some of HDFS basic commands.

c. Sqoop has matured APIs for most of relational database.. most popular ones are MySQL, Oracle, SQL Server, PostgreSQL

d. Sqoop does not offer out of box NoSQL connectors but there are lot of 3^rd party vendors offer NoSQL connectors in Sqoop

e. Spoop is an Apache open source project

f. Spoop allows Automation of workflows like importing and exporting data

g. Spoop is two-way .. we can take read data and upload data in and out of both HDFS and RDMS

h. Some of Sqoop user cases

i. If we want to run big data analytics on transactional RDMS database without impacting online system.. we can automate regular data copy from RDMS to HDFS in Sqoop

ii. If we have multiple silo databases and we want to run analytics on consolidated data, we can automate in sqoop to consolidate from multiple sources to single HDFS

i. Basic sqoop commands

i. Sqoop list-tables --connect jbdc:mysql://(MySQL Address)/dbname

ii. Sqoop import --connect jdbc:mysql://(MySQL Address)/dbname --table tablename -m 1 (-m option specifies number of MapReduce jobs to run)

iii. Sqoop import --connect jdbc:mysql://(MySQL Address)/dbname --query select * from …..

iv. For more information Sqoop.apache.org

8. Querying data with Pig and Hive

a. Hive – data warehouse software that works on top of Hadoop and allows for developers to structure data into schemas

i. We can store both structured and un-structured data into HDFS. But Hive is a tool that deals with structured data, its schema bound. We need first move hdfs structured/unstructured data to Hive (schema bound) and use HiveQL to query

ii. We can write HiveQL just like SQL on RDMS

iii. Just type hive to start hive prompt

iv. We can run hdfs commands also in hive prompt

v. Querying HDFS using HiveQL is 3 step process

1. Build data in Hive

a. Hive> CREATE DATABASE dbname;

b. Hive> USE database (to set database as default for querying)

c. Hive>CREATE TABLE tablanme(field1 String, field2 INT, ….) row format fields delimited by “,” lines delimited by ‘\n’ tblproperties(“skip.header.line.count”=”1”);

d. Hive>LOAD DATA INPATH ‘/user../filename.csv’OVERWRITE INTO TABLE tablename;

2. Write HiveQL query

a. Select fields from tablename where conditions;

3. Extract results

b. Pig – Pig is the application environment used to run Pig Latin and covert Pig Latin scripts to MapReduce jobs

i. Pig is not bound to schema

ii. Pig can be used over unstructured or structured data

iii. Pig interactive grunt shell can be started using “Pig -x local” for local file access and “Pig -x mapreduce” or just “Pig” for HDFS file access

iv. Pig> help (will display all commands help)

v. Pig interactive grunt shell is more interactive than Hive therefore we can run commands like “ls”, “cd” just as we type on shell interactive prompt

c. Sample pig scripts

i. Load data from HDFS

1. A = LOAD ‘/user/../filename’ USING PigStorage(‘,’) as (field1:String, fields2:INT….);

2. Pig does not run the command until its called.. for example above A variable not evaluated until its selected

3. When we do “DUMP A” then the MapReduce job runs to load hdfs filename to A variable in Pig

ii. Write JOIN in Pig Latin

1. A_new = FOR EACH new GENERATE field1,field2;

2. A_combined=JOIN var1 BY field1, var2 by field1;

iii. Store results in HDFS

1. STORE A_combined /user/…/outputfilename USING PigStorage(‘,’);

2. Hive vrs Pig

a. Hive excels in data warehouse scenario. Hive uses declarative SQL like language. Its schema bound.

b. It can be deal with unstructured or structured. Its fits best in ETL type transactions, which normally deal with semi-structured. Text heavy documents, Log Files are some example where Pig could be handy. It uses procedural Pig Latic scripting

c. “SQL on Hadoop – Analyzing BigData with Hive” and “Pig Latin: Getting Started” courses

9. HBASE

a. HBASE is NoSQL distributed and scalable database built on top of Hadoop. Real-time read/write in Hadoop.

b. HBASE can “Host data with billions of rows and millions of columns”

c. HBASE works in highly distributed and scalable environment

d. HBASE interactive shell can be started with command “hbase shell”

e. Basic commands

i. To create a table “user” with column “username” ---- “create ‘user’,’username’”

ii. To select contents in “user” table.. just type “scan ‘user’”

iii. To put row number 1 , columnd usename and value “Raj”.. the command will be .. hbase> put ‘user’,’r1’,’username’,’Raj’

iv. To update ‘Raj’ to ‘Rajendra’ … hbase> put ‘user’,’r1’,’username’,’Rajendra’

v. To drop the table .. disable ‘user’ then drop ‘user’

f. We can user Pig to load/unload HDFS files into/from HBASE

10. We can user Bash shell scripting to automate HDFS files managing tasks.

Technical excerpts

Search This Blog

Thursday, 21 December 2017

BigData - Getting Started with HDFS

No comments:

Post a Comment