Search This Blog

Thursday, 21 December 2017

BigData - Getting Started with HDFS

1.       Some basic facts about HDFS are , its distributed file system and build for batch processing with high fault tolerance.
2.       HDFS Build under the premise the hardward will fail. It has 1 NameNode and several DataNodes in a cluster. A cluster does not need to be in same premise, it can be distributed across regions over internet
3.       Diff between HDFS 1.0 and HDFS 2.0
a.       HDFS 1.0
                                                               i.      Single NameNode which is single point of failure
                                                             ii.      Resource manager and node manager are part of MapReduce
                                                           iii.      Scalability and Performance suffer with large cluster
b.       HDFS 2.0
                                                               i.      NameNode highly available with secondary name node and metadata files (fsimage and edit)
                                                             ii.      YARN was introduced for resource management and node management
                                                           iii.      Bringing in YARN helped fixing scalability and performance issues with large cluster
4.       Actions allowed in HDFS
a.       Copy files from local to HDFS and HDFS to Local
b.       Create New Files
c.       Append Files
d.       Read Files
e.       Delete Files
5.       Actions not allowed in HDFS
a.       Edit Files – HDFS does not allow file modifications
6.       Using HDFS Shell commands
a.       Add HDFS installed bin directory to PATH variable
b.       Then use “hdfs dfs -command [options] [input] [output]”
c.       We can also “hadoop fs -command [options] [input] [output]”
d.       If default file system is not hdfs then use Hadoop command , but with hdfs installed both are same
e.       Get all help using hdfs dfs -help [command]
f.        Basic commands
                                                               i.      Hdfs dfs -help
                                                             ii.      Hdfs dfs -ls
                                                           iii.      Hdfs dfs -touchz filename  (like linux touch command)
                                                           iv.      Hdfs dfs -cat filename
                                                             v.      Hdfs dfs -mkdir directoryname
                                                           vi.      Hdfs dfs -cp sourcefile targetfile (This copied files within HDFS)
                                                          vii.      Hdfs dfs -mv filename1 filename2
g.       File and Directory permissions
                                                               i.      Hdfs dfs -chmod [num] file name
                                                             ii.      Permisions are same as linux permissions
                                                           iii.      Other commands like -chgrp, -chown etc same as linux
h.       Moving files between HDFS and Local file system
                                                               i.      -put to put files insider HDFS
                                                             ii.      -get to get files from HDFS to local file systems
                                                           iii.      Also we can user -copyToLocal and -copyFromLocal
                                                           iv.      Another command is -moveFromLocal
i.         Maintenance Shell Commands
                                                               i.      Hdfs dfs -rm -r (to remove files and empty directory).. this does not really physically delete it.. rm is only trash it unlike in linux file system
                                                             ii.      Hdfs dfs -expunge .. is to clear the trash i.e. physically or permanently deleting the files from trash
7.       Transferring relational data to HDFS using Sqoop
a.       Sqoop is a tool to bulk transfer of data between relational database and HDFS
b.       Sqoop does this by running a MapReduce job and using some of HDFS basic commands.
c.       Sqoop has matured APIs for most of relational database.. most popular ones are MySQL, Oracle, SQL Server, PostgreSQL
d.       Sqoop does not offer out of box NoSQL connectors but there are lot of 3rd party vendors offer NoSQL connectors in Sqoop
e.       Spoop is an Apache open source project
f.        Spoop allows Automation of workflows like importing and exporting data
g.       Spoop is two-way .. we can take read data and upload data in and out of both HDFS and RDMS
h.       Some of Sqoop user cases
                                                               i.      If we want to run big data analytics on transactional RDMS database without impacting online system.. we can automate regular data copy from RDMS to HDFS in Sqoop
                                                             ii.      If we have multiple silo databases and we want to run analytics on consolidated data, we can automate in sqoop to consolidate from multiple sources to single HDFS
i.         Basic sqoop commands
                                                               i.      Sqoop list-tables --connect jbdc:mysql://(MySQL Address)/dbname
                                                             ii.      Sqoop import --connect jdbc:mysql://(MySQL Address)/dbname --table tablename  -m 1 (-m option specifies number of MapReduce jobs to run)
                                                           iii.      Sqoop import --connect jdbc:mysql://(MySQL Address)/dbname --query select * from …..
                                                           iv.      For more information Sqoop.apache.org
8.       Querying data with Pig and Hive
a.       Hive – data warehouse software that works on top of Hadoop and allows for developers to structure data into schemas
                                                               i.      We can store both structured and un-structured data into HDFS. But Hive is a tool that deals with structured data, its schema bound. We need first move hdfs structured/unstructured data to Hive (schema bound) and use HiveQL to query
                                                             ii.      We can write HiveQL just like SQL on RDMS
                                                           iii.      Just type hive to start hive prompt
                                                           iv.      We can run hdfs commands also in hive prompt
                                                             v.      Querying HDFS using HiveQL is 3 step process
1.       Build data in Hive
a.       Hive> CREATE DATABASE dbname;
b.       Hive> USE database (to set database as default for querying)
c.       Hive>CREATE TABLE tablanme(field1 String, field2 INT, ….) row format fields delimited by “,”  lines delimited by ‘\n’ tblproperties(“skip.header.line.count”=”1”);
d.       Hive>LOAD DATA INPATH ‘/user../filename.csv’OVERWRITE INTO TABLE tablename;
2.       Write HiveQL query
a.       Select fields from tablename where conditions;
3.       Extract results
b.       Pig – Pig is the application environment used to run Pig Latin and covert Pig Latin scripts to MapReduce jobs
                                                               i.      Pig is not bound to schema
                                                             ii.      Pig can be used over unstructured or structured data
                                                           iii.      Pig interactive grunt shell can be started using “Pig -x local” for local file access and “Pig -x mapreduce” or just “Pig” for HDFS file access
                                                           iv.      Pig> help (will display all commands help)
                                                             v.      Pig interactive grunt shell is more interactive than Hive therefore we can run commands like “ls”, “cd” just as we type on shell interactive prompt
c.       Sample pig scripts
                                                               i.      Load data from HDFS
1.       A = LOAD ‘/user/../filename’ USING PigStorage(‘,’)  as (field1:String, fields2:INT….);
2.       Pig does not run the command until its called.. for example above A variable not evaluated until its selected
3.       When we do “DUMP A” then the MapReduce job runs to load hdfs filename to A variable in Pig
                                                             ii.      Write JOIN in Pig Latin
1.       A_new = FOR EACH new GENERATE field1,field2;
2.       A_combined=JOIN var1 BY field1, var2 by field1;
                                                           iii.      Store results in HDFS
1.       STORE A_combined /user/…/outputfilename USING PigStorage(‘,’);
2.       Hive vrs Pig
a.       Hive excels in data warehouse scenario. Hive uses declarative SQL like language. Its schema bound.
b.       It can be deal with unstructured or structured. Its fits best in ETL type transactions, which normally deal with semi-structured. Text heavy documents, Log Files are some example where Pig could be handy. It uses procedural Pig Latic scripting
c.       “SQL on Hadoop – Analyzing BigData with Hive” and “Pig Latin: Getting Started” courses
9.       HBASE
a.       HBASE is NoSQL distributed and scalable database built on top of Hadoop. Real-time read/write in Hadoop.
b.       HBASE can “Host data with billions of rows and millions of columns”
c.       HBASE works in highly distributed and scalable environment
d.       HBASE interactive shell can be started with command “hbase shell”
e.       Basic commands
                                                               i.      To create a table “user” with column “username” ---- “create ‘user’,’username’”
                                                             ii.      To select contents in “user” table.. just type “scan ‘user’”
                                                           iii.      To put row number 1 , columnd usename and value “Raj”.. the command will be .. hbase> put ‘user’,’r1’,’username’,’Raj’
                                                           iv.      To update ‘Raj’ to ‘Rajendra’ … hbase> put ‘user’,’r1’,’username’,’Rajendra’
                                                             v.      To drop the table .. disable ‘user’  then drop ‘user’
f.        We can user Pig to load/unload HDFS files into/from HBASE
10.   We can user Bash shell scripting to automate HDFS files managing tasks.

No comments:

Post a Comment