1.
Some basic facts about HDFS are , its
distributed file system and build for batch processing with high fault
tolerance.
2.
HDFS Build under the premise the hardward will
fail. It has 1 NameNode and several DataNodes in a cluster. A cluster does not
need to be in same premise, it can be distributed across regions over internet
3.
Diff between HDFS 1.0 and HDFS 2.0
a.
HDFS 1.0
i.
Single NameNode which is single point of failure
ii.
Resource manager and node manager are part of
MapReduce
iii.
Scalability and Performance suffer with large
cluster
b.
HDFS 2.0
i.
NameNode highly available with secondary name
node and metadata files (fsimage and edit)
ii.
YARN was introduced for resource management and
node management
iii.
Bringing in YARN helped fixing scalability and
performance issues with large cluster
4.
Actions allowed in HDFS
a.
Copy files from local to HDFS and HDFS to Local
b.
Create New Files
c.
Append Files
d.
Read Files
e.
Delete Files
5.
Actions not allowed in HDFS
a.
Edit Files – HDFS does not allow file
modifications
6.
Using HDFS Shell commands
a.
Add HDFS installed bin directory to PATH
variable
b.
Then use “hdfs dfs -command [options] [input]
[output]”
c.
We can also “hadoop fs -command [options]
[input] [output]”
d.
If default file system is not hdfs then use
Hadoop command , but with hdfs installed both are same
e.
Get all help using hdfs dfs -help [command]
f.
Basic commands
i.
Hdfs dfs -help
ii.
Hdfs dfs -ls
iii.
Hdfs dfs -touchz filename (like linux touch command)
iv.
Hdfs dfs -cat filename
v.
Hdfs dfs -mkdir directoryname
vi.
Hdfs dfs -cp sourcefile targetfile (This copied
files within HDFS)
vii.
Hdfs dfs -mv filename1 filename2
g.
File and Directory permissions
i.
Hdfs dfs -chmod [num] file name
ii.
Permisions are same as linux permissions
iii.
Other commands like -chgrp, -chown etc same as
linux
h.
Moving files between HDFS and Local file system
i.
-put to put files insider HDFS
ii.
-get to get files from HDFS to local file
systems
iii.
Also we can user -copyToLocal and -copyFromLocal
iv.
Another command is -moveFromLocal
i.
Maintenance Shell Commands
i.
Hdfs dfs -rm -r (to remove files and empty
directory).. this does not really physically delete it.. rm is only trash it
unlike in linux file system
ii.
Hdfs dfs -expunge .. is to clear the trash i.e.
physically or permanently deleting the files from trash
7.
Transferring relational data to HDFS using Sqoop
a.
Sqoop is a tool to bulk transfer of data between
relational database and HDFS
b.
Sqoop does this by running a MapReduce job and
using some of HDFS basic commands.
c.
Sqoop has matured APIs for most of relational
database.. most popular ones are MySQL, Oracle, SQL Server, PostgreSQL
d.
Sqoop does not offer out of box NoSQL connectors
but there are lot of 3rd party vendors offer NoSQL connectors in
Sqoop
e.
Spoop is an Apache open source project
f.
Spoop allows Automation of workflows like
importing and exporting data
g.
Spoop is two-way .. we can take read data and
upload data in and out of both HDFS and RDMS
h.
Some of Sqoop user cases
i.
If we want to run big data analytics on
transactional RDMS database without impacting online system.. we can automate
regular data copy from RDMS to HDFS in Sqoop
ii.
If we have multiple silo databases and we want
to run analytics on consolidated data, we can automate in sqoop to consolidate
from multiple sources to single HDFS
i.
Basic sqoop commands
i.
Sqoop list-tables --connect jbdc:mysql://(MySQL
Address)/dbname
ii.
Sqoop import --connect jdbc:mysql://(MySQL
Address)/dbname --table tablename -m 1
(-m option specifies number of MapReduce jobs to run)
iii.
Sqoop import --connect jdbc:mysql://(MySQL
Address)/dbname --query select * from …..
iv.
For more information Sqoop.apache.org
8.
Querying data with Pig and Hive
a.
Hive – data warehouse software that works on top
of Hadoop and allows for developers to structure data into schemas
i.
We can store both structured and un-structured
data into HDFS. But Hive is a tool that deals with structured data, its schema
bound. We need first move hdfs structured/unstructured data to Hive (schema
bound) and use HiveQL to query
ii.
We can write HiveQL just like SQL on RDMS
iii.
Just type hive to start hive prompt
iv.
We can run hdfs commands also in hive prompt
v.
Querying HDFS using HiveQL is 3 step process
1.
Build data in Hive
a.
Hive> CREATE DATABASE dbname;
b.
Hive> USE database (to set database as
default for querying)
c.
Hive>CREATE TABLE tablanme(field1 String, field2
INT, ….) row format fields delimited by “,”
lines delimited by ‘\n’ tblproperties(“skip.header.line.count”=”1”);
d.
Hive>LOAD DATA INPATH ‘/user../filename.csv’OVERWRITE
INTO TABLE tablename;
2.
Write HiveQL query
a.
Select fields from tablename where conditions;
3.
Extract results
b.
Pig – Pig is the application environment used to
run Pig Latin and covert Pig Latin scripts to MapReduce jobs
i.
Pig is not bound to schema
ii.
Pig can be used over unstructured or structured
data
iii.
Pig interactive grunt shell can be started using
“Pig -x local” for local file access and “Pig -x mapreduce” or just “Pig” for
HDFS file access
iv.
Pig> help (will display all commands help)
v.
Pig interactive grunt shell is more interactive
than Hive therefore we can run commands like “ls”, “cd” just as we type on
shell interactive prompt
c.
Sample pig scripts
i.
Load data from HDFS
1.
A = LOAD ‘/user/../filename’ USING PigStorage(‘,’) as (field1:String, fields2:INT….);
2.
Pig does not run the command until its called..
for example above A variable not evaluated until its selected
3.
When we do “DUMP A” then the MapReduce job runs
to load hdfs filename to A variable in Pig
ii.
Write JOIN in Pig Latin
1.
A_new = FOR EACH new GENERATE field1,field2;
2.
A_combined=JOIN var1 BY field1, var2 by field1;
iii.
Store results in HDFS
1.
STORE A_combined /user/…/outputfilename USING
PigStorage(‘,’);
2.
Hive vrs Pig
a.
Hive excels in data warehouse scenario. Hive uses
declarative SQL like language. Its schema bound.
b.
It can be deal with unstructured or structured.
Its fits best in ETL type transactions, which normally deal with
semi-structured. Text heavy documents, Log Files are some example where Pig
could be handy. It uses procedural Pig Latic scripting
c.
“SQL on Hadoop – Analyzing BigData with Hive”
and “Pig Latin: Getting Started” courses
9.
HBASE
a.
HBASE is NoSQL distributed and scalable database
built on top of Hadoop. Real-time read/write in Hadoop.
b.
HBASE can “Host data with billions of rows and
millions of columns”
c.
HBASE works in highly distributed and scalable
environment
d.
HBASE interactive shell can be started with
command “hbase shell”
e.
Basic commands
i.
To create a table “user” with column “username”
---- “create ‘user’,’username’”
ii.
To select contents in “user” table.. just type “scan
‘user’”
iii.
To put row number 1 , columnd usename and value “Raj”..
the command will be .. hbase> put ‘user’,’r1’,’username’,’Raj’
iv.
To update ‘Raj’ to ‘Rajendra’ … hbase> put ‘user’,’r1’,’username’,’Rajendra’
v.
To drop the table .. disable ‘user’ then drop ‘user’
f.
We can user Pig to load/unload HDFS files into/from
HBASE
10.
We can user Bash shell scripting to automate
HDFS files managing tasks.
No comments:
Post a Comment