1.
BigData basic requirements are
a.
Storage – Store very big files
b.
Process – Process very big files in timely
manner
c.
Scale – Scale up computing power without
impacting the application
2.
BigData systems are build up on Distributed
Computing System… i.e. storing single file data into multiple computers/nodes
and processing data on multiple nodes, then coordinating all computing on
multiple nodes and resulting one final result set.
3.
Unlike traditional Monolithic (Every thing on
single system), Distributed Computing System will have multiple systems called
Nodes i.e. a cluster of machines or nodes.. Scalability can be achieving by
simply increasing the nodes.. its not same in case of Monolithic because even
if we increase hardware, that does not guarantee increase in computing power.
4.
In Distributed Computing System, we need some
software to co-ordinate all nodes in Cluster.. things like partitioning data,
replicating data, coordinating computing tasks, handling fault tolerance etc
5.
Google developed proprietary software .. Google
File System and MapReduce to handle huge data on Distributed Computing System..
Google then released white papers on this software..
6.
Apache release open source equivalent of Google
software… Equivalent of Google File System is HDFS and equivalent of MapReduce
is MapReduce
7.
HDFS + MapReduce + YARN is Hadoop
a.
HDFS is a file system to manage the storage of
data
b.
MapReduce is framework to define (not
processing) the data across multiple servers
c.
YARN is the framework to run the data processing
tasks across multiple nodes.. manage resources , memory etc
d.
MapReduce defines the what data , how data
processed.. YARN does not care about what is data task about.. YARN only care
about running the data task and seeing through completion
8.
Series of steps that happen when we submit the
job to Hadoop
a.
(MapReduce) User defines map and reduce tasks
using MapReduce API. This API is
available in Java and also other programming languages.. MapReduce define what
computations we need to perform on data… MapReduce code is packaged into jobs
and Jobs are triggered on the Hadoop Cluster..
b.
(YARN) Job is triggered on cluster using YARN..
YARN check if nodes in cluster have resource available to run the job then YARN
figures out which nodes to use for job execution
c.
(HDFS) YARN runs the job and stores the data
that result of the job in HDFS
9.
Some popular technologies in Hadoop distributed
computing system are
a.
HIVE – SQL like query interface for Hadoop
i.
The bridge to Hadoop for folks who do not have
exposure to OOP in Java
ii.
It converts SQL like queries into MapReduce
behind the system
b.
HBase – Different kind of database
i.
HBase is build on top of HDFS to allow low latency
operations on key pair values
ii.
Integrates with application just like a
traditional database
c.
Pig – Way to convert unstructured data in
structured format
i.
Scripting Language for data manipulation of
unstructured data like Logs
d.
Oozie – Workflow Management System
e.
Flume/Sqoop – Tools that allow to put and get
data into/from Hadoop
i.
Allows transfer data from Hadoop to other traditional
databases and vice versa
f.
Spark – Way to perform complex transformation in
a functional way on BigData
i.
Has interface with scripting languages like
Python/Scala
10.
Hadoop can be installed in Standalone mode or
Pseudo distributed mode or fully distributed mode
a.
Standalone is default mode, runs on single mode.
It uses local file system not HDFS and YARN does not run.. its useful for
testing purpose
b.
Pseudo Distributed mode is lies between
Standalone and Fully Distributed. This simulates 2 nodes, master and slaves..
HDFS is used and YARN also runs.
c.
Fully Distributed is the production like
environments or prod like test environments in environments... it could be
physically different nodes or in clouds.. it’s a non-trivial task, lot of
configurations required. Enterprise Distributions like Cloudera provides
preconfigured Hadoop
No comments:
Post a Comment