Search This Blog

Friday 15 December 2017

BigData Hadoop - Basics

1.       BigData basic requirements are
a.       Storage – Store very big files
b.       Process – Process very big files in timely manner
c.       Scale – Scale up computing power without impacting the application
2.       BigData systems are build up on Distributed Computing System… i.e. storing single file data into multiple computers/nodes and processing data on multiple nodes, then coordinating all computing on multiple nodes and resulting one final result set.
3.       Unlike traditional Monolithic (Every thing on single system), Distributed Computing System will have multiple systems called Nodes i.e. a cluster of machines or nodes.. Scalability can be achieving by simply increasing the nodes.. its not same in case of Monolithic because even if we increase hardware, that does not guarantee increase in computing power.
4.       In Distributed Computing System, we need some software to co-ordinate all nodes in Cluster.. things like partitioning data, replicating data, coordinating computing tasks, handling fault tolerance etc
5.       Google developed proprietary software .. Google File System and MapReduce to handle huge data on Distributed Computing System.. Google then released white papers on this software..
6.       Apache release open source equivalent of Google software… Equivalent of Google File System is HDFS and equivalent of MapReduce is MapReduce
7.       HDFS + MapReduce + YARN is Hadoop
a.       HDFS is a file system to manage the storage of data
b.       MapReduce is framework to define (not processing) the data across multiple servers
c.       YARN is the framework to run the data processing tasks across multiple nodes.. manage resources , memory etc
d.       MapReduce defines the what data , how data processed.. YARN does not care about what is data task about.. YARN only care about running the data task and seeing through completion
8.       Series of steps that happen when we submit the job to Hadoop
a.       (MapReduce) User defines map and reduce tasks using MapReduce API.  This API is available in Java and also other programming languages.. MapReduce define what computations we need to perform on data… MapReduce code is packaged into jobs and Jobs are triggered on the Hadoop Cluster..
b.       (YARN) Job is triggered on cluster using YARN.. YARN check if nodes in cluster have resource available to run the job then YARN figures out which nodes to use for job execution
c.       (HDFS) YARN runs the job and stores the data that result of the job in HDFS
9.       Some popular technologies in Hadoop distributed computing system are
a.       HIVE – SQL like query interface for Hadoop
                                                               i.      The bridge to Hadoop for folks who do not have exposure to OOP in Java
                                                             ii.      It converts SQL like queries into MapReduce behind the system
b.       HBase – Different kind of database
                                                               i.      HBase is build on top of HDFS to allow low latency operations on key pair values
                                                             ii.      Integrates with application just like a traditional database
c.       Pig – Way to convert unstructured data in structured format
                                                               i.      Scripting Language for data manipulation of unstructured data like Logs
d.       Oozie – Workflow Management System
e.       Flume/Sqoop – Tools that allow to put and get data into/from Hadoop
                                                               i.      Allows transfer data from Hadoop to other traditional databases and vice versa
f.        Spark – Way to perform complex transformation in a functional way on BigData
                                                               i.      Has interface with scripting languages like Python/Scala
10.   Hadoop can be installed in Standalone mode or Pseudo distributed mode or fully distributed mode
a.       Standalone is default mode, runs on single mode. It uses local file system not HDFS and YARN does not run.. its useful for testing purpose
b.       Pseudo Distributed mode is lies between Standalone and Fully Distributed. This simulates 2 nodes, master and slaves.. HDFS is used and YARN also runs.
c.       Fully Distributed is the production like environments or prod like test environments in environments... it could be physically different nodes or in clouds.. it’s a non-trivial task, lot of configurations required. Enterprise Distributions like Cloudera provides preconfigured Hadoop


No comments:

Post a Comment