Technical excerpts: BigData - Spark Fundamentals

1. Spark architecture contains a Driver and Workers. Once Driver act as master and manages tasks, scheduling, data locality etc

2. The driver or master maintains the context. Each application needs to instantiate its own context and operate within that context

3. RDD – Resilient Distributed Dataset – Collection of elements partitioned across nodes in a cluster that can be operated on in parallel

4. Transformations (Like MAP, FILTER etc) are evaluated lazily and Actions(collect, count, reduce etc ) are evaluated immediately

5. This lazy evaluation allows Spark to store functional instructions to DAG for later use

6. This DAG (Directed A Cyclical Graph) continuously grows with functional lineage and then at the time of actions, a task is distributed to workers using this functional lineage from DAG

7. Loading Data

a. Spark supports Amazon S3, HDFS, and many databases… as well as many data serialization techniques and file formats like AVRO, PARQUET

b. Space Context is the starting point for loading data to initial RDD

c. sc.parallelize(1,100) to distribute a range sequence from 1 to 100

d. res0.collect to collect this sequence to show driver like console

e. to see spark methods starting with first few characters in spark-shell, just do one tab for auto complete and then tab again to see the signature of method

f. sc.makeRDD, sc.range are few other memory loading methods

g. textFile, wholeTextFiles, sequenceFile, objectFile are some file loading methods that are implemented from the generic hadoopFile method

h. even more generic file load method is hadoopRDD. The difference between hadoopFile and hadoopRDD is that hadoopRDD accepts jobconfiguration parameter and does not accept path. Using hadoopRDD any file/data can be loaded and this input data file or formation is supplied via job configuration

8. Transformations

a. Transformation are lazily evaluated. Its collection of methods (not actions) that run set of functions on data to transform into target format

b. Transformations returns another RDD. Lazily built graph of actions (DAG) to act up on when an action performed

c. Some transformations are are MAP, FILTER etc

d. RDDs can be combined using different transformations like rddname.union, .intersect, .subtract, .cartisian , rddname1 ++ rddname2 etc

9. Actions

a. Actions do not return RDD and are evaluated immediately

b. .collect … it collects entire RDD into driver (or master node).. this could cause memory exceptions if final RDD size does not fit in driver’s memory

c. .take(5) – takes only first 5 items only .. similarly, we can take only specific number of items by passing number to take action.

Technical excerpts

Search This Blog

Saturday, 20 January 2018

BigData - Spark Fundamentals

No comments:

Post a Comment