1.
Spark architecture contains a Driver and
Workers. Once Driver act as master and manages tasks, scheduling, data locality
etc
2.
The driver or master maintains the context. Each
application needs to instantiate its own context and operate within that
context
3.
RDD – Resilient Distributed Dataset – Collection of elements partitioned across
nodes in a cluster that can be operated on in parallel
4.
Transformations (Like MAP, FILTER etc) are
evaluated lazily and Actions(collect, count, reduce etc ) are evaluated immediately
5.
This lazy evaluation allows Spark to store
functional instructions to DAG for later use
6.
This DAG (Directed A Cyclical Graph) continuously
grows with functional lineage and then at the time of actions, a task is
distributed to workers using this functional lineage from DAG
7.
Loading Data
a.
Spark supports Amazon S3, HDFS, and many
databases… as well as many data serialization techniques and file formats like
AVRO, PARQUET
b.
Space Context is the starting point for loading
data to initial RDD
c.
sc.parallelize(1,100) to distribute a range sequence from 1 to 100
d.
res0.collect to collect this sequence to show
driver like console
e.
to see spark methods starting with first few
characters in spark-shell, just do one tab for auto complete and then tab again
to see the signature of method
f.
sc.makeRDD, sc.range are few other memory
loading methods
g.
textFile,
wholeTextFiles, sequenceFile,
objectFile are some file loading methods that are implemented from the generic
hadoopFile method
h.
even more generic file load method is hadoopRDD.
The difference between hadoopFile and hadoopRDD is that hadoopRDD accepts
jobconfiguration parameter and does not accept path. Using hadoopRDD any
file/data can be loaded and this input data file or formation is supplied via
job configuration
8.
Transformations
a.
Transformation are lazily evaluated. Its
collection of methods (not actions) that run set of functions on data to
transform into target format
b.
Transformations returns another RDD. Lazily built
graph of actions (DAG) to act up on when an action performed
c.
Some transformations are are MAP, FILTER etc
d.
RDDs can be combined using different
transformations like rddname.union, .intersect, .subtract, .cartisian ,
rddname1 ++ rddname2 etc
9.
Actions
a.
Actions do not return RDD and are evaluated
immediately
b.
.collect … it collects entire RDD into driver
(or master node).. this could cause memory exceptions if final RDD size does
not fit in driver’s memory
c.
.take(5) – takes only first 5 items only ..
similarly, we can take only specific number of items by passing number to take
action.
No comments:
Post a Comment