1.
Map logic is coded in a class that extends the
standard mapper class
2.
Map class take Input <key, value> and produces
output <Key, value>. The input <key, value> is determined by how
the input file on disk is read.
Typically input files are text files. The Mapper framework reads input text
file line by line and the line number is assigned to Key and line itself is
assigned value field. Output key,value depends on how have you processed input
key,value pairs.
3.
Reduce logic is coded in a class that extends Reducer
class. The output key,value of map class should match with reduce input
key,value.
4.
Main class is driver program. This is class that
set up mapreduce job. In this job we need to specify map class and reduce class
the we need to specific input file path
and output file path etc..
5.
Datatypes
a.
LongWritable is Hadoop wrapper for Long datatype.
It implements writable interface provided by the Hadoop library. The writable
interface allows to serialize and deserialize data from storage system. All keys
should be writable to disk and readable from disk which is why key always have
to be writable classes.
b.
Text is a String type. It’s a Hadoop is writable
class which wraps Java String. Text data type allows easily interoperate with
other tools which understand UTF8.
c.
IntWritable is Hadoop wrapper for Int just like
other Hadoop wrappers.
6.
Within map class, we will have to overwrite map
method
a.
Map method take 3 parameters. Input Key,value
and context
i.
Input key and value are passed from the class
parameters
ii.
Context is where actual internal shuffle and
sort logic happen before passing data to reducer class.
iii.
The output of map methods should be written to context
using context.write(key,value) in order to pass to reducer class
7.
The map reduce frameworks behind the scenes to
collect all map class output and finds
all values with same key and collects them together , and offer this as
collection input to reduce class (this is what happens in context.. shuffle and
sort)
8.
The reduce class input should match with output
of map class. The reduce class takes key and value input but the value will be
collection …. This is because map reduce framework collected all map class output
and produced key and collections of values by sort and shuffle. Therefore the
value column should be defined as iterable for reduce class
9.
Within
reduce class, we will have to overwrite reduce method
a.
Reduce method takes 3 parameters. Input key, Interable(Value),
Context
i.
Context is same map class. This context is an
objects provides bridge to the external framework.. in simple works it stores final
output that mapreduce job should produce. So at then end of reduce method, we
need to write to context.write() to write to mapreduce job output
10.
The Main class extends Configured class and
implements tool interface
a.
Configured class is base class for any application
in Hadoop framework which has configurable parameters
b.
Tool
Interface essentially says any application run in Hadoop environment will pick
up right command line arguments that it needs
c.
Tool along with ToolRunner class accepts application
specific command line arguments
d.
We will have to overwrite run method .. run
method is from Tool interface
i.
In run method, the mapreduce job is configured
and instantiated
ii.
ToolRunner class extract command line arguments
and sets the mapreduce job running
11.
In terms of parallel processing
a.
Number of map tasks are determined by number
file partitions which is determined by underlying file system. This is not configurable
and cannot be altered
b.
We can configure number of reduce tasks. Default
is 1. This can be set in the main class using job.setNumReduceTasks(no);
c.
If we configure more than 1 reduce tasks, the partition
job internal to mapreduce frameworks assigns each key of map output to a partition
and each partition will be processed by each reduce task
d.
The internal partition jobs ensures that all
same keys goes to same partition. i.e. same can not be sent to multiple reducers,
this is ensured by partitioned internal jobs
e.
Finally all reduce tasks output is combined to
give final mapreduce output
12.
To improve MAP class performance, we can create Combiner
a.
Combiner does the reduce task that can be performed
on individual nodes
b.
For example, if we are calculating sum..
combiner can perform sum on mad node output and pass it to reducer. Which mean
reducers will have to do less
c.
To set the combiner class , all we have to do is
setting a parameter in Main class.. job.setCombinerClass(Reduce.class)
d.
Combiners will increase the parallelism and
reduce network data transfer
e.
Most of the cases , we can user reduce class as
combiner. But some cases like AVG , we can not user reduce class as combiner therefore
it has be to different class.
No comments:
Post a Comment