Technical excerpts: BigData - MapReduce programming concepts

1. Map logic is coded in a class that extends the standard mapper class

2. Map class take Input <key, value> and produces output <Key, value>. The input <key, value> is determined by how the input file on disk is read. Typically input files are text files. The Mapper framework reads input text file line by line and the line number is assigned to Key and line itself is assigned value field. Output key,value depends on how have you processed input key,value pairs.

3. Reduce logic is coded in a class that extends Reducer class. The output key,value of map class should match with reduce input key,value.

4. Main class is driver program. This is class that set up mapreduce job. In this job we need to specify map class and reduce class the we need to specific input file path and output file path etc..

5. Datatypes

a. LongWritable is Hadoop wrapper for Long datatype. It implements writable interface provided by the Hadoop library. The writable interface allows to serialize and deserialize data from storage system. All keys should be writable to disk and readable from disk which is why key always have to be writable classes.

b. Text is a String type. It’s a Hadoop is writable class which wraps Java String. Text data type allows easily interoperate with other tools which understand UTF8.

c. IntWritable is Hadoop wrapper for Int just like other Hadoop wrappers.

6. Within map class, we will have to overwrite map method

a. Map method take 3 parameters. Input Key,value and context

i. Input key and value are passed from the class parameters

ii. Context is where actual internal shuffle and sort logic happen before passing data to reducer class.

iii. The output of map methods should be written to context using context.write(key,value) in order to pass to reducer class

7. The map reduce frameworks behind the scenes to collect all map class output and finds all values with same key and collects them together , and offer this as collection input to reduce class (this is what happens in context.. shuffle and sort)

8. The reduce class input should match with output of map class. The reduce class takes key and value input but the value will be collection …. This is because map reduce framework collected all map class output and produced key and collections of values by sort and shuffle. Therefore the value column should be defined as iterable for reduce class

9. Within reduce class, we will have to overwrite reduce method

a. Reduce method takes 3 parameters. Input key, Interable(Value), Context

i. Context is same map class. This context is an objects provides bridge to the external framework.. in simple works it stores final output that mapreduce job should produce. So at then end of reduce method, we need to write to context.write() to write to mapreduce job output

10. The Main class extends Configured class and implements tool interface

a. Configured class is base class for any application in Hadoop framework which has configurable parameters

b. Tool Interface essentially says any application run in Hadoop environment will pick up right command line arguments that it needs

c. Tool along with ToolRunner class accepts application specific command line arguments

d. We will have to overwrite run method .. run method is from Tool interface

i. In run method, the mapreduce job is configured and instantiated

ii. ToolRunner class extract command line arguments and sets the mapreduce job running

11. In terms of parallel processing

a. Number of map tasks are determined by number file partitions which is determined by underlying file system. This is not configurable and cannot be altered

b. We can configure number of reduce tasks. Default is 1. This can be set in the main class using job.setNumReduceTasks(no);

c. If we configure more than 1 reduce tasks, the partition job internal to mapreduce frameworks assigns each key of map output to a partition and each partition will be processed by each reduce task

d. The internal partition jobs ensures that all same keys goes to same partition. i.e. same can not be sent to multiple reducers, this is ensured by partitioned internal jobs

e. Finally all reduce tasks output is combined to give final mapreduce output

12. To improve MAP class performance, we can create Combiner

a. Combiner does the reduce task that can be performed on individual nodes

b. For example, if we are calculating sum.. combiner can perform sum on mad node output and pass it to reducer. Which mean reducers will have to do less

c. To set the combiner class , all we have to do is setting a parameter in Main class.. job.setCombinerClass(Reduce.class)

d. Combiners will increase the parallelism and reduce network data transfer

e. Most of the cases , we can user reduce class as combiner. But some cases like AVG , we can not user reduce class as combiner therefore it has be to different class.

Technical excerpts

Search This Blog

Sunday, 17 December 2017

BigData - MapReduce programming concepts

No comments:

Post a Comment