You can't really process large data set (size: GB's/TB's) on single application/machine. Instead you will try to run small chunks of data sets over multiple applications/machines in a distributed way. This is simply MapReduce is all about.
In HDFS we see how data will be stored in distributed way. Using Map Reduce technique we will see how to process the data in Distributed parallel way.
(You know Map Reduce is written entirely in java.)
It consists of two phases
Map
Reduce
In HDFS we see how data will be stored in distributed way. Using Map Reduce technique we will see how to process the data in Distributed parallel way.
(You know Map Reduce is written entirely in java.)
It consists of two phases
Map
Reduce
Map Takes a key/value pair and generates a set of intermediate key/value pairs
•map(k1, v1) -> list(k2, v2)
Reduce Iterate over values with the same key and generate and aggregated or consolidated output
•Takes intermediate values and associates them with the same intermediate key
•reduce(k2, list[v2]) -> list (k3, v3)
(input) <k1, v1> => map -> <k2, v2> => shuffle -> <k2, v2> => reduce -> <k3, v3> (output)
Here keys are objects which implements WritableComparable thats why key can compare with each other while sorting and send to Reducer.
Before going into detailed explanation we will see some of the classes which are involved in Map Reduce programming.
Firstly client submits a file and FileInputFormat which computes into no of InputSplits.
InputSplit is nothing but a piece of a file which is going to be store in DataNode.
There are 3 types of FileInputFormat types
TextInputFormat (default)
KeyValueTextInputFormat
SequenceFileInputFormat
TextInputFormat: takes offset value (row num) as key and entire row as value;
KeyValueTextInputFormat: first tab separated value as key and rest of the line as value;
SequenceFileInputFormat: it takes entire file into bytes format;
Similarly there are 3 types of OutputFormats
TextOutputFormat (default)
SequenceOutputFileFormat
NullOutputFormat
So InputSplit will be processed by any one of these InputFormats to give it as an input to Mapper (Map). To read line by line and converts into key value pairs RecordReader will comes into picture.
LineRecordReader is the default RecordReader which reads line by line and converts into key,value pairs and given it as input to Map.
With this we can say no of input splits are equals to no of Maps. So Hadoop will control the no of Maps to be launched based on the no of Input Splits.
Once Maps got the input splits starts parallel processing on each DataNode and produce a key,value pair and write into disk. This causes the Map Reduce technique slow because writing into a disk is costs you.
These key,value pairs are partitioned and shuffled and gives as input to Reducer. The no of Reducers should run that we can control. Reducer will generate key,value pairs and writes output into HDFS.
First of all client submit the file based on file size FileInputFormat create inputsplits (logical splitting) which will be loaded into Mapper as an input by RecordReader as a key value pair.
The default InputFormatter is TextInputFormat and the default RecordReader is LineRecordReader.
Once Mapper will process the input and writes output to disk as key, value pair in sorted order. These outputs were partitioned and shuffeled (Same keys are moved to one/same reducer and then again sorting will apply. Partition and Shuffling ) and given as input to Reducers (HashPartitioner is the default partitioner, partitioner will do shuffling).
Finally Reducer produces the key value pair output and OutputFormat will write into HDFS as line by line using RecordWriter.
Map will sort locally,after copying to Reducer it will again sort all the combined data. So sorting will happen in both Mapper and Reducer side..
Reducer has to process output files from multiple/many Maps which costs time consuming. (mostly maps are in huge numbers compared to reducers) ie; reducer bottle-neck issue,to avoid this situation we have to configure no of reducers explicitly. and in some other cases we use Combiner.
Combiner is nothing but reducer only which we will configure after Map processing before partitioning happens.Let say Maps will produce more key value pairs that needs to process by the reducer. so we are processing those values before it reaching the reducer.
combiner will aggregate map results before they are partitioned.
Till here you know how internally Map Reduce works, time to write our first Map Reduce program.