Home / Posts / Introduction to MapReduce

Introduction to MapReduce

MapReduce is a Distributed computing programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

MapReduce is defined of 2 functions

map() and  reduce() and The rest is taken care by Hadoop

Let’s take an example :

Objective : Create Frequency Distribution of words in the  file

we have a large text file

The text file has been divided into blocks and stored in HDFS

Each block here would represent a part of the text file

Mapper Phase

The map step will generate a list of key  value pairs on each node

From now on all the inputs and outputs are formatted as <key,value> pairs

These are all copied over to one single node, On that node an operation called Sort/Merge occurs

  • Map function will run once fore each  line of the  text file.
  • Both the  input and output need to be formatted as <key,value> pairs.
  • At the end of the  map phase, we have a set of key-value pairs from each data node.

 

Reducer Phase

  • All of these results are first copied over to a single node.

  • This operation can run in parallel on each data node there is no interdependency in the inputs and outputs.
  • The data is  sorted based on key,key-value pairs with the same key are merged.
  • reduce() will run on each  pair generated by the sort/merge step

Map Reduce Entire Flow Diagram.

 

 

 

About Andanayya

Experienced Hadoop Developer with a demonstrated history of working in the computer software industry. Skilled in Big Data, Spark,Java,Sqoop, Hive,Spring Boot and SQL

Check Also

Introduction to Hadoop Architecture

Hadoop is an open source Distributed processing framework that manages data processing and storage for …