Home / Posts / MapReduce Hello World Programming for Beginners

MapReduce Hello World Programming for Beginners

MapReduce is a Distributed computing programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

Let’s write our first MapReduce program

  • MapReduce Programs are usually written in Java.
  • Hadoop has an API called Streaming that can be used for Ruby, Python etc We’ll get to that a little later

Objective: Create a Frequency Distribution of words in a text file

Step 1: Write a map() function.
Step 2: Write a reduce() function.
Step 3: Setup a driver that points to our map and reduce implementations.

In Java the map() function is represented by the Mapper Class

WordMapper Class

This Class represents the code for the step

<Line num,Line of text>                                         <word,1>

Mapper Class is  generic It has 4 type parameters

  • < Input Key Type, Input value type, Output key type, Output value type >
  • Hadoop has it’s own set of basic types optimised for network serilaization.
  • These are wrappers around Java primitive types
  • These types are implemented using the Writable Interface
  • The Mapper class has an abstract method called map(), we will override this method with our own implementation
  • Inputs are processed and written in to a Context Object
  • The Context Object stores the output and is accessed by the rest of MapReduce system, this is how Mapper/Reducer communicates with the rest of MapReduce system.

Following Code:

  1. Convert the Hadoop Text type to java string
  2. Split the lint into words
  3. For each of the words,we will write out a key-value pair
  4. The pair (word,1) is written to the context object

 

WordReducer Class

       In java the reduce() function is represented by the Reducer Class

  • This is created by taking the map output and merging all the with same key.
  • Reducer class is a generic it has 4 type parameters
  • < Input Key Type, Type of each element in the list of Input values, Output key type, Output value type >
  • The Reducer class has an abstract method called reduce(), we will override this method with our own implementation
  • The reduce() method takes a Kay and an Iterable of values.
  • Inputs are processed and written in to a Context Object

Following code:

  • We iterate through the list of values
  • And compute the sum of the values
  • In the end, we write out a pair (word ,count) to the Context object 

WordCount Class

The main() method will take 2 String inputs

  • 2 String inputs Input text file path,Output file path
  • We check whether the right number of inputs are given and if not print an error
  • By default,this will pick up the configuration of your Hadoop deployment
  • We instantiate a new job whose name is “wordcount”,This object will hold all the specifications including the Mapper and Reducer to be used
  • We specify the JAR file that contains our code, Hadoop will look for the JAR file that contains this class
  • it will then distribute this JAR file to all the nodes where our job will run
  • We set the input file path and output file path for the job using the arguments passed to main()
  • The input path can be a file or a directory or a file name pattern.
  • The output path must be a directory that does not exist.
  • We set the Mapper class and Reducer class to the classes we’ve written.
  • Explicitly set the type of the output keys and values of the Reducer Class
  • This has to match the types we specified in our Reducer Class
  • In our case the Mapper class also has the same output type
  • Other wise, the Mapper output types should be set using setMapOutputKeyClass() and setMapOutputValueClass(

 

Running a MapReduce job

Step 1: Build a JAR file that contains all the code we just wrote
Step 1a: In the JAR, include all the Hadoop JARS as dependencies
Step 2: Run the job at the command line

hadoop jar <JARpath> <mainclass> <inputFile> <outputDir>

 

 

About Andanayya

Experienced Hadoop Developer with a demonstrated history of working in the computer software industry. Skilled in Big Data, Spark,Java,Sqoop, Hive,Spring Boot and SQL

Check Also

Visakhapatnam Smart City – IT and ITeS Companies

Visakhatanam is the one of the fast growing developing city in India. Below is the …