Home / Posts / MapReduce Hello World Programming for Beginners

MapReduce Hello World Programming for Beginners

MapReduce is a Distributed computing programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

Let’s write our first MapReduce program

  • MapReduce Programs are usually written in Java.
  • Hadoop has an API called Streaming that can be used for Ruby, Python etc We’ll get to that a little later

Objective: Create a Frequency Distribution of words in a text file

Step 1: Write a map() function.
Step 2: Write a reduce() function.
Step 3: Setup a driver that points to our map and reduce implementations.

In Java the map() function is represented by the Mapper Class

WordMapper Class

This Class represents the code for the step

<Line num,Line of text>                                         <word,1>

Mapper Class is  generic It has 4 type parameters

  • < Input Key Type, Input value type, Output key type, Output value type >
  • Hadoop has it’s own set of basic types optimised for network serilaization.
  • These are wrappers around Java primitive types
  • These types are implemented using the Writable Interface
  • The Mapper class has an abstract method called map(), we will override this method with our own implementation
  • Inputs are processed and written in to a Context Object
  • The Context Object stores the output and is accessed by the rest of MapReduce system, this is how Mapper/Reducer communicates with the rest of MapReduce system.

Following Code:

  1. Convert the Hadoop Text type to java string
  2. Split the lint into words
  3. For each of the words,we will write out a key-value pair
  4. The pair (word,1) is written to the context object

 

WordReducer Class

       In java the reduce() function is represented by the Reducer Class

  • This is created by taking the map output and merging all the with same key.
  • Reducer class is a generic it has 4 type parameters
  • < Input Key Type, Type of each element in the list of Input values, Output key type, Output value type >
  • The Reducer class has an abstract method called reduce(), we will override this method with our own implementation
  • The reduce() method takes a Kay and an Iterable of values.
  • Inputs are processed and written in to a Context Object

Following code:

  • We iterate through the list of values
  • And compute the sum of the values
  • In the end, we write out a pair (word ,count) to the Context object 

WordCount Class

The main() method will take 2 String inputs

  • 2 String inputs Input text file path,Output file path
  • We check whether the right number of inputs are given and if not print an error
  • By default,this will pick up the configuration of your Hadoop deployment
  • We instantiate a new job whose name is “wordcount”,This object will hold all the specifications including the Mapper and Reducer to be used
  • We specify the JAR file that contains our code, Hadoop will look for the JAR file that contains this class
  • it will then distribute this JAR file to all the nodes where our job will run
  • We set the input file path and output file path for the job using the arguments passed to main()
  • The input path can be a file or a directory or a file name pattern.
  • The output path must be a directory that does not exist.
  • We set the Mapper class and Reducer class to the classes we’ve written.
  • Explicitly set the type of the output keys and values of the Reducer Class
  • This has to match the types we specified in our Reducer Class
  • In our case the Mapper class also has the same output type
  • Other wise, the Mapper output types should be set using setMapOutputKeyClass() and setMapOutputValueClass(

 

Running a MapReduce job

Step 1: Build a JAR file that contains all the code we just wrote
Step 1a: In the JAR, include all the Hadoop JARS as dependencies
Step 2: Run the job at the command line

hadoop jar <JARpath> <mainclass> <inputFile> <outputDir>

 

 

About Andanayya

Experienced Hadoop Developer with a demonstrated history of working in the computer software industry. Skilled in Big Data, Spark,Java,Sqoop, Hive,Spring Boot and SQL

Check Also

Introduction to Hadoop Architecture

Hadoop is an open source Distributed processing framework that manages data processing and storage for …