I am feeling great while learning Hadoop and MapReduce Framework. In this post, I am sharing my findings on my initial steps to learn the MapReduce Framework. This program is a little different than the usual WordCount example, in which it utilizes the setup and clearup methods in accomplishing the task.
Agenda:
1) Input File size is about 2GB, containing numbers in each line. This file is generated by program which generates random positive numbers.
2) Copy the input file into the HDFS system
3) Hadoop Environment: VMPlayer with Ubuntu 11.0.0 and Hadoop 0.20 installed in it.
4) Map and Reduce Program to count the sum of numbers
5) Driver Program to configure the JobClient
Steps Followed:
1) Data Dumper Program:
2) Copy the number-data.xml file from Ubuntu Desktop folder into the HDFS saip input folder
hadoop fs -copyFromLocal number-data.xml saip
3) Write the Mapper Program as follows:
4) Write the Driver Program without the Reducer as follows:
5) Start the Hadoop Framework:
6) Run the map-only task as:
7) The output of map-only task in part-m-000xx files : In my case it created 29 map files because file size is around 1.9GB. It is as good as saying 64MB * 29 = 1.9GB (where 64MB is the default block size). Each file containing one line Result XXXXXXXXXX. This is the result of the cleanup() method in Map class.
8) Now let us write the Reduce Program as follows:
9) Uncomment the Reducer related lines from Driver Program and run
10) The output of Reducer program is as follows: It produced a single file part-r-00000 file containing one Result 194340451763551429