In this blog we will discuss about MRJob, what is it and how to use, But before discussing the MRjob you should know about Mapreduce and how it works. After that we will discuss MRjob. So, Let's start with a quick introduction with Mapreduce.
What is Mapreduce ?
Mapreduce is one of the most important and major components of the Hadoop ecosystem. It is designed to process the large amount of data in parallel by dividing the work into some smaller pieces and independent tasks.
In the Mapreduce programming process the data in two functions mapper() and reducer(). Mapreduce job splits the input data into independent small pieces which are processed by the map task in parallel manner. The framework sort output of the map task. The output of the map task is used as the input to the reduce task. Both the input and output are stored in a file system shared by all processing nodes.
How works simple wordcount program?
Mapper
Read the input from stdin line by line
Split the word separated by space, comma or tab etc.
pairs key-values
Reducer
Read the result of mapper from stdout
Sums the occurence of each word
Write the result to stdout
Introduction to MRjob
MRjob is a python library for Mapreduce, which is created by YELP. This library allows mapreduce applications to be written in a single class using python programming, instead of writing separate programs for the mapper and reducer. It is simple and more pythonic. Mrjob is the easiest way to write a python program that runs on hadoop. If you are using the mrjob you can test your program locally without installing hadoop. Amazon EMR is the web service provide by AWS for Big Data processing.
Features
All mapreduce code in a single class
Easily upload and install code and data dependencies at runtime
switch input/output formats with a single line code
Automatically download and parse error logs for Python tracebacks
Put command line filters before or after your Python code
Why MRjob
It has more documentation than any other library or framework.
You can run or test your program without installing the hadoop at all.
MRJobs provides a consistent interface to every environment that mrjobs are supported. Whether you are running on cloud or locally, code does not change.
MRJob handles most of the machinery of getting the code and data out of the cluster your work is running on. No need for a series of scripts to install dependencies or upload files.
Mrjobs makes debugging easier. Mapreduce implementation runs locally hence you get a traceback in your console instead of in an obscure log file. It parses error logs for Python tracebacks and other likely causes of failure, on the cluster or elastic mapreduce.
Mrjobs are automatically serialized and deserialized data
Installation of MRjob
The installation of mrjob is very easy, you can install it with pip by using the following command. If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
pip install mrjob
Example : 1
First program using MRJob.
In the code below we can see both mapper and reducer are the same in a single program, when we use MRJob, we need not create a separate file for mapper and reducer. First of all we import the MRJob library. Create a class and define two functions for mapper and reducer. Mapper collects the data from a text file and splits it into the word. Reducer function counts the number of each word and gives us the result in key values.
Code :
from mrjob.job import MRJob
class Mrwordcount(MRJob):
def mapper(self,_,line):
words = line.split()
for word in words:
yield word,1
def reducer(self,key,value):
yield key,sum(value)
if __name__ == '__main__':
Mrwordcount.run()
To run the code, first open a terminal set the path of that folder wherever your python file and text data are stored. Then just type python wordcount.py text. You can pass multiple files at the same time.
Syntax python "python_file_name" "text_data"
Output :
Example : 2
The simplest way to write a one-step job is to subclass MRJob and override a few methods:
Code
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordCount(MRJob):
#mapper
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
#combiner
def combiner(self, word, counts):
yield word, sum(counts)
#reducer
def reducer(self, word, counts):
yield word, sum(counts)
if __name__ == '__main__':
MRWordCount.run()
Output :
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
コメント