資工碩一 黃威凱
MapReduce
資工碩一 黃威凱
OutlinePurposeExampleMethodAdvanced
資工碩一 黃威凱
PURPOSE
資工碩一 黃威凱
PurposeData miningData processing
資工碩一 黃威凱
EXAMPLE
資工碩一 黃威凱
ExampleFind the maximum temperature of
yearNational Climatic Data Center(NCDC)
◦The data is stored using a line-oriented ASCII format , in which each line is a record
◦There is a directory for each year from 1901 to 2001 ,each containing a gzipped file for each weather station with its readings for that year
資工碩一 黃威凱
Example(Data format)
資工碩一 黃威凱
Example(Gzipped file, example for 1990)
◦% ls raw/1990 | head◦010010-99999-1990.gz◦010014-99999-1990.gz◦010015-99999-1990.gz◦010016-99999-1990.gz◦010017-99999-1990.gz◦010030-99999-1990.gz◦010040-99999-1990.gz◦010080-99999-1990.gz◦010100-99999-1990.gz◦010150-99999-1990.gz
資工碩一 黃威凱
METHOD
資工碩一 黃威凱
MethodAnalzing the data with Unix toolsAnalzing the data with Hadoop
資工碩一 黃威凱
Method(Unix tools)
資工碩一 黃威凱
Method(Unix tools)Here is the beginning of a run:
◦% ./max_temperature.sh◦1901 317◦1902 244◦1903 289◦1904 256◦1905 283◦ ...
The complete run for the century took 42 minutes in one run single EC2 High-CPU Extra Large Instance.
資工碩一 黃威凱
Method(Hadoop)Use MapReduce
◦Map Shuffle
◦Reduce
資工碩一 黃威凱
Method(Hadoop)Map function
◦Pull out the year and the air temperature
◦Transform key-value pairs
資工碩一 黃威凱
Method(Hadoop)Map function
◦The shuffle Each reduce task is fed by many map
tasks.
資工碩一 黃威凱
Method(Hadoop)Reduce function
◦Iterate through the list and pick up the maximum reading
◦Input (1949, [111, 78]) (1950, [0, 22, -11])
◦Output: (1949, 111) (1950, 22)
資工碩一 黃威凱
Method(Hadoop)Data flow
資工碩一 黃威凱
Method(Hadoop)Java MapReduce-Mapper
example
資工碩一 黃威凱
Method(Hadoop)Java MapReduce-Reduce example
資工碩一 黃威凱
Method(Hadoop)Java MapReduce-Job example
Support multiple path
資工碩一 黃威凱
ADVANCED
資工碩一 黃威凱
AdvancedCase1
資工碩一 黃威凱
AdvancedCase2
資工碩一 黃威凱
AdvancedCase3
資工碩一 黃威凱
AdvancedCombiner Functions on Map
output◦Example
Map input1: (1950, 0), (1950, 20), (1950, 10)
Map input2: (1950, 25), (1950, 15) After shuffle:
Map1: (1950, [0,20,10]) Map2: (1950, [25,15])
No Use Combiner to reduce input (1950, [0, 20, 10, 25, 15])
Use Combiner to reduce input (1950, [20, 25])