hadoop api

7/30/2019 Hadoop API

1/34

Large Scale Data Management Systems

Hadoop API

Dimitrios Michail

Dept. of Informatics and TelematicsHarokopio University of Athens

2012-2013

. . . . . .


2/34

. . . . . .

Hadoop

Input/Output(input) < k1, v1 >

map

< k2, v2 >

combine

< k2, v2 >

reduce

< k3, v3 > (output)

Dimitrios Michail Harokopio University of Athens 2/29


3/34

. . . . . .

Hadoop

Serializable

The keyand value classes need to be serializable by the framework.

they need to implement the Writable interface

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/DoubleWritable.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/Writable.html


4/34

. . . . . .

Hadoop

Serializable



Comparable

To facilitate sorting by the framework, the keyclass also needs to

be comparable.

needs to also implement the WritableComparable interface.



5/34

. . . . . .

Hadoop

Serializable



Comparable

To facilitate sorting by the framework, the keyclass also needs to

be comparable.

needs to also implement the WritableComparable interface.

There are implementations for all basic types, e.g. DoubleWritable.



6/34

. . . . . .

WordCount Example

1 p u b l i c c l a s s WordCount {2

3 p ub li c s t a t i c c l a s s Map e xt e n d s MapReduceBase implementsMapper {

4

5 p r i v a t e s t a t i c I n t W r i t a b l e one = new I n t W r i t a b l e ( 1 ) ;6 p r i v a t e

Text word =new

Text () ;78 p u b l ic v oi d map (L o n g W rita ble k ey , Tex t v a lu e ,

O u t p u t C o l l e ct o r o ut pu t , R e p o r t e rr e p o r t e r ) throws I O E x ce p t i on {

9 S t r i n g l i n e = v a l u e . t o S t r i n g ( ) ;10 S t r in g To k en i ze r t o k e n i z e r = new S t r i n g T o k e n i z e r ( l i n e ) ;

11 w h i l e ( to k e n i ze r . h as Mo reTo kens () ) {12 word . se t ( to ke ni ze r . nextToken( ) ) ;13 o u tp ut . co l l e c t (word , o ne) ;14 }15 }16 }



7/34

. . . . . .

WordCount Example

18

19 p ub li c s t a t i c c l a s s Reduce e xt e n d s MapReduceBaseimplements Reducer {

20

21 p u b l ic v oi d r e du c e ( Te xt key , I t e r a t o r v a l u e s , O u t p u t C o l l e c t or o ut pu t ,R e po r te r r e p o r t e r ) throws I O E x ce p t i on {

22 i n t sum = 0;23 w h i l e ( v a l u es . h a sN ext () ) {24 sum += va l ue s . nex t () . ge t () ;25 }26 o u tp ut . co l l e c t ( k ey , new In t W ri ta b l e (sum) ) ;27 }28 }



8/34

. . . . . .

WordCount Example

30 p u b l i c s t a t i c v oi d main ( S t r i n g [ ] a r g s ) throws E x c ep t i on {31 J ob Co nf co n f = new JobConf(WordCount. c l a s s ) ;32 co nf . setJobName (wordcount) ;33

34 conf . setOutputKe yClass ( Text . c l a s s ) ;35 c o n f . s e t O u t pu t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;36

37 conf . setMap perC lass (Map. c l a s s ) ;38 co n f . s etC o mb in erC la s s (Redu ce . c l a s s ) ;39 co n f . s etRed u cer C la s s (Redu ce . c l a s s ) ;40

41 conf . setInp utFor mat ( TextInputFormat . c l a s s ) ;42 co nf . setOutpu tFormat ( TextOutputFormat . c l a s s ) ;43

44 FileI n p u tF o rma t . s etI n p u t P a th s ( co n f , new P ath ( a rg s [ 0 ] ) ) ;45 FileOutputFormat . setOutputPath ( conf , new Pa th ( a rg s [ 1 ] ) ) ;46

47 J o b C lie n t . ru nJ o b ( co n f ) ;48 }49

50 } // e nd o f c l a s s WordCount



9/34

. . . . . .

Input

There are three major interfaces for providing input to amap-reduce job

InputFormat

InputSplit

RecordReader

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputFormat.html


10/34

. . . . . .

InputFormat Interface

InputFormat describes the input-specification for aMap-Reduce job.

The Map-Reduce framework relies on the InputFormat of the job

to:1 Validate the input-specification of the job.

2 Split-up the input file(s) into logical InputSplit(s), each ofwhich is then assigned to an individual Mapper.

3 Provide the RecordReader implementation to be used toproduce input records from the logical InputSplit forprocessing by the Mapper.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputFormat.html


11/34

. . . . . .

InputSplit Interface

InputSplit represents the data to be processed by an individualMapper.

presents a byte-oriented view on the input, and

it is the responsibility of the RecordReader of the job toprocess this and present a record-oriented view.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.html


12/34

. . . . . .

RecordReader Interface

RecordReader reads pairs from an InputSplit.

converts the byte-oriented view of the input, provided by the

InputSplit, and

presents a record-oriented view for the Mapper & Reducertasks for processing.

It thus assumes the responsibility of processing record boundariesand presenting the tasks with keys and values.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordReader.html


13/34

. . . . . .

Default InputFormat

The default InputFormat is the TextInputFormat.

TextInputFormat

Implements TextInputFormat

InputFormat for plain text files,

files are broken into lines. Either linefeed or carriage-returnare used to signal end of line.

keys are the position in the file, and values are the line of text.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputFormat.html


14/34

. . . . . .

Output

The major interfaces for outputing a map-reduce job are

OutputCollectorOutputFormat

RecordWriter

FileSystem

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputCollector.html


15/34

. . . . . .

OutputCollector Interface

Collects the pairs output by Mappers and Reducers.

OutputCollectoris the generalization of the facility providedby the Map-Reduce framework to collect data output by either theMapper or the Reducer i.e. intermediate outputs or the output ofthe job.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


16/34

. . . . . .

OutputFormat Interface

OutputFormat describes the output-specification for aMap-Reduce job.

The Map-Reduce framework relies on the OutputFormatof

the job to:

1 Validate the output-specification of the job. For e.g. checkthat the output directory doesnt already exist.

2 Provide the RecordWriter implementation to be used to write

out the output files of the job. Output files are stored in aFileSystem.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputFormat.html


17/34

. . . . . .

RecordWriter Interface

RecordWriter writes the output pairs to an output

file.

RecordWriter implementations write the job outputs to theFileSystem.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/RecordWriter.html


18/34

. . . . . .

FileSystem

An abstract base class for a fairly generic filesystem.

Implemented either as:

local

reflects the locally-connected disk

Distributed File System

the Hadoop DFS which is a multi-machine system thatappears as a single disk

fault tolerance and potentially very large capacity



19/34

. . . . . .

Default OutputFormat

The default OutputFormat is the TextOutputFormat.

TextOutputFormat

OutputFormat that writes plain text files,


J bC f
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/TextOutputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputFormat.html


20/34

. . . . . .

JobConfmap/reduce job configuration

JobConf is the primary interface for a user to describe amap-reduce job to the Hadoop framework for execution.

JobConf typically specifies the

Mapper,combiner (if any),

Partitioner,

Reducer,

InputFormatand

OutputFormat

implementations to be used etc.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/JobConf.html


21/34

. . . . . .

Mapper Interface

Mapper maps input key/value pairs to a set of intermediatekey/value pairs.

Maps are the individual tasks which transform input records

into intermediate records.The user implements the following interface:

p u bl i c i n t e r f a c e Mapper e xt e n d sJ o bC o nf i gu r ab l e , C l o s e a b le {

v o i d map(K1 key , V1 val ue , Out put Col lec tor o ut pu t , R e p or t e r r e p o r t e r ) ;

}

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


22/34

. . . . . .

Hadoop Map-Reduce Framework

Framework

spawns one map task for each InputSplit generated by theInputFormat for the job.

calls map(K1, V1, OutputCollector, Reporter) for eachkey/value pair in the InputSplit for that task.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/InputSplit.html


23/34

. . . . . .


Grouping

All intermediate values associated with a given output key aresubsequently grouped by the framework, and passed to aReducer to determine the final output.

Users can control the grouping by specifying a Comparator viaJobConf.setOutputKeyComparatorClass(Class).

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


24/34

. . . . . .


Partitioning

The grouped Mapper outputs are partitioned per Reducer.

Users can control which keys (and hence records) go to which

Reducer by implementing a custom Partitioner.


H d M R d F k
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


25/34

. . . . . .


Partitioning

The grouped Mapper outputs are partitioned per Reducer.

Users can control which keys (and hence records) go to which

Reducer by implementing a custom Partitioner.

A partitioner is represented by the following interface:

i n t e r f a c e P a rt it io n e r e xt e n d s J o b C o nf i g u r a b le {

i n t g e t P a r t i t i o n (K2 key , V2 v a lu e , i n t n u m P a r t i t i o n s ) ;

}


H d M R d F k
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


26/34

. . . . . .


Combiners

Users can optionally specify a combiner, via

JobConf.setCombinerClass(Class).Combiners perform local aggregation of the intermediateoutputs.

Helps to cut down the amount of data transferred from the

Mapper to the Reducer.


H d M R d F k
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


27/34

. . . . . .


The intermediate, grouped outputs are stored in SequenceFileswhich are flat files consisting of binary key/value pairs.

There are three SequenceFile Writers:

1 Writer : Uncompressed records.

2 RecordCompressWriter : Record-compressed files, onlycompress values.

3 BlockCompressWriter : Block-compressed files, both keys &

values are collected in blocks separately and compressed.The size of the block is configurable.


H d M R d F k
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.html


28/34

. . . . . .


The intermediate, grouped outputs are stored in SequenceFileswhich are flat files consisting of binary key/value pairs.

The actual compression algorithm used to compress key and/orvalues can be specified by using the appropriate CompressionCodec.

There are also Readers and Sorters for each SequenceFile type.


R d I t f
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.html


29/34

. . . . . .

Reducer Interface

Reduce a set of intermediate values which share a key to a smallerset of values.

The user implements the following interface:

p u bl i c i n t e r f a c e Reducer e xt e n d sJ o bC on f ig ur a bl e , C l o s e a bl e {

v o i d r e d u ce ( K2 k ey , I t e r a t o r v a l u e s , O u t p u t C o l l e c t or o u tp u t , Rep o rter re p o rt er ) ;

}

The number of Reducers for the job is set by the user viaJobConf.setNumReduceTasks(int).




30/34

. . . . . .

Hadoop Map Reduce FrameworkReducer Phases

Phase 1: Shuffle

The framework, for each reducer, fetches the relevantpartition of the output of all the Mappers, via HTTP.




31/34

. . . . . .

Hadoop Map Reduce FrameworkReducer Phases

Phase 1: Shuffle

The framework, for each reducer, fetches the relevantpartition of the output of all the Mappers, via HTTP.

Phase 2: Sort

The framework groups Reducer inputs by keys (since differentMappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputsare being fetched they are merged.




32/34

. . . . . .

p pReducer Phases

Phase 3: Reduce

The reduce(K2, Iterator, OutputCollector, Reporter)method is called for each pair in thegrouped inputs.

The output of the reduce task is typically written to theFileSystem via OutputCollector.collect(K3, V3).


Reporter
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html


33/34

. . . . . .

Reporter

A facility for Map-Reduce applications to report progress andupdate counters, status information etc.

Is Alive?

Mapper and Reducer can use the Reporter provided to reportprogress or just indicate that they are alive.

In scenarios where the application takes an insignificant amount oftime to process individual key/value pairs, this is crucial since the

framework might assume that the task has timed-out and kill thattask.


Material
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/Reducer.html


34/34

. . . . . .

Material

http://hadoop.apache.org

http://hadoop.apache.org/docs/r1.0.4

http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/r1.0.4http://hadoop.apache.org/

hadoop api

Documents