4.hdfs & io
Post on 01-Jun-2018
232 Views
Preview:
TRANSCRIPT
-
8/9/2019 4.HDFS & IO
1/41
http://www.excelonlineclasses.
/
excel.onlineclasses mail.chttp://www.excelonlineclasses.co.nr/
http://www.excelonlineclasses.co.nr/http://www.excelonlineclasses.co.nr/mailto:excel.onlineclasses@gmail.commailto:excel.onlineclasses@gmail.commailto:excel.onlineclasses@gmail.comhttp://www.excelonlineclasses.co.nr/http://www.excelonlineclasses.co.nr/
-
8/9/2019 4.HDFS & IO
2/41
Online TrainingDevelopmentTesting
Job supportTechnical Guidance Job Consultancy Any needs of IT Sector
Excel Online Classes ofers ollowingservices:
-
8/9/2019 4.HDFS & IO
3/41
MapReduce Anatomy
Nagaruna !
-
8/9/2019 4.HDFS & IO
4/41
AGENDA
" Anatomy of MapReduce" MR work ow
" Hadoop data types
" Mapper" Reducer
" artitioner
" !om"iner" #nput $plit %s &lock $i'e
-
8/9/2019 4.HDFS & IO
5/41
Anatomy o MR
.#() *
+A *A
(,+- Map#nterim
data
(,+- Map #nterimdata
(,+- Map#nterim
data
Reduce(od
stoout
Reduce (odstoout
Reduce(od
sto
out
artitioning$hu0ing
-
8/9/2019 4.HDFS & IO
6/41
Hadoop data types
MR has a de1ned way of keys and%alues types for it to mo%e acroscluster
2alues 3rita"le
4eys 3rita"le!ompara"le5*6 3rita"le!ompara"le 7
-
8/9/2019 4.HDFS & IO
7/41
re!"ently "sed#ey$val"eHadoop type %rapper or &ava type
&oolean3rita"le &oolean
&yte3rita"le &yte
+ou"le3rita"le +ou"le
#nt3rita"le #nteger9ong3rita"le 9ong
*ext $tring
(ull3rita"le laceholder when key/%alue noneeded
-
8/9/2019 4.HDFS & IO
8/41
C"stom %rita'le
or any class to "e value, it has toimplement org.apache.hadoop.io.Writable
write;+ata,utput out< readields;+ata#nput in<
-
8/9/2019 4.HDFS & IO
9/41
C"stom #ey
or any class to "e key, it has toimplementorg.apache.hadoop.io.WritableComparable
+
compare*o;* o<
-
8/9/2019 4.HDFS & IO
10/41
C(ec#o"t %rita'les
!heck out few of the writa"les andwrita"le compara"le
*ime to write your own writa"les
-
8/9/2019 4.HDFS & IO
11/41
MapRed"ce li'raries
*wo li"raries in Hadoop org.apache.hadoop.mapred.=
org.apache.hadoop.mapreduce.=
-
8/9/2019 4.HDFS & IO
12/41
Mapper
$hould implementorg.apache.hadoop.mapred.Mapper54>2>4>26# 2oid con1gure;?o"!onf o"<
# All the parameters speci1ed in the xmls are a%aila"le here.
# Any parameter explicitly set are also a%aila"le
# !all "efore data processing starts
# 2oid Mapper;4 key>2 %alue> ,utput!ollector54>26output>Reporter reporter<# +ata process starts
# 2oid !lose;<# $hould close any 1les> d" connections etc.>
# Reporter pro%ides extra information of mapper to *
-
8/9/2019 4.HDFS & IO
13/41
Mappers )dea"lt
Mapper "nctionality
#dentityMapper #mplemetns Mapper54>2>4>26
" 3hate%er the input it gets it gi%es that tooutput
#n%erseMapper #mplemetns Mapper54>2>2>46" #n%erses the key>%alue from the input to out
*oken!ountMapper
#mplements Mapper54>*ext>*ext>9ong3rita"le6" enerates ;token>< from the input %alue
tokeni'ed.
-
8/9/2019 4.HDFS & IO
14/41
Red"cer
$hould implementorg.apache.hadoop.mapred.Redcuer
$orts the incoming data "ased on key and grtogether all the %alues for a key
Reduce function is called for e%ery key in thesorted order# %oid reduce;4 key> #terator526 %alues>
,utput!ollector54B>2B6 output> Reporter reporter<
Reporter pro%ides extra information of mappe **
-
8/9/2019 4.HDFS & IO
15/41
Red"cer )dea"lt
Red"cer "nctionality
#dentityReducer54>26 #mplements Reducer54>2>4>2maps inputs directly to outpu
9ong$umReducer546 #mplementsReducer54>9ong3rita"le>4>9o
a"le6 anddetermines the sum of all %alcorresponding to the gi%en ke
-
8/9/2019 4.HDFS & IO
16/41
*artitioner
implements artitioner54>26 con1gure;<
int getartition ; C <
# D5 return5no.of.reducers
enerally> implement artitioner so
same keys go to one reducer
-
8/9/2019 4.HDFS & IO
17/41
Reading and %riting
enerally two kinds of 1les inHadoop *ext ;plain > EM9> html C. <
&inary ;$eFuence<# #t is a hadoop speci1c compressed "inary 1
format.
# ,ptimi'ed to transfer output from one MR t
MR
3e can customi'e
-
8/9/2019 4.HDFS & IO
18/41
+np"t ormat
H+$ "lock si'e
#nput splits
-
8/9/2019 4.HDFS & IO
19/41
,loc#s in HD-
&ig ile is di%idinto multiple "land stored in h
*his is a physicdi%ision of data
dfs."lock.si'e;GM& default
&9,!4
&9,!4
&9,!4 B
&9,!4
9AR- #9-
-
8/9/2019 4.HDFS & IO
20/41
+np"t -plits and Record
#nput split A chunk of data processed "y a mapper
urther di%ided into records
Map process these records# Record 7 key 8 %alue
How to correlate to a +& ta"le# roup of rows split
# Row record
.ey /
R R
R R
RB R
9,#!A9
+#2#$#,(
-
8/9/2019 4.HDFS & IO
21/41
+np"t-plit
pu"lic interface #nput$plit extends 3rita"le I
long get9ength;< throws #,-xceptionJ
$tringKL get9ocations;< throws #,-xceptionJ
#t doesnNt contain the data ,nly locations where the data is present
Helps o"tracker to arrange tasktrackers ;data locality< get9ength greater length split will "e executed
-
8/9/2019 4.HDFS & IO
22/41
+np"tormat
How we get the data to mapper #nputsplits and how the splits are di%ide
into records will "e taken care "y
inputformat.
pu"lic interface #nputormat54> 26 I#nput$plitKL get$plits;?o"!onf o"> int num$plits< throws #,-xceptioRecordReader54> 26 getRecordReader;#nput$plit split> ?o"!onf o">
Reporter reporter
-
8/9/2019 4.HDFS & IO
23/41
+np"tormat
Mapper getRecordReader;< is called to get
RecordReader
,nce the record reader is o"tained># Map method is called recursi%ely until the e
of the split
-
8/9/2019 4.HDFS & IO
24/41
RecordReader
4 key 7 reader.create4ey; output> reporter
-
8/9/2019 4.HDFS & IO
25/41
&o' -"'mission ))retrospection ?o"!lient running the o" ets inputsplits "y calling get$plits;< in
#nputormat
+etermines data locations for the splits $ends these locations to the ?o"*racke
?o"*racker assigns mappers
appropriately.# +ata locality
-
8/9/2019 4.HDFS & IO
26/41
+n,"ilt +np"tormats
-
8/9/2019 4.HDFS & IO
27/41
ile+np"tormat
&ase class for all implementations o#nputormat > which uses 1les asinput
+e1nes 3hich 1les to include for the o"
#mplementation for generating splits
-
8/9/2019 4.HDFS & IO
28/41
ile+np"tormat
$et of iles con%erts to no.of spli $plits only large 1lesC. H,3 9AR- O
9arger than &lock$i'e
!an we control it O*roperty Description Dea"lt va
mapred.min.split.si'e *he smallest %alid si'e in"ytes for a 1le split.
mapred.max.split.si'e *he largest %alid si'e in "ytesfor a 1le split.
9ong.MaxP%
dfs."lock.si'e *he si'e of a "lock in H+$ in"ytes.
GM"
-
8/9/2019 4.HDFS & IO
29/41
Calc"lating -plit -i0e
mapred1min1split1si0e
mapred1max1split1si0e
ds1'loc#1si0e
split si0e
9ong.MAE G M& GM&
9ong.MAE QM& Q M&
Q M& 9ong.MAE GM& QM&
BM& GM& BM&
" Application may impose minimum split si'e greater than &lock $i'e." *here is no good reason to that
" +ata locality is lost
-
8/9/2019 4.HDFS & IO
30/41
ile+np"tormat
Min split si'e 3e might set it to larger than "lock si'e
&ut concept of data locality may "e lost tosome extent
$plit si'e calculated "y formula max ;minimum$i'e> min;maximum$i'e> "lock$i'e<
&y default# minimum$i'e 5 "lock$i'e 5 maximum$i'e
ile +normation in t(e
-
8/9/2019 4.HDFS & IO
31/41
ile +normation in t(emapper
!on1gure;?o"!onf o"<
*roperty name description
mapred.input.1le *he path of the input 1le "eingprocessed
mapred.input.start *he "yte oset of the start ofthe split
map.input.length *he length of the split in "ytes
-
8/9/2019 4.HDFS & IO
32/41
2ext+np"tormat
+efault ile#nputormat -ach line is a %alue
&yte oset is a key
-xample Run identity mapper program
+np"t -plits and HD-
-
8/9/2019 4.HDFS & IO
33/41
+np"t -plits and HD-,loc#s9ogical Records de1ned "y ile#nputorm
doesnNt usually 1t it into H+$ "locks. -%eryile is written is written as seFuence of
"ytes.
G M& reached O then start the new "lock
3hen G M& reached> the logical record may half written
$o> the other half of logical record goes into tnext H+$ "lock.
+np"t -plits and HD-
-
8/9/2019 4.HDFS & IO
34/41
+np"t -plits and HD-,loc#s
$o e%en in data locality some remoreading is done.. a slight o%erhead. $plit gi%es logical record "oundaries
&locks S physical "oundaries ;si'e<
- ll il
-
8/9/2019 4.HDFS & IO
35/41
-mall iles
iles which are %ery small areineTcient in mapper phase
#magine &
GM" S G 1les S G mappers DDk" S DDD 1les S DDD mappers
C 'i il + t
-
8/9/2019 4.HDFS & IO
36/41
Com'ineile+np"torma
acks many 1les into single split +ata locality is taken into consideration
MR accelerates "est if operated atdis# transer rate not at see# rate
*his helps in processing large 1les als
N3i + t t
-
8/9/2019 4.HDFS & IO
37/41
N3ine+np"tormat
$ame as *ext#nputormat
-ach split guarenteed to ha%e ( lin
mapred.line.input.format.linespermp
.ey/al"e2ext+np"torm
-
8/9/2019 4.HDFS & IO
38/41
.ey/al"e2ext+np"tormt
-ach line in text 1le is a recordirst separator character di%ides ke
and %alue
+efault is UVtN
!ontroller property
key.%alue.separator.in.input.line
- il + t t . /
-
8/9/2019 4.HDFS & IO
39/41
-e!"enceile+np"tormat4.5/6
#nputormat for reading seFuence 1les
)ser de1ned 4ey 4)ser de1ned 2alue 2
*hey are splitta"le 1les. 3ell$uited for MR
*hey store compression
*hey can store ar"itrary types
O t t t
-
8/9/2019 4.HDFS & IO
40/41
O"tp"tormat
2e tO tormat
-
8/9/2019 4.HDFS & IO
41/41
2extO"tormat
key>%alues stored as Vt separated "default.
mapred.textoutputformat.separator WW parameter
!ounterart for 4ey2alue*ext#nputormat
!an suppress key/%alue "y using (ull3rita"le
top related