apache hadoop - a deep dive (part 2 - mapreduce)
DESCRIPTION
Demystifying the Map Reduce compute paradigm in Hadoop.TRANSCRIPT
Map Reduce
Introduction
Agenda Recap
Definition
Analogy
Phase : Map Reduce
Recap Access speed did not keep up with the Storage capacity
Processing Data in Parallel is better
Cluster Architecture is apt for Hadoop
How Hadoop got started
HDFS Architecture[Block Size and Replication]
Name Node and Secondary Name Node
5000 feet overview how HDFS Writes happen
Definition MapReduce is a framework for writing applications that process large
amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. Framework Write Applications Process Large Data Structure or Un-Structured Process Data In Parallel Reliable Fault-tolerant
E-Sarjapur
Sort
A M
N Z
Census- A MapReduce Analogy
E-K.R.Puram
N-Yelahanka
S-J P Nagar
N-Hebbal
W-Rajajinagar
Merge
HebbalJPNagarKRPuramRajajinagarSarjapurYelahanka
HDFS
Phase : Map & Reduce
3
4
5
1
2
Ma
pp
er
sInput Splits
Sort and Shuffle
4
2
Re
du
ce
rs
Data Node / Task trackers
Aggregation
HDFS
Word Count using MapReduceInput Splitting Mapping Shuffling Reducing Final
Result
Near ear herehere there HearEar dear There
Near ear here
Here there Hear
Ear Dear There
Near,1 ear ,1Here,1
Here,1There,1Hear, 1
Ear,1 Dear ,1There,1
Ear 1,1Dear 1
Here 1,1,1
Near 1There 1, 1
Ear, 2Dear, 1Here, 2
Hear,1There,2Near, 1
Ear, 2Dear, 1Here,2Hear, 1There,2Near,1
Input to Mapper <K1,V1> Output from Mapper <K2,V2>
Input to reducers <K2,(V2,V2)>
<K3,V3>
How does it work
RUNTIME
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
CodeCodeCodeCode
Role and Responsibilities Job Client
Submits Job to Job Trackers
Job Tracker – orchestrate jobs Query Name Node for Data Location Create Execution Plan Submits job to Task Tracker Manage Phases (Map, Shuffle & Reduce) Updates Status
Task Tracker – Executes job Tasks Reports Progress
Remember HDFS
RACK 1 - DataNodes RACK 2 - DataNodes
File Metadata/user/kc/data01.txt – Block 1,2,3,4/user/apb/data02.txt– Block 5,6
1 11
2 23
3
2
34 445
5
5 66
6
Block1: R1DN01, R1DN02, R2DN01Block2:R1DN01, R1DN02, R2DN03Block3:R1DN02, R1DN03, R2DN01
Data FlowClient Job Tracker Task Tracker
Splits Uses bytes and Storage location from InputSplit
RecordReader
MAP()
Combiner
Partitioner
Shuffler & Sort
Reduce
Output Format
Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
MapReduce Layer
HDFS Layer
Hadoop Distributed architecture
Master Slave
Questions / Feedback Support Team’s blog: http://blogs.msdn.com/b/bigdatasupport/
Facebook Page: https://www.facebook.com/MicrosoftBigData
Facebook Group: https://www.facebook.com/groups/bigdatalearnings/
Twitter: @debarchans
Twitter: @confusionblinds
Read more:
http://en.wikipedia.org/wiki/Hadoop
http://en.wikipedia.org/wiki/Big_data
Next Session:
Apache Hadoop – Setup Lab