apache hadoop - a deep dive (part 2 - mapreduce)

13
Map Reduce Introduction

Upload: debarchan-sarkar

Post on 17-Nov-2014

191 views

Category:

Data & Analytics


1 download

DESCRIPTION

Demystifying the Map Reduce compute paradigm in Hadoop.

TRANSCRIPT

Page 1: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Map Reduce

Introduction

Page 2: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Agenda Recap

Definition

Analogy

Phase : Map Reduce

Page 3: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Recap Access speed did not keep up with the Storage capacity

Processing Data in Parallel is better

Cluster Architecture is apt for Hadoop

How Hadoop got started

HDFS Architecture[Block Size and Replication]

Name Node and Secondary Name Node

5000 feet overview how HDFS Writes happen

Page 4: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Definition MapReduce is a framework for writing applications that process large

amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. Framework Write Applications Process Large Data Structure or Un-Structured Process Data In Parallel Reliable Fault-tolerant

Page 5: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

E-Sarjapur

Sort

A M

N Z

Census- A MapReduce Analogy

E-K.R.Puram

N-Yelahanka

S-J P Nagar

N-Hebbal

W-Rajajinagar

Merge

HebbalJPNagarKRPuramRajajinagarSarjapurYelahanka

Page 6: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

HDFS

Phase : Map & Reduce

3

4

5

1

2

Ma

pp

er

sInput Splits

Sort and Shuffle

4

2

Re

du

ce

rs

Data Node / Task trackers

Aggregation

HDFS

Page 7: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Word Count using MapReduceInput Splitting Mapping Shuffling Reducing Final

Result

Near ear herehere there HearEar dear There

Near ear here

Here there Hear

Ear Dear There

Near,1 ear ,1Here,1

Here,1There,1Hear, 1

Ear,1 Dear ,1There,1

Ear 1,1Dear 1

Here 1,1,1

Near 1There 1, 1

Ear, 2Dear, 1Here, 2

Hear,1There,2Near, 1

Ear, 2Dear, 1Here,2Hear, 1There,2Near,1

Input to Mapper <K1,V1> Output from Mapper <K2,V2>

Input to reducers <K2,(V2,V2)>

<K3,V3>

Page 8: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

How does it work

RUNTIME

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

CodeCodeCodeCode

Page 9: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Role and Responsibilities Job Client

Submits Job to Job Trackers

Job Tracker – orchestrate jobs Query Name Node for Data Location Create Execution Plan Submits job to Task Tracker Manage Phases (Map, Shuffle & Reduce) Updates Status

Task Tracker – Executes job Tasks Reports Progress

Page 10: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Remember HDFS

RACK 1 - DataNodes RACK 2 - DataNodes

File Metadata/user/kc/data01.txt – Block 1,2,3,4/user/apb/data02.txt– Block 5,6

1 11

2 23

3

2

34 445

5

5 66

6

Block1: R1DN01, R1DN02, R2DN01Block2:R1DN01, R1DN02, R2DN03Block3:R1DN02, R1DN03, R2DN01

Page 11: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Data FlowClient Job Tracker Task Tracker

Splits Uses bytes and Storage location from InputSplit

RecordReader

MAP()

Combiner

Partitioner

Shuffler & Sort

Reduce

Output Format

Page 12: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

MapReduce Layer

HDFS Layer

Hadoop Distributed architecture

Master Slave

Page 13: Apache Hadoop - A Deep Dive (Part 2 - MapReduce)

Questions / Feedback Support Team’s blog: http://blogs.msdn.com/b/bigdatasupport/

Facebook Page: https://www.facebook.com/MicrosoftBigData

Facebook Group: https://www.facebook.com/groups/bigdatalearnings/

Twitter: @debarchans

Twitter: @confusionblinds

Read more:

http://en.wikipedia.org/wiki/Hadoop

http://en.wikipedia.org/wiki/Big_data

Next Session:

Apache Hadoop – Setup Lab