map reduce

Post on 25-May-2015

427 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Map Reduce Programming Model

Boris Farber

boris.farber@gmail.com

04/12/2023

04/12/2023

About Me

• Senior Software Engineer at Varonis– Data Governance, Leakage and

Synchronization– Android , Windows

• Author of Profiterole – Map Reduce library for Android

• “Android System Programming” book reviewer

Credits

• Avner Ben– Chief Architect at Elisra Systems– Skill Tree ® designer

• Nathan Marz– Author of “Big Data” book at Manning– Software Engineer at Twitter– Creator of Storm and Cascalog

• Matei Zaharia – – http://www.cs.berkeley.edu/~matei/

Plan

• Introduction to Big Data - problem• Map Reduce - solution

– Map/Reduce functions– Sample problems– Analysis– Cascalog

• Lambda Architecture - getting the big picture

• Summary

What is Big Data

• Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail, not only to scale up but also to provide desired functionality.

• For example back bone of Facebook/Twitter/LinkedIn/Google

• The common data pattern for the companies above huge amount of un-structured data such as tweets, likes, network updates.

Big Data

• The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.

• To gain value from this data, you must choose an alternative way to process it.

• Processing the data to keep pre-computed queries (we don’t want to perform peta-byte search for every query)

Handling Huge Data Amounts

• Cost-efficiency:– Commodity machines (cheap, but unreliable)– Commodity network– Automatic fault-tolerance (fewer

administrators)– Easy to use (fewer programmers)

• Large-Scale Data Processing with commodity hardware

• Want to use 1000s of CPUs/Processes/Threads• But don’t want hassle of managing things

Map Reduce

• A programming model (& its associated implementation) for processing large data set.

• Exploits large set of commodity computers• Executes process in distributed manner• Offers high degree of transparencies• Based on Functional Programming

approach which is inherently more scalable than Object Oriented one !

Challenge

• The reason why MapReduce is such a powerful paradigm is because programs written in terms of MapReduce are inherently scalable.

• The same program can run on ten megabytes of data as can run on ten petabytes of data, why ?

Usage

• http://research.google.com/archive/mapreduce.html Original paper by Google

• Backbone of Hadoop – open source Apache Big Data framework

Map

Original list

Function

New List

Map

• Is a higher-order function that applies a given function to each element of a list, returning a list of results. It is often called apply-to-all when considered in functional form.

• (defn bubble[x] (* x x))• (map #(bubble %1) [ 1 3 5 7 ])

(1 9 25 49)

Reduce

Original list

Function

Result1000

Reduce

• Reduce and accumulate are a family of higher-order functions that analyze a  data structure and recombine through use of a given combining operation the results of recursively processing its constituent parts, building up a return value.

• (reduce * [1 2 3 4 5 6 6]) 4320

Map-Reduce

1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Word Count Naïve Approach

• You can do this sequentially by getting your fastest machine (you've got plenty lying around) and running over the text from start to finish.

• Maintain a hash map of every word you find (the key) and incrementing the frequency (value) every time you parse a word.

• Simple, straightforward and slow.

Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

MapMap

MapMap

MapMap

ReduceReduce

ReduceReduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

InputMap Reduce

Reduce Output

Search• Input: (lineNumber, line) records• Output: lines matching a given pattern

• Map: if(line matches pattern)

output(line)• Reduce: identify function

– Alternative: no reducer (map-only job)

Inverted Index• Input: (filename, text) records• Output: list of files containing each word

• Map: foreach word in text.split() output(word, filename)

• Combine: uniquify filenames for each word

• Reduce:def reduce(word, filenames) output(word, sort(filenames))

Inverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)greatness, (12th.txt)

not, (12th.txt, hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

Map Reduce

• Map:– Accepts input

key/value pair– Emits intermediate

key/value pair

• Reduce :– Accepts intermediate

key/value* pair– Emits output

key/value pair

Very big

data

ResultMAP

REDUCE

PartitioningFunction

Programming Model

• Data type: key-value records

• Map function:

(Kin, Vin) list(Kinter, Vinter)

• Reduce function:

(Kinter, list(Vinter)) list(Kout, Vout)

Throwing Hardware Linear Scalability

• Cheap nodes fail, especially if you have many

• Mean time between failures for 1 node = 3 years

• Mean time between failures for 1000 nodes = 1 day

• Programming distributed systems is hard users write “map” & “reduce” functions, system distributes work and handles faults

Fault Tolerance Task Crash

• If a task crashes:– Retry on another node

• OK for a map because it has no dependencies• OK for reduce because map outputs are on disk

– If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)

• For these fault tolerance features to work, your map and reduce tasks must be side-effect-free

Fault Tolerance Node Crash

• If a node crashes:– Re-launch its current tasks on other nodes– Re-run any maps the node previously ran, Necessary

because their output files were lost along with the crashed node

Fault Tolerance Slow Task

• If a task is going slowly (straggler):– Launch second copy of task on another node (“speculative

execution”)– Take the output of whichever copy finishes first, and kill the

other

• Surprisingly important in large clusters– Stragglers occur frequently due to failing hardware,

software bugs, misconfiguration, etc– Single straggler may noticeably slow down a job

Key Code (async thread pool)MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String,

Integer>>();int numThreads = 25;ExecutorService pool = Executors.newFixedThreadPool(numThreads);CompletionService<OutputUnit> futurePool = new

ExecutorCompletionService<MapCallback.OutputUnit>(pool);Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>();

// linear addition of jobs, parallel executionfor (TMapInput m : input) {

futureSet.add(futurePool.submit(mapper.makeWorker(m)));}// tasks runningpool.shutdown();

Clojure/Cascalog

• Key problem in Java – very low level processing with lack of high level combinations (it took me while to add another reducer …) static typing and OO (function is not first class citizen) felt writing interpreter

• Is a fully-featured data processing and querying library for Clojure or Java.

• The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer.

• Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Benefits of Map Reduce

• Many problems can be phrased this way• Elegant and Powerful

Takeaways

• By providing a data-parallel programming model, MapReduce can control job execution in useful ways:– Automatic division of job into tasks– Automatic load balancing– Recovery from failures & stragglers

• User focuses on application, not on complexities of distributed computing

Conclusions

• MapReduce programming model hides the complexity of work distribution and fault tolerance

• Principal design philosophies:– Make it scalable, so you can throw hardware at

problems– Make it cheap, lowering hardware, programming and

admin costs

• MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

Suitable for your task if

• Have a cluster• Working with large dataset• Working with independent data (or

assumed)• Can be cast into map and reduce

Lambda Architecture

Lambda Architecture

• Query = Function(All Data)• General purpose approach to implementing an

arbitrary function on an arbitrary large dataset and having the function return its results with low latency

• The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real-time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.

• Key Insight The alternative approach is to pre-compute the query function – Why ?

Chart

Lambda Architecture

• Batch Layer – pre- computed saved queries (saves time), using Map Reduce

• Serving layer – updates• Speed layer – combined result (both data

in database plus updates not yet inserted)

THANK YOU!

top related