map reduce

39
The Map Reduce Programming Model Boris Farber [email protected] 06/26/2022 06/26/2022

Upload: boris-farber

Post on 25-May-2015

427 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Map reduce

The Map Reduce Programming Model

Boris Farber

[email protected]

04/12/2023

04/12/2023

Page 2: Map reduce

About Me

• Senior Software Engineer at Varonis– Data Governance, Leakage and

Synchronization– Android , Windows

• Author of Profiterole – Map Reduce library for Android

• “Android System Programming” book reviewer

Page 3: Map reduce

Credits

• Avner Ben– Chief Architect at Elisra Systems– Skill Tree ® designer

• Nathan Marz– Author of “Big Data” book at Manning– Software Engineer at Twitter– Creator of Storm and Cascalog

• Matei Zaharia – – http://www.cs.berkeley.edu/~matei/

Page 4: Map reduce

Plan

• Introduction to Big Data - problem• Map Reduce - solution

– Map/Reduce functions– Sample problems– Analysis– Cascalog

• Lambda Architecture - getting the big picture

• Summary

Page 5: Map reduce

What is Big Data

• Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail, not only to scale up but also to provide desired functionality.

• For example back bone of Facebook/Twitter/LinkedIn/Google

• The common data pattern for the companies above huge amount of un-structured data such as tweets, likes, network updates.

Page 6: Map reduce

Big Data

• The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.

• To gain value from this data, you must choose an alternative way to process it.

• Processing the data to keep pre-computed queries (we don’t want to perform peta-byte search for every query)

Page 7: Map reduce

Handling Huge Data Amounts

• Cost-efficiency:– Commodity machines (cheap, but unreliable)– Commodity network– Automatic fault-tolerance (fewer

administrators)– Easy to use (fewer programmers)

• Large-Scale Data Processing with commodity hardware

• Want to use 1000s of CPUs/Processes/Threads• But don’t want hassle of managing things

Page 9: Map reduce

Map Reduce

• A programming model (& its associated implementation) for processing large data set.

• Exploits large set of commodity computers• Executes process in distributed manner• Offers high degree of transparencies• Based on Functional Programming

approach which is inherently more scalable than Object Oriented one !

Page 10: Map reduce

Challenge

• The reason why MapReduce is such a powerful paradigm is because programs written in terms of MapReduce are inherently scalable.

• The same program can run on ten megabytes of data as can run on ten petabytes of data, why ?

Page 11: Map reduce

Usage

• http://research.google.com/archive/mapreduce.html Original paper by Google

• Backbone of Hadoop – open source Apache Big Data framework

Page 12: Map reduce

Map

Original list

Function

New List

Page 13: Map reduce

Map

• Is a higher-order function that applies a given function to each element of a list, returning a list of results. It is often called apply-to-all when considered in functional form.

• (defn bubble[x] (* x x))• (map #(bubble %1) [ 1 3 5 7 ])

(1 9 25 49)

Page 14: Map reduce

Reduce

Original list

Function

Result1000

Page 15: Map reduce

Reduce

• Reduce and accumulate are a family of higher-order functions that analyze a  data structure and recombine through use of a given combining operation the results of recursively processing its constituent parts, building up a return value.

• (reduce * [1 2 3 4 5 6 6]) 4320

Page 16: Map reduce

Map-Reduce

1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Page 17: Map reduce

Word Count Naïve Approach

• You can do this sequentially by getting your fastest machine (you've got plenty lying around) and running over the text from start to finish.

• Maintain a hash map of every word you find (the key) and incrementing the frequency (value) every time you parse a word.

• Simple, straightforward and slow.

Page 18: Map reduce

Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

MapMap

MapMap

MapMap

ReduceReduce

ReduceReduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

InputMap Reduce

Reduce Output

Page 20: Map reduce

Search• Input: (lineNumber, line) records• Output: lines matching a given pattern

• Map: if(line matches pattern)

output(line)• Reduce: identify function

– Alternative: no reducer (map-only job)

Page 21: Map reduce

Inverted Index• Input: (filename, text) records• Output: list of files containing each word

• Map: foreach word in text.split() output(word, filename)

• Combine: uniquify filenames for each word

• Reduce:def reduce(word, filenames) output(word, sort(filenames))

Page 22: Map reduce

Inverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)greatness, (12th.txt)

not, (12th.txt, hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

Page 23: Map reduce

Map Reduce

• Map:– Accepts input

key/value pair– Emits intermediate

key/value pair

• Reduce :– Accepts intermediate

key/value* pair– Emits output

key/value pair

Very big

data

ResultMAP

REDUCE

PartitioningFunction

Page 24: Map reduce

Programming Model

• Data type: key-value records

• Map function:

(Kin, Vin) list(Kinter, Vinter)

• Reduce function:

(Kinter, list(Vinter)) list(Kout, Vout)

Page 25: Map reduce

Throwing Hardware Linear Scalability

• Cheap nodes fail, especially if you have many

• Mean time between failures for 1 node = 3 years

• Mean time between failures for 1000 nodes = 1 day

• Programming distributed systems is hard users write “map” & “reduce” functions, system distributes work and handles faults

Page 26: Map reduce

Fault Tolerance Task Crash

• If a task crashes:– Retry on another node

• OK for a map because it has no dependencies• OK for reduce because map outputs are on disk

– If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)

• For these fault tolerance features to work, your map and reduce tasks must be side-effect-free

Page 27: Map reduce

Fault Tolerance Node Crash

• If a node crashes:– Re-launch its current tasks on other nodes– Re-run any maps the node previously ran, Necessary

because their output files were lost along with the crashed node

Page 28: Map reduce

Fault Tolerance Slow Task

• If a task is going slowly (straggler):– Launch second copy of task on another node (“speculative

execution”)– Take the output of whichever copy finishes first, and kill the

other

• Surprisingly important in large clusters– Stragglers occur frequently due to failing hardware,

software bugs, misconfiguration, etc– Single straggler may noticeably slow down a job

Page 29: Map reduce

Key Code (async thread pool)MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String,

Integer>>();int numThreads = 25;ExecutorService pool = Executors.newFixedThreadPool(numThreads);CompletionService<OutputUnit> futurePool = new

ExecutorCompletionService<MapCallback.OutputUnit>(pool);Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>();

// linear addition of jobs, parallel executionfor (TMapInput m : input) {

futureSet.add(futurePool.submit(mapper.makeWorker(m)));}// tasks runningpool.shutdown();

Page 30: Map reduce

Clojure/Cascalog

• Key problem in Java – very low level processing with lack of high level combinations (it took me while to add another reducer …) static typing and OO (function is not first class citizen) felt writing interpreter

• Is a fully-featured data processing and querying library for Clojure or Java.

• The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer.

• Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Page 31: Map reduce

Benefits of Map Reduce

• Many problems can be phrased this way• Elegant and Powerful

Page 32: Map reduce

Takeaways

• By providing a data-parallel programming model, MapReduce can control job execution in useful ways:– Automatic division of job into tasks– Automatic load balancing– Recovery from failures & stragglers

• User focuses on application, not on complexities of distributed computing

Page 33: Map reduce

Conclusions

• MapReduce programming model hides the complexity of work distribution and fault tolerance

• Principal design philosophies:– Make it scalable, so you can throw hardware at

problems– Make it cheap, lowering hardware, programming and

admin costs

• MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

Page 34: Map reduce

Suitable for your task if

• Have a cluster• Working with large dataset• Working with independent data (or

assumed)• Can be cast into map and reduce

Page 35: Map reduce

Lambda Architecture

Page 36: Map reduce

Lambda Architecture

• Query = Function(All Data)• General purpose approach to implementing an

arbitrary function on an arbitrary large dataset and having the function return its results with low latency

• The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real-time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.

• Key Insight The alternative approach is to pre-compute the query function – Why ?

Page 37: Map reduce

Chart

Page 38: Map reduce

Lambda Architecture

• Batch Layer – pre- computed saved queries (saves time), using Map Reduce

• Serving layer – updates• Speed layer – combined result (both data

in database plus updates not yet inserted)

Page 39: Map reduce

THANK YOU!