+ 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀. + outline introduction...

+

100062108 李智宇、100062116 林威宏、100062220 施閔耀

+OutlineIntroduction

Architecture of Hadoop

HDFS

MapReduce

Comparison

Why Hadoop

Conclusion

2

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+What is Hadoop ? open-source software framework

process and store big data

Easy to use and implement, economic, flexible

lots of nodes(server)

written in JAVA

free license

created by Doug Cutting and Mike Cafarella in 2005

3

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+Advantages of Interpreted Language

Cross-platform(ex: Windows, Ubuntu, Mac OS X)

smaller executable program size

easier to modify during both development and execution

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

4

+Architecture of Hadoop

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

5

+Hadoop in Enterprise

6

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

The Dell representation of the Hadoop ecosystem.

+Hadoop in Enterprise

7

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+Who is using Hadoop ?

more than half of the Fortune 50 uses Hadoop by 2013

8

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+HDFSHadoop Distributed File System

Client: user

name node: manage and store metadata, namespace of files

Data node: store files

each data node sends its status to name node periodically

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

9

+HDFS: Writing data in HDFS Each file will be divided into blocks(in size 64

or 128MB) , and have three copies in different data nodes.

Client asks name node to get a list of data node sorted by distance, and send the file to the nearest one , then the data node will send the file to the rest node.

When above operation done, data node will send “done” to name node.

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

10

+HDFS: Reading data in HDFSClient send filename to the name

node , then the name node will send a list of the blocks of files sorted by distance.

Client use the list to get the file from data node.

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

11

+HDFS: failurenode failure

communication failure

data corruption

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

12

+HDFS: handle failureHandle writing failure:

name node will skip the data node without an ACK.

Handle reading failure:recall that when reading a file, client will get a list of data node content the file.

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

13

+HDFS: handle failureName node handle node failure :

name node will find out the data the failure node have, and copy those data from others and restore them to other data node.

Note that HDFS can’t guarantee at least one copy of data is alive.

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

14

+MapReducesimilar to divide-and-conquer

First, use “Map” to divide tasks

Second, use “Shuffle” to “transfer the data from the mapper nodes to a reducer’s node and decompress if needed. “

Third, use “Reduce” to “execute the user-defined reduce function to produce the final output data. “

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

15

+MapReduce-Map

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

16

+MapReduce-shuffle

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

17

+MapReduce-Reduce

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

18

+MapReduce

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

19

+Comparison

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

20

+Comparison

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

21

+Why Hadoop?technically

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

22

Comparison of Grep Task Result with Vertica and DBMS-X

+Why Hadoop?

Simple structure vs. Optimization

Transaction time not minimized

Lower performance with same number of nodes

No compelling reason to choose Hadoop

technically

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

23

+Why Hadoop?commercially

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

24

+Why Hadoop

Cheap (Buy more servers to beat DBMS)

Flexible (Both in design and deployment)

Easier to design

Easier to scale up

Combine with other system to achieve better performance

commercially

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

25

+ConclusionHadoop is much easier for users to

implement and more economic

MapReduce advocates should study the techniques used in parallel DBMSs

Hybrid systems are also popular

With improvement of performance, we believe Hadoop will lead the trend of big data computing

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

26

+Reference http://hadoop.apache.org/

http://www.runpc.com.tw/content/cloud_content.aspx?id=105318

http://en.wikipedia.org/wiki/Apache_Hadoo

https://www.facebookbrand.com/

http://assets.fontsinuse.com/static/use-media-items/15/14246/full-2048x768/522903b7/Yahoo_Logo.png

http://wiki.apache.org/hadoop/PoweredBy

http://semiaccurate.com/assets/uploads/2011/09/Amazon-logo.jpg

http://www.conceptcupboard.com/blog/wp-content/uploads/2013/09/google.jpg

27

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+Reference http://datashieldcorp.com/files/2013/11/adobe-LOGO-2.jpg

http://upload.wikimedia.org/wikipedia/commons/7/77/The_New_York_Times_logo.png

http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/hadoop-introduction.pdf

http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitecture.pdf

http://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fwww.classcloud.org%2Fcloud%2Fraw-attachment%2Fwiki%2FHinet100402%2F02.HadoopOverview.pdf&ei=IE2XUtLfBMfxiAea_oHQCA&usg=AFQjCNFoIXxLJrOnoul4cKJpQ8v3_kuTYg

28

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+Reference http://www.accenture.com/SiteCollectionDocuments/PDF/

Accenture-Hadoop-Deployment-Comparison-Study.pdf

https://www.google.com.tw/url?sa=t&rct=j&q&esrc=s&source=web&cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fwww.psgtech.edu%2Fyrgcc%2Fattach%2FMAP%2520REDUCE%2520PROGRAMMING.ppt&ei=7lGXUtvCJsy5iAfWtYH4Bw&usg=AFQjCNGWRKJLal-tvbvORULZV6_Te2y74g&sig2=Ba77ihsV1SEqcNeEFkRzfg

https://www.cs.duke.edu/starfish/files/hadoop-models.pdf

http://dotnetmis91.blogspot.tw/2010/04/hdfs-hadoop-mapreduce.html

http://wiki.apache.org/hadoop/HDFS

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html

29

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

+Reference http://en.wikipedia.org/wiki/Interpreted_language

A Comparison of Approaches to Large-Scale Data Analysis by Sam Madden

http://www.cc.ntu.edu.tw/chinese/epaper/0011/20091220_1106.htm

http://web.cs.wpi.edu/~cs561/s12/Lectures/6/Hadoop.pdf

http://www.mobilemartin.com/mobile/show-me-the-mobile-money.jpg

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

30