hadoop & neptune feb. 2009 김형준

Hadoop & Nep-tune

Feb. 2009http://www.openneptune.com

http://www.jaso.co.kr

김형준

The Data 'Tsunami'

More CPU

Faster DiskProgram Tuning

More Memory

Uninstall

Where?Distributed File System

How?Distributed/Parallel Computing

Hadoop DFSUnlimited StorageNo Backup, Self-healingThousands NodesBut, No POSIXNo Random write

: machine: daemon process

NameNode(DFS Master)

JobTracker(Job Master)

DataNode(DFS Slave)

TaskTracker(Task Mgmt.)

Local Disk

DataNode(DFS Slave)


Local Disk

DataNode(DFS Slave)


Local Disk

SecondaryNameNode

ClientAPIcontrol

datacontrol

data

Hadoop MapReduce1TB group by -> 10 분

More Machine -> 1 분

• map (k1,v1) → list(k2,v2)• reduce (k2, list (v2)) → result value

This is a book. That book is on the desk.I like that book.

This is a book. That book is on the desk.

I like that book.

(This,1)(book, 1)(That, 1)(book, 1)…

(I,1)(that, 1)(book, 1)…

map()

map()

(book, [1,1,1])…(is, [1,1])…(This,[1])

(book, 3)…(is, 2)…(This,1)

reduce()

Exec distributed/parallelMap&Reduce execution platform

Split

PartitionMergeSort

: machine: daemon process

NameNode(DFS Master)

JobTracker(Job Master)

DataNode(DFS Slave)


Local Disk

DataNode(DFS Slave)


Local Disk

DataNode(DFS Slave)


Local Disk

SecondaryNameNode

ClientAPIcontrol

datacontrol

data

A piece of Cake

NeptuneDatabase running on DFS(Hadoop)Unlimited Structured DataNo Backup

But, No JOIN, No SQLNo Multiple row operationNo Aggregation function

OperationCreate/Drop Tableput/getlike/betweenscan/merge scan(join)MapReduce

Why Neptune?

Tablet A-3

Tablet A-N

…

Tablet A-2

TabletA-1

TableA

JobTracker

Make Map&Reduce function

Run on Map&Reduce framework

META Table Get tablet list

Map Task

TaskTracker

Map TaskMap Task

Map Task

TaskTracker

Map TaskMap Task

Map Task

TaskTracker

Map TaskMap Task

Task assign to each node

TaskTracker

ReduceTask

TaskTracker

ReduceTask

TableB

Tablet B-2

Tablet B-1

분산 / 병렬처리: Speed, Scalability

분산파일시스템 (Hadoop or other)

TabletServer #1TabletServer #2 TabletServer #n

Cluster Management System

NeptuneMaster

분산 / 병렬컴퓨팅 플랫폼(Hadoop)

사용자 애플리케이션

Neptune( 대용량분산 데이터 저장소 )

논리적 Table

물리적 저장소

When use NeptuneLarge DataOnline put/get and analysisLess complex

Google Personalized SearchGoogle analytics

Finding developer

Cheap Hardware and Smart SoftwareUse cheap commodity hardware frequent failureDevelop smart software for reducing the cost of failure

Easy ManagementHigh Scalability by automatic discovery of new servers and racksHigh Redundancy for failure of servers, racks, even data centers

Speed and Then More SpeedHigh speed with low cost Rapid development and deployment of new products

Use existing technologiesUse techniques from the leading edge of computer scienceUse open source codes as a starting point

Principle of Google Infra

Google Infra

Google Linux

GFS

Bigtable

Map & Reduce Client API

Chubby

Cluster M

gmt

Batch applica-tion Online Services

HardwareLow-end commodity servers40 or more pizza box server per rack

Google’s core competencyGoogle’s software stack

hadoop & neptune feb. 2009 김형준

Documents