hadoop & neptune feb. 2009 김형준
DESCRIPTION
More CPU Faster Disk Program Tuning More MemoryTRANSCRIPT
Hadoop & Nep-tune
Feb. 2009http://www.openneptune.com
http://www.jaso.co.kr
김형준
The Data 'Tsunami'
More CPU
Faster DiskProgram Tuning
More Memory
Uninstall
Where?Distributed File System
How?Distributed/Parallel Computing
Hadoop DFSUnlimited StorageNo Backup, Self-healingThousands NodesBut, No POSIXNo Random write
: machine: daemon process
NameNode(DFS Master)
JobTracker(Job Master)
DataNode(DFS Slave)
TaskTracker(Task Mgmt.)
Local Disk
DataNode(DFS Slave)
TaskTracker(Task Mgmt.)
Local Disk
DataNode(DFS Slave)
TaskTracker(Task Mgmt.)
Local Disk
SecondaryNameNode
ClientAPIcontrol
datacontrol
data
Hadoop MapReduce1TB group by -> 10 분
More Machine -> 1 분
• map (k1,v1) → list(k2,v2)• reduce (k2, list (v2)) → result value
This is a book. That book is on the desk.I like that book.
This is a book. That book is on the desk.
I like that book.
(This,1)(book, 1)(That, 1)(book, 1)…
(I,1)(that, 1)(book, 1)…
map()
map()
(book, [1,1,1])…(is, [1,1])…(This,[1])
(book, 3)…(is, 2)…(This,1)
reduce()
Exec distributed/parallelMap&Reduce execution platform
Split
PartitionMergeSort
: machine: daemon process
NameNode(DFS Master)
JobTracker(Job Master)
DataNode(DFS Slave)
TaskTracker(Task Mgmt.)
Local Disk
DataNode(DFS Slave)
TaskTracker(Task Mgmt.)
Local Disk
DataNode(DFS Slave)
TaskTracker(Task Mgmt.)
Local Disk
SecondaryNameNode
ClientAPIcontrol
datacontrol
data
A piece of Cake
NeptuneDatabase running on DFS(Hadoop)Unlimited Structured DataNo Backup
But, No JOIN, No SQLNo Multiple row operationNo Aggregation function
OperationCreate/Drop Tableput/getlike/betweenscan/merge scan(join)MapReduce
Why Neptune?
Tablet A-3
Tablet A-N
…
Tablet A-2
TabletA-1
TableA
JobTracker
Make Map&Reduce function
Run on Map&Reduce framework
META Table Get tablet list
Map Task
TaskTracker
Map TaskMap Task
Map Task
TaskTracker
Map TaskMap Task
Map Task
TaskTracker
Map TaskMap Task
Task assign to each node
TaskTracker
ReduceTask
TaskTracker
ReduceTask
TableB
Tablet B-2
Tablet B-1
분산 / 병렬처리: Speed, Scalability
분산파일시스템 (Hadoop or other)
TabletServer #1TabletServer #2 TabletServer #n
Cluster Management System
NeptuneMaster
분산 / 병렬컴퓨팅 플랫폼(Hadoop)
사용자 애플리케이션
Neptune( 대용량분산 데이터 저장소 )
논리적 Table
물리적 저장소
When use NeptuneLarge DataOnline put/get and analysisLess complex
Google Personalized SearchGoogle analytics
Finding developer
Cheap Hardware and Smart SoftwareUse cheap commodity hardware frequent failureDevelop smart software for reducing the cost of failure
Easy ManagementHigh Scalability by automatic discovery of new servers and racksHigh Redundancy for failure of servers, racks, even data centers
Speed and Then More SpeedHigh speed with low cost Rapid development and deployment of new products
Use existing technologiesUse techniques from the leading edge of computer scienceUse open source codes as a starting point
Principle of Google Infra
Google Infra
Google Linux
GFS
Bigtable
Map & Reduce Client API
Chubby
Cluster M
gmt
Batch applica-tion Online Services
HardwareLow-end commodity servers40 or more pizza box server per rack
Google’s core competencyGoogle’s software stack
Q&A