hadoop inside
Post on 06-May-2015
1.175 Views
Preview:
TRANSCRIPT
Hadoop Inside
TC 데이터플랫폼실 GFIS팀
이은조
What is Hadoop
Hadoop is a Framework & System for
parallel processing of
large amounts of data in
a distributed computing environment
http://searchbusinessintelligence.techtarget.in/tutorial/Apache-Hadoop-FAQ-for-BI-professionals
Apache project
open source
java based
google system clone
GFS -> HDFS
MapReduce -> MapReduce
Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
Hadoop System Architecture
Job
Tracker
Name
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
: Node : Process : Heart Beat : Data Read/Write
Secondary
Name
Node
HDFS + MapReduce
HDFS
vs. Filesystem
inode – namespace
cylinder / track – data node
blocks(bytes) – blocks(Mbytes)
Features
very large files
write once, read many times
support for usual file system operations
ls, cp, mv, rm, chmod, chown, put, cat, …
no support for multiple writers or arbitrary modifications
Block Replication & Rack Awareness
1 2
3 4
1 2
3 4
1
2
3
4
1 2 3 4
1 2
3 4
1 2
3
4
1 2
3
4 : Server
: Rack
: File
: Block
HDFS - Read
Name
Node
Data Node
: Node : Data Block : Data I/O
Data Read
: Operation Message
Client
Data Node
Data Node
1. Read Request
2. Response
3. Reqeust Data
4. Read Data
HDFS - Write
Name
Node
Data Node
Data Write
Client
Data Node
Data Node
1. Write Request
2. Response
3. Write Data
4. Write Replica
4. Write Replica
5. Write Done
: Node : Data Block : Data I/O : Operation Message
HDFS – Write (Failure)
Name
Node
Data Node
Data Write
Client
Data Node
Data Node
1. Write Request
2. Response
3. Write Data
4. Write Replica
5. Write Done
: Node : Data Block : Data I/O : Operation Message
HDFS – Write (Failure)
Name
Node
Data Node
Data Write
Client
Data Node
Data Node
: Node : Data Block : Data I/O : Operation Message
Data Node
Write Replica
Delete Partial block
Replica Arrangement
MapReduce
Definition
map: (+1) [ 1, 2, 3, 4, …, 10 ] -> [ 2, 3, 4, 5, …, 11 ]
reduce: (+) [ 2, 3, 4, 5, …, 11 ] -> 65
Programming Model for processing data sets in Hadoop
projection, filter -> map task
aggregation, join -> reduce task
sort -> partitioning
Job Tracker & Task Trackers
master / slave
job = many tasks
# of map tasks = # of file splits (default: # of blocks)
# of reduce tasks = user configuration
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
MapReduce
: Distributed File System
: Split
: Input Data Record
: Map Task
: Reduce Task
: Shuffling & Sorting
: Map Output Record (Key/Value pair)
: Reduce Output Record (Key/Value pair)
Map / Reduce Task
: Partition
Mapper - partitioning
double indexed structure
Spill Thread
data sorting: 2nd index (quick sort)
spill file generating
spill data file & index file
flush
merge sort (by key) per partition
key value key value … key value
partition key offset
value offset
partition key offset
value offset
…
key offset
key offset
key offset
….
Output Buffer (default: 100Mb)
1st Index
2nd Index
TaskTracker (reduce task)
Reducer –fetching
GetMapEventsThread
map event listener
MapOutputCopier
data fetching from completed mapper (HTTP)
concurrent running in some threads
Merger
key sorting (heap sort)
TaskTracker
(map task)
TaskTracker
(map task)
TaskTracker
(map task)
Job
Tracker
Copier
Copier
completion events
completion events
HTTP - GET
Reducer
Job Flow
Job
Client
MapReduce
Program
Job
Tracker
Task
Tracker
Child
Map/
Reduce
Task
1. runJob
3. submit job
5. add job
6. heartbeat
7. assign task
9. launch
10. run
Shared
File System 2. copy job
resources
4. retrieve
input spilts
8. retrieve
job resources
: Node
: JVM
: Class
: Job Queue
: Method Call
: I/O
11. read data/
write result
: Job
: Task
Client Node
JobTracker Node
TaskTracker Node
Monitoring
Heart beat
task tracker status checking
task request / alignment
other commands (restart, shudown, kill task, …)
Cluster Status
Job / Task Status
JobInProgress
TaskInProgress
Reporter & Metrics
Black list
Monitoring (Summary)
Heart beat
task tracker status checking
task request / alignment
other commands (restart, shudown, kill task, …)
Cluster Status
Job / Task Status
JobInProgress
TaskInProgress
Reporter & Metrics
Black list
Monitoring (Cluster Info)
Monitoring (Job Info)
Monitoring (Task Info)
Task Scheduler
job queue
red-black tree ( java.util.TreeMap)
sort by priority & job id (request time)
load factor
remain tasks / capacity
task alignment
high priority
new task > speculative execution task > dummy splits task
map task (local) > map task (non-local) > reduce task
padding
padding = MIN(total tasks * pad faction, task capacity)
for speculative execution
Error Handling
Retry
configurable (default 4 times)
Timeout
configurable
Speculative Execution
current – start >= 1 minute
average progress – progress > 20%
Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
HDFS Client master / slave
replication / rack awareness job scheduler
Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
heart beat job/task status
reporter / metrics
Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
black list time out & retry
speculative execution
Limitations
map -> reduce network overhead
iterative processing
full(or theta) join
small size but many splits data
Low latency
polling & pulling
job initializing
optimized for throughput
job scheduling
data access
Q&A
top related