a two-phase execution engine of reduce tasks in hadoop mapreduce xiaohongzhang*guoweiwang*...

Download A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology

If you can't read please download the document

Upload: rosamond-scott

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Motivation Reduce task issue massive remote I/O operations to copy the intermediate results of map tasks This paper propose an execution engine for reduce task 3

TRANSCRIPT

A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology Henan Polytechnic University 2012 International Conference on Systems and Informatics (ICSAI 2012) Speaker: WMC Lab 1 Outline Motivation Introduction Background Design and Implementation Experiment Conclusion and Future work 2 Motivation Reduce task issue massive remote I/O operations to copy the intermediate results of map tasks This paper propose an execution engine for reduce task 3 Introduction Comparison 4 Condie et al.Seo et al.Su et al. MethodPushing wayPre-SufflingLocality-aware FeatureTcpData dependenceStoring the intermediate results of each node DisadvantageOccupy network width Only reduce the delay of copying intermediate results Cannot benefit from the parallel exection MapReduce Data Flow 5 Background Intermediate results in the form of Partitons intermediate results into different classes according to the keys Execute reduce task three steps: the copy step the sort step the reduce step 6 The copy step 7 A job includes 31 map tasks and 2 reduce tasks. Supposing the reduce tasks are process on n4 and n9, and then n4 issue 27 remote I/O operations, n9 issue 28 remote I/O operations The copy step Data transmission delay Reduce tasks issue massive remote I/O operations cause massive delay and degrade system performance 8 Design and Implementation Propose an execution engine for reduce tasks First phase : select the nodes, assigns reduce tasks, and then order the nodes to prefectch intermediate result Second phase : the nodes allocates resources for reduce tasks, and run these tasks The remot access delay of results can be hidden 9 Design and Implementation The engine is comprised of an engine server and many engine clients The engine server : It selects the nodes to run reduce tasks It decides the number of the tasks that each selected node will run The engine clients : prefeches intermediate results for the reduce tasks dispatched to the same nodes 10 Design and Implementation 11 Resource competition 12 Map task Reduce task (Waiting State) Node Resource (competition) Problem Resource competition in the node Reduce tasks do not release the resource in the waiting states Network resource competition Client request information of the completed map tasks from the sever periodically 13 Solution Resource competition in the node The engine imposes restrictions on the time starting to run reduce task Network resource competition The engine postpones reduce tasks scheduling until the amount of completed map tasks reaches a certain number 14 Experiment Environment The execution engine in Hadoop Linux cluster, cluster intcluded 11 nodes First rack included 4 nodes and second rack included 7 nodes, all the nodes were connected by Gigabit switch 15 Experiment 16 Experiment Criteria : mean execution time When the number of the completed map tasks reached to 5% of the total map tasks, Hadoop began to schedule reduce tasks, and kept them in first phase Until the percentage of the completed map tasks reached to a special number, reduce tasks can enter into the second phase(special number configure to 10%, 20%, , 90%) 17 Experiment Horizontal axis : the percentage of the completed map tasks, which controlled the reduce task to enter into the second phase Vertical axis : the mean execution time of each ran 18 Experiment 19 Experiment 20 Experiment 21 22 Experiment 23 Conclusion and Future work Conclusion The results showed that the engine optimized the performance of Hadoop in most cases Future work Bottleneck Client and Server Network width How to know reduce task want to what intermediate results 24 Thank you 25