jobtracker memory leak solution

25
次集群问题的解决 JT Memory Leak发现及解决过程

Upload: jiangyu1211

Post on 05-Dec-2014

228 views

Category:

Internet


4 download

DESCRIPTION

solution for hadoop job tracker memory leak based on hadoop-0.20.203 also find in 1.x

TRANSCRIPT

Page 1: JobTracker  Memory Leak Solution

⼀一次集群问题的解决JT Memory Leak发现及解决过程

Page 2: JobTracker  Memory Leak Solution

⺫⽬目录

• ⼀一、问题现象

• ⼆二、调查过程

• 三、产⽣生原因

• 四、解决⽅方案

Page 3: JobTracker  Memory Leak Solution

⼀一、问题现象

Page 4: JobTracker  Memory Leak Solution

⼀一、问题现象

Page 5: JobTracker  Memory Leak Solution

⼀一、问题现象

Page 6: JobTracker  Memory Leak Solution

⼀一、问题现象

• What is the problems?

1、thousands of threads lead to too much Context Switching degrades performance

2、too much memory consumption lead to GC stop the world

Page 7: JobTracker  Memory Leak Solution

⼆二、调查过程!

!

!

• One instance of Filesystem has one DFSClient

• One DFSClient has one thread of LeaseChecker

DFSClientFileSystem

LeaseCheckerDFSClient

NameNodeRenew Lease

Thread

Page 8: JobTracker  Memory Leak Solution

⼆二、调查过程

• 可能的问题:FileSystem close,但是DFSClient没有close,可能性低。

• 可能的问题:FileSystem没有close导致DFSClient没有close,从⽽而有⼤大量的LeaseChecker,可能性⾼高。

• 曾经的问题:Scribe java版本中代码bug导致⽂文件⻓长时间⽆无法close,甚⾄至⽆无法recovery

Page 9: JobTracker  Memory Leak Solution

⼆二、调查过程

• 社区jira:https://issues.apache.org/jira/browse/HADOOP

• 相关的jira(但与我们不同):MAPREDUCE-5508,MAPREDUCE-5351

Page 10: JobTracker  Memory Leak Solution

⼆二、调查过程

!

!

• Ebay的经验

• email后得到猜测的结果

Page 11: JobTracker  Memory Leak Solution

⼆二、调查过程

• 结论:FileSystem没有Close,造成JobTracker memory leaks

Page 12: JobTracker  Memory Leak Solution

三、产⽣生原因

• Job提交流程

Page 13: JobTracker  Memory Leak Solution

三、产⽣生原因!

!

• JobInProgress init ⽣生成FileSystem object

!

• JobHistory获得FileSystem object

Page 14: JobTracker  Memory Leak Solution

三、产⽣生原因

• FileSystem都采⽤用缓存

• Key是由scheme+authority+ugi+unique决定

Page 15: JobTracker  Memory Leak Solution

三、产⽣生原因• ⽤用户conf没有定义schema,那么获得的FileSystem就是JIP初始化时候的FS,close会全部close掉

• 如果⽤用户⾃自定义conf,并且scema不同,那么获得History FS将与JIP初始化FS不同,并且没有close⽅方法,造成泄漏

Page 16: JobTracker  Memory Leak Solution

四、解决⽅方法

• 1、重启JobTracker:影响⾯面较⼤大

• 2、动态更改runtime bytecode

Page 17: JobTracker  Memory Leak Solution

四、解决⽅方法• BTrace?

Page 18: JobTracker  Memory Leak Solution

四、解决⽅方法• JVM Instrumentation: introduced in jdk 1.5

• statically & dynamically load byte code at runtime

• ⽆无限可能(⻛风险)

• usage: profile tools , monitor tools

Page 19: JobTracker  Memory Leak Solution

四、解决⽅方法• lifecycle of java-agent

• Premain

• Main

• AgentMain

!

Page 20: JobTracker  Memory Leak Solution

四、解决⽅方法

• load过程

Page 21: JobTracker  Memory Leak Solution

四、解决⽅方法• Using JVM Instrumentation

• 1、create agent: jar , Manifest

(1)command line:javaagent:jarpath[=options]

(2)attach to an existing jvm

• 2、ClassLoader load agent

Page 22: JobTracker  Memory Leak Solution

四、解决⽅方法

Page 23: JobTracker  Memory Leak Solution

四、解决⽅方法• Instrumentation把实现交给了⽤用户

!

!

!

!

example:http://blog.javabenchmark.org/2013/05/java-instrumentation-tutorial.html

bytecode manipulation tools:ASM, javassist

Page 24: JobTracker  Memory Leak Solution

四、解决⽅方法• 我们的解决⽅方法:

1、通过agent 把代码inject到JobTracker中

2、代码⾸首先通过反射获得所有的Cache中的DistributedFileSystem实例(包括正常的和泄漏的)

3、加锁,通过DFS的conf获得JobID对象,对应了正在执⾏行的Job

4、反查JT中Map<JobID, JobInProgress> jobs

5、release leak filesystem

Page 25: JobTracker  Memory Leak Solution

refrence• http://dhruba.name/2010/02/07/creation-dynamic-

loading-and-instrumentation-with-javaagents/

• http://dhruba.name/2010/02/07/creation-dynamic-loading-and-instrumentation-with-javaagents/

• http://blog.javabenchmark.org/2013/05/java-instrumentation-tutorial.html

• 代码及详细说明:https://svn.intra.sina.com.cn/data/DGM/docs/ProblemsSummary/hadoop