静态代码分析

52
静静静静静静 梁梁梁 2011-05-25

Upload: lael-murray

Post on 31-Dec-2015

67 views

Category:

Documents


0 download

DESCRIPTION

静态代码分析. 梁广泰 2011-05 - 25. 提纲. 动机 程序静态分析(概念 + 实例) 程序缺陷分析(科研工作). 动机. 云平台特点 应用程序直接部署在云端服务器上,存在安全隐患 直接操作破坏服务器文件系统 存在安全漏洞时,可提供黑客入口 资源共享,动态分配 单个应用的性能低下,会侵占其他应用的资源 解决方案之一: 在部署应用程序之前,对其进行静态代码分析: 是否存在违禁调用?(非法文件访问) 是否存在低效代码?(未借助 StringBuilder 对 String 进行大量拼接) 是否存在安全漏洞?( SQL 注入,跨站攻击,拒绝服务) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 静态代码分析

静态代码分析

梁广泰2011-05-25

Page 2: 静态代码分析

提纲

动机程序静态分析(概念 + 实

例)程序缺陷分析(科研工作)

Page 3: 静态代码分析

动机云平台特点

应用程序直接部署在云端服务器上,存在安全隐患• 直接操作破坏服务器文件系统 • 存在安全漏洞时,可提供黑客入口

资源共享,动态分配• 单个应用的性能低下,会侵占其他应用的资源

解决方案之一: 在部署应用程序之前,对其进行静态代码分析:

• 是否存在违禁调用?(非法文件访问)• 是否存在低效代码?(未借助 StringBuilder 对 String 进行大量拼接)

• 是否存在安全漏洞?( SQL注入,跨站攻击,拒绝服务)• 是否存在恶意病毒?• ……

Page 4: 静态代码分析

提纲

动机程序静态分析(概念 + 实

例)程序缺陷分析(科研工作)

Page 5: 静态代码分析

静态代码分析定义:

程序静态分析是在不执行程序的情况下对其进行分析的技术,简称为静态分析。

对比: 程序动态分析:需要实际执行程序 程序理解:静态分析这一术语一般用来形容自动化工具的分析,而人工分

析则往往叫做程序理解用途:

程序翻译 /编译 (编译器),程序优化重构,软件缺陷检测等 过程:

大多数情况下,静态分析的输入都是源程序代码或者中间码(如 Java bytecode ),只有极少数情况会使用目标代码;以特定形式输出分析结果

Page 6: 静态代码分析

静态代码分析 Basic BlocksControl Flow GraphDataflow Analysis

Live Variable Analysis Reaching Definition Analysis

Lattice Theory

Page 7: 静态代码分析

Basic BlocksA basic block is a maximal sequence of

consecutive three-address instructions with the following properties: The flow of control can only enter the basic block

thru the 1st instr. Control will leave the block without halting or

branching, except possibly at the last instr.

Basic blocks become the nodes of a flow graph, with edges indicating the order.

Page 8: 静态代码分析

EE

AA

BB

CC

DD

FF

Basic Block Example

Leaders

1. i = 12. j = 13. t1 = 10 * i4. t2 = t1 + j5. t3 = 8 * t26. t4 = t3 - 887. a[t4] = 0.08. j = j + 19. if j <= 10 goto (3)10. i = i + 111. if i <= 10 goto (2)12. i = 113. t5 = i - 114. t6 = 88 * t515. a[t6] = 1.016. i = i + 117. if i <= 10 goto

(13)

Basic Blocks

Page 9: 静态代码分析

Control-Flow GraphsControl-flow graph:

Node: an instruction or sequence of instructions (a basic block)

• Two instructions i, j in same basic blockiff execution of i guarantees execution of j

Directed edge: potential flow of control Distinguished start node Entry & Exit

• First & last instruction in program

Page 10: 静态代码分析

Control-Flow EdgesBasic blocks = nodesEdges:

Add directed edge between B1 and B2 if:• Branch from last statement of B1 to first

statement of B2 (B2 is a leader), or• B2 immediately follows B1 in program order and

B1 does not end with unconditional branch (goto)

Definition of predecessor and successor• B1 is a predecessor of B2• B2 is a successor of B1

Page 11: 静态代码分析

CFG Example

Page 12: 静态代码分析

静态代码分析Basic BlocksControl Flow GraphDataflow Analysis

Live Variable Analysis Reaching Definition Analysis

Lattice Theory

Page 13: 静态代码分析

Dataflow Analysis

Compile-Time Reasoning About Run-Time Values of Variables or Expressions

At Different Program Points Which assignment statements produced value of

variable at this point? Which variables contain values that are no longer

used after this program point? What is the range of possible values of variable at

this program point? ……

Page 14: 静态代码分析

Program Points One program point before each node One program point after each node Join point – point with multiple predecessors Split point – point with multiple successors

Page 15: 静态代码分析

Live Variable AnalysisA variable v is live at point p if

v is used along some path starting at p, and no definition of v along the path before the use.

When is a variable v dead at point p?No use of v on any path from p to exit node, or If all paths from p redefine v before using v.

Page 16: 静态代码分析

What Use is Liveness Information?

Register allocation. If a variable is dead, can reassign its register

Dead code elimination.Eliminate assignments to variables not read later.But must not eliminate last assignment to variable

(such as instance variable) visible outside CFG.Can eliminate other dead assignments.Handle by making all externally visible variables l

ive on exit from CFG

Page 17: 静态代码分析

Conceptual Idea of Analysisstart from exit and go backwards in CFGCompute liveness information from end to

beginning of basic blocks

Page 18: 静态代码分析

Liveness Example a = x+y;t = a;c = a+x;x == 0

b = t+z;

c = y+1;

1100100

1110000

Assume a,b,c visible outside method

So are live on exit Assume x,y,z,t not

visible Represent Liveness

Using Bit Vector order is abcxyzt

1100111

1000111

1100100

0101110

a b c x y z t

a b c x y z t

a b c x y z t

Page 19: 静态代码分析

Formalizing Analysis Each basic block has

IN - set of variables live at start of block OUT - set of variables live at end of bloc

k USE - set of variables with upwards expo

sed uses in block (use prior to definition)

DEF - set of variables defined in block prior to use

USE[x = z; x = x+1;] = { z } (x not in USE) DEF[x = z; x = x+1; y = 1;] = {x, y} Compiler scans each basic block to derive

USE and DEF sets

Page 20: 静态代码分析

Algorithmfor all nodes n in N - { Exit }

IN[n] = emptyset;OUT[Exit] = emptyset; IN[Exit] = use[Exit];Changed = N - { Exit };

while (Changed != emptyset) choose a node n in Changed; Changed = Changed - { n };

OUT[n] = emptyset; for all nodes s in successors(n)

OUT[n] = OUT[n] U IN[p];

IN[n] = use[n] U (out[n] - def[n]);

if (IN[n] changed) for all nodes p in predecessors(n) Changed = Changed U { p };

Page 21: 静态代码分析

静态代码分析 – 概念Basic BlocksControl Flow GraphDataflow Analysis

Live Variable Analysis Reaching Definition Analysis

Lattice Theory

Page 22: 静态代码分析

Reaching DefinitionsConcept of definition and use

a = x+y is a definition of a is a use of x and y

A definition reaches a use if value written by definition may be read by use

Page 23: 静态代码分析

Reaching Definitions s = 0; a = 4; i = 0;k == 0

b = 1; b = 2;

i < n

s = s + a*b;i = i + 1; return s

Page 24: 静态代码分析

Reaching Definitions and Constant PropagationIs a use of a variable a constant?

Check all reaching definitions If all assign variable to same constant Then use is in fact a constant

Can replace variable with constant

Page 25: 静态代码分析

Is Is aa Constant in Constant in s = s+a*bs = s+a*b?? s = 0; a = 4; i = 0;k == 0

b = 1; b = 2;

i < n

s = s + a*b;i = i + 1; return s

Yes!On all reaching definitionsa = 4

Page 26: 静态代码分析

Constant Propagation TransfConstant Propagation Transformorm

s = 0; a = 4; i = 0;k == 0

b = 1; b = 2;

i < n

s = s + 4*b;i = i + 1; return s

Yes!On all reaching definitionsa = 4

Page 27: 静态代码分析

Computing Reaching DefinitionsCompute with sets of definitions

represent sets using bit vectors each definition has a position in bit vector

At each basic block, compute definitions that reach start of block definitions that reach end of block

Do computation by simulating execution of program until reach fixed point

Page 28: 静态代码分析

1: s = 0; 2: a = 4; 3: i = 0;k == 0

4: b = 1; 5: b = 2;

0000000

11100001110000

1111100

1111100 1111100

1111111

1111111 1111111

1 2 3 4 5 6 7

1 2 3 4 5 6 7 1 2 3 4 5 6 7

1 2 3 4 5 6 7

1 2 3 4 5 6 7

1 2 3 4 5 6 7

1110000

1111000 1110100

1111100

01011111111100

1111111i < n

1111111return s6: s = s + a*b;

7: i = i + 1;

Page 29: 静态代码分析

Formalizing Reaching DefinitionsEach basic block has

IN - set of definitions that reach beginning of block

OUT - set of definitions that reach end of blockGEN - set of definitions generated in blockKILL - set of definitions killed in block

GEN[s = s + a*b; i = i + 1;] = 0000011KILL[s = s + a*b; i = i + 1;] = 1010000Compiler scans each basic block to derive G

EN and KILL sets

Page 30: 静态代码分析

Example

Page 31: 静态代码分析

Forwards vs. backwardsA forwards analysis is one that for each

program point computes information about the past behavior. Examples of this are available expressions

and reaching definitions.Calculation: predecessors of CFG nodes.

A backwards analysis is one that for each program point computes information about the future behavior. Examples of this are liveness and very busy

expressions.Calculation: successors of CFG nodes.

Page 32: 静态代码分析

May vs. MustA may analysis is one that describes

information that may possibly be true and, thus, computes an upper approximation.Examples of this are liveness and reaching

definitions.Calculation: union operator.

A must analysis is one that describes information that must definitely be true and, thus, computes a lower approximation. Examples of this are available expressions and

very busy expressions.Calculation: intersection operator.

Page 33: 静态代码分析

静态代码分析 – 概念Basic BlocksControl Flow GraphDataflow Analysis

Live Variable Analysis Reaching Definition Analysis

Lattice Theory

Page 34: 静态代码分析

Basic IdeaInformation about program

represented using values from algebraic structure called lattice

Analysis produces lattice value for each program point

Two flavors of analysis Forward dataflow analysis Backward dataflow analysis

Page 35: 静态代码分析

Partial OrdersSet PPartial order such that x,y,zP

x x (reflexive) x y and y x implies x y (asymmetric) x y and y z implies x z (transitive)

Can use partial order to define Upper and lower bounds Least upper bound Greatest lower bound

Page 36: 静态代码分析

Upper BoundsIf S P then

xP is an upper bound of S if yS. y x xP is the least upper bound of S if

• x is an upper bound of S, and • x y for all upper bounds y of S

- join, least upper bound (lub), supremum, sup S is the least upper bound of S• x y is the least upper bound of {x,y}

Page 37: 静态代码分析

Lower BoundsIf S P then

xP is a lower bound of S if yS. x y xP is the greatest lower bound of S if

• x is a lower bound of S, and • y x for all lower bounds y of S

- meet, greatest lower bound (glb), infimum, inf S is the greatest lower bound of S• x y is the greatest lower bound of {x,y}

Page 38: 静态代码分析

Coveringx y if x y and xy x is covered by y (y covers x) if

x y, and x z y implies x z

Conceptually, y covers x if there are no elements between x and y

Page 39: 静态代码分析

Example

P = { 000, 001, 010, 011, 100, 101, 110, 111}(standard Boolean lattice, also called hypercube)

x y if (x bitwise and y) = x

111

011101

110

010

001

000

100

Hasse Diagram• If y covers x

• Line from y to x• y above x in

diagram

Page 40: 静态代码分析

LatticesIf x y and x y exist for all x,yP,

then P is a lattice.If S and S exist for all S P,

then P is a complete lattice.All finite lattices are complete

Page 41: 静态代码分析

LatticesIf x y and x y exist for all x,yP,

then P is a lattice.If S and S exist for all S P,

then P is a complete lattice.All finite lattices are completeExample of a lattice that is not complete

Integers IFor any x, yI, x y = max(x,y), x y = min(x,y)But I and I do not exist I {, } is a complete lattice

Page 42: 静态代码分析

Lattice Examples

Lattices

Non-lattices

Page 43: 静态代码分析

Semi-LatticeOnly one of the two binary operations

(meet or join) exist Meet-semilattice If x y exist for all x,yP Join-semilattice If x y exist for all x,yP

Page 44: 静态代码分析

Monotonic Function & Fixed point

Let L be a lattice. A function f : L → L is monotonic if

∀x, y ∈ S : x y ⇒ f (x) f (y)

Let A be a set, f : A → A a function, a ∈A .If f (a) = a, then a is called a fixed point of f on A

Page 45: 静态代码分析

Existence of Fixed Points• The height of a lattice is defined to be

the length of the longest path from ⊥

to ⊤• In a complete lattice L with finite

height, every monotonic function f : L → L has a unique least fixed-point :

0( )i

if

Page 46: 静态代码分析

Knaster-Tarski Fixed Point Theorem

Suppose (L, ) is a complete lattice, f: LL is a monotonic function.

Then the fixed point m of f can be defined as

Page 47: 静态代码分析

Calculating Fixed PointThe time complexity of computing a

fixed-point depends on three factors: The height of the lattice, since this provides a

bound for i; The cost of computing f; The cost of testing equality.

The computation of a fixed-point can be illustrated as a walk up the lattice starting at ⊥:

Page 48: 静态代码分析

Application to Dataflow Analysis

Dataflow information will be lattice values Transfer functions operate on lattice values Solution algorithm will generate increasing sequence

of values at each program point Ascending chain condition will ensure termination

Will use to combine values at control-flow join points

Page 49: 静态代码分析

Transfer FunctionsTransfer function f: PP for each node

in control flow graphf models effect of the node on the

program information

Page 50: 静态代码分析

Transfer Functions

Each dataflow analysis problem has a set F of transfer functions f: PP Identity function iF F must be closed under composition:

f,gF. the function h = x.f(g(x)) F Each f F must be monotone:

x y implies f(x) f(y) Sometimes all fF are distributive:

f(x y) = f(x) f(y) Distributivity implies monotonicity

Page 51: 静态代码分析

课程考核方式作业(提交到课程平台http://sase.seforge.org/,并演示) + 课程报告作业选题:

代码注释提取,文档生成 代码信息统计:总行数,代码行数,类数量,方法数,方法长度等 Latex格式文档自动转成PDF 代码在线diff Executable Jar转换成带有特定 icon 的 exe程序 代码各类缺陷检测:内存泄漏,空指针异常 Test case 自动生成 脚本缺陷分析: Javascript , Python , Ruby, PHP …… C# 代码缺陷分析 在线压缩,解压缩,加密,解密 ……

Page 52: 静态代码分析

Questions?Thank you!