parameterized summary-based static analysis for detecting...
TRANSCRIPT
공학박사학위논문
메모리누수와유사코드쌍탐지를위한
매개화된프로시져요약기반의정적분석
Parameterized Summary-based Static Analysis for
Detecting Memory Leaks and Code Clones
2011년 8월
서울대학교 대학원
컴퓨터공학부
정 영 범
메모리누수와유사코드쌍탐지를위한매개화된프로시져요약기반의정적분석
지도교수 이 광 근
이 논문을 공학박사학위논문으로 제출함
2011년 4월
서울대학교 대학원
컴퓨터공학부
정 영 범
정 영 범의 박사학위논문을 인준함
2011년 7월
위 원 장 문 병 로 (인)
부 위 원 장 이 광 근 (인)
위 원 우 치 수 (인)
위 원 염 헌 영 (인)
위 원 박 성 우 (인)
Abstract
Parameterized Summary-based Static Analysis for Detecting
Memory Leaks and Code Clones
Yungbum Jung
School of Computer Science and Engineering
College of Engineering
Seoul National University
We present a parametrized procedural summary-based static analysis. The analysis
is flow-, context-, and partially path-sensitive. We separately analyze each proce-
dure’s memory behavior a summary that is used in analyzing its call sites. Each
procedural summary is parametrized by the procedure’s call context so that it can
be instantiated at different call sites. When analyzing a procedure the execution
path conditions are captured as guards. These guards makes the analysis preserve
the path information.
The analysis is successfully applied to detecting memory leaks and code clones.
The precision of memory leak detection is relatively high. We found a number of
memory leak errors on SPEC2000 benchmarks and several open-source software
packages. A new semantic code clone detection is proposed by comparing pro-
grams’ abstract memory states, which are computed by the proposed static analy-
sis. Our experimental study using three large-scale open source projects shows that
our technique can detect semantic clones that existing syntactic- or semantic-based
clone detectors miss.
Keywords : Programming Language, Abstract Interpretation,
Memory Leaks, Code Clones, Static Analysis, Proce-
dural Summary
Student Number : 2004-21624
Contents
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Procedural Summary . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Memory Leak Detection . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Code Clone Detection . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Dissertation Outline and Summary of Contributions . . . . . . . . . 7
2 Procedural Summary-based Static Analysis 10
2.1 Target Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Memory State Representation: Abstract Domains . . . . . . . . . . 12
2.3 Estimating Procedure’s Semantics: Abstract Semantics . . . . . . . 14
2.4 Constructing Unknown Input Memories . . . . . . . . . . . . . . . . 17
2.5 Merging Abstract States . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Handling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Explanatory Example . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Parametrized Procedural Summary . . . . . . . . . . . . . . . . . . . 25
2.8.1 Summary Information . . . . . . . . . . . . . . . . . . . . . . 25
2.8.2 Summary Instantiation Using Calling Contexts . . . . . . . . 28
2.8.3 Summarizing Procedures from Abstract States . . . . . . . . 34
i
2.9 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 Implementation and Engineerings . . . . . . . . . . . . . . . . . . . . 40
2.10.1 Reducing Guards . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.10.2 Global Variables Abstraction . . . . . . . . . . . . . . . . . . 42
2.10.3 Following Loop Iteration Effects . . . . . . . . . . . . . . . . 42
3 Memory Leak Detection 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Memory Leak Detection Overview . . . . . . . . . . . . . . . . . . . 44
3.2.1 Summaries and Their Use . . . . . . . . . . . . . . . . . . . . 45
3.2.2 From Memory Effects to Summaries . . . . . . . . . . . . . . 46
3.2.3 Instantiating Summaries . . . . . . . . . . . . . . . . . . . . . 47
3.3 Procedural Summaries for Memory Leak Detection . . . . . . . . . . 48
3.3.1 Eight Categories of Procedural Summaries . . . . . . . . . . 49
3.3.2 Interprocedural Summary Instantiation . . . . . . . . . . . . 52
3.4 Reporting Leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2 Comparison with FastCheck . . . . . . . . . . . . . . . . . . 58
3.5.3 Comparison with Saturn . . . . . . . . . . . . . . . . . . . . . 60
3.5.4 Path-sensitive Extension . . . . . . . . . . . . . . . . . . . . . 61
4 Code Clone Detection 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Clone Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Clone Detection Based on Memory Comparison . . . . . . . . . . . 66
4.4 Example for Comparison . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Comparing Abstract Memory States . . . . . . . . . . . . . . . . . . 67
4.5.1 Equivalent Addresses . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.2 Similarity Between Guarded Values . . . . . . . . . . . . . . 69
4.5.3 Equivalent Values . . . . . . . . . . . . . . . . . . . . . . . . 70
ii
4.5.4 Equivalent Guards . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.5 Best Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 Judgement of Clones . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7.1 Detectability . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8 Detecting Inconsistent Changes . . . . . . . . . . . . . . . . . . . . . 85
4.8.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Conclusions 91
iii
List of Tables
2.1 The intermediate language for the analysis. . . . . . . . . . . . . . . 11
2.2 Abstract domains: the abstract semantics of a procedure is estimated
as abstract memory state over domain State at the exit point of the
procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Abstract semantic rules for statements. . . . . . . . . . . . . . . . . 15
2.4 AddGuard and Update functions. . . . . . . . . . . . . . . . . . . . . 16
2.5 abstract semantic rules for expressions, location expressions, and guards. 18
2.6 Procedural Summary: the memory behaviour of procedures is cap-
tured in this form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Auxiliary functions are used in instantiation algorithm. Function Locate
finds the location in which actions happen. From the given GV start-
ing guarded values, the abstract memory M is explored according
to the anchor C. Function TakeAction changes the abstract memory
M or the allocated and deallocated sets ⟨AL,FR⟩ according to the
corresponding actions. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Function Reachable finds all the reachable addresses from the initial
address set. The Reachable function takes a set of starting address
A and an abstract memory M as input. Then result is a set of tuples
which are consist of anchor, reachable address, and corresponding
guard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
3.1 Eight categories of procedural summary for detecting memory leaks.
The reachable locations from outside and the sets of allocated and
freed locations give us memory leak related information. . . . . . . 48
3.2 Analysis results on programs from SPEC2000 benchmark and open
source programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Performance comparison for the same C programs. Other tools’ data
are from the cited papers. Mairac found more bugs than others
with a reasonable false-alarm ratio. . . . . . . . . . . . . . . . . . . 58
3.4 Overall comparison with other memory leak detectors. Other tools’
data are from [21]. Note that these tools are applied to different
programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Performance comparison for the same C programs between the new
path-sensitive Mairac and the old Mairac [83]. . . . . . . . . . . 62
4.1 Properties of the subject projects. . . . . . . . . . . . . . . . . . . . 74
4.2 The distribution of detected clone types by MeCC. . . . . . . . . . 75
4.3 Detected clones and false positives. Total: total number of detected
clones, FP: number of false positive clones, and FP ratio: false pos-
itive ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 False negatives on the benchmark set [121]. * MeCC misses only one
clone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Time spent for the detection process. . . . . . . . . . . . . . . . . . 80
4.6 The numbers of detected Type-3 and Type-4 clones by MeCC, Deckard,
CCFinder, and a PDG-based detector [48]. . . . . . . . . . . . 81
4.7 Exploitable bugs and code smells in all clones found by MeCC and
filtered by Deckard. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
v
List of Figures
2.1 Translation from C program to graph-based core language. . . . . . 12
2.2 Procedure bar with its procedural summary and procedure foo with
its abstract memory state at the exit point (line 5). . . . . . . . . 19
2.3 An example procedure and its abstract memory M and the allo-
cated/deallocated sets ⟨AL,FR⟩ at the exit point. From a k-bounded
exit memory state, the summary is also bounded. . . . . . . . . . . 22
2.4 An explanatory code example. . . . . . . . . . . . . . . . . . . . . . 23
2.5 The names of global variables are ignored in procedural summary
and represented by Global. . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Tuple ⟨AL,FR⟩ at each edge is the set of allocated and, respec-
tively, freed locations. At the exit, allocated address ℓ remains not
to be freed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Procedural summary, instantiation, and summarization. . . . . . . . 45
3.2 Arg2Free case: The procedure frees addresses reachable from argu-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Arg2Glob and Glob2Arg cases: The attachGlob procedure attaches
some locations reachable from arguments to global variables and at-
taches locations reachable from global variables to arguments. . . . 50
3.4 Alloc2Arg case: The makeArray procedure attaches an allocated ad-
dress to the pointer argument p. . . . . . . . . . . . . . . . . . . . 51
3.5 Alloc2Ret case: The make2List procedure returns an allocated list of
length two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vi
3.6 Glob2Ret and Arg2Arg cases: The argPassing procedure passes an
address from the first argument to the second argument and returns
global pointer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Arg2Ret case: The renewList procedure returns addresses reachable
from an argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 The clean procedure calls some procedures presented above. . . . . 52
3.9 The memory state after line 4 of the code in Figure 3.8: some al-
located addresses are freed and the other allocated addresses are
reachable from the pointer variable lst2. . . . . . . . . . . . . . . . 53
3.10 The exit memory state of clean: the one allocated address pointed
to by lst2 is not reachable from global variables, hence leaked. . . 54
3.11 An example of reporting leaks. . . . . . . . . . . . . . . . . . . . . . 56
3.12 Example code from “mesa”(a SPEC2000 benchmark). . . . . . . . . 59
3.13 Procedural summary of gl_create_context. Nodes are shaded if they
are not freed by procedure gl_destroy_context. . . . . . . . . . . 60
4.1 Our clone detection approach: abstract memory states of individual
clone candidates are computed by a path-sensitive semantic-based
static analyzer. These abstract memory states are compared for de-
tecting code clones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Procedure foo2 with its abstract memory state at the exit point
(line 5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Type-4 clone, control replacement from Python. The statement if-else
is changed by using the ternary conditional ? : operator. Syntactical
differences are underlined. . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Type-4 clone, statement reordering from Apache. . . . . . . . . . . . 77
4.5 Type-4 clone, preserving memory behavior from PostgreSQL. . . . . 78
4.6 Two PDGs of semantic clones in Fig. 4.5. The graphs look signifi-
cantly different even though two clones are semantically similar. Grey-
colored nodes are newly introduced due to changes between the two
procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
vii
4.7 The overview of the inconsistent change detection approach: First,
MeCC detects a set of semantic clones. Then, we detect syntactically
similar clones in the set and filter them out. The remaining clones
(gray-colored region) are likely inconsistent changes. . . . . . . . . . 87
4.8 A Type-4 clone as an inconsistent clone. The procedure pwd_getpwall
in (a) causes a resource leak due to absence of a proper procedure
call endpwent() before line 13. . . . . . . . . . . . . . . . . . . . . . 90
viii
Chapter 1
Introduction
1.1 Problem
Software plays important roles in modern society and exists everywhere. Software
controls airplane, aerospace plane, or vehicle. Critical systems such as online bank-
ing, healthcare, atmospheric condition prediction, traffic control, and so on largely
depend on software system.
Despite of a lot of efforts on enhancing software quality, it is not satisfactory
yet. Even released software contains thousands potential defects. After a long time
passed with numerous patches and efforts of developers on the software, still it is
not surprising to meet other defects. According to NIST [131] , software companies
typically spend more than 80% of their development budget on quality control.
They also estimated that software errors cause $59.5 billion costs to U.S. economy
in 2002.
In order to enhance software quality, we tackled two problems: memory leaks
and semantic code clones in C programs. Memory leaks are silent killers, they
slightly encroach upon system’s memory without any noticeable symptoms. Con-
sequently, the system is crashed with out of memory errors. We use static anal-
ysis for detecting memory leaks at compile time. Detecting code clones is useful
for software development and maintenance tasks including identifying refactoring
1
candidates [67], finding potential bugs [49,75,78], and understanding software evo-
lution [40,87]. Instead of comparing syntactic similarity, our focus is on semantics
of code pairs. Our approach complements other syntax-based code clone detection
techniques [48,73,84,89,98].
1.2 Solution
We use static analysis technique equipped with parametrized procedural summary
to understand memory behavior of programs. Static analysis requires neither run-
time environments nor inputs. Procedural summary enables practical and yet pre-
cise analysis.
Throughout this dissertation, we want to show that the proposed path-sensitive
and parametrized procedural summary-based static analyzer is practical and effec-
tive to inferring program semantics. As applications, memory leak detection and
semantic code clone detection are tackled. These problems require deep semantic
understanding on programs to be resolved.
1.2.1 Static Analysis
Static analysis allows us to know properties of programs without running the pro-
grams. Properties can be anything: existence of memory leaks, termination proof,
checking whether a certain variable is always zero or not, and so on. Compared to
dynamic testing, static analysis techniques have the following benefits:
1. Static The earlier program errors are detected, the cheaper it is to fix them.
Static analysis works on program source and do not require a complete run-
ning environment. Modular analysis is possible.
2. Automatic Static analysis does not require test cases and any helps from
users. With a given program text, it automatically gives a result on the de-
sired properties on the program. Although user’s help can increase the accu-
racy of the analysis, it is not essential.
2
3. Coverage Static analysis can soundly approximate all possible execution paths
and memory states. Especially, static analysis based on abstract interpreta-
tion [27–29] allows us to design a sound analyzer. Sound analyzers can detect
all possible errors if they exist.
Static analysis has a disadvantage compared to dynamic testing. Abstraction is
not evitable due to existence of loops, ignorance of outer environments, and lack of
access to source code. As a consequence, static analyzer sometimes loses precision
which causes false positives or false negatives. On the other hand, dynamic testing
always finds real bugs (possible situations during execution).
Striking the right balance between cost and accuracy is the key challenge in
practical static analysis. Usually, precise analyses require much cost and cannot be
applied to large code bases. On the other hand, scalable analyses are usually not
precise enough to prove the desired properties, which results in many false alarms.
In this dissertation, we design an accurate and practical static analysis tech-
nique. The following sensitivities are used to represent the accuracy of static anal-
yses:
1. Flow-sensitive Every program point has its own program state. The analy-
sis takes into account the order of statements in a program. For example, a
flow-sensitive pointer analysis results in “after line 10, pointer x and y may
refer to the same location” while a flow-insensitive pointer analysis results in
“pointer x and y may refer to the same location in the program”.
2. Context-sensitive Different call contexts are distinguishably considered when
analyzing the target of a function call. While context-insensitive analyses es-
timate merged call contexts hence potentially lose precision.
3. Path-sensitive An analysis keeps different information dependent on the
predicates at conditional branch statements. For instance, if an if statement
contains a condition x > 0, then on the false branch, the analysis assumes
that x ≤ 0 and on the true branch the analysis assumes that x > 0.
3
Our static analyzer supports all the above sensitivities, although only partial
path-sensitivity is supported. Because global path-sensitivity is infeasible in gen-
eral,
1.2.2 Procedural Summary
Context-sensitive analysis is crucial for memory leak detection and semantic code
clone detection. For memory leak detection, we should trace allocated addresses via
procedure calls separately. If we cannot discriminate addresses allocated at different
calls, then most of memory leaks are not detectable. For code clone detection, al-
most all existing techniques do not tract the semantics of procedure call. Because
their approaches are textually biased. While our approach is based on semantic
similarity hence tracking precisely the semantics of procedure calls is necessary.
However, fully context-sensitive analysis is demanding of costs. Different call
contexts make the analyzer to analyze a procedure again without reusing pre-analysis
results.
Procedural summary-based static analysis is a practical approach to context-
sensitive analysis. The summary-based analyses consist of two interlocking pro-
cesses: (1) the summarization of procedures’ memory behavior and (2) the use of
these summaries at the procedures’ call sites. Once a procedure is analyzed, it is
not necessary to analyze the procedure again.
Our procedural summary is fine-tuned to understand memory behavior of pro-
cedures. The procedural summary is designed after other choices have been tested
against realistic C programs. The design decision focuses on not neglecting com-
mon memory-related behaviors in realistic C programs.
1.3 Related Work
Automatic modular program analyses based on procedure summary is a main in-
terest in program analysis [17, 20, 22, 26, 53, 119, 127, 135, 136, 139]. Specifically, our
work is inspired by other works for summarizing procedures’ behavior focusing on
4
interesting features [20,53,135,136]. Contrary to existing works, our work focuses on
fine-tuned procedural summaries focusing on detecting memory leaks and also sup-
ports interprocedural path-sensitivity. Yorsh et. al. [139] devised a general frame-
work for generating symbolic procedure summary.
Next, we will discuss related works on the contexts of memory leak detection
and code clone detection.
1.3.1 Memory Leak Detection
In comparison with other published memory leak detectors [21, 63, 112, 136] using
the same benchmark software, our analyzer consistently detects more bugs than
the others.
Clouseau [63, 64], proposed by Heine and Lam, generates more false positives
than ours. Their analysis is a flow-sensitive and context-sensitive for detecting mem-
ory leaks in C and C++ programs. They developed a type system to formalize a
practical ownership model of memory management. In their ownership model, ev-
ery object is pointed to by only one owning pointer. The owning pointer takes
the responsibility of freeing the object or passing its obligation to another pointer.
From this concept they generate constraints for the input program. If constraints
are unsatisfiable then there are memory leaks or double deletions.
Our analysis finds more bugs with a lower false positive ratio on SPEC2000
benchmarks than a memory leak detection presented by Orlovich and Rugina [112].
The leak detection algorithm assumes the presence of leaks and runs a reverse heap
analysis to disprove the assumption. Theoretically the algorithm is sound. However,
the reverse heap analysis stops probing when the number of steps is over a certain
threshold.
Our analyzer can detect more bugs with a similar false positive ratio (differ-
ences are 2∼3%) on SPEC2000 benchmarks than FastCheck [21]. Recently Rugina
et al. proposed a new analyzer FastCheck using guarded value-flow analysis. They
model memory leak detecting problems using source-sink properties. They simplify
the program to guarded value flows by reaching definition and branch condition
5
expressions. Their analysis is very fast but additional region analysis is required.
They found bugs with a low false positive ratio.
Our procedural summary contains more information for memory leak detec-
tion than others of existing techniques [135,136]. Whaley and Rinard developed a
compositional pointer and escape analysis for Java [135]. The analysis uses param-
eterized points-to escape graphs that keeps information regarding which memory
blocks escape from methods. This information is similar to our procedural sum-
mary, but more information is required to detect memory leaks. Xie and Aiken [136]
presented a Saturn-based memory leak detector. Saturn [136, 137] exploits path-
sensitivity from modeling the input program as Boolean formulas. Memory leak
detection is reduced to a Boolean satisfiability problem. Their analyzer is context-
and path- sensitive, but loops and recursions are handled heuristically.
1.3.2 Code Clone Detection
Most clone detection techniques are syntactic clone detectors [9,48,73,84,98,121–
123, 125] leveraging line-based [125], token-based [84, 98], or tree-based [73] ap-
proaches. These detectors are good at identifying Type-1 and Type-2 clones, but
they miss most of the Type-4 and some of the Type-3 clones as discussed in Sec-
tion 4.7.4.
Existing semantic clone detectors have limitations too. For example, as dis-
cussed in Section 4.7.4, PDG-based detectors [48,89,99] miss some semantic clones
due to, for example, ignorance of inter-procedural semantics. A PDG-based tech-
nique [48] maps slices of PDGs to syntax subtrees and applies DECKARD [73] to
detect similar subtrees. Although slicing enables one to detect more gapped clones,
clones in each clone cluster still need to be syntactically similar. Jiang et al. [74]
proposed a clone detector using random testing techniques. They concluded that
two code fragments are clones when their outputs are the same for a number of
randomly generated inputs. Since random testing cannot cover all program paths
or inputs - usually around up to 60 ∼ 70% [114, 115, 132], false positives are in-
6
evitable. Furthermore, the inter-procedural behaviors are not considered in their
approach.
We propose an inconsistent change detection technique clones without using
any heuristics or tunings. Gabel et al. [49] proposed inconsistent change detec-
tion technique using code clone detection. The technique first identifies code clones,
which are treated as candidates for having inconsistent changes. Then, many heuris-
tics and options tunings are used to filter out consistent changes. However, many
heuristics and tunings used in their technique may not be generalizable to other
subjects. In addition, they use DECKARD [73] to initially identify inconsistent
change candidates. As shown in Section 4.7.4, DECARD may miss Type 3 and
4 clones, and thus their technique may miss Category 2 inconsistent changes dis-
cussed in Section 4.8. We could detect Category 2 inconsistent changes, since MeCC
detects Type 3 and 4 clones as discussed in Section 4.7.
There are theoretical advances for proving program equivalence in a toy lan-
guage but not practical as yet. Because proving program equivalence is undecid-
able [47] in general. Pitts developed a method to prove program equivalence in
higher-order functional languages by operational semantics and specifications on
observable results [116]. Recently, Jia et al. proposed a technique to prove program
equivalence by checking typing rules on dependent type systems parameterized by
an abstract relation [72].
1.4 Dissertation Outline and Summary of Contributions
This dissertation presents a static program analysis technique based-on parameter-
ized procedural summaries. The analysis is applied to detecting memory leaks and
code clone detection.
Chapter 2 describes the procedural summary-based static analysis. The static
analysis leverages significant advances in satisfiability modulo theories (SMT) and
uses fine-tuned procedural summary to capture program’s memory behavior. This
chapter contains full descriptions of our semantic transfer functions and how to
estimating program’s behavior while constructing unknown input memories.
7
We present an analysis method for separately summarizing each procedure’s
memory leak behavior. We separately analyze each procedure’s memory behavior
to produce a parameterized summary of it, which will be instantiated in analyz-
ing its call sites. Each procedure’s summarization is done by conventional fixpoint
iteration over the abstract semantics (à la abstract interpretation [27–29]). The
summary is parameterized for its call context. The call context is the collection of
locations accessed by the procedure. The accessed locations are expressed by the
access path forms that are explicit in the procedure’s source text [20].
We also show a general procedural summary for capturing memory behavior of
programs with respect to allocation, deallocation, and alias information. Instanti-
ation algorithms and summarization algorithms are also presented.
Chapter 3 suggests an algorithm for detecting memory leaks. This chapter is
based on joint work with Kwangkeun Yi [83] and makes the following contributions:
• Practical and yet Precise Memory Leak Detection: The algorithm tends
to be both faster and more accurate than existing analyzers that make a com-
parable cost-accuracy tradeoff. In comparison with other published memory
leak detectors [21, 63, 112, 136] using the same benchmark software, our an-
alyzer consistently detects more bugs than the others. The analysis speed
(720 LOC/sec) is second only to FastCheck [21]1, and the false-positive ratio
(12.4%) is the second smallest, beaten only by Saturn [136]: (10%).
• Parametrized Procedural Summaries: We present what information to
collect in the procedural summary to find an effective trade-off point. The
information is generalized with guard, action and access path (Section 2.8).
We have carefully chosen which information is crucial to memory leak detec-
tions.
Chapter 4 presents a new code clone detection technique comparing abstract
memories computed by semantics-aware static analysis presented in Chapter 2.
This chapter is based on joint work with Heejung Kim, Sunghun Kim, and Kwangkeun1FastCheck’s reported speed of 37,900LOC/sec does not count the pointer analysis cost [21]
though.
8
Yi [86]. The semantic code clone detection technique makes the following contri-
butions:
• Abstract memory-based clone detection technique: We show that us-
ing abstract memory states that are computed by semantic-based static anal-
ysis is effective to detect semantic clones. This is a brand new approach to
detecting semantic code clones.
• Semantic clone detector MeCC: We implemented the proposed technique
as a tool, MeCC (http://ropas.snu.ac.kr/mecc). We show the effectiveness
of the proposed technique by experimentally evaluating MeCC.
• Clone benchmark: For our experimental study, we manually inspect and
classify code clones of three open source projects. We make these data pub-
licly available, and it can serve as a benchmark set for other clone related
research (http://ropas.snu.ac.kr/mecc).
• Finding inconsistent clones: After filtering out syntactic clones from all
clones detected by MeCC, we identified about 19% of the remaining clones
as potential bugs or code smells.
Finally we conclude the dissertation in Chapter 5.
9
Chapter 2
Procedural Summary-based Static
Analysis
Procedural summary-based static analysis is a modular and bottom-up analysis.
The analysis consists of two interlocking processes: (1) the summarization of pro-
cedures’ memory behavior and (2) the use of these summaries at the procedures’
call sites.
In order to analyze a procedure, all callee procedures of the procedure should
be prepared as procedural summaries. Hence the order in which procedures are
analyzed and summarized is the reverse topological order of the static call graph.
Leaf procedures of a program are analyzed first then summarized. Procedural sum-
maries are used at their call sites. In the case of call cycles, all procedures in a call
cycle are analyzed together within a single fixpoint iteration. In the case of dy-
namic call edges (due to function pointers), the caller’s summarization is delayed
until the callee’s summary becomes ready.
We compute abstract memory states at every program point of a given proce-
dure using the conventional fixpoint iteration over abstract semantics (à la abstract
interpretation [27–29]). After all abstract states at every program point reach to
fixpoints, the abstract states at the exit point of the procedure are used for sum-
10
c ∈ Cmd ::= assert grd | le := e | le := call(e, e) | return e
le ∈ LExp ::= x | *e | e.f
e ∈ Exp ::= c | le | e⊕ e | ¬e | &le
grd ∈ Guard ::= e ∼ e | ¬grd∼ ∈ Rel ::= = | = | > | < | ≥ | ≤ | ∨ | ∧⊕ ∈ Bop ::= + | − | ∗ | ÷ | % | ∼
Table 2.1: The intermediate language for the analysis.
marizing the procedure. Because the abstract states at the exit point faithfully
estimates the memory behavior of the procedure.
2.1 Target Language
For brevity of analysis design, we define a graph-based core language of C pro-
gram. Every procedure in a given input C program is translated into a control
flow graph with commands defined on the Table 2.1. All C programs’ statements
are translated into following four constructs: assert e, le := e, return e, and
le := call(e, e).
All condition expressions are translated into assert expressions. For example,
expression if grd c1 c2 is translated into the following two statements assert grd ; c1
and assert ¬grd ; c2 as shown in Figure 2.1. Translation lifts any side effects in ex-
pressions explicitly. The conditional expression in the if statement have side effect
which assigns x + 2 into x. This side effect is lifted into other assignment statement
in the translated program. As a consequence, no expressions in the core language
have side effects, which makes our design neater.
All assignment expressions are transformed to le := e. Left hand side of as-
signment should be L-expressions that can be always evaluated to an address. Ex-
pressions on the right hand side are evaluated into values or guards. Guards are a
relation between expressions or composition of guards. For convenience, we assume
that all function call commands have an address expression to which the return
11
if( (x = x + 2) > 0) y = 10;
else y = 5;
z = 10;
z = x + y + z;
⇒
x = x + 2
Assert(x > 0);y = 10
Assert(!(x > 0));y = 5
z = 10;z = x + y + z
Figure 2.1: Translation from C program to graph-based core language.
value of the function is assigned. For simplicity, only one parameter is considered.
Return construct returns the value of expression. The exit point of a procedure is
a dummy node which is the next node of return statements in the procedure.
2.2 Memory State Representation: Abstract Domains
T ∈ Table = Blockfin−→ State
S ∈ State = Assert ×Mem ×AllocFree
G ∈ Assert = Guard
M ∈ Mem = Addrfin−→ GV
⟨AL,FR⟩ ∈ AllocFree = 2Guard×Addr × 2Guard×Addr
GV ∈ GV = 2Guard×Value
g ∈ Guard = (Value × Rel×Value) + Guard ∧Guard + Guard ∨Guard
v ∈ Value = N +Addr + (Uop× Value) + (Value× Bop× Value) +⊤
a, α, ℓ ∈ Addr = Var + Symbol +AllocSite + (Addr × Field) + Ret
x ∈ Var = Global + Param + Local
Table 2.2: Abstract domains: the abstract semantics of a procedure is estimated asabstract memory state over domain State at the exit point of the procedure.
12
Static analyses based on abstract interpretation estimate program’s semantics
via abstract semantics over abstract domain. Intermediate language describes tar-
get syntactic objects while abstract domain defines target semantic objects.
Our abstract domains for memory states are presented in Table 2.2. Our anal-
ysis is flow- and path-sensitive; it estimates possible abstract states for each basic
block and all execution paths to the basic block. If the table of a procedure does
not change (reaching fixpoint) then our analysis of the procedure stops.
An abstract state (S in Table 2.2) is consist of three pieces of information. (1)
Assert keeps the condition under which the corresponding program point is reach-
able from the entries of procedures. (2) An abstract memory is a finite mapping
from abstract (symbolic) addresses to guarded values. (3) AllocFree keeps allocated
or deallocated addresses with corresponding conditions. Theses three components
are the target of our analysis. When we check the reaching fixpoint of abstract
states, only the abstract memory needs to be checked. Because the stableness of
abstract memory guarantees the stableness of other components.
A guarded value (GV in Table 2.2) is a set of pairs of a guard and a sym-
bolic value, where the guard is the accumulated symbolic condition that leads to
the accompanying value. The set of all variables (Var) consists of three disjoint
sets, all global variables (Global), all parameters (Param), and all local variables
(Local) except procedure parameters. This partitioning enables us to define three
equivalence classes for variables when defining equivalent addresses in Section 4.5.
Symbols (Symbol) are used to indicate symbolic values or symbolic addresses
in global input memories of the current procedure. Allocated addresses (AllocSite)
denote all addresses allocated (including arrays) at each allocation site (a static call
program point for allocations). Consequently, all address arithmetic operations are
ignored (all array excess expressions a[n] are regarded as a single pointer derefer-
ence *a). Field addresses (Addr × Field) represent field variables of structures (our
analysis is field-sensitive). Return address Ret keeps the values to be returned.
A symbolic value can be a number (N ), an address (Addr), a binary value
(Value× Bop× Value), or a unary value (Uop× Value). Bop and Uop denote a set
13
of binary and unary operation symbols respectively. A guard (Guard) can be gen-
erated from the relations between values (Value×Rel×Value), where Rel denotes
the set of comparison operators (e.g., =, ≤). Some guards can also be connected
by logical operators (conjunction ∧ and disjunction ∨).
The next step is defining the semantics of the each construct as elements in
this domain.
2.3 Estimating Procedure’s Semantics: Abstract Seman-
tics
Our analysis starts from the entry point of a procedure without knowing the input
memory states. The unknown input memory states are constructed by observing
which locations and values are accessed by the procedure (Section 2.4). Abstract
memory states are updated by executing each statement on the control flow in the
procedure, and how to update is decided by the predefined abstract semantics of
each statement. All conditions on the execution path are collected as guards.
Table 2.3 shows the abstract semantic rules
G,M, ⟨AL,FR⟩ ⊢ c : G′,M′, ⟨AL′,FR′⟩
for each statement c. The semantic rule of a command takes an assert G, an ab-
stract memory M, and a set of allocated addresses AL and a set of deallocated
addresses FR as input and then results theses components of the same form. In
order to trace the conditions of the program execution from the initial point to
the current point, we keep the assert G. We use G instead of g to distinguish this
guard on abstract memory state with guards in the guarded values (GV). If the
guard G at a program point is unsatisfiable (equivalent to false) then the execu-
tion path to the program point is infeasible. Guard true means that the program
point is always reachable. Allocated addresses and deallocated addresses are used
to detect memory leaks and make procedural summaries.
14
G,M, ⟨AL,FR⟩ ⊢ c : G′,M′, ⟨AL′,FR′⟩
M ⊢ grd : g
G,M, ⟨AL,FR⟩ ⊢ assert grd : G ∧ g,M, ⟨AL,FR⟩ assert
M ⊢l le : GV1 M ⊢ e : GV2
M ⊢ le := e : Update(GV1,M,GV2) assignment
M ⊢ e : GVM ⊢ return e : M{Ret 7→ GV} return
M ⊢ e1 : GVf M ⊢ e2 : GVp M ⊢l le : GVr
G′,M′, ⟨AL′,FR′⟩ = Instantiate GVr GVf GVp M ⟨AL,FR⟩G,M, ⟨AL,FR⟩ ⊢ le := call(e1, e2) : G′,M′, ⟨AL′,FR′⟩ function call
Table 2.3: Abstract semantic rules for statements.
The statement assert grd changes the condition on which the execution path is
valid. The result guard is the conjunction of the accumulated guard and the current
guard. The unchanged components in the semantics are omitted for simplicity.
The semantic rule of assignment expressions changes the input abstract mem-
ory state. The AddGuard and Update functions are defined in Table 2.4. AddGuard
is an overload function on guarded values and abstract memory states. The input
guard is dispatched to each element. Update function defines two cases (strong and
weak updates) determined by the address values.
The strong update overwrites the previous guarded values of the updated ad-
dress. The rule indicates the destructive update can happen only when the ad-
dress value is a single variable (note that the singleton set for the value in GV1 =
{(g, v)}). And the address v should not be an aggregated address (e.g. arrays, al-
located addresses, and addresses in the recursive call cycles). As a result, the value
of address v is updated by the guarded values GV2. The guards for the new values
15
AddGuard(g,GV) = {(g ∧ gi, vi)}iwhere GV = {(gi, vi)}i
AddGuard(g,AL) = {(g ∧ gi, li)}iwhere AL = {(gi, li)}i
AddGuard(g,FR) = {(g ∧ gi, li)}iwhere FR = {(gi, li)}i
AddGuard(g,M) = {ℓ1 7→ AddGuard(g,GV1), ..., ℓi 7→ AddGuard(g,GV i)}where M = {ℓ1 7→ GV1, ..., ℓi 7→ GV i}
Update(GV1,M,GV2) =
M{v 7→ AddGuard(g,GV2)}where GV1 = {(g, v)} and v is not an aggregated addressM
{vi 7→ AddGuard(¬gi,M(vi)) ∪ AddGuard(gi,GV2)
}i
where GV1 = {(gi, vi)}i
Table 2.4: AddGuard and Update functions.
are the conjunctions of guard g of the address and the guard of the guarded values
GV2.
The weak update case describes more complex situations caused by abstrac-
tions. An abstract address may point to several abstract addresses, whereas in real
execution a pointer can point to only a single address. In this situation, the guards
on the guarded values should be carefully chosen. The following example shows how
the guards should be handled. The abstract memory states M and M′ are valid
under M ⊢ ∗p := 3 : M′.
M ={p 7→ {(A, x), (B, y)}, x 7→ {(C, 1)}, y 7→ {(D, 2)}
}M′ =
{p 7→ {(A, x), (B, y)}, x 7→ {(¬A ∧ C, 1), (A, 3)}, y 7→ {(¬B ∧D, 2), (B, 3)}
}The value of x is updated only when the condition A is satisfied. Hence x contains
the value 3 only when conditions A holds. If condition A is not satisfiable (hence
¬A is valid) and the original condition C holds then the value of x remains as the
original value 1.
16
The abstract semantic rule of return expression records the return value of a
function to the predefined address Ret . The value of address Ret is collected from
different return points and used to make the procedural summary of the function.
Our analysis uses procedural summaries, hence the semantics of a procedure
call is defined by instantiating pre-calculated procedural summaries. After evalu-
ating function pointers (GVf ), the parameter (GVp) and the return address (GVr),
the corresponding procedural summaries are instantiated with theses values. The
detailed semantics of function Instantiate is defined later in Section 2.8.
Table 2.5 shows how we evaluate value, location, and guard expressions. The
abstract binary operator ⊕ makes a value by combining two values. If the two
values are all numbers then ⊕ is the same as ⊕. It returns the evaluated value
according to the type of ⊕ and two values. If either value is ⊤ the result value is
also ⊤. Otherwise it freezes the operator with the two values as a symbolic value.
The abstract relation operator ∼ is defined similarly. If two input values can be
evaluated to numbers then it returns true or false according to the type of the
operator. Otherwise, it creates a symbolic guard.
2.4 Constructing Unknown Input Memories
We should estimate the exit memory states for each procedure without knowing the
input memory state (call context). Because bottom-up analysis is ignorant about
the input memory of a procedure being analyzed. In order to make an image of
unknown input memory state, symbolic expressions are introduced. These intro-
duced symbolic expressions are instantiated as corresponding values according to
the input memory of each call site.
Figure 2.2 shows how the unknown input memory is inferred by analyzing a
procedure. The analysis starts with true assert and empty memory. The abstract
memory state at the exit point (line 5) is presented on the right side. Parameter a
is accessed in the conditional statement at line 3, however the value of parameter a
is unknown. Hence a new symbol α is created to represent the value of parameter
a. For the field value of a->len which is also unknown, new symbol β is created. At
17
M ⊢ e : GV
M ⊢ c : {(true, c)}
M(x) = GVM ⊢ x : GV
M ⊢ e1 : {(gi, vi)}i M ⊢ e2 : {(gj , vj)}jM ⊢ e1 ⊕ e2 :
⋃ij
{(gi ∧ gj , vi ⊕ vj)}
M ⊢l le : GVM ⊢ &le : GV
M ⊢l le : {(gi, vi)}i M(vi) = {(gj , vj)}j GV i = {(gi ∧ gj , vj)}jM ⊢ le :
⋃i
GV i
M ⊢l le : GV
M ⊢l x : {(true, x)}
M ⊢ e : GVM ⊢l ∗e : GV
M ⊢ e : GVM ⊢l e.f : GV.f where {(gi, vi)}i.f = {(gi, vi.f)}i
M ⊢ grd : g
M ⊢ grd : g
M ⊢ ¬grd : ¬g
M ⊢ e1 : {(gi, vi)}i M′ ⊢ e2 : {(gj , vj)}jM ⊢ e1 ∼ e2 :
∨ij
{(gi ∧ gj ∧ (vi ∼ v2))}
Table 2.5: abstract semantic rules for expressions, location expressions, and guards.
18
1 int* foo(list *a, int b){
2 int res = 0;
3 if (a->len > 5)
4 res = bar(b);
5 return res;
6 }
7 int* bar(int x){
8 int *m = 0;
9 if (x > 0)
10 m = malloc(x);
11 return m;
12 }
The abstract memory at line 5a {⟨true, α⟩}α.len {⟨true, β⟩}b {(true, γ)}
{⟨β > 5 ∧ γ > 0, ℓ⟩,res ⟨β ≤ 5 ∨ (β > 5 ∧ γ ≤ 0), 0⟩
}
The procedural summary of bar{(Ret*,Allocating , x > 0), (Ret*,Nullifying , x ≤ 0)}
Figure 2.2: Procedure bar with its procedural summary and procedure foo withits abstract memory state at the exit point (line 5).
line 2, variable res contains guarded value {⟨true, 0⟩} which means variable res
always has the value zero at the program point. From the conditional statement,
guards β > 5 and β ≤ 5 are kept for true and false branches respectively. At this
point, the real values of parameter a and a->len are unknown but how the values
are accessed in the procedure is determined.
During analysis, we check the type of domain elements dynamically. When-
ever an value is dereferenced, the type of value should be Addr . If not, the result
guarded values are the empty set ⊥GV = {}. If the value of the address does not
exist in the current memory states a new symbolic address is introduced. Hence
the memory look-up function used in our analysis is different from the just table
look-up. The function is defined as follows:
M(v) =
{GV if M{v 7→ GV}{(true, α)}, where origin(α) = v otherwise
where v ∈ Addr
M(GV) =⋃i
AddGuard(g,M(vi)) where GV = {(gi, vi)}i
19
If a new symbolic address α is created then the address v introducing the sym-
bolic address is kept in function origin (e.g. origin(α) = v). As an example, the
following origin(α) = a and origin(β) = α.len hold in Fig. 2.2. This reverse map-
ping function is essential to correctly trace values from input memory when the
accessed input memory is updated. The following shows a procedure and the ab-
stract memory of the exit point of the procedure.
f(int a){ x = a + 1; a = 0;} M = {a 7→ {(true, 0)}, x 7→ {(true, α+ 1)}
Regardless of input values of the parameter a, the final value of parameter a is set
to zero. With the origin function, we know the symbolic value α in the value of
x comes from the parameter (origin(α) = a).
2.5 Merging Abstract States
When a node has more than one predecessor (e.g. loop heads and merge points of
branches), we merge all the abstract states from predecessors into a single abstract
state S, while preserving path sensitivity. The merge operator⊔
is defined like the
following:
(G1,M1, ⟨AL1,FR1⟩)⊔
(G2,M2, ⟨AL2,FR2⟩) = (G′,M′, ⟨AL′,FR′⟩)
where G′ = G1 ∨ G2
M′ = AddGuard(G1,M1) ⊔ AddGuard(G2,M2)
AL′ = AddGuard(G1,AL1) ∪ AddGuard(G2,AL2)
FR′ = AddGuard(G1,FR1) ∪ AddGuard(G2,FR2)
Asserts are merged as the disjunction which captures possible path conditions
to reach the program point.
Memories are joined after each assert G is propagated to each memory entry in
order to preserve path sensitivity [136]. The piecewise join ⊔ operator of abstract
memories is defined like the following:
20
M1 ⊔M2 = λa.M1(a) ∪M2(a)
The allocated and deallocated sets are joined after reflecting each assert on the
corresponding sets to preserve path information. In a set of AL or FR, if an ad-
dress occurs in several elements we interprete accompanied guards as a disjunction.
For example, we interprete AL = {(x = 0, ℓ), (x = 1, ℓ)} as address ℓ is allocated
when x = 0∨x = 1. In practice, allocated and deallocated sets are implemented as
finite maps from Addr to Guard . Hence these guards are disjunctively combined at
merge point if there are same addresses in the sets from different points. The same
engineering is possible to be applied when union the two sets of guarded values.
2.6 Handling Loops
Termination of fixpoint iterations on domain with infinite height is not guaranteed.
In our abstract domain, the heights of the number domain N and the symbolic-
value domain Value × Bop × Value are infinite. We introduce a widening opera-
tor [27,30] for guaranteeing that finite iterations lead to some fixpoints.
Our widening operator is simple; after k iterations (delayed widening [14]),
changing values go into the special value ⊤ (indicating an unknown value). Be-
cause of the ⊤ values our conclusion on memory leaks and code clones is neither
sound nor complete.
For example, procedure freeList in the Figure 2.3 deallocates all (without lim-
itation on the length) list elements linked from the head pointer argument. In order
to fully capture the memory behavior of this procedure, infinitely many dealloca-
tion actions should be created. Infinite summaries are infeasible.
Because our analysis is restricted to generate finite number of symbolic values,
we abstract the memory behavior of procedure freeList into deallocating just k
number of linked elements αi from the head pointer. Theses finite abstract memory
and deallocated sets lead to a finite procedure summary.
21
1 freeList(List *p){
2 List *x, *y;
3 x = p;
4 while(x != 0){
5 y = x->next;
6 free(x);
7 x = y;
8 }
9 return;
10 }
The abstract memory at the exit pointp {⟨true, α1⟩}
α1.next {⟨α1 = 0, α2⟩}...
...αk−1.next {⟨αk−1 = 0, αk⟩}
x {⟨true, 0⟩}y {⟨true, 0⟩}
The allocated and deallocated sets at the exit point⟨AL,FR⟩ = ⟨∅, {α1, · · · , αk}⟩
Figure 2.3: An example procedure and its abstract memory M and the allo-cated/deallocated sets ⟨AL,FR⟩ at the exit point. From a k-bounded exit memorystate, the summary is also bounded.
2.7 Explanatory Example
Throughout of this dissertation we will use the toy example in Figure 2.4. Pro-
cedure use calls procedure summary two times and procedure summary does not call
other procedures except two library functions free and malloc. Hence our analysis
22
1 struct List {
2 struct List *next;
3 int val;
4 };
5
6 struct List *table;
7
8 struct List *summary(int n, struct List *node){
9 struct List *x;
10 if (n > 0) free(node->next);
11 if (n < 10) {
12 x = 0;
13 } else {
14 x = malloc(...);
15 }
16 if (node->val + n > 0) table = node;
17 return x;
18 }
19
20 struct List *use (int option){
21 struct List *lst = malloc(...);
22 lst->val = 0;
23 if (lst == 0) return 0;
24 lst->next = malloc(...);
25 if (lst->next == 0) return 0;
26 if (option > 0) {
27 lst->next = summary(option, lst);
28 }
29 return lst;
30 }
Figure 2.4: An explanatory code example.
23
starts to analyze procedure summary first. The following table shows the abstract
state (G,M, ⟨AL,FR⟩) at the exit point of the summary procedure.
G = true
M =
n {(true, α)}node {(true, β)}β.next {(α > 0, γ)}β.val {(true, δ)}x {(α < 10, 0), (α ≥ 10 ∧ success14, ℓ14),
(α ≥ 10 ∧ ¬success14, 0)}table {(δ + α > 0, β)}Ret {(α < 10, 0), (α ≥ 10 ∧ success14, ℓ14),
(α ≥ 10 ∧ ¬success14, 0)}
⟨AL,FR⟩ = ⟨{(α ≥ 10 ∧ success14, ℓ14)}, {(α > 0, γ)}⟩
Please note that symbolic addresses α, β, and γ are created for representing
unknown values reachable from the parameters. Whenever a symbolic address is
generated the origin function is updated. As a result, we get the following origin
function:
origin(α) = n
origin(β) = node
origin(γ) = β.next
origin(δ) = β.val
We need to capture the situation when a memory allocation via malloc library
function call fails. Let’s see the following code snippet with a memory leak:
24
1 int *a = malloc(...);
2 if (a == 0) return;
3 int *b = malloc(...);
4 if (b == 0) return;
Let’s suppose that the memory allocation at line 1 succeeds and the mem-
ory allocation at line 3 fails. Then before return statement at line 4, the allo-
cated memory pointed to by pointer a should be deallocated. If we do not consider
the situation when the memory allocation fails then we miss this kind of memory
leaks. In order to capture the semantics of malloc library function calls, we intro-
duce boolean variable. Boolean variable success14 represents the condition when a
memory cell is successfully allocated at line 14. If the allocation succeeds allocated
address ℓ14 is generated. If the condition is invalid null pointer is created for the
return value of malloc.
The final abstract memory captures procedural behavior related to pointer alias-
ing, memory allocation, and memory deallocation information. Global pointer table
points to the symbolic values of parameter node. In the set of allocated addresses
ℓ14 is included with the condition. The symbolic address γ is deallocated when the
value α of n is greater than 0.
2.8 Parametrized Procedural Summary
2.8.1 Summary Information
The procedural summary information enables the analyzer to capture the seman-
tics of procedure calls without analyzing the procedures again. The main problem
is determining which information and how detail the memory behavior should be
captured.
We observe memory-behavior of procedures and focus on effects visible to the
outside of the procedures. Locations visible to the outside of a procedure are those
reachable from the global variables, the pointer arguments, and the return value
of the procedure.
25
Understanding a procedure’s memory-behavior effects needs four pieces of in-
formation: allocations, deallocations, aliases and null pointers. That is, we need to
know which allocations inside procedures become visible to the outside, which lo-
cations visible to the outside are freed, which locations visible to the outside are
aliased, and which locations visible to the outside are null pointers.
1. Allocation Allocated locations are visible to the outside of a procedure only
when they are returned or assigned to the locations of the caller’s environ-
ment. In C, locations of the caller’s environment are reachable via only glob-
als or pointer arguments. We do not record which allocated locations are as-
signed to globals, because addresses reachable from global variables are acces-
sible from any environment in the program. However, we miss some leaks that
come from inter-procedural overwriting of allocated addresses on the same
global variable.
2. Deallocation We record which existing locations are freed. For locations
that were allocated before the procedure call yet visible inside the proce-
dure, there are two cases. Such locations are accessible via pointer arguments
or globals. For allocated locations already reachable from globals we don’t
record which are freed. Locations reachable from globals remain visible from
anywhere in the program until the global variable is overwritten with a new
address, so they are not of concern for detecting non-interprocedural memory
leaks.
3. Alias Be reminded that locations visible to the outside are those reachable
from the globals, pointer arguments, and return values. Aliases between these
three classes of locations happen by assignments between them.
4. Null Null pointers are closely related to memory leaks. In C programs, if an
address is checked as null then which means that the address is not allocated
in most cases. Hence, tracing null pointers visible to the outside enhances the
accuracy of the analysis.
26
In addition to information mentioned above we want to know under which con-
dition these actions (allocation, deallocation, aliases, and null pointers) happen.
Our procedural summary information used in our analysis is shown in Table 2.6.
P ∈ ProcSummary = 2Anchor×Action×Guard
A ∈ Action = Allocating + Freeing + (Aliasing ×Anchor) +Nullifying
C ∈ Anchor = (Ret | Param | Global)× (* | .f)∗
Table 2.6: Procedural Summary: the memory behaviour of procedures is capturedin this form.
Formally, a procedural summary is a set of triples (Anchor ×Action ×Guard);
(1) Anchors (Anchor) is used to locate the place where action occurs. As men-
tioned above, we care only locations visible to the outside of procedures. So An-
chors record access paths from return value, parameters, or global addresses. (2)
Actions (Action) indicate which kind of memory behavior happens (allocating new
heap cells, deallocating allocated heap cells, aliasing information, or null pointer as-
signment). For aliasing information, the aliased location is also captured in anchors.
(3) Guards (Guard) are conditions under which the action happens. All symbolic
addresses in a procedural summary are instantiated corresponding values according
to the calling context (the abstract memories of call sites).
Once a procedure is analyzed then summarized, only procedural summaries are
used at different call contexts. For example, at line 4 in Fig. 2.2, procedure bar is
called. According to the pre-calculated procedural summary, the procedure returns
an allocated address ℓ when the value of parameter x is greater than 0, otherwise
it returns a null pointer.
A procedural summary keeps conditions (as extended from [83]) for memory
behaviors of a procedure. This procedural summary is instantiated with the ab-
stract memory state at the call site. The value of formal parameter x in pro-
cedure bar is instantiated with γ (the value of actual parameter b). With this
instantiation of the procedural summary, we obtain the result memory state of
27
the procedure call (line 4). Now, variable res points to the result guarded value,
{⟨β > 5∧γ > 0, ℓ⟩, ⟨β > 5∧γ ≤ 0, 0⟩}. Here guard β > 5 comes from the condition
of true branch at line 3 and guards γ > 0 and γ ≤ 0 come from the procedural
summary of bar. At line 5, the abstract memory states on both true and false
branches are joined. Variable res points to a guarded value {⟨β ≤ 5, 0⟩} in the
memory state of the false branch. The joined memory state at the return point of
foo (line 5) is shown as the table in Fig. 2.2. The procedural summary of procedure
foo is automatically generated from this abstract memory state (Section 2.8.3).
2.8.2 Summary Instantiation Using Calling Contexts
Algorithm 1: Instantiate GVr GVf GVp M ⟨AL,FR⟩Input: abstract memory state M, return address GVr, parameter values
GVp, function pointers GVf , allocated set AL, and deallocated setFR
Output: result abstract memory state M′ and a pair of sets of allocatedand deallocated addresses ⟨AL′,FR′⟩
M′ := {}; AL′ := {}; FR′ := {};1
foreach procedure (gf , f) ∈ GVf do2
Mf := M; ALf := AL; FRf := FR;3
P := ProceduralSummary(f);4
foreach guarded action (C,A, g) ∈ P do5
C′ := C{GVp/Param,GVr/Ret};6
A′ := A{GVp/Param};7
GV := Locate(C′,Mf ,∅);8
(Mf , ⟨ALf ,FRf ⟩) := TakeAction(GV,A′, g,Mf , ⟨ALf ,FRf ⟩)9
end10
M′ := M′ ⊔ AddGuard(gf ,Mf );11
AL′ := AL′ ∪ AddGuard(gf ,ALf );12
FR′ := FR′ ∪ AddGuard(gf ,FRf );13
end14
28
The instantiation step is presented in the Algorithm 1. The instantiation takes
as input the return addresses (GVr), the function pointers (GVf ), the parameters
(GVp), the abstract memory state at the call site M, and the sets of allocated and
deallocated addresses ⟨AL,FR⟩. Then instantiation reflects the memory effects of
the procedures on the resulting abstract memory state M′ and updates appropri-
ately the sets of allocated and deallocated addresses ⟨AL′,FR′⟩.The function pointer GVf may contain several procedures. The memory behav-
ior of all the procedures in GVf should be accumulated (line 2). For each proce-
dure, the calling context, the set of allocated addresses, and the set of deallocated
addresses are remembered as Mf , ALf , and FRf respectively (line 3). For the
given procedure f in the function pointer, we pick out procedure summary P (line
4).
Procedure summary contains tuples of anchor, action, and guard. Each tuple
describes how to change the abstract memory Mf and the sets of allocated and
deallocated addresses ⟨ALf ,FRf ⟩. First, we should instantiate formal parameter
Param and return address Ret according to the actual parameter value GVp and
actual return address GVr in the calling context (line 6). Action Aliasing may
contain parameter Param in order to describe aliasing from parameter to other
addresses. Hence the formal parameter also should be instantiated with the ac-
tual parameter. It is impossible that an address comes from return address. So
the return address cannot appear in action Aliasing (line 7). Second, the function
Locate specifies the location on which actions happen (line 8). Finally, an action
takes place at the location. The type of action determines how the abstract mem-
ory and the sets of allocated and deallocated addresses is mutated (line 9). Details
of Locate and TakeAction are presented in Table 2.7.
Function Locate(C,M,GV) in Table 2.7 explores the abstract memory M to
find the target location. The exploration is guided by the anchor C starting from
the initial guarded values GV. If the head of anchor C is * (dereferencing) then the
abstract memory is looked up to find the guarded values pointed to by the current
guarded values. If the head of anchor is field access then field values are made
29
Locate(C,M,GV) =match C with
nil ⇒ GVaddr :: C′ ⇒ Locate(C′,M, {(True, addr)})GV :: C′ ⇒ Locate(C′,M,GV)* :: C′ ⇒ Locate(C′,M,M(GV))
.f :: C′ ⇒ Locate(C′,M,GV.f)
TakeAction(GV,A, g,M, ⟨AL,FR⟩) =match A with
Allocating ⇒ ( Update(GV,M, {(g, ℓ)}), ⟨AL ∪ {(g, ℓ)},FR⟩ ) new ℓ
Freeing ⇒ ( M, Free(M,GV, g) )
(Aliasing , C) ⇒ ( Update(GV,M, Locate(C,M,∅)), ⟨AL,FR⟩ )
Nullifying ⇒ ( Update(GV,M, {(g, 0)}), ⟨AL,FR⟩ )
Free(M,GV, gfree) = ⟨AL′,FR′⟩
where AL′ = ∀(galloc, ℓ) ∈ AL.⋃{
{(galloc ∧ ¬(gfree ∧ g′), ℓ)} if (g′, ℓ) ∈ M(GV){(galloc, ℓ)} otherwise
FR′ = FR ∪ {(gfree ∧ g′, ℓ) | (g′, ℓ) ∈ M(GV)}
Table 2.7: Auxiliary functions are used in instantiation algorithm. Function Locate
finds the location in which actions happen. From the given GV starting guardedvalues, the abstract memory M is explored according to the anchor C. FunctionTakeAction changes the abstract memory M or the allocated and deallocated sets⟨AL,FR⟩ according to the corresponding actions.
from the current guarded values. Function Locate continues searching until when
the anchor becomes empty. All corresponding guards on the anchors to the target
location are accumulated as a conjunction while exploring the abstract memory
following the anchors.
Function TakeAction(GV,A, g,M, ⟨AL,FR⟩) in Table 2.7 changes the current
abstract state. The function takes as input memory M, the sets of allocated and
30
deallocated addresses ⟨AL,FR⟩, the type of action A, and the condition guard g.
Action can be one of the following four types.
1. Action Allocating generates a new dynamic address with the condition on
which the allocation takes place. This newly allocated address is attached to
the current location in the abstract memory. The allocated address is also
added to the set of allocated address AL.
2. Action Freeing deallocates the addresses pointed to by the current location
which is described in the Free function. All the addresses in M(GV) are
added to the deallocated set FR. The condition on which the address is deal-
located is a conjunction of the condition on which the address is reachable
g′ and the condition on which the address is deallocated gfree recorded in
procedural summary. Deallocating an address should be reflected on the set
of allocated address AL. If ℓ in AL is being deallocated then AL should be
changed. The condition on which the allocated address remains allocated is
galloc ∧ ¬(gfree ∧ g′).
3. Action Aliasing finds the aliased values following the anchor C. This found
guarded values are pointed to by the current location given as GV.
4. Action Nullifying makes a null pointer be pointed to by the current location.
Procedural Summary Instantiation Example
Let’s look at the use procedure in Figure 2.4 again. This procedure calls procedure
summary hence the summary procedure is analyzed and summarized as follows:
{(Ret ,Allocating , α ≥ 10 ∧ success14),
Psummary = (Ret ,Nullifying , α < 10 ∨ ¬success14),(node*.next,Freeing , α > 0),
((table,Aliasing , δ + α > 0), node*)
}
31
The summarization process for this procedure is described in Section 2.8.3.
Next, we analyze the use procedure. At the call site of procedure summary (line
27), the calculated abstract state is the following:
G = success21 ∧ success24 ∧ α > 0
M =
lst {(success21, ℓ21)}ℓ21.val {(true, 0)}ℓ21.next {(success21 ∧ success24, ℓ24)}option {(success21 ∧ success24, α)}
⟨AL,FR⟩ = ⟨{(success21, ℓ21), (success21 ∧ success24, ℓ24)},∅⟩
Following steps of Algorithm 1, the new abstract state (assert is not changed
though) is computed. From the statement lst->next = summary(option,lst) and
memory M, we get GVr,GVf , andGVp like the following:
GVr = (success21, ℓ21.next)
GVf = (true, summary)
GVp = {(success21 ∧ success24, α), (success21, ℓ21)}
Because GVf is a singleton set of procedure summary we will use only the proce-
dural summary of the procedure. Calling several procedures is possible when func-
tion pointers exist in program and the binding procedures is determined at run-
time.
For each guarded action in P, first we should substitute the formal parameters
and return address to the corresponding actual parameters GVp and the address to
hold the return value GVr (line 6 and 7). During substitution the symbolic values
are recovered as the access from parameters and return values using origin func-
32
tion represented in Section 2.7. This substitution generates the following procedural
summary:
{(ℓ21.next,Allocating , α ≥ 10 ∧ success14),
P = (ℓ21.next,Nullifying , α < 10 ∨ ¬success14),(ℓ21.next,Freeing , α > 0),
((table,Aliasing , 0 + α > 0), ℓ21)
}
The symbolic value α is now representing a value for parameter option of proce-
dure use not for parameter n of procedure summary. After reflecting all actions in
P we obtain the following new abstract memory M and allocated/deallocated sets
of addresses ⟨AL,FR⟩.
G = success21 ∧ success24 ∧ α > 0
M =
lst {(success21, ℓ21)}ℓ21.val {(true, 0)}ℓ21.next {(success21 ∧ success14 ∧ α ≥ 10, ℓ14),
(success21 ∧ (success14 ∨ α < 10), 0)}option {(success21 ∧ success24, α)}table {(α > 0, ℓ21)}
AL = {(success21, ℓ21), (success21 ∧ success24 ∧ α ≤ 0, ℓ24),
(success14 ∧ α ≥ 10, ℓ14)}FR = {(α > 0, ℓ24)}
From the final abstract state, we can conclude that the allocated address ℓ24
is safely freed. Because the current assertion tells α > 0 and the allocation con-
33
dition is unsatisfiable. Furthermore we know the allocated address ℓ21 is definitely
allocated and ℓ14 is allocated if α ≥ 10.
2.8.3 Summarizing Procedures from Abstract States
Algorithm 2: Summarization M ⟨AL,FR⟩ Paramf Global
Input: abstract memory state M, allocated set AL, deallocated set FR,formal parameters of procedure f Paramf, and all the globalvariables Global
Output: the procedural summary P of procedure f
P := {};1
retset := Reachable(Ret ,M);2
argset := Reachable(Paramf,M);3
globalset := Reachable(Global ,M);4
foreach (C*, a, g) ∈ retset ∪ argset do5
P := P ∪ {(C,Allocating , g ∧ AL(a)), (C,Freeing , g ∧ FR(a))};6
P := P ∪ {((C,Aliasing , g ∧ g′), C′) | (C′, a, g′) ∈ argset ∪ globalset};7
P := P ∪ {((C,Nullifying , g) | a = 0};8
end9
A procedure is summarized from the abstract states (G,M, ⟨AL,FR⟩) at the
exits of the procedure. Because the abstract states at the exit point accumulate
all semantics of the procedure from the entry point to the exit point. We can ex-
tract pointer information, reachability from visible environment, and feasibility of
path conditions from abstract states. From ⟨AL,FR⟩ we derive which addresses
are allocated and wich addresses are deallocated. Actions Allocating and Freeing
are evaluated with this information. Alias information between addresses visible
from environment are also captured in procedural summary. Be reminded that the
all visible addresses from environment are those from arguments, return values, and
global variables. So summarization process requires the parameters of the proce-
dure Paramf and the set of all global variables Global as well.
34
The summarization step is presented in the Algorithm 2. First we initialize the
summary P to empty set (line 1). All reachable addresses from return values, pa-
rameters, and global variables are calculated by function Reachable respectively
(line 2 - 4). The Reachable function is defined in Table 2.8.
Reachable(A,M) = lfp λS.X ∪ (Onestep S M)
where
X =
⋃{(Param, a, true) | a ∈ A, a ∈ Param}⋃{(Global , a, true) | a ∈ A, a ∈ Global}⋃{(Ret , a, true) | a ∈ A, a ∈ Ret}
Onestep S M =
⋃{(C*, v, g ∧ g′) | (g′, v) ∈ M(a), (C, a, g) ∈ S, v ∈ Addr}⋃{(C.f, (v, f), g) | (v, f) ∈ dom(M), (C, a, g) ∈ S}
Table 2.8: Function Reachable finds all the reachable addresses from the initialaddress set. The Reachable function takes a set of starting address A and an ab-stract memory M as input. Then result is a set of tuples which are consist ofanchor, reachable address, and corresponding guard.
Function Reachable calculates not only reachable addresses but also anchors
to describe the access path to the addresses and the guard as condition on which
the corresponding addresses are reachable. Until the set of tuples are unchanged
(fixpoint), function onestep is applied to the current set. The initial set of tuples
X is determined by the type of input set of addresses A. The Onestep function
explores other reachable addresses by looking up the abstract memory with the
current address or searching field addresses of the current address.
Now we investigate that every reachable address from return values or argu-
ments is allocated, deallocated, aliased, or a null pointer. We do not regard how
the addresses reachable from global variables are changed. It is because following
all memory traces of global variables is infeasible and ineffective for detecting mem-
ory leaks. Details of this design decision are described in Section 2.10.2. It is easy
to check whether the address is allocated or deallocated (line 6).
35
If the current address exist also in the set of allocated addresses AL or the set
of deallocated addresses FR then we keep this information by adding the corre-
sponding actions. The condition on which the allocation or deallocation succeeds
is a conjunction of reachable condition g and allocation or deallocation conditions
AL(a) or FR(a) defined as follows:
AL(a) =
{g if ∃(g, a) ∈ ALfalse otherwise
FR(a) =
{g if ∃(g, a) ∈ FRfalse otherwise
Finally, alias information of following four cases is captured (line 7): (1) ad-
dresses reachable from arguments to return values, (2) addresses reachable from
arguments to other arguments, (3) addresses reachable from global variables to
return values, and (4) addresses reachable from global variables to arguments. If
there exist an aliased address a from addresses reachable from arguments or global
variables (argset ∪ globalset) to return values or arguments (retset ∪ argset) then
the anchor C′ and guard g′ are recorded.
The result procedural summary P describes how to construct return values and
arguments by aliases and allocations. The summary also locates which addresses
reachable from arguments may be deallocated.
36
Procedure Summarization Example
After analyzing the summary procedure in Figure 2.4, we obtain the following ab-
stract state at the exit point of the procedure:
G = true
M =
n {(true, α)}node {(true, β)}β.next {(α > 0, γ)}β.val {(true, δ)}x {(α < 10, 0), (α ≥ 10 ∧ success14, ℓ14),
(α ≥ 10 ∧ ¬success14, 0)}table {(δ + α > 0, β)}Ret {(α < 10, 0), (α ≥ 10 ∧ success14, ℓ14),
(α ≥ 10 ∧ ¬success14, 0)}
⟨AL,FR⟩ = ⟨{(α ≥ 10 ∧ success14, ℓ14)}, {(α > 0, γ)}⟩
From this abstract state we can summarize the procedure as described in Algo-
rithm 2. All reachable addresses from return address, parameters, and global vari-
ables are calculated from line 4 to 6 in the algorithm.
retset = {(Ret ,Ret , true), (Ret*, 0, α < 10 ∨ ¬success14),(Ret*, ℓ14, α ≥ 10 ∧ success14)}
argset = {(n, n, true), (n*, α, true)(node, node, true), (node*, β, true), (node*.next, γ, α > 0)}
globalset = {(table, table, true), (table*, β, δ + α > 0)}
For each tuple in retset∪argset, we generate elements of procedural summary
if possible according to the summarization algorithm (line 5 to 9). Actually, this
37
process generates many unnecessary allocation or deallocation related elements. For
example,
(Allocating , n, false), (Freeing , n, false), ...
are generated. We simply remove theses elements if the condition attached to the
action is unsatisfiable. Hence, elements aliased to themselves are also generated like
the following:
(Aliasing , n, n), (Aliasing , n*, n*), ...
We remove these elements of aliasing action by checking if two anchors are equiv-
alent.
After filtering above all the trivial elements, we obtain the following procedural
summary of the procedure.
{(Ret ,Allocating , α ≥ 10 ∧ success14),
P = (Ret ,Nullifying , α < 10 ∨ ¬success14),(node*.next,Freeing , α > 0),
((table,Aliasing , δ + α > 0), node*)
}
This summary describes the estimated memory behavior of the procedure. (1)
The summary procedure allocates a new memory if value α of the parameter n is
larger than 10 and the malloc function succeeds. Then the allocated memory is
returned. (2) The procedure frees an address reachable from parameter node. (3)
When α < 0 ∨ ¬success14 holds, the procedure returns a null pointer (or zero
value). The anchor node*.next captures the access path to the deallocated ad-
dress from the parameter. (4) The last element describes aliasing from parameter
to global variable. Actually, ((node,Aliasing , δ + α > 0), table*) also can be cre-
ated. But we know the aliased address β comes from the argument node hence this
element is not considered.
38
2.9 Main Algorithm
Algorithm 3: The main algorithm of our analysis.Input: The list of input procedures L
Output: FinalState ∈ Procedurefin−→ State
and ProceduralSummary ∈ Procedurefin−→ ProcSummary
f ∈ Procedure;1
b ∈ Block ;2
W ∈ Worklist = 2Block ;3
T ∈ Table = Blcokfin−→ State;4
F ∈ Block → State → State;5
L := ReverseTopologicalSort(L);6
foreach procedure f ∈ L do7
W := {entryf};8
while W = ∅ do9
b := choose(W);10
S := F b T (b);11
foreach b′ ∈ Succ(b) do12
if S ⊑ T (b′) then13
W := W ∪ {b′};14
T (b′) := T (b′) ⊔ S;15
end16
end17
end18
⟨G,M, ⟨AL,FR⟩⟩ := T (exitf);19
M := AddGuard(G,M);20
AL := AddGuard(G,AL);21
FR := AddGuard(G,FR);22
P := Summarization M ⟨AL,FR⟩ Paramf Global ;23
ProceduralSummary(f) := P;24
FinalState(f) := T (exitf)25
end26
39
The main algorithm of our analysis is presented in Algorithm 3, which takes a
program (a sequence of procedures) as input. The output of the algorithm is the
abstract state of the exit point and procedural summary of every procedure.
The analysis order is determined by calling relations among the procedures.
First, we sort all the procedures as the reverse topological order of the static call
graph (line 6). Every procedure is analyzed according to the order.
A part of Algorithm 3 describes the classical fixpoint algorithm (from 8 to 18).
Starting from the entry point of a procedure, the fixpoint algorithm computes a ta-
ble T ∈ Block → State which associates each basic block with its input state. The
semantic transfer function F defined in Section 2.3 is used to obtain the output
state of a certain block. For all successors b′ of block b, we check the output state
is subsumed by the pre-calculated state (recorded in the table T ). If the table is
not reached to a fixpoint then we add the successors to worklist W and accumulate
the output state to the fixpoint table.
After reaching a fixpoint of abstract states for all basic blocks in a procedure,
we use only the abstract state of the exit point of procedure for summarizing the
procedure and recording abstract states. The map ProceduralSummary is instan-
tiated at the call sites of the procedure to reflect the memory behaviors of the
procedure to calling contexts. The map FinalState is used for detecting memory
leaks (Chapter 3) and for detecting semantic code clones (Chapter 4).
2.10 Implementation and Engineerings
We implemented the proposed static analysis technique in Ocaml, functional pro-
gramming language. The core analysis engine is implemented on about 12,700 lines
of Ocaml code. The analysis presented in Chapter 2 is path-sensitive, hence ana-
lyzing millions of code is infeasible in general. Every branch creates a guard and
the negation of the guard. With different calling contexts and deep procedure call
chain aggravate the situation. The number of paths easily explodes.
40
We use the following techniques (some of which violate the analysis soundness)
to mitigate the path-explosion, silence false alarms, and lower the analysis costs.
We sometimes miss a small number of memory leaks errors but which is acceptable.
2.10.1 Reducing Guards
A naive implementation of our analysis design makes many redundant guards dur-
ing analysis. Especially, switch expressions in C program makes unacceptable re-
dundant guards which degrades the performance of our analyzer.
In order to remove trivial redundant guards, we use straightforward syntactic
reduction like the following:
A ∨A = A A ∧A = A
A ∨ ¬A = true A ∨ ¬A = false
A ∨ true = true A ∧ true = A
A ∨ false = A A ∧ false = false
(A ∧B) ∨ (A ∧ ¬B) = A (A ∨B) ∧ (A ∨ ¬B) = A
We keep all guards combined in disjunction or conjunction as sets. Hence the
first trivial checking is automatically done when inserting a guard that is equivalent
to an element in the sets into sets.
Whenever a guard is added conjunctively or disjunctively to guards, these syn-
tactic checking keeps the size of guards smaller.
For redundant guards which passed the simple syntactic checking, we use a sim-
plification technique. In general we can remove a guard when the following condi-
tions hold:A ∧B = A when A ⇒ B
A ∨B = B when A ⇒ B
This checking requires SMT solver calls which take time. Recently, Dillig et.
al. [37] proposed a technique to reduce the cost which is required to remove re-
dundant formulas. We implemented their technique to reduce the size of guards
with reasonable cost.
41
2.10.2 Global Variables Abstraction
All global variables in the program are abstracted into one global variable node
in the procedural summary. We miss some leaks that come from interprocedural
overwriting of allocated addresses stored in same global variable. For example, in
the following program, a memory leak (involving the block allocated by malloc(4))
is not reported.
int *gp;
f(int *p){ gp = p; }
...
g(){
int *p = malloc(4);
f(p);
p = malloc(8);
f(p);
}
Param1*
Global*
true
Figure 2.5: The names of global variables are ignored in procedural summary andrepresented by Global.
2.10.3 Following Loop Iteration Effects
At flow join points (e.g., a loop head), the allocation set AL and the freed set FRof all predecessors are collected. For loops like in Figure 2.6, it may cause some
stupid false positives in memory leak detection. Allocated address ℓ remains in the
allocation set at the exit even if it is definitely freed in the loop body.
When a loop iterates more than once, we do not join with the ⟨{ℓ}, ∅⟩ tuple of
the initial input memory at the loop head. This choice is based on the heuristic
that most loops in programs iterate at least once.
42
p = malloc(...);
for(i=0;i<10;i++){
...
free(p);
}
return;
p = malloc(...)
i < 10 ?
free(p);
return
f(int *p, int n){if (n) free(p);
}arg1
* !
Similarly, assignments to global variables are collected regard-less of the path. For example, the following example program’smemory states has no path-dependency information:
int *gp;f(int n){
int *p =malloc();if(n) gp = p;
}
p
global
*
*
!
We assume that function free frees all locations that may bepointed to by the argument. In the following example program,the memory states indicate that both locations pointed to bythe arguments are subject to being freed, though only one isactually freed.
f(int *x, int *y){int *p = x;if (n) p = y;free(p);
}
x
y p
*
* *
*
!!
!!
• K-bound Exploration Deriving memory images from loops isk-bounded, and hence the summaries are finite (Figure 12).
freeList(List *p){List *t = p;while(t != 0){
free(t);p = p->next;t = p;
}}arg ... ...
k-bound
* next next next! ! !
Figure 12. From a k-bounded exit memory state, the summary isk-bounded.
• Being Sensitive to Memory-Allocating Paths We have ob-served that almost all false positives come from the lack of in-terprocedural path-sensitivity.
int foo(int **pp){if(n==0) return 0;*pp = malloc(n);return 1;}void bar(){int *p;if(foo(&p) == 0) return;...
}Procedure foo returns the integer 1 whenever a newly allocatedlocation is assigned to its argument. But the summarized returnvalue of this procedure would to be the interval [0,1] (i.e., thereturn value must be either 0 or 1) because of the return 0statement. Whenever procedure foo is called, our analyzer as-sumes that pointer p points to a newly allocated address and thereturn value is the interval [0, 1]. Hence we lose the informationregarding the relation between the returned integer and allocat-ing action. So our analyzer falsely reports the allocated addressas a potential leak.
Our remedy is, when we collect summary categories from allthe exits, if some paths allocate new locations and others do not,instead of joining the all possibilities we choose to summarizeonly the allocation paths. Hence the return value is determinedby only the allocation paths. The summary of the foo functionis that it returns 1 and attaches an allocated address to thepointer argument.
• Following Loop Iteration Effects At flow join points (e.g.,a loop head), the allocation set L and the freed set L! of allpredecessors are collected. For loops like in Figure 13, it maycause some false positives. Allocated address ! remains in theallocation set at the exit even if it is definitely freed in the loopbody.
p = malloc();for(i=0;i<10;i++){
...free(p);
}return;
p = malloc!()
i<10? return
free(p)
!", "#
!{!}, "#
!{!}, "#!", {!}#
!{!}, {!}#
Figure 13. Tuple !L, L!# at each edge is the set of allocated and,respectively, freed locations. At the exit, allocated address ! re-mains not to be freed.
When a loop iterates more than once, we do not join with the!{!}, "# tuple of the initial input memory at the loop head. Thischoice is based on the heuristic that most loops in programsiterate at least once.
• Using Names in Addition To Paths in Summaries This isnot unsound. Some procedures return an allocated address andattach the same address to an argument. Because extracting asummary from the exit memory state derives locations in termsof access paths, two different paths whose ends are the sameallocated location can be confused to mean different locations.For example, analyzer can misunderstand the semantics of pro-cedure foo below as if it allocates two different addresses forthe return value and the argument. We use allocation site identi-fiers in addition to access paths when summarizing procedures.arg1 ret
!
**
*!
int * foo(int **p){int *ret = malloc!(4);*p = ret;return ret;
}
4. Summarizing Procedures via FixpointIterations
To summarize a procedure we must know what the procedure does.Based on the abstract interpretation framework [5], our analyzerdoes fixpoint iteration to find memory states at the exits of theprocedures in the input program.During fixpoint iteration, some symbolic addresses are intro-
duced to represent accessed locations in the unknown input mem-ory. From these symbolic addresses an abstract input memory im-age is derived. Starting with this input memory image, we trackthe procedure’s memory behavior. The procedure’s behavior is thensummarized from the memory states at the procedure’s exit points.
f(int *p, int n){if (n) free(p);
}arg1
* !
Similarly, assignments to global variables are collected regard-less of the path. For example, the following example program’smemory states has no path-dependency information:
int *gp;f(int n){
int *p =malloc();if(n) gp = p;
}
p
global
*
*
!
We assume that function free frees all locations that may bepointed to by the argument. In the following example program,the memory states indicate that both locations pointed to bythe arguments are subject to being freed, though only one isactually freed.
f(int *x, int *y){int *p = x;if (n) p = y;free(p);
}
x
y p
*
* *
*
!!
!!
• K-bound Exploration Deriving memory images from loops isk-bounded, and hence the summaries are finite (Figure 12).
freeList(List *p){List *t = p;while(t != 0){
free(t);p = p->next;t = p;
}}arg ... ...
k-bound
* next next next! ! !
Figure 12. From a k-bounded exit memory state, the summary isk-bounded.
• Being Sensitive to Memory-Allocating Paths We have ob-served that almost all false positives come from the lack of in-terprocedural path-sensitivity.
int foo(int **pp){if(n==0) return 0;*pp = malloc(n);return 1;}void bar(){int *p;if(foo(&p) == 0) return;...
}Procedure foo returns the integer 1 whenever a newly allocatedlocation is assigned to its argument. But the summarized returnvalue of this procedure would to be the interval [0,1] (i.e., thereturn value must be either 0 or 1) because of the return 0statement. Whenever procedure foo is called, our analyzer as-sumes that pointer p points to a newly allocated address and thereturn value is the interval [0, 1]. Hence we lose the informationregarding the relation between the returned integer and allocat-ing action. So our analyzer falsely reports the allocated addressas a potential leak.
Our remedy is, when we collect summary categories from allthe exits, if some paths allocate new locations and others do not,instead of joining the all possibilities we choose to summarizeonly the allocation paths. Hence the return value is determinedby only the allocation paths. The summary of the foo functionis that it returns 1 and attaches an allocated address to thepointer argument.
• Following Loop Iteration Effects At flow join points (e.g.,a loop head), the allocation set L and the freed set L! of allpredecessors are collected. For loops like in Figure 13, it maycause some false positives. Allocated address ! remains in theallocation set at the exit even if it is definitely freed in the loopbody.
p = malloc();for(i=0;i<10;i++){
...free(p);
}return;
p = malloc!()
i<10? return
free(p)
!", "#
!{!}, "#
!{!}, "#!", {!}#
!{!}, {!}#
Figure 13. Tuple !L, L!# at each edge is the set of allocated and,respectively, freed locations. At the exit, allocated address ! re-mains not to be freed.
When a loop iterates more than once, we do not join with the!{!}, "# tuple of the initial input memory at the loop head. Thischoice is based on the heuristic that most loops in programsiterate at least once.
• Using Names in Addition To Paths in Summaries This isnot unsound. Some procedures return an allocated address andattach the same address to an argument. Because extracting asummary from the exit memory state derives locations in termsof access paths, two different paths whose ends are the sameallocated location can be confused to mean different locations.For example, analyzer can misunderstand the semantics of pro-cedure foo below as if it allocates two different addresses forthe return value and the argument. We use allocation site identi-fiers in addition to access paths when summarizing procedures.arg1 ret
!
**
*!
int * foo(int **p){int *ret = malloc!(4);*p = ret;return ret;
}
4. Summarizing Procedures via FixpointIterations
To summarize a procedure we must know what the procedure does.Based on the abstract interpretation framework [5], our analyzerdoes fixpoint iteration to find memory states at the exits of theprocedures in the input program.During fixpoint iteration, some symbolic addresses are intro-
duced to represent accessed locations in the unknown input mem-ory. From these symbolic addresses an abstract input memory im-age is derived. Starting with this input memory image, we trackthe procedure’s memory behavior. The procedure’s behavior is thensummarized from the memory states at the procedure’s exit points.
f(int *p, int n){if (n) free(p);
}arg1
* !
Similarly, assignments to global variables are collected regard-less of the path. For example, the following example program’smemory states has no path-dependency information:
int *gp;f(int n){
int *p =malloc();if(n) gp = p;
}
p
global
*
*
!
We assume that function free frees all locations that may bepointed to by the argument. In the following example program,the memory states indicate that both locations pointed to bythe arguments are subject to being freed, though only one isactually freed.
f(int *x, int *y){int *p = x;if (n) p = y;free(p);
}
x
y p
*
* *
*
!!
!!
• K-bound Exploration Deriving memory images from loops isk-bounded, and hence the summaries are finite (Figure 12).
freeList(List *p){List *t = p;while(t != 0){
free(t);p = p->next;t = p;
}}arg ... ...
k-bound
* next next next! ! !
Figure 12. From a k-bounded exit memory state, the summary isk-bounded.
• Being Sensitive to Memory-Allocating Paths We have ob-served that almost all false positives come from the lack of in-terprocedural path-sensitivity.
int foo(int **pp){if(n==0) return 0;*pp = malloc(n);return 1;}void bar(){int *p;if(foo(&p) == 0) return;...
}Procedure foo returns the integer 1 whenever a newly allocatedlocation is assigned to its argument. But the summarized returnvalue of this procedure would to be the interval [0,1] (i.e., thereturn value must be either 0 or 1) because of the return 0statement. Whenever procedure foo is called, our analyzer as-sumes that pointer p points to a newly allocated address and thereturn value is the interval [0, 1]. Hence we lose the informationregarding the relation between the returned integer and allocat-ing action. So our analyzer falsely reports the allocated addressas a potential leak.
Our remedy is, when we collect summary categories from allthe exits, if some paths allocate new locations and others do not,instead of joining the all possibilities we choose to summarizeonly the allocation paths. Hence the return value is determinedby only the allocation paths. The summary of the foo functionis that it returns 1 and attaches an allocated address to thepointer argument.
• Following Loop Iteration Effects At flow join points (e.g.,a loop head), the allocation set L and the freed set L! of allpredecessors are collected. For loops like in Figure 13, it maycause some false positives. Allocated address ! remains in theallocation set at the exit even if it is definitely freed in the loopbody.
p = malloc();for(i=0;i<10;i++){
...free(p);
}return;
p = malloc!()
i<10? return
free(p)
!", "#
!{!}, "#
!{!}, "#!", {!}#
!{!}, {!}#
Figure 13. Tuple !L, L!# at each edge is the set of allocated and,respectively, freed locations. At the exit, allocated address ! re-mains not to be freed.
When a loop iterates more than once, we do not join with the!{!}, "# tuple of the initial input memory at the loop head. Thischoice is based on the heuristic that most loops in programsiterate at least once.
• Using Names in Addition To Paths in Summaries This isnot unsound. Some procedures return an allocated address andattach the same address to an argument. Because extracting asummary from the exit memory state derives locations in termsof access paths, two different paths whose ends are the sameallocated location can be confused to mean different locations.For example, analyzer can misunderstand the semantics of pro-cedure foo below as if it allocates two different addresses forthe return value and the argument. We use allocation site identi-fiers in addition to access paths when summarizing procedures.arg1 ret
!
**
*!
int * foo(int **p){int *ret = malloc!(4);*p = ret;return ret;
}
4. Summarizing Procedures via FixpointIterations
To summarize a procedure we must know what the procedure does.Based on the abstract interpretation framework [5], our analyzerdoes fixpoint iteration to find memory states at the exits of theprocedures in the input program.During fixpoint iteration, some symbolic addresses are intro-
duced to represent accessed locations in the unknown input mem-ory. From these symbolic addresses an abstract input memory im-age is derived. Starting with this input memory image, we trackthe procedure’s memory behavior. The procedure’s behavior is thensummarized from the memory states at the procedure’s exit points.
f(int *p, int n){if (n) free(p);
}arg1
* !
Similarly, assignments to global variables are collected regard-less of the path. For example, the following example program’smemory states has no path-dependency information:
int *gp;f(int n){
int *p =malloc();if(n) gp = p;
}
p
global
*
*
!
We assume that function free frees all locations that may bepointed to by the argument. In the following example program,the memory states indicate that both locations pointed to bythe arguments are subject to being freed, though only one isactually freed.
f(int *x, int *y){int *p = x;if (n) p = y;free(p);
}
x
y p
*
* *
*
!!
!!
• K-bound Exploration Deriving memory images from loops isk-bounded, and hence the summaries are finite (Figure 12).
freeList(List *p){List *t = p;while(t != 0){
free(t);p = p->next;t = p;
}}arg ... ...
k-bound
* next next next! ! !
Figure 12. From a k-bounded exit memory state, the summary isk-bounded.
• Being Sensitive to Memory-Allocating Paths We have ob-served that almost all false positives come from the lack of in-terprocedural path-sensitivity.
int foo(int **pp){if(n==0) return 0;*pp = malloc(n);return 1;}void bar(){int *p;if(foo(&p) == 0) return;...
}Procedure foo returns the integer 1 whenever a newly allocatedlocation is assigned to its argument. But the summarized returnvalue of this procedure would to be the interval [0,1] (i.e., thereturn value must be either 0 or 1) because of the return 0statement. Whenever procedure foo is called, our analyzer as-sumes that pointer p points to a newly allocated address and thereturn value is the interval [0, 1]. Hence we lose the informationregarding the relation between the returned integer and allocat-ing action. So our analyzer falsely reports the allocated addressas a potential leak.
Our remedy is, when we collect summary categories from allthe exits, if some paths allocate new locations and others do not,instead of joining the all possibilities we choose to summarizeonly the allocation paths. Hence the return value is determinedby only the allocation paths. The summary of the foo functionis that it returns 1 and attaches an allocated address to thepointer argument.
• Following Loop Iteration Effects At flow join points (e.g.,a loop head), the allocation set L and the freed set L! of allpredecessors are collected. For loops like in Figure 13, it maycause some false positives. Allocated address ! remains in theallocation set at the exit even if it is definitely freed in the loopbody.
p = malloc();for(i=0;i<10;i++){
...free(p);
}return;
p = malloc!()
i<10? return
free(p)
!", "#
!{!}, "#
!{!}, "#!", {!}#
!{!}, {!}#
Figure 13. Tuple !L, L!# at each edge is the set of allocated and,respectively, freed locations. At the exit, allocated address ! re-mains not to be freed.
When a loop iterates more than once, we do not join with the!{!}, "# tuple of the initial input memory at the loop head. Thischoice is based on the heuristic that most loops in programsiterate at least once.
• Using Names in Addition To Paths in Summaries This isnot unsound. Some procedures return an allocated address andattach the same address to an argument. Because extracting asummary from the exit memory state derives locations in termsof access paths, two different paths whose ends are the sameallocated location can be confused to mean different locations.For example, analyzer can misunderstand the semantics of pro-cedure foo below as if it allocates two different addresses forthe return value and the argument. We use allocation site identi-fiers in addition to access paths when summarizing procedures.arg1 ret
!
**
*!
int * foo(int **p){int *ret = malloc!(4);*p = ret;return ret;
}
4. Summarizing Procedures via FixpointIterations
To summarize a procedure we must know what the procedure does.Based on the abstract interpretation framework [5], our analyzerdoes fixpoint iteration to find memory states at the exits of theprocedures in the input program.During fixpoint iteration, some symbolic addresses are intro-
duced to represent accessed locations in the unknown input mem-ory. From these symbolic addresses an abstract input memory im-age is derived. Starting with this input memory image, we trackthe procedure’s memory behavior. The procedure’s behavior is thensummarized from the memory states at the procedure’s exit points.
f(int *p, int n){if (n) free(p);
}arg1
* !
Similarly, assignments to global variables are collected regard-less of the path. For example, the following example program’smemory states has no path-dependency information:
int *gp;f(int n){
int *p =malloc();if(n) gp = p;
}
p
global
*
*
!
We assume that function free frees all locations that may bepointed to by the argument. In the following example program,the memory states indicate that both locations pointed to bythe arguments are subject to being freed, though only one isactually freed.
f(int *x, int *y){int *p = x;if (n) p = y;free(p);
}
x
y p
*
* *
*
!!
!!
• K-bound Exploration Deriving memory images from loops isk-bounded, and hence the summaries are finite (Figure 12).
freeList(List *p){List *t = p;while(t != 0){
free(t);p = p->next;t = p;
}}arg ... ...
k-bound
* next next next! ! !
Figure 12. From a k-bounded exit memory state, the summary isk-bounded.
• Being Sensitive to Memory-Allocating Paths We have ob-served that almost all false positives come from the lack of in-terprocedural path-sensitivity.
int foo(int **pp){if(n==0) return 0;*pp = malloc(n);return 1;}void bar(){int *p;if(foo(&p) == 0) return;...
}Procedure foo returns the integer 1 whenever a newly allocatedlocation is assigned to its argument. But the summarized returnvalue of this procedure would to be the interval [0,1] (i.e., thereturn value must be either 0 or 1) because of the return 0statement. Whenever procedure foo is called, our analyzer as-sumes that pointer p points to a newly allocated address and thereturn value is the interval [0, 1]. Hence we lose the informationregarding the relation between the returned integer and allocat-ing action. So our analyzer falsely reports the allocated addressas a potential leak.
Our remedy is, when we collect summary categories from allthe exits, if some paths allocate new locations and others do not,instead of joining the all possibilities we choose to summarizeonly the allocation paths. Hence the return value is determinedby only the allocation paths. The summary of the foo functionis that it returns 1 and attaches an allocated address to thepointer argument.
• Following Loop Iteration Effects At flow join points (e.g.,a loop head), the allocation set L and the freed set L! of allpredecessors are collected. For loops like in Figure 13, it maycause some false positives. Allocated address ! remains in theallocation set at the exit even if it is definitely freed in the loopbody.
p = malloc();for(i=0;i<10;i++){
...free(p);
}return;
p = malloc!()
i<10? return
free(p)
!", "#
!{!}, "#
!{!}, "#!", {!}#
!{!}, {!}#
Figure 13. Tuple !L, L!# at each edge is the set of allocated and,respectively, freed locations. At the exit, allocated address ! re-mains not to be freed.
When a loop iterates more than once, we do not join with the!{!}, "# tuple of the initial input memory at the loop head. Thischoice is based on the heuristic that most loops in programsiterate at least once.
• Using Names in Addition To Paths in Summaries This isnot unsound. Some procedures return an allocated address andattach the same address to an argument. Because extracting asummary from the exit memory state derives locations in termsof access paths, two different paths whose ends are the sameallocated location can be confused to mean different locations.For example, analyzer can misunderstand the semantics of pro-cedure foo below as if it allocates two different addresses forthe return value and the argument. We use allocation site identi-fiers in addition to access paths when summarizing procedures.arg1 ret
!
**
*!
int * foo(int **p){int *ret = malloc!(4);*p = ret;return ret;
}
4. Summarizing Procedures via FixpointIterations
To summarize a procedure we must know what the procedure does.Based on the abstract interpretation framework [5], our analyzerdoes fixpoint iteration to find memory states at the exits of theprocedures in the input program.During fixpoint iteration, some symbolic addresses are intro-
duced to represent accessed locations in the unknown input mem-ory. From these symbolic addresses an abstract input memory im-age is derived. Starting with this input memory image, we trackthe procedure’s memory behavior. The procedure’s behavior is thensummarized from the memory states at the procedure’s exit points.
Figure 2.6: Tuple ⟨AL,FR⟩ at each edge is the set of allocated and, respectively,freed locations. At the exit, allocated address ℓ remains not to be freed.
43
Chapter 3
Memory Leak Detection
3.1 Introduction
A memory leak in a C program is sometimes fatal; it may silently ail the pro-
gram until memory is exhausted and the program is aborted. A procedure leaks
heap memory whenever (1) memory is allocated while the procedure is active and
(2) this memory is neither recycled nor visible to its caller after its return. The
local point of view on memory leak detection enables modular analysis.
Using the static analysis presented in this dissertation, we devised an auto-
matic memory leak detection for C programs. The choice of summary categories
has been empirically tuned. The summary categories are chosen after other choices
have been tested against realistic C programs. The abstraction decision focuses on
not neglecting common memory-leak-related behaviors in realistic C programs.
The presented memory leak detection cannot soundly determine that a program
is free from memory leaks; it detects some memory leaks but not all. Because the
underlying static analysis is neither sound nor complete.
3.2 Memory Leak Detection Overview
The analysis consists of two processes as described in Chapter 2. (1) analyzing
callee procedures and then summarizing procedures’ memory behavior using parametrized
44
procedural summaries presented in Section 2.8 and (2) instantiating procedural
summaries at the procedures’ call sites with the calling contexts.
In this section, we explain how to use procedural summaries and how to sum-
marize memory behaviors of procedures focusing on memory leak detections. We
will use pictorial presentations for summaries and abstract memories for helping
readers’ understanding.
3.2.1 Summaries and Their Use
Consider the following example:1 int n;
2 int *foo(List *p) {
3 if (n) return mymalloc(...);
4 else {
5 p->next = mymalloc(...);
6 return 0;
7 }
8 }
1 int *bar(){
2 List k;
3 n = 1;
4 int *a = foo(&k);
5 return a;
6 }
Ret Alloc
Param
n > 0
*
* Alloc.next
! (n > 0)
a Alloc
k
1 > 0
*
* Alloc.next
! (1 > 0)
Ret Alloc
true
*
summary of foo instantiation for a = foo(&k) summary of bar
Figure 3.1: Procedural summary, instantiation, and summarization.
We use pictorial presentations for procedural summary explained in Section 2.8.
For brevity, we assume that mymalloc always succeeds memory allocation and also
omit null pointer assignments from procedural summary representation. In Fig-
ure 3.1, the summary for procedure foo says that the return value can be a pointer
to an allocated cell or the parameter pointer’s next field can point to another al-
45
located cell with corresponding conditions. This pictorial representation is actually
identical to the following procedural summary and origin function:
P = {(Ret*,Allocating , α > 0), (β.next,Allocating ,¬(α > 0))}origin(α) = n
origin(β) = p
The call site foo(&k) inside bar uses this summary by instantiating the return and
parameter boxes with a and &k respectively.
Among two possibilities of allocating a new memory cell, only the allocated
address reachable from return value is feasible. Because the value of n in guards of
procedural summary is instantiated with 1, which results in makes the guard ¬(1 >
0) infeasible. With the instantiation result we can summarize the bar procedure.
3.2.2 From Memory Effects to Summaries
Summarization of procedures consists of two sequential steps: (1) estimating mem-
ory effects of a procedure and (2) creating the procedural summary consisting of in-
formation for identifying possible memory leaks from the estimated memory effects.
The memory-effect estimation step is based on abstract interpretation [27–29], us-
ing fixpoint iteration on our abstract semantics of the C language as presented in
Chapter 2.
The memory effect of a procedure consists of three pieces of information: allo-
cated addresses, freed addresses, and the exit memory state (memory state at the
end of the procedure). From these three pieces of information, it is straightforward
to summarize effects related to memory leaks. From the exit memory state, we col-
lect the addresses that are potentially reachable from outside the procedure (via
the global variables, the pointer arguments, and the return value). We then ex-
amine which among these locations are allocated ones, freed ones, or aliased ones.
The results are summarized into the procedure’s summary.
One major obstacle in estimating the exit memory state for each procedure is
how to derive the exit memory state without knowing the input memory state (call
46
context). We have to parameterize the exit memory state by the procedure’s input
memory state.
Our first observation is that we do not need the whole image of the input mem-
ory but only those locations that are accessed by the procedure. Our second ob-
servation is that C procedures access the input memory through either arguments
or global variables. Our third observation is that although we cannot collect the
accessed locations themselves unless we have the input memory, we can determine
the “access path” with which those locations are accessed. Such access paths are
explicit in the procedure source.
Each procedural summary describes accessed locations that are reachable from
the outside of the procedure. All accessed locations, in parameterized form as ac-
cess paths, occur as location entries in the exit memory state. Among them, reach-
able locations from the outside of the procedure are those reachable from global
variables, the pointer arguments, or the return value. In the summary, from such
reachable locations and the sets of allocated and freed locations we summarize the
procedure’s behavior. Summarization algorithm is presented in Section 2.8.3.
3.2.3 Instantiating Summaries
In analyzing a procedure, when we meet a call site we instantiate the callee pro-
cedure’s summary with the call site’s abstract states. The instantiation consists
simply of using the call site’s memory to fill in the blanks in the summary that
were parameterized by summarization. Alias information captured at the call site’s
memory is reflected in instantiation.
The instantiation’s output consists of the abstract memory after the call, along
with the updated sets of allocated and freed locations. We track these three pieces
of information to the exits of the current procedure using fixpoint iteration, and
then we record this information in the current procedure’s summary, as explained
above. Instantiation algorithm is presented in Section 2.8.2.
47
3.3 Procedural Summaries for Memory Leak Detection
Even though our procedural summary can present any allocation, deallocation, null
assigning, and alias actions (Section 2.8). No all the possible actions are necessary
for detecting memory leaks.
Only eight categories shown in Table 3.1 are used in our procedural summary
for detecting memory leaks.
We exclude the other four possible combination from procedural summaries be-
cause theses are not effective to memory leak detection.
• Alloc2Free If allocated addresses are safely freed before exiting a procedure
then there is no need to trace the allocated addresses.
• Alloc2Glob We do not record which allocated locations are assigned to globals,
because any addresses reachable from global variables are accessible from any
environment in the program.
• Glob2Free Because we do not trace the allocation on global variables (Alloc2Glob),
there is no reason to keep this information on procedural summaries.
• Glob2Glob The aliases between globals are not considered in our analysis be-
cause we use a single abstract location for all global variables when summa-
rizing procedures.
Freeing global argument returnAllocating Alloc2Arg Alloc2Ret
global Glob2Arg Glob2Ret
argument Arg2Free Arg2Glob Arg2Arg Arg2Ret
Table 3.1: Eight categories of procedural summary for detecting memory leaks. Thereachable locations from outside and the sets of allocated and freed locations giveus memory leak related information.
48
3.3.1 Eight Categories of Procedural Summaries
We show examples of how the selected eight categories in the procedural summary
play roles for memory leak detection. In the examples, we will use the same pic-
torial forms for procedural summaries as already used before.
A procedural summary is represented by a directed graph. Each node represents
one abstract address. Circle nodes are for heap locations (dynamically allocated
addresses). Rectangular nodes are for stack locations (for variables, arguments, and
globals). Newly allocated heap locations in the current procedure are marked by
“Alloc”. While freed locations are marked by “Free”. Each directed edge from node
a to node b indicates that location a may point to location b. The label on the
edge indicates the manner in which the predecessor points to the successor. The
label is either “∗” (dereferencing) or the name of a pointer field in a structure.
Guards are represented above the target node.
• Arg2Free (Fig. 3.2): Procedure freeNext in the example frees a location reach-
able from the argument. Its summary is shown in the graph figure
struct List {
struct List *next;
int *val;
};
freeNext (List *lst){
free(lst->next);
}
Param* Free
.nexttrue
Figure 3.2: Arg2Free case: The procedure frees addresses reachable from arguments.
• Arg2Glob and Glob2Arg (Fig. 3.3): The circle node pointed to by the first
argument is pointed to by global variable which represents Arg2Glob. While
the second parameter points to global address which represents Glob2Arg. In
the attachGlob procedure, the address node pointed to by first argument
p1 becomes reachable from the global variables node. All global variables in
program are abstracted into one global variable node.
49
int *p1 = mymalloc();
int **p2;
attachGlob(p1,&p2);
*p2 = mymalloc();
The difference between these two categories Arg2Glob and Glob2Arg is clearly
revealed when instantiation. All allocated addresses in the above code become
reachable from global variables. Procedure attachGlob makes the allocated
location pointed to by p1 reachable from a global variable. It also makes the
pointer *p2 become an alias of a location reachable from a global variable,
and therefore the allocated address pointed to by *p2 after the procedure call
is reachable from a global variable.
int *gInt;
List gLst;
attachGlob(int *p1,
int ***p2){
gInt = p1;
*p2 = &(gLst.val);
}
Param1*
GlobalParam2 **
true
Figure 3.3: Arg2Glob and Glob2Arg cases: The attachGlob procedure attachessome locations reachable from arguments to global variables and attaches locationsreachable from global variables to arguments.
• Alloc2Arg (Fig. 3.4): In real C programs some procedures attach allocated ad-
dresses to pointer arguments. More leaks can be detected by capturing this
situation. Interestingly, we found the fact that procedures pass newly allo-
cated memory back to their caller via their parameters (491) than via their
return value (160) in our experiment on binutils project.
• Alloc2Ret (Fig. 3.5): In real C programs many objects are allocated via pro-
cedure calls. It is the most common way to return allocated heap objects.
The structures of heap objects are captured.
50
makeArray(int ** p){
*p = mymalloc();
}Param
* Alloc.next
true
Figure 3.4: Alloc2Arg case: The makeArray procedure attaches an allocated addressto the pointer argument p.
List * make2List(){
List * lst = mymalloc();
lst->val = mymalloc();
lst->next = mymalloc();
(lst->next)->val = mymalloc();
return lst;
}
Ret Alloc* Alloc.next
true
Alloc Alloc
.val .val
truetrue
true
Figure 3.5: Alloc2Ret case: The make2List procedure returns an allocated list oflength two.
• Glob2Ret and Arg2Arg (Fig. 3.6): Some procedures in the Linux kernel return
an object from a global table. Allocated addresses attached to this object
must not to be reported as leaks.
Traces of addresses passed through arguments to other arguments should be
kept. In the example, the address pointed to by first argument is passed to
second argument. This information enables the analyzer to know interproce-
dural aliases via arguments.
List * argPassing(List *lst1,
List **lst2){
*lst2 = lst1;
return &gLst;
}
Param1 Param2
**
*
Ret
Global
*
truetrue
Figure 3.6: Glob2Ret and Arg2Arg cases: The argPassing procedure passes an ad-dress from the first argument to the second argument and returns global pointer.
51
• Arg2Ret (Fig. 3.7): Some library functions in C (e.g. “memcpy” and “str-
cpy”) return a pointer argument. Variable ret and the pointer argument lst
share a commonly reachable location. This interprocedural aliasing can be
captured.
List * renewList(List * lst){
List * ret = mymalloc();
ret->next = lst->next;
free(lst->val);
free(lst);
return ret;
}
Param1 Free Free*
Ret
true true
Alloc* *
*
truetrue
Figure 3.7: Arg2Ret case: The renewList procedure returns addresses reachablefrom an argument.
3.3.2 Interprocedural Summary Instantiation
We show how such eight categories of procedural summaries are instantiated with
simple C code (Figure 3.8). The clean procedure leaks no memory cells at the
1 void clean(){
2 List *lst1,*lst2;
3 int **ptr;
4 lst1 = make2List();
5 lst2 = renewList(lst1);
6 attachGlob((lst2->next)->val, &ptr);
7 makeArray(ptr);
8 freeNext(lst2);
9 }
Figure 3.8: The clean procedure calls some procedures presented above.
exit of the procedure. At line 4, procedure make2List is called. The procedure
have been analyzed and its summary (Figure 3.5) can be used. The return value
52
of the summary is instantiated with variable lst1. Pointer variable lst1 becomes
a pointer to newly allocated list of length two. At line 5, procedure renewList
(Figure 3.7) is called. The first formal parameter and the returned address are
instantiated with variables lst1 and lst2 respectively. Procedure renewList frees
two nodes reachable from the first argument. We can trace which addresses are
freed by following the access path of the summary. The allocated addresses *lst1
and *(*lst1).val at line 4 are freed. The freed addresses are removed from the
allocation set and added to the freed set. The memory state after line 5 is described
in Figure 3.9.
lst1* Alloc
.nexttrue
lst2 Alloc*.next
true
Alloc
true.val
Free
true
.val
Figure 3.9: The memory state after line 4 of the code in Figure 3.8: some allocatedaddresses are freed and the other allocated addresses are reachable from the pointervariable lst2.
At line 6, one allocated address (lst2->next)->val is globalized by procedure
attachGlob. The pointer ptr is aliased with a global variable. At line 7, procedure
makeArray makes one allocated address pointed to by ptr. At line 8, one allocated
address lst2->next is freed by freeNext. The final memory state of clean is rep-
resented in Figure 3.10.
We can see that all addresses but one in the allocation set become reachable
from global variables. Hence we conclude that there are no leaks in the clean pro-
cedure.
53
lst1* Alloc
.nexttrue
lst2 Alloc*.next
true
Alloc
true.val
Free
true
.val
Global ptrAlloc
** *
true
Figure 3.10: The exit memory state of clean: the one allocated address pointed toby lst2 is not reachable from global variables, hence leaked.
3.4 Reporting Leaks
After summarization, if there exist allocated addresses which are not reachable
from visible environments then the unreachable allocated addresses are memory
leaks.
Algorithm 4 shows the process for reporting memory leaks. The set of deallo-
cated addresses is not necessary to conclude memory leaks. It is because the effects
of deallocation reflects not only FR but also AL (function Free in Table 2.7).
Like the summarization process in Algorithm 2, first we collect all possible ad-
dresses reachable from the environment visible outside (line 2 to 4). Please recall
that the visible environment in C programs includes only parameters, return val-
ues, and global variables. For each allocated address (please note that ℓ at line 5)
in those reachable addresses, we add the negation of reachable condition from the
allocated conditions (line 6). After inspecting all reachable addresses, allocated ad-
dresses of which conditions are satisfiable are reported as leaks (line 8 to 12). We
use SMT solvers [41] to check the satisfiability of guard g.
54
Algorithm 4: ReportingLeaks M AL Paramf Global
Input: abstract memory state M, allocated set AL, formal parameters ofprocedure f Paramf, and all the global variables Global
Output: a map from leaked addresses to the corresponding conditions LL := AL;1
retset := Reachable(Ret ,M);2
argset := Reachable(Paramf,M);3
globalset := Reachable(Global ,M);4
foreach (_, ℓ, g) ∈ retset ∪ argset ∪ globalset do5
L := L{ℓ 7→ L(ℓ) ∧ ¬g};6
end7
foreach (ℓ, g) ∈ L do8
if SMT (g) ⇒ Satisfiable then9
report leaks10
end11
end12
For example, let’s suppose that an allocated address ℓ1 is allocated when x > 0
and the address ℓ1 is reachable from a parameter when x < 10 and from global
variable when x > 5. The condition when allocated address is not reachable from
outside (hence leaked) is like the following:
x > 0 ∧ ¬(x < 10) ∧ ¬(x > 5) ≡ false
Hence we can conclude the address ℓ1 is not leaked, hence not reported.
For helping users locate the origin of memory leaks, the memory leak detector
reports the allocation position, the function name of the allocators (show all func-
tions in the call chain of allocators), escaping position (return statement), pointers
pointing to the leaked memory cells, and condition on which the memory leaks as
shown in Figure 3.11.
The leaked address is pointed to by lst and allocated at line 26 in file thesis.i.
In the return statement at line 34, the procedure use terminates without safely
55
In procedure use
allocated at (file: "thesis.i", line: 26) by summary
(file: "thesis.i", line: 18) by mymalloc
escaped at (file: "thesis.i", line: 34)
pointed to by lst
with condition (n - 1) <= 0 /\ (x + 2) = 0
Figure 3.11: An example of reporting leaks.
handle the allocated address when n− 1 ≤ 0 ∧ x+ 2 = 0 is satisfiable. This infor-
mation is useful for users to locate the origin of memory leaks and fix their bugs.
3.5 Experiment Results
We implemented the memory leak detection on the presented static analyzer (called
Mairac). The performance results are presented in the Table 3.2. We ran our an-
alyzer on a 3.2GHz Pentium 4 machine with 4GB of memory under Linux.
From SPEC2000 benchmark programs, we found 81 memory leaks with just 15
false alarms. Especially “gcc” program takes a long time. It is because the control
flow of the program is quite complex. While the other programs have relatively
simple control flows.
From four open source projects (binutils, openssh, httpd, and tar), we found
251 memory leaks with 32 false alarms. We cannot find any bugs from httpd pro-
gram but it does not necessarily mean the program is clean with respect to memory
leaks.
3.5.1 Overall Comparison
In comparison with other published memory leak detectors [21, 63, 112, 136], our
analyzer consistently detects more bugs for the same published benchmark software
as presented Figure 3.3.
Figure 3.4 shows the performances of all existing techniques. Our analysis speed
is about 720 LOC/sec, next to that of the fastest analyzer, FastCheck [21]. Our
56
Programs Size Time Bug FalseKLOC (sec) Count Alarm
ammp 13.2 9.68 20 0art 1.2 0.68 1 0bzip2 4.6 1.52 1 0crafty 19.4 84.32 0 0equake 1.5 1.03 0 0gap 59.4 31.03 0 0gcc 205.8 1330.33 44 1gzip 7.7 1.56 1 4mcf 1.9 2.77 0 0mesa 50.2 43.15 9 0parser 10.9 15.93 0 0twolf 19.7 68.80 5 0vortex 52.6 34.79 0 1vpr 16.9 7.85 0 9
binutils-2.13.1 909.4 712.09 228 25openssh-3.5p1 36.7 10.75 18 4httpd-2.2.2 316.4 74.87 0 0tar-1.13 49.5 11.73 5 3
Table 3.2: Analysis results on programs from SPEC2000 benchmark and opensource programs.
false-positive ratio (the percentage of alarms that are not true bugs) is 12.4%,
which is beaten only by Saturn [136].
To evaluate the quality of memory leak detectors, we devised the following mea-
sure called “Efficacy”.
Efficacy =BugCoount/KLOCFalseAlarmRatio
It is not the perfect measure to evaluate memory leak detectors. But the underlying
intuition on the measure is likes the following: the more bugs with less false alarms
are found in the same program, the better the tool is. While the speed of static
analysis is not the important factor to evaluate the memory leak detection. Because
57
C program Tool Bug False AlarmCount Count
SPEC2000 Mairac [83] 81 15benchmark FastCheck [21] 59 8
binutils-2.13.1 Mairac [83] 246 29& Saturn [136] 165 5
openssh-3.5.p1 Clouseau [63] 84 269
Table 3.3: Performance comparison for the same C programs. Other tools’ data arefrom the cited papers. Mairac found more bugs than others with a reasonablefalse-alarm ratio.
Tool C Size Speed Bug False Alarms EfficacyKLOC LOC/s Count Ratio(%)
Saturn [136] 6,822 50 455 10% 6.67Clouseau [63] 1,086 500 409 64% 5.88FastCheck [21] 671 37,900 63 14% 6.71Contradiction [112] 321 300 26 56% 1.45Mairac 1,777 720 332 12% 8.11
Table 3.4: Overall comparison with other memory leak detectors. Other tools’ dataare from [21]. Note that these tools are applied to different programs.
these all static analyses are fully automatic and do not need to be run like the
interactive tools.
The efficacy of Mairac is the best among the existing techniques.
3.5.2 Comparison with FastCheck
We have experimented with programs from SPEC2000 benchmarks to compare
Mairac with FastCheck [21]. Analyzing the same set of programs as in [21] ex-
cept for “perlmbk”1, Mairac found 81 bugs among 96 reported alarms and FastCheck1We could not analyze the ’perlmbk’ program because our parser can not accept many of its
files.
58
found 59 bugs among 67 reported alarms. Mairac caught all the bugs found by
FastCheck except for only two bugs from the “gcc” program.
261: osmesa = (OSMesaContext) calloc( 1, sizeof( ...
262: if (osmesa) {
263: osmesa->gl_visual = gl_create_visual( rgbmode,
...
272: if (!osmesa->gl_visual) {
273: return NULL;
274: }
276: osmesa->gl_ctx = gl_create_context( ...
...
279: if (!osmesa->gl_ctx) {
280: gl_destroy_visual( osmesa->gl_visual );
281: free(osmesa);
282: return NULL;
283: }
284: osmesa->gl_buffer = gl_create_framebuffer( ...
285: if (!osmesa->gl_buffer) {
286: gl_destroy_visual( osmesa->gl_visual );
287: gl_destroy_context( osmesa->gl_ctx );
288: free(osmesa);
289: return NULL;
290: }
Figure 3.12: Example code from “mesa”(a SPEC2000 benchmark).
The Figure 3.12 shows two reported memory leaks from the “mesa” program.
From line 261 to 263, the pointer variable osmesa points to an allocated heap struc-
ture and osmesa->gl_visual points to another allocated heap structure. Mairac
successfully captures that procedure gl_create_visual returns an allocated heap
structure. If the procedure gl_create_visual returns a null pointer then the cur-
rent procedure returns the null pointer at line 273 without freeing the heap struc-
59
ture pointed to by osmesa. Mairac reports this leak. Mairac is silent at line 282,
because all allocated heap Structures (allocated at line 261 and 263) are freed. At
line 289, Mairac reports that some addresses allocated at line 276 are leaked.
It seems a false positive at first glance because there is a gl_destroy_context
function call at line 287. However, they are indeed leaked. By inspecting our pro-
cedural summaries we find out that some heap locations allocated by the pro-
cedure gl_create_context are not freed by the procedure gl_destroy_context.
FastCheck missed this leak. Figure 3.13 shows the procedural summary of the cre-
ator and the addresses not freed by the destroyer. The analyzer can capture the
creation and the destruction of such a complex heap structure through its proce-
dural summaries.
In SPEC2000 benchmarks, our false positives come from several sources: the
limitation of our pruning operation for if-conditions (for “gcc”); path-insensitivity
between the return value and the condition for allocation (for “gzip”); inaccurate
approximation of the number of loop iterations (for “vortex”); and over-approximation
on a two-dimensional array (for “vpr”).
*
Ret
Alloc
Alloc Alloc
Alloc
Alloc
Alloc
Alloc
.ProxyID
.Proxy2D
.Proxy3D
.texture
.VB .PB
Alloc.shared
Alloc Alloc Alloc
Alloc Alloc
Alloc Alloc
.TextObjects
.DisplayList
.TextObjectList
.Default1D
.Default2D
.Default3D
.next
Figure 3.13: Procedural summary of gl_create_context. Nodes are shaded if theyare not freed by procedure gl_destroy_context.
3.5.3 Comparison with Saturn
We analyzed four open source packages: binutils, openssh, httpd and tar. We used
older versions of the first two in order to compare with the results reported in
60
existing memory leak detection tools [63,136]. Open source software packages have
several target platforms (e.g. binary, library, ...). Table 3.2 lists the analysis results
for one target that generates the largest number of alarms.
Mairac found significantly more bugs (228) than Saturn (136) in analyzing
binutils. This is because our procedural summary is finer than Saturn’s. Saturn
fails to follow allocated addresses if they become reachable from the procedure’s
parameters (category Alloc2Arg). In binutils, more procedures pass newly allocated
memory back to their caller via their parameters (491) than via their return value
(160). Even if allocated addresses are returned from a procedure, Saturn cannot
capture the structure (shape) of the allocated addresses (category Alloc2Ret).
3.5.4 Path-sensitive Extension
We extend the Mairac (which was originally a path-insensitive analyzer) to be
path-sensitive using static analysis technique presented in Chapter 2. Table 3.5
shows the comparison between path-sensitive Mairac and the old Mairac. For
the same versions of the same programs, the new path-sensitive Mairac consis-
tently detects more bugs and also the average false alarm ratio is less than the
original one. The speed of path-sensitive one is about three times slower than the
original one. It is because a path-sensitive analysis is computationally demanding.
Still the speed seems to be acceptable.
Overal, our experiments show the parameterized procedural summary enable
practical path-sensitive analysis for detecting memory leaks.
61
Name Size (.i) Path-Sensitive Mairac Old Mairac [83]Time Bugs False Time Bugs False
Alarms Alarms
binutils 652,886 550 16 9 828 14 78bison-2.4 260,148 391 1 26 340 0 2cake 639,197 364 5 14 620 1 35gnuchess 112,686 47 1 1 35 1 3grep-2.5 50,309 155 3 3 252 2 29gzip-1.4 144,917 65 0 1 63 0 4hanterm 217,311 24 0 0 7 0 0httpd-2.2.2 1,997,956 7,951 0 5 682 0 14openssh-5.3p1 901,631 130 2 6 115 2 10postfix 790,574 256 1 1 393 1 3rmt 50,311 10 0 0 6 0 1sed-4.2 89,464 161 12 13 293 12 77tar 152,410 62 5 2 121 3 10
Total 6,059,800 10,166 56 73 3,755 36 246
Table 3.5: Performance comparison for the same C programs between the newpath-sensitive Mairac and the old Mairac [83].
62
Chapter 4
Code Clone Detection
4.1 Introduction
Detecting code clones is useful for software development and maintenance. Code
clones help us in identifying refactoring candidates [67], finding potential bugs [75,
78], and understanding software evolution [40,87].
Most existing clone detectors [48, 73, 84, 89, 98] are textually biased. For exam-
ple, CCFinder [84] extracts and compares textual tokens from source code to de-
termine code clones. Deckard [73] compares characteristic vectors extracted from
abstract syntax trees (ASTs). Although these detectors are good at detecting syn-
tactic clones, they are not effective for detecting semantic clones. They fail to de-
tect code clones that are functionally similar but syntactically different.
Existing semantic clone detectors have limitations too. Those are based on pro-
gram dependence graphs (PDGs) [48,89,99], or detection by observing program ex-
ecutions via random testing [74]. PDGs can be affected by syntactic changes such
as replacement of statements with a semantically equivalent procedure call. Hence,
PDG-based clone detectors miss some semantic clones. Clone detectability of ran-
dom testing-based approaches [74] may depend on the limited test coverage, cov-
ering only up to 60 ∼ 70% of software [114,115,132].
63
To detect semantic clones effectively, we propose a new clone detection tech-
nique. First we use a path-sensitive semantic-based static analyzer to estimate the
memory states at each procedure’s exit point, and then we compare the memory
states to determine clones. Since the abstract memory states have a collection of
memory effects (though approximated) along execution paths within the proce-
dures, our technique can effectively detect semantic clones, and this clone detection
ability is independent of syntactic similarity of clone candidates.
We implemented our technique as a clone detection tool, Memory Comparison-
based Clone detector (MeCC), by leveraging the proposed semantic-based static
analyzer (Chapter 2). Our experiments with three large-scale open source projects,
Python, Apache, and PostgreSQL (Section 4.7) show that MeCC can identify se-
mantic clones that other existing methods miss.
We also propose a new way to use both a semantic code clone detector and
a syntactic code clone detector. After the semantic code clone detector identifies
clones, the syntactic code clone detector sifts semantic clones from detected clones.
The remaining semantic clones can be used for software development and mainte-
nance tasks such as identifying refactoring candidates, detecting inconsistencies for
locating potential bugs (as discussed in Section 4.8), and detecting software pla-
giarism.
4.2 Clone Types
This paper proposes an abstract memory comparison-based clone detector, which
can identify all four clones classified by Roy et al. [122].
Basically, clones are code pairs or groups that have the same or similar func-
tionalities [121, 122]. Some code clones are syntactically similar, but some are dif-
ferent. Based on syntactic similarity, all clones are classified into four types. These
definitions are widely used in the literature [78,121,123], and we also use them in
this paper.
64
• Type-1 (Exact clones): Identical code fragments except for variations in whites-
pace, layout, and comments.
• Type-2 (Renamed clones): Syntactically identical fragments except for varia-
tions in identifiers, literals, and variable types in addition to Type-1’s varia-
tions.
• Type-3 (Gapped clones): Copied fragments with further modifications such
as changed, added, or deleted statements in addition to Type-2’s variations.
• Type-4 (Semantic clones): Code fragments that perform similar functionali-
ties but are implemented by different syntactic variants.
Definitions of Type-1 and Type-2 clones are straightforward. Mostly, they are
copies (from other code fragments) that remain unchanged (Type-1) or have a
small variance (Type-2). These clones can be easily detected by comparing syn-
tactic features such as tokens in source code [84].
On the other hand, Type-4 (semantic) clones are syntactically different. Since
there is no clear consensus on Type-4 clones, some researchers define subtypes of
Type-4 clones such as statement reordering, control replacement, and unrelated
statement insertion [48, 99, 121]. Similarly, we define subtypes of Type-4 clones as
follows:
• Control replacement with semantically equivalent control structures (Refer to
Fig. 4.3.)
• Statement reordering without modifying the semantics (Refer to Fig. 4.4.)
• Statement modification with preserving memory behavior (Refer to Fig. 4.5.)
• Statement insertion without changing computation (Refer to Fig. 4.8.)
Like Type-4 clones, there is no consensus on Type-3 clones. Stefan Bellon et
al. [9] defined Type-3 clones as all clones that are neither Type-1 nor Type-2. Sim-
ilarly, in this paper, we define Type-3 clones as all clones that are not Type-1,
Type-2, and Type-4 clones.
65
4.3 Clone Detection Based on Memory Comparison
Our goal is to detect clones by comparing functionalities of code fragments, re-
gardless of their syntactic similarity. A naive way to achieve this goal is to perform
exhaustive testing on a given set of clone candidates (programs). Semantic similar-
ities of programs can be determined by generating all possible inputs for programs,
observing all possible executions using the inputs, and comparing their execution
results. However, such exhaustive testing is often infeasible since there might be
an infinite number of inputs and/or execution paths.
For this reason, we use semantic-based static analysis [27,36,71,79,83,109,110,
136] to determine semantic similarities of given programs because static analysis
soundly and finitely estimates the dynamic semantics of programs. In our case, we
use a path-sensitive semantic-based static analyzer that symbolically estimates the
memory effects of procedures.
Our overall approach is shown in Fig. 4.1. We compute abstract memory states
from given programs via static analysis. Then we compare the abstract memory
states to determine code clones.
clone candidate
memorycomparison
abstract memory state
semantic-basedstatic analyzer
clone candidate
abstract memory state
semantic-basedstatic analyzer
while(y<n){ bar()}
if(x>0) bar()else goto L;
Figure 4.1: Our clone detection approach: abstract memory states of individualclone candidates are computed by a path-sensitive semantic-based static analyzer.These abstract memory states are compared for detecting code clones
66
We build a semantic-based static analyzer on top of Mairac [79, 83], which
can analyze and summarize each procedure based on the abstract interpretation
framework [27], and these procedural summaries are carefully tuned to capture all
memory-related behaviors in real-world C programs [83]. However, Mairac does
not support path-sensitive analysis. We extend Mairac to be path-sensitive like [136]
by adding guards and guarded values to the abstract domain (Table 2.2).
Path-sensitivity is crucial for semantic code clones detection. A path-insensitive
analyzer loses the relationship between condition expressions and the corresponding
statements. For example, a path-insensitive analyzer considers the following two
different if-else statements as the same since it does not know which statements
belong to which condition expressions. This insensitivity leads to detection of false
positive clones.
if(a > 0) A else B = if(a > 0) B else A
4.4 Example for Comparison
The abstract memory states at the exit point of procedures are compared for code
clone detection. As an example, procedure foo2 in Fig. 4.2 is a semantic clone of
procedure foo in Fig. 2.2. If we disregard the names of variables, symbols (in this
example, names are the same by chance), field variables, and variable types then
two memories are equivalent. Note that two guards β ≤ 5∨γ ≤ 0 and β ≤ 5∨(β >
5 ∧ γ ≤ 0) are equivalent. This equivalence is attained by function simplify [37]
presented in Section 4.5.
4.5 Comparing Abstract Memory States
Given estimated abstract memory states, we need to quantify their similarities.
Algorithm 5 presents the quantification steps. First, we calculate the similarities
67
1 int* foo2(list2 *x ,int y){
2 int ret = 0;
3 if (x->val > 5 && y > 0)
4 ret = mymalloc(y);
5 return ret;
6 }
The abstract memory state at line 6x {⟨true, α⟩}α.val {⟨true, β⟩}y {(true, γ)}
{⟨β > 5 ∧ γ > 0, ℓ⟩,ret ⟨β ≤ 5 ∨ γ ≤ 0, 0⟩
}
Figure 4.2: Procedure foo2 with its abstract memory state at the exit point (line5).
between guarded value pairs of all possible combinations on the given memories
M1 and M2 (line 2 to 8). We compare addresses using the equivalence relation L=
on addresses (as defined below). If addresses are equivalent, then we calculate the
similarity of two guarded values by function simGV(GV1,GV2) (line 4). If addresses
are not equivalent, the similarity is zero (line 5). For all combinations, the similari-
ties of pairs are recorded in map S (line 6). Then function find_best_matching(S)
finds a subset of S that exclusively spans the two memories such that the total
similarities of matched pairs becomes the biggest (line 9). Finally, the algorithm
returns the ratio of similarity to the total size of memories. If both memories are
empty (the denominator becomes zero), then the similarity is zero (line 10 to 11).
4.5.1 Equivalent Addresses
Two addresses are equivalent when the relation L= is satisfied like the following:
xL= y if x, y ∈ Global ∨ x, y ∈ Param ∨ x, y ∈ Local
ℓL= ℓ′ if ℓ, ℓ′ ∈ AllocSite
a.fL= a′.f ′ if a
L= a′
αL= β if origin(α)
L= origin(β)
When two variables are compared, names and types of the variables are ignored
(Var). We only check if both variables are parameters, global variables, or non-
68
Algorithm 5: simM(M1,M2)
Input: abstract memory states M1 and M2
Output: similarity value of M1 and M2
S := {};1
foreach address a1 ∈ dom(M1) do2
foreach address a2 ∈ dom(M2) do3
if a1L= a2 then v := simGV(M1(a1),M(a2));4
else v := 0;5
S := S{(a1, a2) 7→ v};6
end7
end8
best = find_best_matching(S);9
if | dom(M1) | + | dom(M2) |= 0 then return 0;10
return2 · best
| dom(M1) | + | dom(M2) |11
parameter local variables. All dynamically allocated addresses ℓ are considered as
equivalent regardless of their allocation sites (AllocSite). For field addresses (Addr×Field), names of field variables are ignored and only structural equivalence is con-
sidered. For example, x.val L= x.len holds even if the address uses different field
names. However, (x.next).len L= x.len is not true because the former has an ad-
ditional field dereference. All symbolic addresses are equivalent only when their
origins are the same (Symbol).
4.5.2 Similarity Between Guarded Values
A guarded value GV is a set of pairs which consist of a guard and a value. Function
simGV(GV1,GV2) compares all guards and values in GV1 with those in GV2, and
69
then counts the number of matched pairs n. Finally, the similarity of two guarded
values is computed as follows:
simGV(GV1,GV2) =2 · n
| GV1 | + | GV2 |
n = maximum of |M | s.t. M ⊆ S and
∀⟨(g1, v1), (g2, v2)⟩ ∈ M, (g1, v1) and (g2, v2) appear only once
S =⋃
(g1,v1)∈GV1,(g2,v2)∈GV2
{⟨(g1, v1), (g2, v2)⟩ | g1G= g2 ∧ v1
V= v2}
The similarity is the ratio of the number of matched pairs to the total size of
two guarded values. We seek for the maximum number of matched pairs trying to
match all possible combinations GV1 × GV2. {Equivalent values V= and equivalent
guards G= are defined as the following.
4.5.3 Equivalent Values
Relation V= establishes the equivalence on values:
n1V= n2 if n1 = n2
v1 ⊕ v2V= v3 ⊕′ v4 if v1
V= v3 ∧ (⊕ = ⊕′) ∧ v2
V= v4
⊖v1V= ⊖′v2 if v1
V= v2 ∧ ⊖ = ⊖′
ℓV= ℓ′ if ℓ
L= ℓ′
Equivalence of numbers is determined by numerical equivalence (N ). Binary val-
ues are equivalent when both the pair of values and the operators are equivalent
(Value×Bop×Value). From our definition of V=, we may miss semantically equiva-
lent values due to differences in their syntactic expressions. For example, x > 0 and
0 < x should be regarded as equivalent, but are not regarded as equivalent because
xV= 0, >=<, and 0
V= x. To address this problem, we canonicalize the symbolic
values. Canonicalization gives certain partial orders on both operators and values
70
and then sorts binary values by the orders. Hence all semantically equivalent bi-
nary values have their unique representations.
4.5.4 Equivalent Guards
Relation G= determines equivalent guards:
v1 ∼ v2G= v3 ∼′ v4 if v1
V= v3 ∧ (∼=∼′) ∧ v2
V= v4
g1G= g2 if unify(simplify(g1), simplify(g2))
true G= true
false G= false
Two relation guards v1 ∼ v2 and v3 ∼′ v4 in domain (Value×Rel×Value) are equiv-
alent when both their value pairs and their relationships (e.g., <,=) are equiva-
lent, respectively. However, a formula can be presented in several different forms.
For example, formulas x > 5 ∧ (x < 10 ∨ x > 0) and x > 5 look different, but
are actually equivalent because x > 5 implies x > 0. To remedy this, we use a
function simplify [37] that simplifies guards so that they do not contain any re-
dundant sub-formulas using a decision procedure [41]. Furthermore, we want to
assume x > 5 and z > 5 are equivalent if x L= z holds. This process is done by uni-
fication algorithm unify, which is widely used in type systems [105]. The algorithm
returns true if there exists a substitution which makes two different structures the
same while preserving relations L= and V
=.
71
4.5.5 Best Matching
Function find_best_matching(S) at line 9 in Algorithm 5 finds the best matching
(i.e. the matching that maximizes the sum of similarities), and then returns the
maximum sum of similarities. Consider this similarity table as an example.
❍❍❍❍❍
❍❍❍M2
M1(a11,GV1
1) (a21,GV21) (a31,GV3
1) (a41,GV41)
(a12,GV12) 0.8
10.1 0.5 0.6
(a22,GV22) 0.7 0.7
20.6 0.5
(a32,GV32) 0.6 0.5 0.6
30.4
The boxed ones represent the best matching which maximizes the sum of similari-
ties. Suppose our matching function finds this best matching. The value of best at
line 9 in Algorithm 5 is the sum of similarities, 2.1 = 0.8+0.7+0.6 of all matched
pairs. Hence the similarity, 0.6 = 2 · 2.1/(4 + 3) of these two memories is returned
at line 11 in Algorithm 5.
We develop a lightweight greedy algorithm to heuristically find the best match-
ing which runs in O(n2), where n is the number of elements. After calculating
similarities of all pairs, the pair which has the maximum similarity is chosen as a
matched one. Then the algorithm continues to choose another pair with maximum
similarity among the remaining pairs until all addresses in either M1 or M2 are
matched. The order of choices for the above table is annotated over the boxes. The
algorithm is not guaranteed to find the best matching, but has the advantage of
running time. There is a combinatorial optimization algorithm called the Hungar-
ian method [94], which is guaranteed to find the best matching but runs in O(n3),
much slower than ours. At least in our experiments, we found that our algorithm
yields the same results as the Hungarian method. This is because similarities of
pairs are usually near 1 or 0.
72
4.6 Judgement of Clones
We allow parametrization by MinEntry to filter small clones such as a procedure
containing just one line as its body. Though the similarity function simM(M1,M2)
gives high values to similar memories, this function does not reflect the size of
memories. So we give a penalty to small size memories. Note that the value of the
similarity function ranges over [0, 1].
simM(M1,M2)
log MinEntrylog(| dom(M1) | + | dom(M2) |)
The above formula is proportional to the size of memories and inversely propor-
tional to MinEntry. Log function is used to smoothen the amount of the penalty.
Here parameter MinEntry is given by users depending on target program size. The
parameter is similar to parameter minT which determines the minimum number of
tokens for clone candidates in Deckard [73].
We evaluate similarities for all possible pairs of abstract memories. There is a
high probability that procedures with high similarity are true clones. Hence we sort
all pairs according to their similarities. We allow another parameter Similarity,
which determines the threshold of similarities of clones to be reported. If Similarity
is set to 80% then pairs with similarity less than 0.8 are not reported.
Sometimes the similarity of two memories M1 and M2 never exceeds the given
Similarity if there are big differences in entry numbers of the two memories.
Hence we can skip the comparison of two memories where,
2× min(| dom(M1) |, | dom(M2) |)| dom(M1) | + | dom(M2) |
≤ Similarity.
This strategy significantly reduces the memory comparison time.
Users can choose parameters MinEntry and Similarity to pick thresholds to
determine clones. One could set MinEntry high, if one wants to ignore small clones.
One could set Similarity high, if one wants less false positives.
73
4.7 Experimental Result
In this section, we evaluate our code clone detector MeCC. We apply MeCC to
detect clones in large-scale open source projects, Python, Apache, and PostgreSQL
as shown in Table 4.1.
Table 4.1: Properties of the subject projects.Projects KLOC Procedures ApplicationPython 435 7,657 interpreterApache 343 9,483 web serverPostgreSQL 937 10,469 database
We design our experiments to address the following research questions:
• RQ1 (detectability): How many Type-3 and Type-4 clones can be detected
by MeCC?
• RQ2 (accuracy): How accurately (in terms of false positives and false nega-
tives) can MeCC detect clones?
• RQ3 (scalability): How does MeCC scale (in terms of detection time and de-
tectable program size)?
• RQ4 (comparison): How many gapped and semantic clones identified by MeCC
can be detected by previous clone detectors, CCFinder [84], Deckard [73],
and a PDG-based detector [48]?
4.7.1 Detectability
We apply MeCC to detect clones to evaluate its ability to detect clones. In our
experiments, we set Similarity=80% and MinEntry =50. Then clones detected by
MeCC are manually inspected and categorized into four clone types as discussed
in Section 4.2 by one author who has experience with C/C++ development in in-
dustry more than eight years. The other two authors reviewed and confirmed the
inspected clones.
74
Table 4.2: The distribution of detected clone types by MeCC.Type-1 Type-2 Type-3 Type-4
Python 3 128 81 13Apache 2 85 70 10PostgreSQL 9 120 88 14
The numbers of detected and classified clones are shown in Table 4.2. MeCC
can detect all four types of clones. Type-4 (semantic) and some Type-3 (gapped)
clones in Table 4.2 have noticeable syntactic differences. Nevertheless, MeCC can
detect these clones because it only compares abstract memory states. MeCC also
detects Type-1 (exact) and Type-2 (renamed) clones since syntactic similarity is
usually accompanied by semantic similarity.
Fig. 4.3 shows one Type-4 clone detected by MeCC. This is a typical exam-
ple of control replacement. The if-else statements in Fig. 4.3(a) are replaced
by semantically equivalent statement using the ternary conditional ‘? :’operator in
Fig. 4.3(b). MeCC detects this clone, since their functionalities are the same and
thus the abstract memory states are the same.
A more complex Type-4 clone detected by MeCC is presented in Fig. 4.4. The
clone has two syntactic differences. One difference is statement reordering. Two
statements from line 5 to 7 in Fig. 4.4(a) are reordered into the statements from
line 4 to 5 in Fig. 4.4(b). The second difference comes from using intermediate
variables. The local variable sconf is introduced at line 4 in Fig. 4.4(a) and then
used as a parameter of the ap_get_module_config function call at line 5. The lo-
cal variable proto is introduced at line 7 in Fig. 4.4(b). The return value of the
apr_pstrdup function call at line 13 in Fig. 4.4(b) is assigned to this variable.
This value is assigned to a field address at line 15 in Fig. 4.4 via the local vari-
able. These syntactic changes make it difficult for textual-based clone detectors to
identify such clones [84].
Understanding the semantics of procedure calls is one advantage of MeCC. An
interesting Type-4 clone detected by MeCC in Fig. 4.5 highlights this strength.
The major syntactic difference between the two procedures is that the assignment
75
1 PyObject *PyBool_FromLong(long ok)
2 {
3 PyObject *result;
4 if (ok)
5 result = Py_True;
6 else
7 result = Py_False;
8 Py_INCREF(result);
9 return result;
10 }
(a)
1 static PyObject *
2 get_pybool(int istrue)
3 {
4 PyObject *result = istrue? Py_True : Py_False;
5 Py_INCREF(result);
6 return result;
7 }
(b)
Figure 4.3: Type-4 clone, control replacement from Python. The statement if-elseis changed by using the ternary conditional ? : operator. Syntactical differences areunderlined.
statement at line 8 in Fig. 4.5(a) is substituted by the procedure memcpy call at
line 9 Fig. 4.5(b). Most previous clone detection techniques cannot capture this
semantic similarity between a procedure call and similar assignment statements.
4.7.2 Accuracy
The next question is how accurately MeCC can detect clones. We manually in-
spected the detected clones and identified false positives, which are not real clones,
but are detected as clones by MeCC.
76
1 static const char *set_access_name(cmd_parms *cmd, void *dummy,2 const char *arg)3 {4 void *sconf = cmd->server->module_config;5 core_server_config *conf = ap_get_module_config(sconf, &core_module);6
7 const char *err = ap_check_cmd_context(cmd, NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT);8 if (err != NULL) {9 return err;
10 }11 conf->access_name = apr_pstrdup(cmd->pool, arg);12 return NULL;13 }
(a)
1 static const char *set_protocol(cmd_parms *cmd, void *dummy,2 const char *arg)3 {4 const char *err = ap_check_cmd_context(cmd, NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT);5 core_server_config *conf = ap_get_module_config(cmd->server->module_config, &core_module);6
7 char *proto;8
9 if (err != NULL) {10 return err;11 }12
13 proto = apr_pstrdup(cmd->pool, arg);14 ap_str_tolower(proto);15 conf->protocol = proto;16
17 return NULL;18 }
(b)
Figure 4.4: Type-4 clone, statement reordering from Apache.
Table 4.3 presents false positive clones and the ratio of false positive clones to
the total number of clones found from three projects (when Similarity=80% and
MinEntry=50). In Python, the total number of found clones is 264, the number
of false positive clones is 39, and hence the false positive ratio is around 14.7%.
77
1 void
2 appendPQExpBufferChar(PQExpBuffer str, char ch)
3 {
4 /* Make more room if needed */
5 if (!enlargePQExpBuffer(str, 1))
6 return;
7 /* OK, append the data */
8 str->data[str->len] = ch;
9 str->len++;
10 str->data[str->len] = ’\0’;
11 }
12
(a)
1 void
2 appendBinaryPQExpBuffer(PQExpBuffer str, const char *data,
3 size_t datalen)
4 {
5 /* Make more room if needed */
6 if(!enlargePQExpBuffer(str, datalen))
7 return;
8 /* OK, append the data */
9 memcpy(str->data + str->len, data, datalen);
10 str->len += datalen;
11 str->data[str->len] = ’\0’;
12 }
(b)
Figure 4.5: Type-4 clone, preserving memory behavior from PostgreSQL.
Similarly, the false positive ratio for Apache is 12.5%, and for PostgreSQL it is
around 16.9%.
The most common case of false positive clones is data structure initialization.
In those clones, a structure is allocated and then field variables are initialized
78
Table 4.3: Detected clones and false positives. Total: total number of detectedclones, FP: number of false positive clones, and FP ratio: false positive ratio.
Total FP FP ratioPython 264 39 14.7%Apache 191 24 12.5%PostgreSQL 278 47 16.9%
according to the structure type. Some of them can be viewed as clones, but we
scrupulously mark these initialization code pairs as false positives.
These false positive ratios appear to slightly higher than previous approaches [48,
73, 84]. However, one could set Similarity higher to reduce the false positive ra-
tio. As an example, the false positive ratio is only 3% for Python when we set
Similarity=90%.
In the next step, we measure the ratio of false negative clones — real clones,
but missed by MeCC. For this experiment, since we need an oracle clone set, we
use the benchmark provided by Roy et al. [121]. This benchmark includes three
Type-1, four Type-2, five Type-3, and four Type-4 clones. We apply MeCC on the
benchmark with Similarity=80%. Since the sizes of procedures in the benchmark
are small, we set MinEntry=2.
Table 4.4: False negatives on the benchmark set [121]. * MeCC misses only oneclone.
Type-1 Type-2 Type-3* Type-4Benchmark 3 4 5 4MeCC 3 4 4 4
Table 4.4 shows that MeCC has almost no false negatives. MeCC misses only
one Type-3 clone, which has an insertion of an if statement related to a proce-
dure call, and that changes the memory state. However, if we set Similarity=79%,
MeCC detects this clone.
Overall, our experimental results show that MeCC can detect clones accurately,
with almost no false negatives and with a reasonable false positive ratio.
79
4.7.3 Scalability
In this section, we measure scalability of MeCC. We have already shown that MeCC
can detect clones in large-scale open source projects accurately (Section 4.7.1 and
Section 4.7.2).
We measure the time spent to detect the clones for three subjects. Our exper-
iments were conducted on an Ubuntu 64-bit machine with a 2.4 GHz Intel Core 2
Quad CPU and 8 GB RAM.
Table 4.5: Time spent for the detection process.KLOC Analysis Comparison
Python 435 63m32s 1m54sApache 343 308m58s 1m36sPostgreSQL 937 422m04s 6m28s
Table 4.5 shows the results. Static analysis took about 63 minutes for Python
and 422 minutes for PostgreSQL.
Since our static analysis includes preprocessing, summarization/instantiation of
procedural summaries, and fixpoint iterations for collecting memory states, it is
computationally expensive.
However, this is usually a one-time cost. When software changes, we can incre-
mentally recompute memory states of the changed parts including impacted parts
according to the call relationship. If the changed part in a procedure does not cause
observable changes to memory behaviors of the procedure, then callers of the pro-
cedure do not need to be re-analyzed. Though the dependency can, in the worst
case, expand to all procedures, such a situation (a procedure’s change in memory
effects, combined with that procedure as a hub in the call-graph) would not be
that common.
80
4.7.4 Comparison
Section 4.7.1 shows that MeCC can detect all four types of clones including Type-
3 (gapped) and Type-4 (semantic) clones. In this section, we discuss if other clone
detectors can also identify these clones.
Table 4.6: The numbers of detected Type-3 and Type-4 clones by MeCC,Deckard, CCFinder, and a PDG-based detector [48].
Python Apache PostgreSQL
Type-3
MeCC 81 70 88Deckard 21 12 25CCFinder 0 0 0PDG-based 10 8 11
Type-4
MeCC 13 10 14Deckard 0 0 2CCFinder 0 0 0PDG-based 1 0 1
For comparison, we use two publicly available syntactic clone detectors, Deckard
(a AST-based detector) and CCFinder (a token-based detector). We also use a
results set from a PDG-based semantic clone detector [48].
For Deckard, we set the options as in [73], mint=30 (minimum token size),
stride=2 (size of the sliding window), and Similarity=0.9. For CCFinder, we
also use the default options, Minimum Clone Length=30, Minimum TKS=12 (token
set size), and Shaper Level=Soft shaper. For the PDG-based detector [48], we
directly use the clone detection results provided by the authors of the detector,
since the tool was not publicly available at the time of this writing.
Table 4.6 compares the ability of Deckard, CCFinder, and the PDG-based
detector to detect Type-3 and Type-4 clones. We assume these detectors can detect
all Type-1 and Type-2 clones, since these clones are syntactically almost the same.
CCFinder is a scalable and fast tool which detects Type-1 and Type-2 clones
accurately. However, CCFinder could not identify any Type-3 and Type-4 clones.
81
The main reason is that CCFinder extracts and compares syntactic tokens, but
usually Type-3 and Type-4 clones are significantly different in the token level.
Deckard detects about 25% of Type-3 clones. Since Deckard uses the char-
acteristic vectors of AST, it can detect clones with small syntactic variations. Sur-
prisingly, Deckard identifies two Type-4 clones in PostgreSQL. The two detected
Type-4 clones are classified as the statement reordering subtype (Fig. 4.4). Since
Deckard extracts characteristic vectors of these reordered ASTs, the vector only
captures the number of elements in AST. However, Deckard still misses a large
portion of Type-3 and Type-4 clones.
The PDG-based detector identifies about 12% of Type-3 clones. Only one Type-
4 clone is identified in each Python and PostgreSQL. The detected Type-4 clones
are statement reordering. Since PDGs capture program semantics using data de-
pendency and control flows, the PDG-based detector can detect some Type-4 clones
like statement reordered clones.
However, these PDG-based approaches [48,89,99] have some limitations. First,
inter-procedural semantics via procedure calls cannot be supported, which means
semantic clones that differ in respect of procedure calls (e.g., function inlining) are
missed. MeCC captures memory behavior of procedure calls by procedural sum-
maries as described in Section 4.3. Second, PDGs cannot be completely free from
changes of syntactic structures, while our technique reliably determines the seman-
tic similarity of code fragments because we use purely semantic information (path-
sensitive abstract memory effects) of programs without using any syntactic struc-
tures from programs (e.g., control-flow graph, abstract syntax tree, characteristic
vectors, or PDG).
In order to reveal the limitation of PDG-based approaches, we use the semantic
clone in Fig. 4.5. We draw the PDGs of the two procedures in Fig. 4.6 as described
in the PDG-based approach [48]. The PDGs are significantly different due to fol-
lowing reasons: first, replacing the assignment statement at line 8 in Fig. 4.5(a)
by the procedure call at line 9 in Fig. 4.5(b) affects the PDG, because it intro-
duces an additional call-site node memcpy, which consequently introduces several
82
child nodes. Furthermore, the effect of procedure call memcpy is not reflected in
PDG. This ignorance removes the data dependency edges (from node 1 to node 2
in Figure 4.6(a)) which change into the data dependency edge directly from formal
parameter str (from node 3 to node 8 in Figure 4.6(b)); Second, adding a formal
parameter datalen introduces new dependency flows which affect the PDG. From
this observation, the PDG-based approach can miss clones made by procedure call
additions or new parameter introductions, since these differences directly affect the
PDG structures.
From this observation, the PDG-based approach can be viewed as a comparison
on low-level structures. Some clones involving statement reordering may be iden-
tified by PDG based approaches, since their syntactic differences look the same in
the low-level structures. However, other semantic clones such as procedure call ad-
dition and new parameter introduction may not be detected, since these differences
directly affect the program dependency and thus PDG.
Overall, the comparison results in this section suggest that MeCC, an abstract
memory-based clone detector is effective in detecting all four types (including Type-
3 and Type-4) of clones.
4.7.5 Limitations
Since our current implementation compares abstract memory states at the exit
points of procedures, MeCC detects only procedure-level clones. However it is pos-
sible to extend MeCC to find clones with a finer granularity such as basic blocks
adapting a code fragments generation technique [74] to prepare code clone candi-
dates of finer granularity. Then we can calculate every abstract memory state for
each candidate and compare them to identify clones.
Collecting abstract memory states from programs is a computationally expen-
sive task in both time and memory. Analyzing the semantics of programs takes
longer than syntactic comparison. However, the current implementation of MeCC
showed that MeCC scales to detect clones in PostgreSQL, which is around 1M
LOC.
83
entry
expr return
exit
exprstr->data[str->len]
= ch
exprstr->data[str->len]
='\0'
call-siteenlarge...()
formal-inchformal-in
str
formal-outenlarge...()
ctrl-pt!enlarge
exprstr->len++
actual-instr
control point node
statement node
control dependency
data dependency
entry
expr return
exit
formal-instr
formal-indata
actual-indata
call-sitememcpy()
actual-instr->data+
str->len actual-indatalen expr
str->len+=datalen
exprstr->data[str->len]
='\0'
actual-instr
actual-indatalen
formal-indatalen
call-siteenlarge...()
formal-outenlarge...()
ctrl-pt!enlarge
(a) (b)
Figure 4.6: Two PDGs of semantic clones in Fig. 4.5. The graphs look significantlydifferent even though two clones are semantically similar. Grey-colored nodes arenewly introduced due to changes between the two procedures.
Similar abstract memory states do not always imply similar concrete behaviors,
which may cause false positives. In the abstract interpretation framework [27–29],
one element in an abstract domain can represent several concrete elements. Pro-
cedural summaries record memory related behaviors [83], but do not capture all
concrete procedure behaviors. This limitation is inevitable since determining se-
mantic equivalence between two programs is generally undecidable [47].
We identify the following threats to validity to our work:
84
• Projects are open source and may not be representative. The three
projects used in this paper are all open source and not representative of all
software systems, and hence we cannot currently generalize the results of our
study across all projects. However, these projects are chosen because they are
commonly used in other code clone related research.
• Manually inspected and classified clones. One author manually inspected
and classified clones, and they are used to evaluate MeCC. Since there is no
consensus about Type-3 and Type-4 clones, there is ambiguity in the classi-
fied clones. However, two other authors confirmed the classified clones, and
we made these data publicly available.
• Default options are used. Deckard, CCFinder, and the PDG-based de-
tector have various options to tune their clone detectability. In this paper,
we use their default options. However, careful option tuning may allow these
tools to detect more Type-3 or Type-4 clones.
4.8 Detecting Inconsistent Changes
Inconsistent changes occur when developers modify only some of the code snippets
in clones [23, 42]. Sometimes, these inconsistent changes are intentional. However,
in most cases, they are unintentional, i.e. developers are not aware of other cloned
code snippets or forget to modify other snippets.
These unintended inconsistent changes may reveal software defects. For exam-
ple, the developer may fix defects in some code snippets in the clones. In that case,
the unchanged clones (due to inconsistent changes) still have the defects that are
yet to be fixed. It is important to detect and/or monitor inconsistent changes.
Usually, the results of inconsistent changes are of the following three categories:
1. Slight semantic and slight syntactic changes (Category-1)
2. Slight semantic and significant syntactic changes (Category-2)
3. Significant semantic and significant syntactic changes (Category-3)
85
Category-1 inconsistent changes usually generate Type-2 clones, while Category-
2 changes create Type-3 and Type-4 clones. Since they are still clones, clone de-
tectors can often be used to identify these inconsistent changes. As showed in Ta-
ble 4.6, since existing syntactic clone detectors usually miss Type-3 and Type-4
clones, they also miss Category-2 inconsistent changes. As discussed in Section 4.7,
there are many Type-3 and Type-4 clones (Table 4.2). Therefore, there are many
Category-2 inconsistent changes, and they are non neglectable.
To identify Category-2 inconsistent changes, we introduce an algorithm leverag-
ing MeCC’s semantic clone detectability. Section 4.8.1 describes the algorithm. We
evaluate the algorithm using three large-scale open source projects in Section 4.8.2.
Note that, Category-3 inconsistent changes are hard to be detected by clone
detectors without lowering clone similarity threshold, which yields an unrealisti-
cally large number of false positives. Our algorithm is not able to detect them as
well.
4.8.1 Approach
Our main focus is on detecting Category-1 and Category-2 inconsistent changes.
Fig. 4.7 shows the overall method. We first identify semantically similar clones us-
ing MeCC. Then we run a syntactic clone detector DECKARD to filter out syn-
tactically similar clones. The remaining clones (gray-colored region in Fig. 4.7) are
likely inconsistent changes.
4.8.2 Results
We evaluate the method by using Deckard (as a syntactic code clone detector)
and MeCC (as a semantic code clone detector). We manually inspected all remain-
ing clones identified by MeCC and filtered by Deckard to check if they were
caused by inconsistencies, and if these inconsistencies lead to potential problems.
When we identified problems caused by inconsistencies, we classified them in two
categories, exploitable bugs and code smells: A bug is exploitable if it causes unex-
pected behaviors, for example when a particular value is used as procedure input
86
Syntactic Clones
Semantic Clones
Code Clones
Figure 4.7: The overview of the inconsistent change detection approach: First,MeCC detects a set of semantic clones. Then, we detect syntactically similar clonesin the set and filter them out. The remaining clones (gray-colored region) are likelyinconsistent changes.
as shown in Fig. 4.8 (a). Conversely, a code smell occurs when an inconsistency
has no demonstrated unexpected behaviors, but refactorings or consistent changes
(with other clone pairs) are highly recommended.
Table 4.7: Exploitable bugs and code smells in all clones found by MeCC and fil-tered by Deckard.
# Clones Exploitable CodeOriginal Filtered Remaining Bugs (%) Smells (%)
Python 264 54 210 21 (10.0%) 21 (10.0%)Apache 191 52 139 6 (4.3%) 20 (14.4%)PostgreSQL 278 104 174 16 (9.2%) 17 (9.8%)Total 733 210 523 43 (8.2%) 58 (11.1%)
Table 4.7 shows the manual inspection results1. Among 523 remaining clones,
43 exploitable bugs and 58 code smells were found. About 19% of remaining clones
are either exploitable bugs or code smells. These bugs and code smells definitely
be missed by Deckard and would be missed by other previous approaches (e.g.1More detailed data is available at http://ropas.snu.ac.kr/mecc
87
[48, 84]), since most of these inconsistent clones are semantic clones that are not
detected by previous approaches as discussed in Section 4.7.4.
One of the exploitable bugs was found in the Python project (Fig. 4.8). There
are two noticeable different parts in this clone. First, the second argument is intro-
duced but not used in Fig. 4.8 (b) which makes syntactic change but no seman-
tic changes. Second, a procedure endspent() call appears at line 13 in Fig. 4.8
(b) but it does not appear in Fig. 4.8 (a). The endspent() procedure closes the
shadow password database file. The endpwent() procedure is called to close the
user database. Hence the absence of these procedure calls may have caused re-
source leaks.
We inspected the revision history of Python project and observed that the pro-
cedure in Fig. 4.8 (a) causes real resource leaks. The procedure pwd_getpwall in
Fig. 4.8 (a) was created on revision #20157 with a resource leak bug. Then the
resource leak bug was fixed in revision #73017 with the following comment “Issue
#4873 - Fix resource leak in error case of pwd” by adding endpwent() procedure
call before the return statement (line 13). The pwd_getspall procedure in Fig. 4.8
(b) was introduced in revision #38359. At that time, had these procedures been
found as a clone, the inconsistency of the two procedures would have been visible.
Consequently, the resource leak could have been fixed much earlier than revision
#73017. So finding inconsistent clones is useful to find and fix bugs.
Overall, Table 4.7 implies clones identified by MeCC and filtered by Deckard
are useful for detecting inconsistencies, exploitable bugs, and code smells.
4.9 Other Applications
MeCC can be used for plagiarism detection and common bug pattern identifica-
tion. Syntactic plagiarism detection tools (e.g. Moss [126] and JPlag [118]) cannot
detect plagiarism if code is copied and intentionally changed with some syntactic
obfuscations. MeCC is able to detect plagiarism as long as the semantics of the
copied code remains similar regardless of its syntactic changes. Similarly, MeCC
can help identify common bug patterns. Kim et al. proposed BugMem [88], which
88
identifies common bug fix patterns and locates similar bugs in other code. How-
ever, they only capture syntactic bug patterns using tokens of code. MeCC can
improve their work by identifying common semantic bug patterns.
89
1 static PyObject *pwd_getpwall(PyObject *self)2 {3 PyObject *d;4 struct passwd *p;5 if ((d = PyList_New(0)) == NULL)6 return NULL;7 setpwent();8 while ((p = getpwent()) != NULL) {
9 PyObject *v = mkpwent(p);10 if (v == NULL || PyList_Append(d, v) != 0) {11 Py_XDECREF(v);12 Py_DECREF(d);13 return NULL;14 }15 Py_DECREF(v);16 }17 endpwent();18 return d;19 }
(a)
1 static PyObject *spwd_getspall(PyObject *self, PyObject *args)2 {3 PyObject *d;4 struct spwd *p;5 if ((d = PyList_New(0)) == NULL)6 return NULL;7 setspent();8 while ((p = getspent()) != NULL) {
9 PyObject *v = mkspent(p);10 if (v == NULL || PyList_Append(d, v) != 0) {11 Py_XDECREF(v);12 Py_DECREF(d);13 endspent();14 return NULL;15 }16 Py_DECREF(v);17 }18 endspent();19 return d;20 }
(b)
Figure 4.8: A Type-4 clone as an inconsistent clone. The procedure pwd_getpwall
in (a) causes a resource leak due to absence of a proper procedure call endpwent()before line 13.
90
Chapter 5
Conclusions
We showed that the proposed parametrized procedural summary-based static ana-
lyzer is practical and effective to inferring memory behaviors. We successfully ap-
plied the proposed analysis to memory leak detection and semantic code clone de-
tection techniques.
The proposed static analysis is compositional; it analyzes each procedure’s mem-
ory behavior separately and produces a summary of it. The summary is parameter-
ized by the memory state at its call site so that it can be instantiated at different
call sites. Our procedural summaries enable us to find an effective trade-off point
without global analysis. We report engineering of the analysis, some of which are
unsound yet increase the analysis accuracy without much increase in cost.
We proposed a practical memory leak detection technique for C programs. In
comparison with other published memory leak detectors [21,63,112,136], our method
detects consistently more bugs for the same published benchmark software, while
the analysis speed is next to the fastest and the false-positive ratio is next to the
smallest.
We proposed an abstract memory-based code clone detection technique, pre-
sented its implementation, MeCC, and discussed its applications. Since MeCC com-
pares abstract semantics (as embodied in abstract memory states), its clone de-
tection ability is independent of syntactic similarity. Our empirical study shows
91
that MeCC can accurately detect all four types of code clones in large-scale open
source projects such as Python, Apache, and PostgreSQL. We also show that most
of Type-4 and some of Type-3 clones identified by MeCC cannot be detected by
previous approaches [48, 73, 84]. We proved that MeCC allows developers to find
inconsistencies as shown in Section 4.8, identify refactoring candidates, and under-
stand software evolution related to semantic clones which would be neglected by
previous approaches.
As a future work for clone detection technique, we still see room for improve-
ment. Since MeCC uses static analysis, it requires some time to analyze the entire
source code prior to our clone detection process. Even though this is usually an
one-time cost, we plan to devise a lightweight static analysis technique optimized
for clone detection. Our static analyzer can only collect memory states in the pro-
cedure level, and thus MeCC can detect only procedure level clones. To detect finer
granularity clones, we plan to adapt our static analyzer to collect memory states
for each basic block. Overall, we expect that future clone detection approaches will
exploit more deep semantics of code via static analysis program logic, and/or other
program verification technologies. MeCC is one step forward in this direction.
Throughout this dissertation, we have showed the practicality of the proposed
static analysis for understanding program’s memory behavior by detecting memory
leaks and semantic code clones from real world C programs.
92
Bibliography
[1] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles,
techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston,
MA, USA, 1986.
[2] Rajeev Alur, Pavol Cerný, P. Madhusudan, and Wonhong Nam. Synthesis of
interface specifications for java classes. In POPL, pages 98–109. ACM, 2005.
[3] Rajeev Alur, P. Madhusudan, and Wonhong Nam. Symbolic compositional
verification by learning assumptions. In CAV, volume 3576 of LNCS, pages
548–562. Springer, 2005.
[4] Dana Angluin. Learning regular sets from queries and counterexamples. In-
formation and Computation, 75(2):87–106, 1987.
[5] Ittai Balaban, Amir Pnueli, and Lenore Zuck. Shape analysis by predicate
abstraction. In VMCAI, volume 3385 of LNCS. Springer, January 2005.
[6] G. Balakrishnan and T. Reps. Analyzing memory accesses in x86 executables.
In Proc. Compiler Construction (LNCS 2985), pages 5–23. Springer Verlag,
April 2004.
[7] Thomas Ball, Byron Cook, Satyaki Das, and Sriram K. Rajamani. Refining
approximations in software predicate abstraction. In TACAS, volume 2988
of LNCS, pages 388–403. Springer, 2004.
[8] Thomas Ball, Andreas Podelski, and Sriram K. Rajamani. Boolean and
cartesian abstraction for model checking c programs. In TACAS 2001: Pro-
93
ceedings of the 7th International Conference on Tools and Algorithms for the
Construction and Analysis of Systems, pages 268–283, London, UK, 2001.
Springer-Verlag.
[9] Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore
Merlo. Comparison and evaluation of clone detection tools. IEEE Trans-
actions on Software Engineering, 33(9):577–591, 2007.
[10] Yves Bertot and Pierre Castéran. Interactive Theorem Proving and Program
Development. Coq’Art: The Calculus of Inductive Constructions. Texts in
Theoretical Computer Science. Springer Verlag, 2004.
[11] Dirk Beyer, Thomas A. Henzinger, Rupak Majumdar, and Andrey Ry-
balchenko. Invariant synthesis for combined theories. In VMCAI, pages 378–
394, 2007.
[12] Dirk Beyer, Damien Zufferey, and Rupak Majumdar. CSIsat: Interpolation
for LA+EUF. In CAV, pages 304–308, 2008.
[13] Bruno Blanchet, Patric Cousot, Radhia Cousot, Jerome Feret, Laurent
Mauborgne, Antonie Mine, David Monnizux, and Xavier Rival. A static an-
alyzer for large safety-critical software. In Proceedings of the SIGPLAN Con-
ference on Programming Language Design and Implementation, pages 196–
207, June 2003.
[14] Bruno Blanchet, Patrick Cousot, Radhia Cousot, Jérome Feret, Laurent
Mauborgne, Antoine Miné, David Monniaux, and Xavier Rival. A static an-
alyzer for large safety-critical software. In Proceedings of the SIGPLAN Con-
ference on Programming Language Design and Implementation, pages 196–
207, 2003.
[15] Rastislav Bodik, Rajiv Gupta, and Vivek Sarkar. ABCD: eliminating array
bounds checks on demand. In SIGPLAN Conference on Programming Lan-
guage Design and Implementation, pages 321–333, 2000.
94
[16] Nader H. Bshouty. Exact learning boolean functions via the monotone the-
ory. Information and Computation, 123:146–153, 1995.
[17] Cristiano Calcagno, Dino Distefano, Peter O’Hearn, and Hongseok Yang.
Compositional shape analysis by means of bi-abduction. SIGPLAN Not.,
44:289–300, January 2009.
[18] G’raud Canet, Pascal Cuoq, and Benjamin Monate. A value analysis for c
programs. In Source Code Analysis and Manipulation, pages 123–124. IEEE,
2009.
[19] Yu-Fang Chen, Azadeh Farzan, Edmund M. Clarke, Yih-Kuen Tsay, and
Bow-Yaw Wang. Learning minimal separating DFA’s for compositional veri-
fication. In TACAS, volume 5505 of LNCS, pages 31–45. Springer, 2009.
[20] Ben-Chung Cheng and Wen-Mei W. Hwu. Modular interprocedural pointer
analysis using access paths: design, implementation, and evaluation. In Pro-
ceedings of the ACM SIGPLAN 2000 conference on Programming language
design and implementation, PLDI ’00, pages 57–69, New York, NY, USA,
2000. ACM.
[21] Sigmund Cherem, Lonnie Princehouse, and Radu Rugina. Practical memory
leak detection using guarded value-flow analysis. SIGPLAN Not., 42(6):480–
491, 2007.
[22] Sigmund Cherem and Radu Rugina. A practical escape and effect analysis for
building lightweight method summaries. In In CC 2007: 16th International
Conference on Compiler Construction, pages 172–186, 2007.
[23] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson En-
gler. An empirical study of operating system errors. In Symposium on Op-
erating Systems Principles, pages 73–88, 2001.
95
[24] Edmund M. Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut
Veith. Counterexample-guided abstraction refinement. In CAV, volume 1855
of LNCS, pages 154–169. Springer, 2000.
[25] Jamieson M. Cobleigh, Dimitra Giannakopoulou, and Corina S. Păsăreanu.
Learning assumptions for compositional verification. In TACAS, volume 2619
of LNCS, pages 331–346. Springer, 2003.
[26] P. Cousot and R. Cousot. Static determination of dynamic properties of pro-
grams. In Proceedings of the Second International Symposium on Program-
ming, pages 106–130. Dunod, Paris, France, 1976.
[27] Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice
model for static analysis of programs by construction or approximation of
fixpoints. In Proceedings of ACM Symposium on Principles of Programming
Languages, pages 238–252, 1977.
[28] Patrick Cousot and Radhia Cousot. Systematic design of program analysis
frameworks. In Proceedings of ACM Symposium on Principles of Program-
ming Languages, pages 269–282, 1979.
[29] Patrick Cousot and Radhia Cousot. Abstract interpretation frameworks.
Journal of Logic and Computation, 2:511–547, 1992.
[30] Patrick Cousot and Radhia Cousot. Comparing the galois connection and
widening/narrowing approaches to abstract interpretation. In PLILP ’92:
Proceedings of the 4th International Symposium on Programming Language
Implementation and Logic Programming, pages 269–295. Springer-Verlag,
1992.
[31] Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine
Miné, David Monniaux, and Xavier Rival. The astrée analyzer. In M. Sa-
giv, editor, European Symposium on Programming (ESOP’05), volume 3444
of Lecture Notes in Computer Science, pages 21–30. Springer-Verlag, 2005.
96
[32] Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear re-
straints among variables of a program. In POPL, pages 84–96. ACM, 1978.
[33] Radhia Cousot, editor. Static Analysis, 10th International Symposium, SAS
2003, San Diego, CA, USA, June 11-13, 2003, Proceedings, volume 2694 of
Lecture Notes in Computer Science. Springer, 2003.
[34] William Craig. Linear reasoning. a new form of the herbrand-gentzen theo-
rem. J. Symb. Log., 22(3):250–268, 1957.
[35] M. Das, S. Lerner, and M. Seigle. Path-sensitive program verification in poly-
nomial time, 2002.
[36] I. Dillig, T. Dillig, and A. Aiken. Sound, complete and scalable path-sensitive
analysis. In PLDI, pages 270–280, 2008.
[37] Isil Dillig, Thomas Dillig, and Alex Aiken. Small formulas for large programs:
On-line constraint simplification in scalable static analysis. In SAS, Lecture
Notes in Computer Science, 2010.
[38] Nurit Dor, Michael Rodeh, and Mooly Sagiv. Cssv: towards a realistic tool
for statically detecting all buffer overflows in c. In PLDI ’03: Proceedings of
the ACM SIGPLAN 2003 conference on Programming language design and
implementation, pages 155–167. ACM Press, 2003.
[39] Vijay D’Silva, Daniel Kroening, Mitra Purandare, and Georg Weissenbacher.
Interpolant strength. In VMCAI, pages 129–145, 2010.
[40] Ekwa Duala-Ekoko and Martin P. Robillard. Tracking code clones in evolving
software. In ICSE, pages 158–167, 2007.
[41] Bruno Dutertre and Leonardo De Moura. A fast linear-arithmetic solver for
dpll(T). In CAV, pages 81–94. Springer, 2006.
[42] Dawson R. Engler, David Y. Chen, and Andy Chou. Bugs as Inconsistent
Behavior: A General Approach to Inferring Errors in Systems Code. In Sym-
posium on Operating Systems Principles, pages 57–72, 2001.
97
[43] Javier Esparza, Stefan Kiefer, and Stefan Schwoon. Abstraction refinement
with craig interpolation and symbolic pushdown systems. In TACAS, pages
489–503, 2006.
[44] David Evans. Static detection of dynamic memory errors. In SIGPLAN Con-
ference on Programming Language Design and Implementation (PLDI ’96),
1996.
[45] Jean-Christophe Filliâtre and Claude Marché. Multi-prover verification of
C programs. In Jim Davies, Wolfram Schulte, and Mike Barnett, editors,
Formal Methods and Software Engineering, volume 3308 of LNCS, pages 15–
29. Springer, 2004.
[46] Cormac Flanagan and Shaz Qadeer. Predicate abstraction for software veri-
fication. In POPL, pages 191–202. ACM, 2002.
[47] P. Enjalbert G. Cousineau. Program equivalence and provability. In MFCS,
1979.
[48] Mark Gabel, Lingxiao Jiang, and Zhendong Su. Scalable detection of seman-
tic clones. In ICSE, pages 321–330, 2008.
[49] Mark Gabel, Junfeng Yang, Yuan Yu, Moises Goldszmidt, and Zhendong Su.
Scalable and systematic detection of buggy inconsistencies in source code. In
OOPSLA, 2010.
[50] Yeting Ge and Leonardo Moura. Complete instantiation for quantified for-
mulas in satisfiabiliby modulo theories. In CAV, volume 5643 of LNCS, pages
306–320. Springer-Verlag, 2009.
[51] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis. Text in Statistical Science. Chapman & Hall/CRC,
second edition edition, 2004.
[52] Susanne Graf and Hassen Saïdi. Construction of abstract state graphs with
pvs. In CAV, volume 1254 of LNCS, pages 72–83. Springer, 1997.
98
[53] S. Gulwani and A. Tiwari. Computing procedure summaries for interproce-
dural analysis. In R. De Nicola, editor, European Symp. on Programming,
ESOP 2007, volume 4421 of LNCS, pages 253–267, 2007.
[54] Sumit Gulwani, Sagar Jain, and Eric Koskinen. Control-flow refinement and
progress invariants for bound analysis. In PLDI, pages 375–385. ACM, 2009.
[55] Sumit Gulwani, Bill McCloskey, and Ashish Tiwari. Lifting abstract inter-
preters to quantified logical domains. In POPL, pages 235–246. ACM, 2008.
[56] Sumit Gulwani, Saurabh Srivastava, and Ramarathnam Venkatesan.
Constraint-based invariant inference over predicate abstraction. In VMCAI,
volume 5403 of LNCS, pages 120–135. Springer, 2009.
[57] Bolei Guo, Neil Vachharajani, and David I. August. Shape analysis with
inductive recursion synthesis. In PLDI ’07: Proceedings of the 2007 ACM
SIGPLAN conference on Programming language design and implementation,
pages 256–265, New York, NY, USA, 2007. ACM Press.
[58] Anubhav Gupta, Kenneth L. McMillan, and Zhaohui Fu. Automated assump-
tion generation for compositional verification. In CAV, volume 4590 of LNCS,
pages 420–432. Springer, 2007.
[59] Ashutosh Gupta and Andrey Rybalchenko. Invgen: An efficient invariant gen-
erator. In CAV, volume 5643 of LNCS, pages 634–640. Springer, 2009.
[60] Samuel Z. Guyer, Kathryn S. Mckinley, and Daniel Frampton. Free-me: a
static analysis for automatic individual object reclamation. SIGPLAN Not.,
41(6):364–375, June 2006.
[61] Nicolas Halbwachs and Mathias Péron. Discovering properties about arrays
in simple programs. In PLDI, pages 339–348, 2008.
[62] Seth Hallem, Benjamin Chelf, Yichen Xie, and Dawson Engler. A system and
language for building system-specific, static analyses. In PLDI 2002. PLDI,
December 2002.
99
[63] David L. Heine and Monica S. Lam. A practical flow-sensitive and context-
sensitive c and c++ memory leak detector. In Proceedings of the ACM SIG-
PLAN 2004 Conference on Programming Language Design and Implementa-
tion, pages 168–181, 2003.
[64] David L. Heine and Monica S. Lam. Static detection of leaks in polymorphic
containers. In ICSE ’06: Proceeding of the 28th international conference on
Software engineering, pages 252–261, New York, NY, USA, 2006. ACM.
[65] Thomas A. Henzinger, Thibaud Hottelier, Laura Kovács, and Andrei
Voronkov. Invariant and type inference for matrices. In VMCAI, pages 163–
179, 2010.
[66] Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Kenneth L.
McMillan. Abstractions from proofs. In POPL ’04, pages 232–244, New York,
NY, USA, 2004. ACM.
[67] Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue, and Key
Words. Aries: Refactoring support environment based on code clone analysis.
In SEA, pages 222–229, 2004.
[68] David Hovemeyer and William Pugh. Finding bugs is easy. SIGPLAN Not.,
39(12):92–106, 2004.
[69] Ranjit Jhala and K. L. Mcmillan. A practical and complete approach to pred-
icate refinement. In TACAS, volume 3920 of LNCS, pages 459–473. Springer,
2006.
[70] Ranjit Jhala and Kenneth L. McMillan. Array abstractions from proofs. In
CAV, volume 4590 of LNCS, pages 193–206. Springer, 2007.
[71] Y. Jhee, M. Jin, Y. Jung, D. Kim, S. Kong, H. Lee, H. Oh, and K. Yi. Ab-
stract interpretation + impure catalysts: Our sparrow experience. In Work-
shop of the 30 Years of Abstract Interpretation, 2008.
100
[72] Limin Jia, Jianzhou Zhao, Vilhelm Sjöberg, and Stephanie Weirich. Depen-
dent types and program equivalence. In POPL, pages 275–286, 2010.
[73] Lingxiao Jiang, Ghassan Misherghi, and Zhendong Su. Deckard: Scalable and
accurate tree-based detection of code clones. In ICSE, pages 96–105, 2007.
[74] Lingxiao Jiang and Zhendong Su. Automatic mining of functionally equiva-
lent code fragments via random testing. In ISSTA, pages 81–92, 2009.
[75] Lingxiao Jiang, Zhendong Su, and Edwin Chiu. Context-based detection of
clone-related bugs. In ESEC/FSE, pages 55–64, 2007.
[76] J. Howard Johnson. Identifying redundancy in source code using fingerprints.
In CASCON, pages 171–183, 1993.
[77] Neil D. Jones and Steven S. Muchnick. Flow analysis and optimization of
lisp-like structures. In POPL ’79: Proceedings of the 6th ACM SIGACT-
SIGPLAN symposium on Principles of programming languages, pages 244–
256, New York, NY, USA, 1979. ACM Press.
[78] Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wag-
ner. Do code clones matter? In ICSE, pages 485–495, 2009.
[79] Yungbum Jung, Jaehwang Kim, Jaeho Shin, and Kwangkeun Yi. Taming
false alarms from a domain-unaware c analyzer by a bayesian statistical post
analysis. In SAS, volume 3672 of Lecture Notes in Computer Science, pages
203–217, 2005.
[80] Yungbum Jung, Soonho Kong, Bow-Yaw Wang, and Kwangkeun Yi. Deriving
invariants in propositional logic by algorithmic learning, decision procedure,
and predicate abstraction. In VMCAI, volume 5944 of LNCS, pages 180–196.
Springer, 2010.
[81] Yungbum Jung, Wonchan Lee, Bow-Yaw Wang, and Kwangkeun Yi. Predi-
cate generation for learning-based quantifier-free loop invariant inference. In
TACAS, pages 205–219, 2011.
101
[82] Yungbum Jung, Hakjoo Oh, and Kwangkeun Yi. Identifying static analy-
sis techniques for finding non-fix hunks in fix revisions. In Proceeding of
the ACM first international workshop on Data-intensive software management
and mining, DSMM ’09, pages 13–18, New York, NY, USA, 2009. ACM.
[83] Yungbum Jung and Kwangkeun Yi. Practical memory leak detector based
on parameterized procedural summaries. In ISMM, pages 131–140, 2008.
[84] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: A multi-
linguistic token-based code clone detection system for large scale source code.
IEEE Transactions on Software Engineering, 28:654–670, 2002.
[85] B. W. Kernighan and D. M. Ritchie. The C programming language. Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 1978.
[86] Heejung Kim, Yungbum Jung, Sunghun Kim, and Kwangkeun Yi. Mecc:
memory comparison-based clone detector. In Proceeding of the 33rd inter-
national conference on Software engineering, ICSE ’11, pages 301–310, New
York, NY, USA, 2011. ACM.
[87] Miryung Kim, Vibha Sazawal, David Notkin, and Gail Murphy. An empirical
study of code clone genealogies. SIGSOFT Softw. Eng. Notes, 30(5):187–196,
2005.
[88] Sunghun Kim, Kai Pan, and E. E. James Whitehead, Jr. Memories of bug
fixes. In SIGSOFT FSE, pages 35–45, 2006.
[89] Raghavan Komondoor and Susan Horwitz. Using slicing to identify duplica-
tion in source code. In SAS, pages 40–56, 2001.
[90] Soonho Kong, Yungbum Jung, Cristina David, Bow-Yaw Wang, and
Kwangkeun Yi. Automatically inferring quantified loop invariants by algo-
rithmic learning from simple templates. In Kazunori Ueda, editor, APLAS,
pages 328–343, 2010. Springer.
102
[91] Laura Kovács and Andrei Voronkov. Finding loop invariants for programs
over arrays using a theorem prover. In FASE, LNCS, pages 470–485.
Springer, 2009.
[92] Ted Kremenek and Dawson Engler. Z-ranking: Using statistical analysis to
counter the impact of static analysis approximations. In Cousot [33], pages
295–315.
[93] D. Kroening and O. Strichman. Decision Procedures — an algorithmic point
of view. EATCS. Springer, 2008.
[94] Harold W. Kuhn. The hungarian method for the assignment problem. In 50
Years of Integer Programming 1958-2008. Springer Berlin Heidelberg, 2009.
[95] Shuvendu K. Lahiri, Randal E. Bryant, and Al E. Bryant. Constructing
quantified invariants via predicate abstraction. In VMCAI, volume 2937 of
LNCS, pages 267–281. Springer, 2004.
[96] Shuvendu K. Lahiri, Randal E. Bryant, Al E. Bryant, and Byron Cook. A
symbolic approach to predicate abstraction. In CAV, volume 2715 of LNCS,
pages 141–153. Springer, 2003.
[97] E. Larson and T. Austin. High coverage detection of input-related security
faults. In Proc. of the 12th Usenix Security Symposium, Aug 2003.
[98] Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. Cp-miner:
Finding copy-paste and related bugs in large-scale software code. IEEE
Trans. Softw. Eng., 32(3):176–192, 2006.
[99] Chao Liu, Chen Chen, Jiawei Han, and Philip S. Yu. Gplag: detection of
software plagiarism by program dependence graph analysis. In KDD, pages
872–881, 2006.
[100] Laurent Mauborgne and Xavier Rival. Trace partitioning in abstract inter-
pretation based static analyzers. In M. Sagiv, editor, European Symposium
103
on Programming (ESOP’05), volume 3444 of Lecture Notes in Computer Sci-
ence, pages 5–20. Springer-Verlag, 2005.
[101] K. L. McMillan. An interpolating theorem prover. Theoretical Computer
Science, 345(1):101–121, 2005.
[102] Kenneth L. McMillan. An interpolating theorem prover. Theor. Comput.
Sci., 345(1):101–121, 2005.
[103] Kenneth L. McMillan. Lazy abstraction with interpolants. In Thomas Ball
and Robert B. Jones, editors, CAV, volume 4144 of LNCS, pages 123–136.
Springer, 2006.
[104] Kenneth L. McMillan. Quantified invariant generation using an interpolat-
ing saturation prover. In TACAS, volume 4693 of LNCS, pages 413–427.
Springer, 2008.
[105] Robin Milner. A theory of type polymorphism in programming. Journal of
Computer and System Sciences, 17:348–375, 1978.
[106] Erick M.Nystrom, Hong-Seok Kim, and Wen mei W.Hwu. Bottom-up and
top-down context-sensitive summary-based pointer analysis. In The proceed-
ings of the 11th Annual International Static Analysis Symposium, Lecture
Notes in Computer Science. Springer, 2004.
[107] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL —
A Proof Assistant for Higher-Order Logic, volume 2283 of LNCS. Springer,
2002.
[108] Gene Novark, Emery D. Berger, and Benjamin G. Zorn. Exterminator: auto-
matically correcting memory errors with high probability. In Proceedings of
the ACM SIGPLAN 2007 Conference on Programming Language Design and
Implementation, San Diego, California, USA, June 10-13, 2007, pages 1–11,
2007.
104
[109] Hakjoo Oh, Lucas Brutschy, and Kwangkeun Yi. Access analysis-based tight
localization of abstract memories. In VMCAI, Lecture Notes in Computer
Science, pages 356–370, 2011.
[110] Hakjoo Oh and Kwangkeun Yi. An algorithmic mitigation of large spurious
interprocedural cycles in static analysis. Software - Practice and Experience,
40(8):585–603, 2010.
[111] Peter W. O’Hearn, Hongseok Yang, and John C. Reynolds. Separation and
information hiding. ACM Trans. Program. Lang. Syst., 31(3):1–50, 2009.
[112] M Orlovich and R Rugina. Memory leak analysis by contradiction. In SAS
2006: 13th Annual International Static Analysis Symposium, Lecture Notes
in Computer Science. Springer, 2006.
[113] Hongseok Yang Oukseh Lee and Kwangkeun Yi. Automatic verification of
pointer programs using grammar-based shape analysis. In ESOP 2005: The
European Symposium on Programming, volume 3444 of Lecture Notes in Com-
puter Science, pages 124–140. Springer-Verlag, 2005.
[114] Carlos Pacheco, Shuvendu K. Lahiri, and Thomas Ball. Finding errors in
.net with feedback-directed random testing. In ISSTA, pages 87–96, 2008.
[115] Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball.
Feedback-directed random test generation. In ICSE, pages 75–84, 2007.
[116] Andrew Pitts. Operationally-based theories of program equivalence. In Se-
mantics and Logics of Computation, pages 241–298, 1995.
[117] Amir Pnueli. The temporal logic of programs. In SFCS, pages 46–57, 1977.
[118] Lutz Prechelt, Guido Malpohl, and Michael Philippsen. Finding plagiarisms
among a set of programs with jplag. Journal of Universal Computer Science,
8:1016–1038, 2001.
105
[119] Thomas Reps, Susan Horwitz, and Mooly Sagiv. Precise interprocedu-
ral dataflow analysis via graph reachability. In Proceedings of the 22nd
ACM SIGPLAN-SIGACT symposium on Principles of programming lan-
guages, POPL ’95, pages 49–61, New York, NY, USA, 1995. ACM.
[120] J. Reynolds. Separation logic: a logic for shared mutable data structures. In
In Proceedings of the 17th Annual IEEE Symposium on Logic in Computer
Science, 2002.
[121] Chanchal K. Roy, James R. Cordy, and Rainer Koschke. Comparison and
evaluation of code clone detection techniques and tools: A qualitative ap-
proach. Sci. Comput. Program., 74(7):470–495, 2009.
[122] Chanchal Kumar Roy and James R. Cordy. A survey on software clone de-
tection research. SCHOOL OF COMPUTING TR 2007-541, QUEEN’S UNI-
VERSITY, 115, 2007.
[123] Chanchal Kumar Roy and James R. Cordy. NICAD: Accurate detection of
near-miss intentional clones using flexible pretty-printing and code normaliza-
tion. International Conference on Program Comprehension, 0:172–181, 2008.
[124] Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and
Zhendong Su. Detecting code clones in binary executables. In ISSTA, pages
117–128, 2009.
[125] Brenda S.Baker. A program for identifying duplicated code. In Computer
Science and Statistics: Proc. Symp. on the Interface, pages 49–57, 1992.
[126] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing: local al-
gorithms for document fingerprinting. In SIGMOD, pages 76–85, 2003.
[127] Micha Sharir and Amir Pnueli. Two approaches to interprocedural data
flow analysis. In Steven S. Muchnick and Neil D. Jones, editors, Program
Flow Analysis: Theory and Applications, pages 189–234, Englewood Cliffs,
NJ, 1981. Prentice-Hall.
106
[128] Manu Sridharan and Ras Bodik. Refinement-based context-sensitive points-to
analysis for java. Technical Report UCB/EECS-2006-31, EECS Department,
University of California, Berkeley, Mar 2006.
[129] Saurabh Srivastava and Sumit Gulwani. Program verification using templates
over predicate abstraction. In PLDI, pages 223–234. ACM, 2009.
[130] Saurabh Srivastava, Sumit Gulwani, and Jeffrey S. Foster. VS3: SMT solvers
for program verification. In CAV, volume 5643 of LNCS, pages 702–708,
2009.
[131] Gregory Tassey. The economic impacts of inadequate infrastructure for soft-
ware testing. Technical report, National Institute of Standards and Technol-
ogy, May 2002.
[132] Suresh Thummalapenta, Tao Xie, Nikolai Tillmann, Jonathan de Halleux,
and Wolfram Schulte. Mseqgen: object-oriented unit-test generation via min-
ing source code. In ESEC/FSE, pages 193–202, 2009.
[133] Alfonso Valdes and Keith Skinner. Probabilistic alert correlation. In Recent
Advances in Intrusion Detection (RAID 2001), number 2212 in Lecture Notes
in Computer Science. Springer-Verlag, 2001.
[134] Static Analysis Vinod. Buffer overrun detection using linear programming
and.
[135] John Whaley and Martin Rinard. Compositional pointer and escape analysis
for Java programs. ACM SIGPLAN Notices, 34(10):187–206, 1999.
[136] Yichen Xie and Alex Aiken. Context- and path-sensitive memory leak detec-
tion. In ESEC/FSE-13: Proceedings of the 10th European software engineer-
ing conference held jointly with 13th ACM SIGSOFT international symposium
on Foundations of software engineering, pages 115–125, New York, NY, USA,
2005. ACM.
107
[137] Yichen Xie and Alex Aiken. Scalable error detection using boolean satisfia-
bility. In POPL ’05: Proceedings of the 32nd ACM SIGPLAN-SIGACT sym-
posium on Principles of programming languages, pages 351–363, New York,
NY, USA, 2005. ACM.
[138] Yichen Xie, Andy Chou, and Dawson Engler. Archer: using symbolic, path-
sensitive analysis to detect memory access errors. In ESEC/FSE-11: Pro-
ceedings of the 9th European software engineering conference held jointly with
11th ACM SIGSOFT international symposium on Foundations of software en-
gineering, pages 327–336. ACM Press, 2003.
[139] Greta Yorsh, Eran Yahav, and Satish Chandra. Generating precise and con-
cise procedure summaries. In Proceedings of the 35th annual ACM SIGPLAN-
SIGACT symposium on Principles of programming languages, POPL ’08,
pages 221–234, New York, NY, USA, 2008. ACM.
[140] Misha Zitser, Richard Lippmann, and Tim Leek. Testing static analysis
tools using exploitable buffer overflows from open source code. In SIG-
SOFT ’04/FSE-12: Proceedings of the 12th ACM SIGSOFT twelfth inter-
national symposium on Foundations of software engineering, pages 97–106.
ACM Press, 2004.
108
초 록
매개화된 프로시져 요약을 이용하는 정적 분석을 제안한다. 이 분석은 흐름과 문
맥에 민감하며 부분적으로 경로를 고려한다. 각각의 프로시져들의 메모리 변화를
분석하여 요약한 후에 그 프로시져들이 호출되는 곳에서 사용한다. 프로시져 요
약은 호출되는 지점에 문맥에 따라 다르게 구체화될 수 있도록 매개화하였다. 프
로시져의 실행 경로 조건을 분석 중에 계산하여 가드로 기억한다. 이 가드는 프로
시져 내에서의 경로 정보를 기록한다. 프로시져 요약에도 가드를 사용하여 메모리
변화에 관련된 함수 간의 경로 정보도 어느 정도는 따라갈 수 있다.
이 분석을 이용하여 메모리 누수와 유사 코드 쌍을 찾는데 성공했다. 메모리
누수 탐지의 정확도는 상대적으로 높다. SPEC2000 벤치마크와 여러 오픈 소스
소프트웨어 패키지에서 많은 수의 메모리 누수를 탐지했다. 우리가 제안한 정적
분석 기술이 계산한 요약 메모리 상태를 비교함으로써 의미 코드 유사 쌍을 찾는
새로운 방법을 제시했다. 규모 있는 오픈 소스 프로젝트에서 기존의 유사 코드 쌍
탐지 방법들이 찾을 수 없는 의미 코드 유사 쌍을 많이 찾을 수 있었다.
주요어 : 프로그래밍 언어, 요약 해석, 메모리 누수, 유사 코드쌍 ,
정적 분석, 프로시져 요약
학 번 : 2004-21624