an execution-semantic and content-and-context-based code-clone detection and analysis
TRANSCRIPT
An {Execution-Semantic, Content-and-Context}-Based
Code-Clone {Detection,Analysis}
Toshihiro KamiyaFuture University Hakodate
Toshihiro Kamiya: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis, Proceedings of the 9th IEEE International Workshop on Software Clones (IWSC'15), pp. 1-7 (2015).
TOC
● Problem/Motivation● Outline of proposed method● Example● Algorithm of clone detection● Visualization● Implementation● Preliminary experiment
The problems / Motivation
● In functional PLs, developers can define their own control structure.– Analyzing only pre-defined control statements is no longer sufficient to
represent code pattern.
– E.g., if (C) A; else B; ⇔ myIf(C, lambdaA, lambdaB);
→ inter-procedural analysis
● Dynamic dispatching makes inter-procedural analysis difficult.– Esp. in functional + OO + dynamically typed PLs
(no explicit type declaration → hard to analyze dispatches in a static way)
Idea
Detect clones from an execution trace !
● Dispatches and control structures have been expanded (resolved).
● Detected clones are inter-procedural, type 3 clones.
Outline of proposed method
● Execution trace
→ Call tree
→ Contents and Context (for each node)●
main()
os.listdir()print_extensions_w_for_stmt()
print_extensions_w_map_func()
os.path.splitext() print str.join()get_extensions() print
map()
lambda() at line 8
os.path.splitext()
contents
context
Clone detectionClone analysis
ContentsContext
Example code
These two functions are...
A helper function
...a semantic clone.
The same functionality: finds extensions of given files and prints them out
Shared items
and differences
Distinct loops.for vs map
All shared items are contained in a function.
Shared items are spread into functions.
Detection steps
Input: a call tree (← execution trace ← target program)
1. Extracts contents and context of each node
2. Identifies sets of contents-sharing nodes
3. Removes redundant nodes (filtering with contexts)
Input
…call __main__//<module> runpy//_run_code 69:load_const __main__//<module> 0load_const __main__//<module> 12load_const __main__//<module> 21load_const __main__//<module> 30load_const __main__//<module> 39call __main__//main __main__//<module> 63:call __main__//print_extensions_w_for_stmt __main__//main 24: <list>call posixpath//splitext __main__//print_extensions_w_for_stmt 25: 'about.txt'call genericpath//_splitext posixpath//splitext 18: 'about.txt' '/' None '.'load_const genericpath//_splitext 0return genericpath//_splitext 139: * 'about' '.txt'return posixpath//splitext 21: * 'about' '.txt'call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32: '.txt'return pygoat.hook/Out/write 15call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33: '\n'return pygoat.hook/Out/write 15call posixpath//splitext __main__//print_extensions_w_for_stmt 25: 'pygoat.data'call genericpath//_splitext posixpath//splitext 18: 'pygoat.data' '/' None '.'load_const genericpath//_splitext 0return genericpath//_splitext 139: * 'pygoat' '.data'return posixpath//splitext 21: * 'pygoat' '.data'call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32: '.data'return pygoat.hook/Out/write 15call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33: '\n'return pygoat.hook/Out/write 15call posixpath//splitext __main__//print_extensions_w_for_stmt 25: 'greeting.md'call genericpath//_splitext posixpath//splitext 18: 'greeting.md' '/' None '.'load_const genericpath//_splitext 0return genericpath//_splitext 139: * 'greeting' '.md'return posixpath//splitext 21: * 'greeting' '.md'call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32: '.md'return pygoat.hook/Out/write 15call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33
Program
Execution trace
main()
os.listdir()print_extensions_w_for_stmt()
print_extensions_w_map_func()
os.path.splitext() print str.join()get_extensions() print
map()
lambda() at line 8
os.path.splitext()
Call tree
Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)
Step 1.
1. Extracts contents and context of each nodemain()
os.listdir()print_extensions_w_for_stmt()
print_extensions_w_map_func()
os.path.splitext() print str.join()get_extensions() print
map()
lambda() at line 8
os.path.splitext()
main()
get_extensions(),map(),lambda() at line 8,os.listdir(),os.path.split(),print,print_extensions_w_for_stmt(),print_extensions_w_map_func(),str.join()
print_extensions_w_for_stmt()
main()
os.path.split()print
print_extensions_w_map_func()
main()
get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()
Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)
Step 2.
2. Identifies sets of contents-sharing nodes
main()
get_extensions(),map(),lambda() at line 8,os.listdir(),os.path.split(),print,print_extensions_w_for_stmt(),print_extensions_w_map_func(),str.join()
print_extensions_w_for_stmt()
main()
os.path.split()print
print_extensions_w_map_func()
main()
get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()
Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)
Step 3.
3. Removes redundant nodes (filtering with contexts)
main()
get_extensions(),map(),lambda() at line 8,os.listdir(),os.path.split(),print,print_extensions_w_for_stmt(),print_extensions_w_map_func(),str.join()
print_extensions_w_for_stmt()
main()
os.path.split()print
print_extensions_w_map_func()
main()
get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()
Included by all of other nodes in the set
⇒ redundant
Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)
Detection result
A clone class: { print_extensions_w_map_func(), print_extensions_w_for_stmt() }
Shared items:{ os.path.split(), print }
print_extensions_w_for_stmt()
main()
os.path.split()print
print_extensions_w_map_func()
main()
get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()
Detection result
A clone class: { print_extensions_w_map_func(), print_extensions_w_for_stmt() }
Shared items:{ os.path.split(), print }
dagified (merged) by label(DAG = directed acyclic graph)
Context
Contents
main()
print_extensions_w_for_stmt()
print_extensions_w_map_func()
get_extensions()print
map()
lambda() at line 8
os.path.splitext()
Content-and-context analysis for triaging
● Clone class (a), shared items (b), distinct contents (or gap) (c)● The distinct contents (c) shared the same set of
(sub-)contents (d) → (c) is another clone class.● If (c) is merged before (a), (c) will not be a gap of (a)
anymore.
(a)
(b)
(c)
(d)
Detected from markdown2's code (described later)
Tool prototypeTarget program Inputs / Test
cases
Execution (Python
interpreter)
Execution trace
Debugging /profiling APIs
Execution trace extraction
String balloon generation
String balloons
Frequent item-set mining
(Apriori)
Similar sets of contents
Redundant context removal
Code clones
Step 1
Step 2
Step 3
Detection
Visualization Metrics calculationAnalysis
● Input: Python source code● Uses a frequent item-set mining
algorithm / implementation– Apriori (www.borgelt.net/apriori.html)
● Heuristics / optimizations– Max. depth of contents from a target node
(default 5)
– Max. number of content items of a candidate node (default 25)
● Filters out the nodes with large contents, i.e., nodes near to the root of call tree
– Removal of basic, primitive functions
– ...
Content-and-context clone on call graph
Preliminary experiment
for each of the parameter(“Max. number of content items of a candidate node”) values: 10, 15, …, 30.
Target product Collection of exe. seq. # function calls
# unique labels
markdown2 Running 144 unit tests 227K 1128
wxPython Invoking a sample program “pySketch”
483K 1058
Results
Results
Exponential to
number of c
ontents
Too “peaky” for practical use
Summary
● A code-clone detection from a dynamic info, execution trace– Aiming to apply functional/dynamically typed PLs
● Context-and-content analysis for triage● Algorithm, implementation, heuristics● Preliminary experiment
– Targets: markdown2 and wxPython
– Peaky, sensitive to a parameter Max. number of content items of a candidate node → Needs refinements
Omitted, refer the paper:● Threats to validity● Future plan
(a)
(b)
(c)
(d)