an execution-semantic and content-and-context-based code-clone detection and analysis

23
An {Execution-Semantic, Content-and-Context}-Based Code-Clone {Detection,Analysis} Toshihiro Kamiya Future University Hakodate [email protected] Toshihiro Kamiya: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis, Proceedings of the 9th IEEE International Workshop on Software Clones (IWSC'15), pp. 1-7 (2015).

Upload: kamiya-toshihiro

Post on 24-Jul-2015

148 views

Category:

Science


4 download

TRANSCRIPT

Page 1: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

An {Execution-Semantic, Content-and-Context}-Based

Code-Clone {Detection,Analysis}

Toshihiro KamiyaFuture University Hakodate

[email protected]

Toshihiro Kamiya: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis, Proceedings of the 9th IEEE International Workshop on Software Clones (IWSC'15), pp. 1-7 (2015).

Page 2: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

TOC

● Problem/Motivation● Outline of proposed method● Example● Algorithm of clone detection● Visualization● Implementation● Preliminary experiment

Page 3: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

The problems / Motivation

● In functional PLs, developers can define their own control structure.– Analyzing only pre-defined control statements is no longer sufficient to

represent code pattern.

– E.g., if (C) A; else B; ⇔ myIf(C, lambdaA, lambdaB);

→ inter-procedural analysis

● Dynamic dispatching makes inter-procedural analysis difficult.– Esp. in functional + OO + dynamically typed PLs

(no explicit type declaration → hard to analyze dispatches in a static way)

Page 4: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Idea

Detect clones from an execution trace !

● Dispatches and control structures have been expanded (resolved).

● Detected clones are inter-procedural, type 3 clones.

Page 5: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Outline of proposed method

● Execution trace

→ Call tree

→ Contents and Context (for each node)●

main()

os.listdir()print_extensions_w_for_stmt()

print_extensions_w_map_func()

os.path.splitext() print str.join()get_extensions() print

map()

lambda() at line 8

os.path.splitext()

contents

context

Clone detectionClone analysis

ContentsContext

Page 6: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Example code

Page 7: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

These two functions are...

A helper function

Page 8: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

...a semantic clone.

The same functionality: finds extensions of given files and prints them out

Page 9: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Shared items

Page 10: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

and differences

Distinct loops.for vs map

All shared items are contained in a function.

Shared items are spread into functions.

Page 11: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Detection steps

Input: a call tree (← execution trace ← target program)

1. Extracts contents and context of each node

2. Identifies sets of contents-sharing nodes

3. Removes redundant nodes (filtering with contexts)

Page 12: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Input

…call __main__//<module> runpy//_run_code 69:load_const __main__//<module> 0load_const __main__//<module> 12load_const __main__//<module> 21load_const __main__//<module> 30load_const __main__//<module> 39call __main__//main __main__//<module> 63:call __main__//print_extensions_w_for_stmt __main__//main 24: <list>call posixpath//splitext __main__//print_extensions_w_for_stmt 25: 'about.txt'call genericpath//_splitext posixpath//splitext 18: 'about.txt' '/' None '.'load_const genericpath//_splitext 0return genericpath//_splitext 139: * 'about' '.txt'return posixpath//splitext 21: * 'about' '.txt'call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32: '.txt'return pygoat.hook/Out/write 15call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33: '\n'return pygoat.hook/Out/write 15call posixpath//splitext __main__//print_extensions_w_for_stmt 25: 'pygoat.data'call genericpath//_splitext posixpath//splitext 18: 'pygoat.data' '/' None '.'load_const genericpath//_splitext 0return genericpath//_splitext 139: * 'pygoat' '.data'return posixpath//splitext 21: * 'pygoat' '.data'call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32: '.data'return pygoat.hook/Out/write 15call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33: '\n'return pygoat.hook/Out/write 15call posixpath//splitext __main__//print_extensions_w_for_stmt 25: 'greeting.md'call genericpath//_splitext posixpath//splitext 18: 'greeting.md' '/' None '.'load_const genericpath//_splitext 0return genericpath//_splitext 139: * 'greeting' '.md'return posixpath//splitext 21: * 'greeting' '.md'call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32: '.md'return pygoat.hook/Out/write 15call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33

Program

Execution trace

main()

os.listdir()print_extensions_w_for_stmt()

print_extensions_w_map_func()

os.path.splitext() print str.join()get_extensions() print

map()

lambda() at line 8

os.path.splitext()

Call tree

Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)

Page 13: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Step 1.

1. Extracts contents and context of each nodemain()

os.listdir()print_extensions_w_for_stmt()

print_extensions_w_map_func()

os.path.splitext() print str.join()get_extensions() print

map()

lambda() at line 8

os.path.splitext()

main()

get_extensions(),map(),lambda() at line 8,os.listdir(),os.path.split(),print,print_extensions_w_for_stmt(),print_extensions_w_map_func(),str.join()

print_extensions_w_for_stmt()

main()

os.path.split()print

print_extensions_w_map_func()

main()

get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()

Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)

Page 14: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Step 2.

2. Identifies sets of contents-sharing nodes

main()

get_extensions(),map(),lambda() at line 8,os.listdir(),os.path.split(),print,print_extensions_w_for_stmt(),print_extensions_w_map_func(),str.join()

print_extensions_w_for_stmt()

main()

os.path.split()print

print_extensions_w_map_func()

main()

get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()

Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)

Page 15: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Step 3.

3. Removes redundant nodes (filtering with contexts)

main()

get_extensions(),map(),lambda() at line 8,os.listdir(),os.path.split(),print,print_extensions_w_for_stmt(),print_extensions_w_map_func(),str.join()

print_extensions_w_for_stmt()

main()

os.path.split()print

print_extensions_w_map_func()

main()

get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()

Included by all of other nodes in the set

⇒ redundant

Input: a call tree (← execution trace ← target program)1. Extracts contents and context of each node2. Identifies sets of contents-sharing nodes3. Removes redundant nodes (filtering with contexts)

Page 16: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Detection result

A clone class: { print_extensions_w_map_func(), print_extensions_w_for_stmt() }

Shared items:{ os.path.split(), print }

print_extensions_w_for_stmt()

main()

os.path.split()print

print_extensions_w_map_func()

main()

get_extensions(),map(),lambda() at line 8,os.path.split(),print,str.join()

Page 17: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Detection result

A clone class: { print_extensions_w_map_func(), print_extensions_w_for_stmt() }

Shared items:{ os.path.split(), print }

dagified (merged) by label(DAG = directed acyclic graph)

Context

Contents

main()

print_extensions_w_for_stmt()

print_extensions_w_map_func()

get_extensions()print

map()

lambda() at line 8

os.path.splitext()

Page 18: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Content-and-context analysis for triaging

● Clone class (a), shared items (b), distinct contents (or gap) (c)● The distinct contents (c) shared the same set of

(sub-)contents (d) → (c) is another clone class.● If (c) is merged before (a), (c) will not be a gap of (a)

anymore.

(a)

(b)

(c)

(d)

Detected from markdown2's code (described later)

Page 19: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Tool prototypeTarget program Inputs / Test

cases

Execution (Python

interpreter)

Execution trace

Debugging /profiling APIs

Execution trace extraction

String balloon generation

String balloons

Frequent item-set mining

(Apriori)

Similar sets of contents

Redundant context removal

Code clones

Step 1

Step 2

Step 3

Detection

Visualization Metrics calculationAnalysis

● Input: Python source code● Uses a frequent item-set mining

algorithm / implementation– Apriori (www.borgelt.net/apriori.html)

● Heuristics / optimizations– Max. depth of contents from a target node

(default 5)

– Max. number of content items of a candidate node (default 25)

● Filters out the nodes with large contents, i.e., nodes near to the root of call tree

– Removal of basic, primitive functions

– ...

Content-and-context clone on call graph

Page 20: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Preliminary experiment

for each of the parameter(“Max. number of content items of a candidate node”) values: 10, 15, …, 30.

Target product Collection of exe. seq. # function calls

# unique labels

markdown2 Running 144 unit tests 227K 1128

wxPython Invoking a sample program “pySketch”

483K 1058

Page 21: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Results

Page 22: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Results

Exponential to

number of c

ontents

Too “peaky” for practical use

Page 23: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

Summary

● A code-clone detection from a dynamic info, execution trace– Aiming to apply functional/dynamically typed PLs

● Context-and-content analysis for triage● Algorithm, implementation, heuristics● Preliminary experiment

– Targets: markdown2 and wxPython

– Peaky, sensitive to a parameter Max. number of content items of a candidate node → Needs refinements

Omitted, refer the paper:● Threats to validity● Future plan

(a)

(b)

(c)

(d)