embedded lab. park yeongseong. introduction state of the art core values design experiment ...

19
Value-Based Program Characterization and Its Application to Software Plagiarism Detection Embedded Lab. Park Yeongseong ICSE 2011 Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University Xiaoqi Jia State Key Laboratory of Information Security, Institute of Software, Chinese Academy of Sciences

Upload: agatha-alexander

Post on 14-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Value-Based Program Characterization and Its Application to Software Plagiarism De-

tection

Embedded Lab.Park Yeongseong

ICSE 2011

Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University

Xiaoqi JiaState Key Laboratory of Information Security, Institute of Software,

Chinese Academy of Sciences

Page 2: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Introduction State of the art Core values Design Experiment Discussion Conclusion Q&A

Contents

Page 3: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A
Page 4: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Identifying same or similar code is very im-portant

Previous works◦ Static source code comparison – C1◦ Static excutable code comparison – C2◦ Dynamic control flow based methods – C3◦ Dynamic API based methods – C4

Introduction

Page 5: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Three highly desired requirements◦ R1 – Resiliency◦ R2 - Ability to directly work on binary executables◦ R3 – Platform independence

BUT!!!! Not satisfy requirement◦ Static source code comparison – C1 R1 R2◦ Static excutable code comparison – C2 R1◦ Dynamic control flow based methods – C3 R1 R3◦ Dynamic API based methods – C4 R3

Introduction

Page 6: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Introduce new approach◦ Core-values

5 optimization options (-O0 ~ -O3, -Os) 3 Compilers ( GCC, TCC, WCC ) KlassMaster, Thicket, Loco/Diablo Obfusca-

tors

Introduction

Page 7: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Code Obfuscation Techniques◦ data obfuscation, control obfuscation, layout obfus-

cation and preventive transformations◦ indirect branches, control-flow flattening, function-

pointer aliasing

Static Analysis Based Plagiarism Detection◦ String-based◦ AST-based◦ Token-based◦ PDG-based◦ Birthmark-based

State of the arts

Page 8: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Dynamic Analysis Based Plagiarism Detec-tion◦ Whole program path based (WPP)◦ Sequence of API function calls birthmark(EXESEQ)◦ Frequency of API function calls

birthmark(EXEFREQ)◦ System call based birthmark

State of the arts

Page 9: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Runtime values◦ The output operands of the machine instructions ex-

ecuted

Core values◦ Constructed from runtime values

Eliminate non-core values◦ If is not derived form , is not a core-value of ◦ If is not in the set of runtime values of is not a core-

value of

Core values

Page 10: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Core values

Page 11: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Not all values associated with the execution of a program are core-values◦ Value-updating instruction◦ Related to the program’s semantics

Design-Value Sequence Extrac-tion

Page 12: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

To refine value sequences◦ Sequential refinement – reduction rate 16%~34%◦ Optimization-based refinement – 5 optimization◦ Address removal – exclude pointer values

Design-Value Sequence Refinementand Similarity Metric

Page 13: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Design-Overview

Page 14: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Intel Quad-Core 2.00 GHz CPU 4GB RAM Linux machin QEMU 0.9.1

Questions1. resilient 2. false accusation3. credible

Experiment

Page 15: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Obfuscation techniques◦ SandMark, KlassMaster : Java bytecode obfusca-

tors

Test application : Jlex◦ Lexical analyzer

Experiment-Obfuscation tool(resiliency)

Page 16: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Test Application◦ 5 individual XML pasers:expat, libxml2, Parsifal,

rxp,xercesc

Experiment-Similar Programs(false accusation)

Page 17: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Test application◦ Bzip2, gzip, oggenc, 9 of 11 programs

Result◦ Similarity scores between 0 and 0.27◦ zip and gzip similarity scores are 1.0

Same compression algorithm : deflate◦ zip and bzip2 similarity scores are 0.01 to 0.03

Different compression algorithm : block sorting

Experiment-Different Programs(credible)

Page 18: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

introduce a novel approach to dynamic characterization of executable programs.

The value-based method successfully dis-criminates 34 plagiarisms by SandMark, KlassMaster, Thicket.

Conclusion

Page 19: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A

Q&A