improving andriod app development's efficiency and quality through machine learning techinque...

33
Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque 刘刘刘 Lau Shyh Tzer, David isiting Student, IIIS, Tsinghua Universit Summer 2013 BSc. Computer Science, The Chinese University of Hong Kong

Upload: norah-wilkins

Post on 26-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque

刘世泽 Lau Shyh Tzer, David

Visiting Student, IIIS, Tsinghua UniversitySummer 2013

BSc. Computer Science, The Chinese University of Hong Kong

Background•Problem: The Growing of Android API

• Difficult for developers to master the usage of Android API, especially the inexperienced developer

• Level 1-18, over 20k API Methods

•Aim: Adapt Machine Learning and Reverse Engineering Technique to Analyze the Usage Pattern

• Possibly developer a helper tool to suggest/fix the Android API usage during development stage

Workflow

Adapt reverse engineering technique to retrieve the needed data from

packaged Android App (.apk)

1

Perform data mining on the result raw data to dig out

interesting API usage pattern, relationship.

2

1Reverse Engineering on Android App

• Need to retrieve information from package (.apk) file not source code

• There’re basically three options:

Perform static analysis not

dynamic analysis

.apk Dalvikbytecode

Retrieve .dex file

1Low level, Lack of

analysis tools (Few)

.apk smalicode

Disassembly

2Good for hacking.

Take time to familiar with smali

code

.apk .jar (.class)bytecode

Decompile

3Becomes a Java

problemLots of analysis

tool

1Reverse Engineering on Android App

.apk .jar (.class)bytecode

Decompile

• Use dex2jar open source tool: https://code.google.com/p/dex2jar/

• Support directly decompile .apk to .jar file 2 2 2 linux sh dex jar/d j-dex jar.sh someApk.apk

2jjjjj2jjjjj2jjjjjjj - ..

• We can then redirect it into a Java problem and focus on the static analysis with Java bytecode

1Reverse Engineering on Android App

• In order to understand the usage pattern of Android API, we have to know the structure of the code

• The easy and abstract approach to understand the structure of the code is to look at its Abstract Syntax Tree (AST)

Generate AST from Java bytecode• It’s obvious and easy to parse Java source

code into Abstract Syntax Tree, but parse the bytecode is not• Bytecode is a set of instructions that JVM interpret to perform stack execution to run the program• The stack execution on the bytecode is different than the common program flow that we observe at source code

Example

method

return

=

i3 *

i1 i2

*

i3 2

ASTBytecode

Bytecode Outline Plugin for Eclipse

http://asm.ow2.org/eclipse/index.html

Generate AST from Java bytecode• Intuitively, bytecode is interpreted by JVM as

the stack execution, so we can ‘recover’ the code structure and construct the AST through simulating the JVM stack operation

Example: Variable Assignment

Thread Stack

1i 1

Abstract Syntax Tree

=

Source

code

bytecode

int1i= ;

1ICONST

0 ISTORE

Example 2: From Previous Bytecode Example

Source code

bytecode

1 2public int method(inti ,int i ){ j3jj1*j2= j3*2j; }

1ILOADjjjjj2jjjjjjjjjj3jjjjj3

2ICONSTjjjjjjjjjjj

Thread Stack Abstract Syntax Tree

method

*

i2

*

2

=

i3

i1

return

i3i1i2*i32*

Generate AST from Java bytecode• There are various kinds of AST structure,

such as condition statement, goto statement, compound statement, but they can all be ‘recovered’ from bytecode by using the previous technique to simulate the stack execution• However, read directly on the .class file result in binary format that useless for our parsing

• So we need a systematic way to parse the bytecode

• ASM is an all purpose Java bytecode manipulation and analysis framework http://asm.ow2.org/

• It provides two powerful APIs: Core API and Tree API

• Core API creates an interface of visiting bytecode

• Tree API parses bytecode into Objects

ASM - Bytecode Engineering Library

Refer to http://download.forge.objectweb.org/asm/asm4-guide.pdf

for complete usage

ASM - Bytecode Engineering Library• Tree API is particularly useful for generating the AST from bytecode

• It provides two important interfaces: ClassNode and MethodNode which enables the developer to assess to the bytecode information directly

The bytecode (opcode) of the

respective method is stored at InsnList

instructions

ASM - Bytecode Engineering LibraryUsage: 1= ( “ . ” );

jjjjjjjjj jj j jjj jjjjjjjjjjjjjjjjjjjjjjjjj0jj

ACM parse the whole class and

store the respective objects

over here

jjj jjjjj jjjj.; <- jjjjj jjjjj jjjjjjjjj.; <-jjjjjjjjjjjjjjj jj jjjjj jjjjj jjjjjjj

Assess Class Information: All the information is stored at the properties of the object, so simply

retrieve from them

Assess Method Information: <> = . ;jjjjjjjjjjjjjj : ) {jjjjjjj jj jjjjjj jjjj jjjjjjjjj jjjjjjjjjj jjj . <- (jjjjjj jjjjj jjjjjjjjjjjjjjjjjjjjj jj jjj jjjjjjjjjj jjjjjjj}

Each class contains several methods, so

it’s List<MethodNode>

type

Assess Bytecode Intructions:

jjjjjjjjjjjjjjjj= . ;jjjjjjjj jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj jjjjjjjjjjjjjjjjjjjjjjjjjjjj( ) ;jjj jjjjjj j jjjjjjjjjjjjjjjj = . ( ) ;

jjj jjjjj jjjjjjjjj jjjj jj //jjjjjjjj jjj}

ASM - Bytecode Engineering Library

An abstract class to wrap the

instructions. ASM separate instruction

into 16 different kinds

The bytecode instruction is defined

as an integer constant in ASM. For

example,3 = ICONST_021 = ILOAD

54 = ISTORE

The type of the instructions is also defined in integer

constant. For example:

4 = FIELD_INSN5 = METHOD_INSN

Detail of the instruction, like which store to which local variable, the Field

Variable ID, invoked method’s signature are stored in this object as

properties.

My Design of AST Generator• With the support of ASM Tree API, we can parse the bytecode and simulate each stack execution to construct the respective Abstract Syntax Tree systematically• In order to suit with the needed data for data mining, I designed a customized Abstract Syntax Tree structure

Abstract Parent Class

ASTNode

ASTArithmeticNode

ASTArrayNode

ASTArrayValueNod

e

ASTCastNode

ASTClassNode

ASTConstantNode

ASTFieldNode

ASTFunctionNode

ASTJumpNode

ASTLabelNode

ASTLocalVariableNod

e

ASTMethodNode

ASTObjectNode

ASTReturnNode

ASTSwitchNode

Inherited

Specifically designed class

to suit each code structure

My Design of AST Generator

ASTNode• getASTKind• setName getName• setSignature getSignature• setCallBy getCallBy• setUsedBy getUsedBy• setUsedAsObject getUsedAsObject

Example: S t r i n g B u i l d e r s b =

jjjjjjjjjjjjjj;jjjjjj jjjj j “ ” ;jjjjjjjjjjjjjjjj jjjjjj j =

. ( ) ;

sb

append

result

toString

text

“Sample”

setUsedBy

setUsedBy

setUsedAsObject

setCallBy

setUsedBy

CallBy/UsedBy/UsedAsObject are

stored as ArrayList<ASTNode> to

handle multiple connections

Object and its methods have doubly connections to ensure bidirectional traverse

ASTMethodNode

• addParameter getPara

My Design of AST GeneratoraddParameter are

stored as ArrayList<ASTNode> to

handle multiple connections

ASTLocalVariableNode

• setIndex getIndex• setVariableType getVariableType• setVariableValue getVariableValue

ASTFieldNode

• setFieldValue getFieldValue

Trick:Local variable are stored separately at JVM Method Area by index. So in order to track the changing of local variable assignment (such as one variable can be used multiple times), create a hash table to record the pointers reference to the local variable. So the update of the variable assignment can be done easily while parsing the bytecode

Trick:Same case with Local Variable, create a hash table to have pointers reference to the Field Variable. Be careful that the hash table is clear while accessing each method, but Field Value is tracked through the whole class

Data Flow Analysis on the AST• With complete Abstract Syntax Tree for an Android App, it gives a very useful details to perform various kinds of static analysis

• My research is mainly focused on its data flow analysis:

jjjjjjjjjj?? = .( , , ) ;

1 Where does the return value goes?

Where are these arguments come from?

2

This methods is called by what kind of object?3

Data Flow Analysis on the AST• Performed depth-first-search on the AST to trace the return value path, argument path and call by path

Collect data from Android API Invocation

• Trick: use hash table/list to record down the path to avoid infinite loop within the AST

ASTClassNode

QQ Android

ASTClassNode

ASTMethodNode ASTMethodNode

2 Mining on the Analysis Data

Convert App into Jar format

Generate AST from Jar file

Define To-Do Analysis Format

Preparing Mining Raw Data• Coded a web crawler to download free Android Apps from http://apk.gfan.com/ open Android Market• Successfully grabbed 10, 266 valid Android Apps and generated respective AST, analysis data through the self-developed ASTGenerator by using Amazon Web Service High Memory cluster.

• Result data structure:

appname-android/net/Uri buildUpon-0

1 1 1 0 0 1 1 0

1 1 1 0 0 0 0 1

andro

id/n

et/U

ri b

uildU

pon

andro

id/n

et/U

ri t

oSri

ng

andro

id/n

et/U

ri$Builder

appen

dQ

uer

yPar

amet

er

andro

id/n

et/U

ri$Builder

build

java

/lan

g/S

trin

g index

Of

java

/lan

g/S

trin

g s

ubst

ring

java

/lan

g/S

trin

gBuilder

<in

it>

java

/lan

g/S

trin

gBuilder

appen

d

appname-android/net/Uri parse-0

Mining on the Analysis Data• Adapted Weka 3 to perform data mining task. http://www.cs.waikato.ac.nz/ml/weka/

• It’s convenient to convert the analysis raw data into Weka ARFF input format, especially its support of sparse matrix format

Mining on the Analysis Data• Adapted Hierarchical Agglomerative Clustering in the result matrix to discover the apis’ relationship and their usage pattern

Mining on the Analysis Data• The analysis result data from 10, 266 apps

is huge (~50GB text file), it’s time-consuming and unnecessary to mine directly on them

• Designed a MapReduce task and ran it at Hadoop to categorize the result data into methods by methods and compute the statistics of their invocation numbers• Then perform hierarchical clustering on methods that have enough data to discover meaningful pattern (like the number of invocation reached a threshold)

Mining on the Analysis Data

Total 19, 250 Android API methods discovered to be

invoked at least one time among 10, 266 apps

Methods that have high numbers of invocation don’t directly mean they’re having higher possible to find meaningful pattern. They’re more likely to be really common

usage like UI elements, Log

Methods that have average number of invocation, especially those indicating specific feature

like geo-location, network, database might be the target of

data mining

Clustered Result:• Weka’s hierarchical clustering result in Newick Format, we can use software like Dendroscope to visualize ithttp://ab.inf.uni-tuebingen.de/software/dendroscope/

android/net/Uri buildUponSome clustering result has obvious clusters which may implied an obvious usage pattern with

this method

Clustered Result:

android/location/Location <init>

• Trick: Some analysis that shows many redundant on the column key (related APIs), perform a filter to throw away those no obvious relations (such as only related one time) before sending the data for clustering

May perform several trials

and observe the best cut-off

branch

Clustered Result:

android/location/Geocoder <init>

• The result Newick Format can be parsed back to get the list of related APIs for each cluster

There may have many unique usage pattern like this which probably an

error pattern or special usage

Mining on the Analysis Data• By tracing back the result Newick Tree, we got the related APIs for each methods

• Interesting Results:• The clustering result shows the android/location

package classes are strongly inner-connected. Classes like GpsSatellite, Location, Location Manager are highly relied on each other.

Package Analysis

Identify Good and Error Usage

Pattern

• By having App’s name as identifier, we can trace back to the information of the nature of the app (its download times, rating, popularity) to determine feature, possible good/bad pattern of the result clusters.

Complete APIs Relationship

• Perform clustering on all the useful data and retrieve the APIs relationship from the cluster after identify its usage pattern. It can eventually conclude a ‘Good’ usage pattern suggestion list when respective API methods are called

Mining on the Analysis Data

Android API vs Java Library

• There are many Android APIs have data flow relations with Java Library methods, by digging into the methods’ clusters, we can discover some obvious usage pattern of Android APIs and Java Library methods.

App’s Permissions vs

Method invocation

• Android Permissions information is stored at AndroidManifest.xml which is an binary-encoded XML file in .apk package. It can be decoded by using tool such as APK-tool (https://code.google.com/p/android-apktool/) to read it.• Android Permissions declare at statically at compile time and can’t grant dynamically (at run time)

• So with the AST generated from the bytecode, we can check/traverse the AST to determine the correctness of one app’s declared permissions.

• For example, one app may declared BroadcastReceiver permissions and then there’re no related methods/functions are found at the AST

Thank you very much