impact analysis - impactscale: quantifying change impact to predict faults in large software systems

ImpactScale: Quantifying Change Impact to Predict Faults in Large Software Systems

Kenichi KobayashiFujitsu Laboratories

Akihiko Matsuo Fujitsu Laboratories

Manabu KamimuraFujitsu Laboratories

Toshiaki YoshinoFujitsu

Yasuhiro HayaseUniversity of Tsukuba

Katsuro InoueOsaka University

ICSM2011 @ Williamsburg, 2011-09-27

Overview

1. Background and Goal2. Definition of ImpactScale3. Measuring ImpactScale in Real Systems4. Fault Prediction and Evaluation5. Summary

Copyright 2011 FUJITSU LABORATORIES LIMITED1


Practitioners’ Point of View

Copyright 2011 FUJITSU LABORATORIES LIMITED

Background Fault prediction in maintenance is a difficult task, and

predictive performance is not enough only with product metrics. Product Metrics are metrics extracted from software product such as

source code.

Therefore, process metrics, such as code churn and logical coupling, have been combined to product metrics. Process Metrics are metrics extracted from software process such as

change histories.

However, in enterprise scenes of maintenance, documents, change histories, bug reports, and specialists’ knowledge are often lost, out-of-date, or unable to be used.

2

ICSM2011 @ Williamsburg, 2011-09-27 Copyright 2011 FUJITSU LABORATORIES LIMITED

Goals

Problem Process metrics cannot be always obtained.

Motivation To achieve high predictive performance only with product

metrics extractable from source code

Goals To define a new product metric To show the effectiveness of the metric

3


Need not to solve the affected areas.Only need to solve the scale of them.

We assumed Change Impact Analysis enables us to extract implicit dependency.

WeaknessHigh computational cost


Basic IdeaSoftware dependency is one of surviving factors of faults even after release.

修

修

修

修正忘れ

暗黙の依存関係Change Impact AnalysisTechnique to solve the affected areas when some part of software is changed.

ImpactScale(abbrev. IS)

fix

fix

fix

implicit dependency

missed fix

A metric that quantifies the scale of change impact can improve the performance of fault prediction.

Hypothesis

4


Overview




Dependency

Dependencies are extract from target software, and Propagation Graph is built.

Propagation Model Probabilistic propagation Relation-sensitive propagation

ImpactScale is sum of all Quantities of Change Impact.


Overview of ImpactScale DefinitionPropagation Graph

Code Node

Data Node

Quantity of Change Impact

from C to A

Change!Change!

6


Propagation Graph① Build dependency graph

extracted from target software

《Dependency Graph》 Code Nodemodule, class,

function, source code

Data NodeDB table,

global variable

Dependency Edgewith relation typeCALL, READ, WRITE

7-1


Propagation Graph① Build dependency graph

extracted from target software

《Dependency Graph》《Propagation Graph》

② Add reverse edges to build Propagation Graph

Change impact analysis for ImpactScale is performed on Propagation Graph.

Code Nodemodule, class,

function, source code

Data NodeDB table,

global variable

Dependency Edgewith relation typeCALL, READ, WRITE

7-2


Probabilistic PropagationWe assume that change impact probabilistically propagates

from a node to another node as some Ripple Effect studies. [Hanny72] [Tsantalis05] [Sharafat07]

In this presentation, propagation probability is always 0.5.

×0.5

×0.5×0.5

Quantity of change impact

from the source node

Propagation Probability

Propagation Probability

Change!Change!

8


To avoid overestimation, we used context information to eliminate unlikely propagation.We use an edge’s relation type as minimal context information in

point of computational time.

Cut Rules determine whether propagation from one node to its next node is cut or not, referring its previous and next edge’s relation type.

We call such controlled propagation relation-sensitive propagation.

Computational complexity is practically low.Copyright 2011 FUJITSU LABORATORIES LIMITED

Relation-sensitive Propagation

currentnode

nextrelation type

previous relation type next

nodeCut Rulerefer

refer

9


Cut Rule 2During finding callers,

don’t find callees.

Example of Cut Rules


Change!Change! Change!Change!

Cut Rule 1During finding callees,

don’t find callers.

Cut Rule 3Don’t find beyond

READ edges.

Example from “C” Example from “F”

10


Overview




Two enterprise accounting systems in different companies

Data Sets for Evaluations

Data Set Name #Modules Total LOC #Faults #Faulty

Modules

Faulty Module

Rate

Term Fault-

Collected

DS1 5.8k 1.6M 269 215 3.7% 40 months

DS2 7.6k 3.7M 250 208 2.7% 40 months

Common Properties Language: COBOL Age: Over 20 years

Collected Metrics 7 Existing Metrics

LOC, WMC, MaxVG, Sections, Calls, Fan-in, Fan-out

ImpactScale

12


Real Example of Calculating ImpactScale


DS1

#modules5.8k

Each square-shaped group of modules is a sub-system.

13-1




DS1

#modules5.8k


13-2




DS1

#modules5.8k


13-3




DS1

#modules5.8k


13-4




DS1

#modules5.8k


13-5




DS1

#modules5.8k


13-6




DS1

#modules5.8k


13-7


Measurement ResultsDistribution of ImpactScale

Calculation Time DS1: about 10 sec. DS2: about 30 sec.

0

1000

2000

3000

4000

~50

~100

~150

~200

~250

~300

~350

~400

~450

~500

~550

~600

~650

~700

~750

~800

~850

~900

~950

Num

ber o

f Mod

ules

ImpactScale

Data Set Mean IS Max IS

DS1 86.0 2989.6DS2 156.5 3338.2

Spike:• system-wide dispatcher

or• symptom of bad smell

Long-tailed

Practically short

14


ImpactScale and Faults

First 20% of modules contain

48.8% faults.

IS highly correlates with faults.

ImpactScale

High Low10-quartile

ModuleDatabase Table

15


Overview




Overview of Evaluations Evaluation Procedure 100 times random sub-sampling validation

Evaluations Fault Prediction

• Predicting Faulty or Not Faulty• Effort-aware Fault Prediction

• Comparison between ImpactScale and Network Measures

Validating ImpactScale Definition


Does adding ImpactScale to existing product metricsimprove predictive performance?

RQ3

RQ1

RQ2

17



Predicting Faulty or Not Faulty


PerformanceMeasure

DS1MET

DS1MET+IS

Improvement by IS

Precision 0.148 0.168 +0.020Recall 0.315 0.392 +0.077

F1 0.200 0.234 +0.034

PerformanceMeasure

DS2MET

DS2MET+IS

Improvement by IS

Precision 0.139 0.162 +0.020Recall 0.253 0.334 +0.077

F1 0.177 0.216 +0.034

Practically, these Precision/Recall/F1 evaluations are not very useful. Because in maintenance, high fault-estimated modules tend to be large. Actually, in the case of DS2, the top 10% of high fault-estimated modules

has 24% LOC. It is not effort-effective.

Adding IS improves all performance measures supports RQ1 is YES.

All improvements are significant in Wilcoxon’s signed rank test.

Faults are predicted using logistic regression. MET = Model without ImpactScale / MET+IS = Model with ImpactScale

18


Effort-aware Fault Prediction Model Problem In maintenance, modules estimated as faulty tend to be large. A large module needs large effort to be reviewed or tested.

Practitioners’ Opinion “Budget and schedule are very demanding. We want to find more faults

with less effort.” Therefore, effort-effectiveness is our main concern.

We use “Effort-aware model” [Arisholm06] [Menzies10] [Mende10]

It prioritize modules in the order of relative riskto maximize effort-effectiveness.

Poisson Regression is used to learn relative risk.


)()(#

xEffortxerrors

19


AUC is the Area Under the Curve of lift chart. AUC shows overall predictive performance. High AUC means high

performance.

ddr10 is “detected defect rate in first 10% effort”. ddr10 shows the predictive

performance in the limited effort. High ddr10 means high performance.


In maintenance, budget, schedule and effort is always limited, therefore, ddr10 is more important.

Results of Effort-aware Evaluation


DS1-MET+ISDS1-MET

Optimal

Faul

ts d

etec

ted

Effort (LOC inspected)

《Effort-based Cumulative Lift Chart of DS1》

PerformanceMeasure

DS1-MET

DS1-MET+IS

Improvement by IS

AUC 0.635 0.680 +0.045ddr10 0.186 0.296 ×1.60

0.296

0.186


20-1


Results of Effort-aware Evaluation


DS1-MET+ISDS1-MET

Optimal

Faul

ts d

etec

ted


DS2-MET+ISDS2-MET

Optimal


Faul

ts d

etec

ted

《Effort-based Cumulative Lift Chart of DS1》《Effort-based Cumulative Lift Chart of DS2》

PerformanceMeasure

DS1-MET

DS1-MET+IS

Improvement by IS

AUC 0.635 0.680 +0.045ddr10 0.186 0.296 ×1.60

PerformanceMeasure

DS2-MET

DS2-MET+IS

Improvement by IS

AUC 0.669 0.714 +0.045ddr10 0.225 0.343 ×1.53

0.296

0.186

0.343

0.225


Does adding ImpactScale to existing product metrics improve predictive performance?

RQ1is YES.

20-2


Comparison with Network MeasuresNetwork Measures Recently, [Zimmermann et al. ICSE08] applied Social Network Analysis

(SNA) on a software dependency graph representing relationships between binary modules of software systems.

Over 50 network measures were used. For example,• in/out Degrees• Network Diameter• Closeness• Eigenvector Centrality, etc.

They and some replication studies [Tosun09][Nguyen10] reported they work well in some cases.


“Does adding ImpactScale to existing product metrics and network measures improve predictive performance?”

RQ2

a.k.a. Page Rank

21


ImpactScale vs. Network MeasuresHierarchical Model Comparison based on Effort-aware Model


Models are learned by using Principal Component Poisson Regression.

All improvements and deterioration are significant in Wilcoxon’s signed rank test.

*: P<0.05, **: P<0.01, unmarked: P<0.001

Model with existing metrics

+ImpactScale

+network measures

+network measures

+ImpactScale

Adding ImpactScale

improves performance.

22-1


ImpactScale vs. Network MeasuresHierarchical Model Comparison based on Effort-aware Model


Models are learned by using Principal Component Poisson Regression.

All improvements and deterioration are significant in Wilcoxon’s signed rank test.

*: P<0.05, **: P<0.01, unmarked: P<0.001

Model with existing metrics

+ImpactScale

+network measures

+network measures

+ImpactScale

Adding ImpactScale

improves performance.


RQ2is YES.

22-2


Validating ImpactScale

《Test method》 Compare Models with ImpactScale variants with limited maximum distance of path-finding.

0.10

0.15

0.20

0.25

0.30

0.35

0.40

1 2 3 4 5 6 7 8 9 100.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

1 2 3 4 5 6 7 8 9 10

ddr10

Limit of Maximum Distance of Path-finding

DS1 DS2

ddr10

“Limit=1” variant means

almostfan-in + fan-out.

Is considering distant nodes meaningful?RQ3

YES. Answer

23


Overview




Summary of Evaluations



RQ2YES

Does adding ImpactScale to existing product metricsimprove predictive performance?

RQ1

Is considering distant nodes meaningful?RQ3 YES

A metric that quantifies the scale of change impact can improve the performance of fault prediction.

HypothesisTRUE

YES

YES

25


Threats to Validity Language ImpactScale has no language-specific feature, but the evaluations are

done in only COBOL systems. COBOL has a lot of difference from other languages.

Application Domain The evaluated systems are only in accounting business domain.

Call Graph Analysis The impact of dynamic dispatching (e.g. polymorphism and reflection) is

not assessed.



ConclusionWe defined a new product metric quantifying change impact,

called ImpactScale. Probabilistic propagation Relation-sensitive propagation Practical computational time even for large-scale software systems

We evaluated its predictive performance in enterprise systems. Adding ImpactScale improves the performance

• Over 1.5 times in first 10% effort (LOC). Additional Finding

• Considering distant nodes in dependency graph is meaningful for fault prediction.

27


Future Works Extending supported languages Java, C, C++

Expanding use cases Rapid risk assessmentWatching violations of modularityMeasuring software decay

28


Thank you!Kenichi Kobayashi

Fujitsu Labs

29

impact analysis - impactscale: quantifying change impact to predict faults in large software systems

Technology