revisiting unsupervised learning for defect … unsupervised learning for defect prediction wei fu,...

12
Revisiting Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA [email protected],[email protected] ABSTRACT Collecting quality data from soware projects can be time-consuming and expensive. Hence, some researchers explore “unsupervised” approaches to quality prediction that does not require labelled data. An alternate technique is to use “supervised” approaches that learn models from project data labelled with, say, “defective” or “not- defective”. Most researchers use these supervised models since, it is argued, they can exploit more knowledge of the projects. At FSE’16, Yang et al. reported startling results where unsu- pervised defect predictors outperformed supervised predictors for eort-aware just-in-time defect prediction. If conrmed, these re- sults would lead to a dramatic simplication of a seemingly complex task (data mining) that is widely explored in the soware engineer- ing literature. is paper repeats and refutes those results as follows. (1) ere is much variability in the ecacy of the Yang et al. predictors so even with their approach, some supervised data is required to prune weaker predictors away. (2) eir ndings were grouped across N projects. When we repeat their analysis on a project-by-project basis, supervised predictors are seen to work beer. Even though this paper rejects the specic conclusions of Yang et al., we still endorse their general goal. In our our experiments, supervised predictors did not perform outstandingly beer than unsupervised ones for eort-aware just-in-time defect prediction. Hence, they may indeed be some combination of unsupervised learners to achieve comparable performance to supervised ones. We therefore encourage others to work in this promising area. KEYWORDS Data analytics for soware engineering, soware repository mining, empirical studies, defect prediction ACM Reference format: Wei Fu, Tim Menzies. 2017. Revisiting Unsupervised Learning for De- fect Prediction. In Proceedings of 2017 11th Joint Meeting of the European Soware Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Soware Engineering, Paderborn, Germany, September 4-8, 2017 (ESEC/FSE’17), 12 pages. DOI: 10.1145/3106237.3106257 1 INTRODUCTION is paper repeats and refutes recent results from Yang et al. [54] published at FSE’16. e task explored by Yang et al. was eort-ware just-in-time (JIT) soware defect predictors. JIT defect predictors are built on code change level and could be used to conduct defect prediction right before developers commit the current change. ey report an unsupervised soware quality prediction method that ESEC/FSE’17, Paderborn, Germany 2017. 978-1-4503-5105-8/17/09. . . $15.00 DOI: 10.1145/3106237.3106257 achieved beer results than standard supervised methods. We repeated their study since, if their results were conrmed, this would imply that decades of research into defect prediction [7, 8, 10, 13, 16, 18, 19, 21, 25, 26, 29, 30, 35, 40, 41, 43, 51, 54] had needlessly complicated an inherently simple task. e standard method for soware defect prediction is learning from labelled data. In this approach, the historical log of known defects is learned by a data miner. Note that this approach requires waiting until a historical log of defects is available; i.e. until aer the code has been used for a while. Another approach, explored by Yang et al., uses general background knowledge to sort the code, then inspect the code in that sorted order. In their study, they assumed that more defects can be found faster by rst looking over all the “smaller” modules (an idea initially proposed by Koru et al. [28]). Aer exploring various methods of dening “smaller”, they report their approach nds more defects, sooner, than supervised methods. ese results are highly remarkable: is approach does not require access to labelled data; i.e. it can be applied just as soon as the code is wrien. It is extremely simple: no data pre-processing, no data mining, just simple sorting. Because of the remakrable nature of these results, this paper takes a second look at the Yang et al. results. We ask three questions: RQ1: Do all unsupervised predictors perform better than supervised predictors? e reason we ask this question is that if the answer is “yes”, then we can simply select any unsupervised predictor built from the the change metrics as Yang et al suggested without using any supervised data; if the answer is “no”, then we must apply some techniques to select best predictors and remove the worst ones. However, our results show that, when projects are explored sepa- rately, the majority of the unsupervised predictors learned by Yang et al. perform worse than supervised predictors. Results of RQ1 suggest that aer building multiple predictors using unsupervised methods, it is required to prune the worst predictors and only beer ones should be used for future prediction. However, with Yang et al. approach, there is no way to tell which unsupervised predictors will perform beer without access to the labels of testing data. To test that speculation, we built a new learner, OneWay, that uses supervised training data to remove all but one of the Yang et al. predictors. Using this learner, we asked: RQ2: Is it benecial to use supervised data to prune away all but one of the Yang et al. predictors? Our results showed that OneWay nearly always outperforms the unsupervised predictors found by Yang et al. e success of OneWay leads to one last question: RQ3: Does OneWay perform better than more complex stan- dard supervised learners? Such standard supervised learners include Random Forests, Lin- ear Regression, J48 and IBk (these learners were selected based on arXiv:1703.00132v2 [cs.SE] 24 Jun 2017

Upload: trinhdiep

Post on 11-Mar-2018

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

Revisiting Unsupervised Learning for Defect PredictionWei Fu, Tim Menzies

Com. Sci., NC State, USA

[email protected],[email protected]

ABSTRACTCollecting quality data from so�ware projects can be time-consuming

and expensive. Hence, some researchers explore “unsupervised”

approaches to quality prediction that does not require labelled data.

An alternate technique is to use “supervised” approaches that learn

models from project data labelled with, say, “defective” or “not-

defective”. Most researchers use these supervised models since, it

is argued, they can exploit more knowledge of the projects.

At FSE’16, Yang et al. reported startling results where unsu-

pervised defect predictors outperformed supervised predictors for

e�ort-aware just-in-time defect prediction. If con�rmed, these re-

sults would lead to a dramatic simpli�cation of a seemingly complex

task (data mining) that is widely explored in the so�ware engineer-

ing literature.

�is paper repeats and refutes those results as follows. (1) �ere

is much variability in the e�cacy of the Yang et al. predictors so

even with their approach, some supervised data is required to prune

weaker predictors away. (2) �eir �ndings were grouped across Nprojects. When we repeat their analysis on a project-by-project

basis, supervised predictors are seen to work be�er.

Even though this paper rejects the speci�c conclusions of Yang

et al., we still endorse their general goal. In our our experiments,

supervised predictors did not perform outstandingly be�er than

unsupervised ones for e�ort-aware just-in-time defect prediction.

Hence, they may indeed be some combination of unsupervised

learners to achieve comparable performance to supervised ones.

We therefore encourage others to work in this promising area.

KEYWORDSData analytics for so�ware engineering, so�ware repository mining,

empirical studies, defect prediction

ACM Reference format:Wei Fu, Tim Menzies. 2017. Revisiting Unsupervised Learning for De-

fect Prediction. In Proceedings of 2017 11th Joint Meeting of the EuropeanSo�ware Engineering Conference and the ACM SIGSOFT Symposium on theFoundations of So�ware Engineering, Paderborn, Germany, September 4-8,2017 (ESEC/FSE’17), 12 pages.

DOI: 10.1145/3106237.3106257

1 INTRODUCTION�is paper repeats and refutes recent results from Yang et al. [54]

published at FSE’16. �e task explored by Yang et al. was e�ort-warejust-in-time (JIT) so�ware defect predictors. JIT defect predictors

are built on code change level and could be used to conduct defect

prediction right before developers commit the current change. �ey

report an unsupervised so�ware quality prediction method that

ESEC/FSE’17, Paderborn, Germany2017. 978-1-4503-5105-8/17/09. . .$15.00

DOI: 10.1145/3106237.3106257

achieved be�er results than standard supervised methods. We

repeated their study since, if their results were con�rmed, this

would imply that decades of research into defect prediction [7, 8, 10,

13, 16, 18, 19, 21, 25, 26, 29, 30, 35, 40, 41, 43, 51, 54] had needlessly

complicated an inherently simple task.

�e standard method for so�ware defect prediction is learning

from labelled data. In this approach, the historical log of known

defects is learned by a data miner. Note that this approach requires

waiting until a historical log of defects is available; i.e. until a�er the

code has been used for a while. Another approach, explored by Yang

et al., uses general background knowledge to sort the code, then

inspect the code in that sorted order. In their study, they assumed

that more defects can be found faster by �rst looking over all the

“smaller” modules (an idea initially proposed by Koru et al. [28]).

A�er exploring various methods of de�ning “smaller”, they report

their approach �nds more defects, sooner, than supervised methods.

�ese results are highly remarkable:

• �is approach does not require access to labelled data; i.e. it can

be applied just as soon as the code is wri�en.

• It is extremely simple: no data pre-processing, no data mining,

just simple sorting.

Because of the remakrable nature of these results, this paper takes

a second look at the Yang et al. results. We ask three questions:

RQ1: Do all unsupervised predictors perform better thansupervised predictors?

�e reason we ask this question is that if the answer is “yes”,

then we can simply select any unsupervised predictor built from

the the change metrics as Yang et al suggested without using any

supervised data; if the answer is “no”, then we must apply some

techniques to select best predictors and remove the worst ones.

However, our results show that, when projects are explored sepa-

rately, the majority of the unsupervised predictors learned by Yang

et al. perform worse than supervised predictors.

Results of RQ1 suggest that a�er building multiple predictors

using unsupervised methods, it is required to prune the worst

predictors and only be�er ones should be used for future prediction.

However, with Yang et al. approach, there is no way to tell which

unsupervised predictors will perform be�er without access to the

labels of testing data. To test that speculation, we built a new

learner, OneWay, that uses supervised training data to remove all

but one of the Yang et al. predictors. Using this learner, we asked:

RQ2: Is it bene�cial to use supervised data to prune awayall but one of the Yang et al. predictors?

Our results showed that OneWay nearly always outperforms

the unsupervised predictors found by Yang et al. �e success of

OneWay leads to one last question:

RQ3: DoesOneWayperformbetter thanmore complex stan-dard supervised learners?

Such standard supervised learners include Random Forests, Lin-

ear Regression, J48 and IBk (these learners were selected based on

arX

iv:1

703.

0013

2v2

[cs

.SE

] 2

4 Ju

n 20

17

Page 2: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

prior results by [13, 21, 30, 35]). We �nd that in terms of Recalland Popt (the metric preferred by Yang et al.), OneWay performed

be�er than standard supervised predictors. Yet measured in terms

of Precision, there was no advantage to OneWay.

From the above, we make an opposite conclusion to Yang et al.;

i.e., there are clear advantages to use supervised approaches over

unsupervised ones. We explain the di�erence between our results

and their results as follows:

• Yang et al. reported averaged results across all projects;

• We o�er a more detailed analysis on a project-by-project basis.

�e rest of this paper is organized as follows. Section 2 is a commen-

tary on Yang et al. study and the implication of this paper. Section 3

describes the background and related work on defect prediction.

Section 4 explains the e�ort-ware JIT defect prediction methods

investigated in this study. Section 5 describes the experimental

se�ings of our study, including research questions that motivate

our study, data sets and experimental design. Section 6 presents

the results. Section 7 discusses the threats to the validity of our

study. Section 8 presents the conclusion and future work.

Note one terminological convention: in the following, we treat

“predictors” and “learners” as synonyms.

2 SCIENCE IN THE 21st CENURYWhile this paper is speci�c about e�ort-aware JIT defect prediction

and the Yang et al. result, at another level this paper is also about

science in the 21st century.

In 2017, the so�ware analytics community now has the tools,

data sets, experience to explore a bold wider range of options. �ere

are practical problems in exploring all those possibilities speci�cally,

too many options. For example, in section 2.5 of [27], Kocaguneli

et al. list 12,000+ di�erent ways of estimation by analogy. We have

had some recent successes with exploring this space of options [7]

but only a�er the total space of options is reduced by some initialstudy to a manageable set of possibilities. Hence, what is needed are

initial studies to rule our methods that are generally unpromising

(e.g. this paper) before we apply second level hyper-parameter

optimization study that takes the reduced set of options.

Another aspect of 21st century science that is highlighted by this

paper is the nature of repeatability. While this paper disagrees the

conclusions of Yang et al., it is important to stress that their paper

is an excellent example of good science that should be emulated in

future work.

Firstly, they tried something new. �ere are many papers in the

SE literature about defect prediction. However, compared to most

of those, the Yang et al. paper is bold and stunningly original.

Secondly, they made all their work freely available. Using the “R”

code they placed online, we could reproduce their result, including

all their graphical output, in a ma�er of days. Further, using that

code as a starting point, we could rapidly conduct the extensive ex-

perimentation that leads to this paper. �is is an excellent example

of the value of open science.

�irdly, while we assert their answers were wrong, the question

they asked is important and should be treated as an open and urgent

issue by the so�ware analytics community. In our experiments,

supervised predictors performed be�er than unsupervised– but not

outstandingly be�er than unsupervised. Hence, they may indeed be

some combination of unsupervised learners to achieve comparable

performance to supervised. �erefore, even though we reject the

speci�c conclusions of Yang et al., we still endorse the question

they asked strongly and encourage others to work in this area.

3 BACKGROUND AND RELATEDWORK3.1 Defect PredictionAs soon as people started programming, it became apparent that

programming was an inherently buggy process. As recalled by

Maurice Wilkes [52], speaking of his programming experiences

from the early 1950s: “It was on one of my journeys between the

EDSAC room and the punching equipment that ‘hesitating at the

angles of stairs’ the realization came over me with full force that

a good part of the remainder of my life was going to be spent in

�nding errors in my own programs.”

It took decades to gather the experience required to quantify

the size/defect relationship. In 1971, Fumio Akiyama [2] described

the �rst known “size” law, saying the number of defects D was a

function of the number of LOC where D = 4.86 + 0.018 ∗ LOC.

In 1976, �omas McCabe argued that the number of LOC was

less important than the complexity of that code [33]. He argued that

code is more likely to be defective when his “cyclomatic complexity”

measure was over 10. Later work used data miners to build defect

predictors that proposed thresholds on multiple measures [35].

Subsequent research showed that so�ware bugs are not dis-

tributed evenly across a system. Rather, they seem to clump in

small corners of the code. For example, Hamill et al. [15] report

studies with (a) the GNU C++ compiler where half of the �les

were never implicated in issue reports while 10% of the �les were

mentioned in half of the issues. Also, Ostrand et al. [44] studied

(b) AT&T data and reported that 80% of the bugs reside in 20% of the

�les. Similar “80-20” results have been observed in (c) NASA sys-

tems [15] as well as (d) open-source so�ware [28] and (e) so�ware

from Turkey [37].

Given this skewed distribution of bugs, a cost-e�ective quality

assurance approach is to sample across a so�ware system, then

focus on regions reporting some bugs. So�ware defect predictors

built from data miners are one way to implement such a sampling

policy. While their conclusions are never 100% correct, they can be

used to suggest where to focus more expensive methods such as

elaborate manual review of source code [49]; symbolic execution

checking [46], etc. For example, Misirli et al. [37] report studies

where the guidance o�ered by defect predictors:

• Reduced the e�ort required for so�ware inspections in some

Turkish so�ware companies by 72%;

• While, at the same time, still being able to �nd the 25% of the

�les that contain 88% of the defects.

Not only do static code defect predictors perform well compared to

manual methods, they also are competitive with certain automatic

methods. A recent study at ICSE’14, Rahman et al. [47] compared

(a) static code analysis tools FindBugs, Jlint, and Pmd and (b) static

code defect predictors (which they called “statistical defect predic-

tion”) built using logistic regression. �ey found no signi�cant

di�erences in the cost-e�ectiveness of these approaches. Given

this equivalence, it is signi�cant to note that static code defect

prediction can be quickly adapted to new languages by building

Page 3: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

lightweight parsers that extract static code metrics. �e same is not

true for static code analyzers– these need extensive modi�cation

before they can be used on new languages.

To build such defect predictors, we measure the complexity of

so�ware projects using McCabe metrics, Halstead’s e�ort metrics

and CK object-oriented code mertics [5, 14, 20, 33] at a coarse gran-

ularity, like �le or package level. With the collected data instances

along with the corresponding labels (defective or non-defective),

we can build defect prediction models using supervised learners

such as Decision Tree, Random Forests, SVM, Naive Bayes and Lo-

gistic Regression [13, 22–24, 30, 35]. A�er that, such trained defect

predictor can be applied to predict the defects of future projects.

3.2 Just-In-Time Defect PredictionTraditional defect prediction has some drawbacks such as prediction

at a coarse granularity and started at very late stage of so�ware

development circle [21], whereas in JIT defect prediction paradigm,

the defect predictors are built on code change level, which could

easily help developers narrow down the code for inspection and

JIT defect prediction could be conducted right before developers

commit the current change. JIT defect prediction becomes a more

practical method for practitioners to carry out.

Mockus et al. [38] conducted the �rst study to predict so�ware

failures on a telecommunication so�ware project, 5ESS, by using

logistic regression on data sets consisted of change metrics of the

project. Kim et al. [25] further evaluated the e�ectiveness of change

metrics on open source projects. In their study, they proposed to

apply support vector machine to build a defect predictor based

on so�ware change metrics, where on average they achieved 78%

accuracy and 60% recall. Since training data might not be available

when building the defect predictor, Fukushima et al. [9] introduced

cross-project paradigm into JIT defect prediction. �eir results

showed that using data from other projects to build JIT defect

predictor is feasible.

Most of the research into defect prediction does not consider

the e�ort1

required to inspect the code predicted to be defective.

Exceptions to this rule include the work of Arishom and Briand [3],

Koru et al. [28] and Kamei et al. [21]. Kamei et al. [21] conducted

a large-scale study on the e�ectiveness of JIT defect prediction,

where they claimed that using 20% of e�orts required to inspect

all changes, their modi�ed linear regression model (EALR) could

detect 35% defect-introducing changes. Inspired by Menzies et al.’s

ManualUp model (i.e., small size of modules inspected �rst) [36],

Yang et al. [54] proposed to build 12 unsupervised defect predictors

by sorting the reciprocal values of 12 di�erent change metrics on

each testing data set in descending order. �ey reported that with

20% e�orts, many unsupervised predictors perform be�er than

state-of-the-art supervised predictors.

4 METHOD4.1 Unsupervised PredictorsIn this section, we describe the e�ort-aware just-in-time unsuper-

vised defect predictors proposed by Yang et al. [54], which serves

as a baseline method in this study. As described by Yang et al. [54],

1E�ort means time/labor required to inspect total number of �les/code predicted as

defective.

their simple unsupervised defect predictor is built on change met-

rics as shown in Table 1. �ese 14 di�erent change metrics can be

divided into 5 dimensions [21]:

• Di�usion: NS, ND, NF and Entropy.

• Size: LA, LD and LT.

• Purpose: FIX.

• History: NDEV, AGE and NUC.

• Experience: EXP, REXP and SEXP.

Table 1: Change metrics used in our data sets.Metric Description

NS Number of modi�ed subsystems [38].

ND Number of modi�ed directories [38].

NF Number of modi�ed �les [42].

Entropy Distribution of the modi�ed code across each �le [6, 16].

LA Lines of code added [41].

LD Lines of code deleted [41].

LT Lines of code in a �le before the current change [28].

FIX Whether or not the change is a defect �x [11, 55].

NDEV Number of developers that changed the modi�ed �les [32].

AGE �e average time interval between the last and the current

change [10].

NUC �e number of unique changes to the modi�ed �les [6, 16].

EXP �e developer experience in terms of number of changes [38].

REXP Recent developer experience [38].

SEXP Developer experience on a subsystem [38].

�e di�usion dimension characterizes how a change is distributed

at di�erent levels of granularity. As discussed by Kamei et al. [21],

a highly distributed change is harder to keep track and more likely

to introduce defects. �e size dimension characterizes the size of a

change and it is believed that the so�ware size is related to defect

proneness [28, 41]. Yin et al. [55] report that the bug-�xing process

can also introduce new bugs. �erefore, the Fix metric could be used

as a defect evaluation metric. �e History dimension includes some

historical information about the change, which has been proven to

be a good defect indicator [32]. For example, Matsumoto et al. [32]

�nd that the �les previously touched by many developers are likely

to contain more defects. �e Experience dimension describes the

experience of so�ware programmers for the current change because

Mockus et al. [38] show that more experienced developers are less

likely to introduce a defect. More details about these metrics can

be found in Kamei et al’s study [21].

In Yang et al.’s study, for each change metric M of testing data,

they build an unsupervised predictor that ranks all the changes

based on the corresponding value of1

M (c) in descending order,

where M(c) is the value of the selected change metric for each

change c . �erefore, the changes with smaller change metric values

will ranked higher. In all, for each project, Yang et al. de�ne 12

simple unsupervised predictors (LA and LD are excluded as Yang

et al. [54]).

4.2 Supervised PredictorsTo further evaluate the unsupervised predictor, we selected some

supervised predictors that already used in Yang et al.’s work.

As reported in both Yang et al.’s [54] and Kamei et al.’s [21]

work, EALR outperforms all other supervised predictors for e�ort-

aware JIT defect prediction. EALR is a modi�ed linear regression

Page 4: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

model [21] and it predictsY (x)

E�ort(x) instead of predictingY (x), where

Y (x) indicates whether this change is a defect or not (1 or 0) and

E�ort(x) represents the e�ort required to inspect this change. Note

that this is the same method to build EALR as Kamei et al. [21].

In defect prediction literature, IBk (KNN), J48 and Random Forests

methods are simple yet widely used as defect learners and have

been proven to perform, if not best, quite well for defect predic-

tion [13, 30, 35, 50]. �ese three learners are also used in Yang et

al’s study. For these supervised predictors, Y (x) was used as the

dependant variable. For KNN method, we set K = 8 according to

Yang et al. [54].

4.3 OneWay LearnerBased on our preliminary experiment results shown in the follow-

ing section, for the six projects investigated by Yang et al, some

of 12 unsupervised predictors do perform worse than supervised

predictors and there is no one predictor constantly working best

on all project data. �is means we can not simply say which un-

supervised predictor works for the new project before predicting

on the testing data. In this case, we need a technique to select the

proper metrics to build defect predictors.

We propose OneWay learner, which is a supervised predictor

built on the implication of Yang et al’s simple unsupervised predic-

tors. �e pseudocode for OneWay is shown in Algorithm 1. In the

following description, we use the superscript numbers to denote

the line number in pseudocode.

�e general idea of OneWay is to use supervised training data

to remove all but one of the Yang et al. predictors and then apply

this trained learner on the testing data. Speci�cally, OneWay �rstly

builds simple unsupervised predictors from each metric on training

dataL4

, then evaluates each of those learners in terms of evaluation

metricsL5

, like Popt , Recall, Precision and F1. A�er that, if the

desirable evaluation goal is set, the metric which performs best on

the corresponding evaluation goal is returned as the best metric;

otherwise, the metric which gets the highest mean score over all

evaluation metrics is returnedL9

(In this study, we use the la�er

one). Finally, a simple predictor is built only on such best metricL10

with the help of training data. �erefore, OneWay builds only one

supervised predictor for each project using the local data instead

of 12 predictors directly on testing data as Yang et al [54] .

5 EXPERIMENTAL SETTINGS5.1 Research�estionsUsing the above methods, we explore three questions:

• Do all unsupervised predictors perform be�er than supervised

predictors?

• Is it bene�cial to use supervised data to prune all but one of the

Yang et al. unsupervised predictors?

• Does OneWay perform be�er than more complex standard su-

pervised predictors?

When reading the results from Yang et al. [54], we �nd that they

aggregate performance scores of each learner on six projects, which

might miss some information about how learners perform on each

project. Are these unsupervised predictors working consistently

Algorithm 1 Pseudocode for OneWay

Input: data train, data test, eval goal ∈ {F1, Popt, Recall, Precision, ... }Output: result

1: function OneWay(data train, data test, eval goal)2: all scores← NULL

3: for metric in data train do4: learner ← buildUnsupervisedLearner(data train, metric)5: scores←evaluate(learner)6: //scores include all evaluation goals, e.g., Popt, F1, ...7: all scores.append(scores)8: end for9: best metric ← pruneFeature(all scores, eval goal)

10: result ← buildUnsupervisedLearner(data test, best metric)11: return result12: end function13: function pruneFeature(all scores, eval goal)14: if eval goal == NULL then15: mean scores← getMeanScoresForEachMetric(all scores)16: best metric ← getMetric(max(mean scores))17: return best metric18: else19: best metric ← getMetric(max(all scores[“eval goal”]))20: return best metric21: end if22: end function

across all the project data? If not, how would it look like? �erefore,

in RQ1, we report results for each project separately.

Another observation is that even though Yang et al. [54] propose

that simple unsupervised predictors could work be�er than super-

vised predictors for e�ort-aware JIT defect prediction, one missing

aspect of their report is how to select the most promising metric to

build a defect predictor. �is is not an issue when all unsupervised

predictors perform well but, as we shall see, this is not the case.

As demonstrated below, given M unsupervised predictors, only

a small subset can be recommended. �erefore it is vital to have

some mechanism by which we can down select from M models

to the L � M that are useful. Based on this fact, we propose a

new method, OneWay, which is the missing link in Yang et al.’s

study [54] and the missing �nal step they do not explore. �ere-

fore, in RQ2 and RQ3, we want to evaluate how well our proposed

OneWay method performs compared to the unsupervised predictors

and supervised predictors.

Considering our goals and questions, we reproduce Yang et al’s

results and report for each project to answer RQ1. For RQ2 and

RQ3, we implement our OneWay method, and compare it with

unsupervised predictors and supervised predictors on di�erent

projects in terms of various evaluation metrics.

5.2 Data SetsIn this study, we conduct our experiment using the same data sets

as Yang et al.[54], which are six well-known open source projects,

Bugzilla, Columba, Eclipse JDT, Eclipse Platform, Mozilla and Post-

greSQL. �ese data sets are shared by Kamei et al. [21]. �e statistics

of the data sets are listed in Table 2. From Table 2, we know that

all these six data sets cover at least 4 years historical information,

and the longest one is PostgreSQL, which includes 15 years of data.

Page 5: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

Table 2: Statistics of the studied data sets

Project Period

Total

Change

% of

Defects

Avg LOC

per

Change

# Modi�ed

Files per

Change

Bugzilla 08/1998 - 12/2006 4620 36% 37.5 2.3

Platform 05/2001 - 12/2007 64250 14% 72.2 4.3

Mozilla 01/2000 - 12/2006 98275 5% 106.5 5.3

JDT 05/2001 - 12/2007 35386 14% 71.4 4.3

Columba 11/2002 - 07/2006 4455 31% 149.4 6.2

PostgreSQL 07/1996 - 05/2010 20431 25% 101.3 4.5

�e total changes for these six data sets are from 4450 to 98275,

which are su�cient for us to conduct an empirical study. In this

study, if a change introduces one or more defects then this change is

considered as defect-introducing change. �e percentage of defect-

introducing changes ranges from 5% to 36%. All the data and code

used in this paper is available online2.

5.3 Experimental Design�e following principle guides the design of these experiments:

Whenever there is a choice between methods, data, etc., we will alwaysprefer the techniques used in Yang et al. [54].By applying this principle, we can ensure that our experimental

setup is the same as Yang et al. [54]. �is will increase the validity

of our comparisons with that prior work.

When applying data mining algorithms to build predictive mod-

els, one important principle is not to test on the data used in training.

To avoid that, we used time-wise-cross-validation method which is

also used by Yang et al. [54]. �e important aspect of the following

experiment is that it ensures that all testing data was created a�er

training data. Firstly, we sort all the changes in each project based

on the commit date. �en all the changes that were submi�ed in

the same month are grouped together. For a given project data

set that covers totally N months history, when building a defect

predictor, consider a sliding window size of 6,

• �e �rst two consecutive months data in the sliding window,

ith and i + 1th, are used as the training data to build supervised

predictors and OneWay learner.

• �e last two months data in the sliding window, i + 4th and

i + 5th, which are two months later than the training data, are

used as the testing data to test the supervised predictors, OneWaylearner and unsupervised predictors.

A�er one experiment, the window slides by “one month” data.

By using this method, each training and testing data set has two

months data, which will include su�cient positive and negative

instances for the supervised predictors to learn. For any project that

includesN months data, we can performN−5 di�erent experiments

to evaluate our learners when N is greater than 5. For all the

unsupervised predictors, only the testing data is used to build the

model and evaluate the performance.

To statistically compare the di�erences between OneWay with

supervised and unsupervised predictors, we use Wilcoxon single

ranked test to compare the performance scores of the learners in

this study the same as Yang et al. [54]. To control the false discover

rate, the Benjamini-Hochberg (BH) adjusted p-value is used to

2h�ps://github.com/WeiFoo/RevisitUnsupervised

test whether two distributions are statistically signi�cant at the

level of 0.05 [4, 54]. To measure the e�ect size of performance

scores among OneWay and supervised/unsupervised predictors, we

compute Cli�’s δ that is a non-parametric e�ect size measure [48].

As Romano et al. suggested, we evaluate the magnitude of the e�ect

size as follows: negligible (|δ | < 0.147 ), small (0.147 ≤ |δ | < 0.33),

medium (0.33 ≤ |δ | < 0.474 ), and large (0.474 ≤ |δ |) [48].

5.4 Evaluation MeasuresFor e�ort-aware JIT defect prediction, in addition to evaluate how

learners correctly predict a defect-introducing change, we have

to take account the e�orts that are required to inspect prediction.

Ostrand et al. [45] report that given a project, 20% of the �les

contain on average 80% of all defects in the project. Although there

is nothing magical about the number 20%, it has been used as a

cuto� value to set the e�orts required for the defect inspection

when evaluating the defect learners [21, 34, 39, 54]. �at is, given

20% e�ort, how many defects can be detected by the learner. To be

consistent with Yang et al, in this study, we restrict our e�orts to

20% of total e�orts.

To evaluate the performance of e�ort-aware JIT defect prediction

learners in our study, we used the following 4 metrics: Precision,Recall, F1 and Popt , which are widely used in defect prediction

literature [21, 35, 36, 39, 54, 56].

Precision =True Positive

True Positive + False Positive

Recall =True Positive

True Positive + False Neдative

F1 =2 ∗ Precision ∗ RecallRecall + Precision

where Precision denotes the percentage of actual defective changes

to all the predicted changes and Recall is the percentage of predicted

defective changes to all actual defective changes. F1 is a measure

that combines both Precision and Recall which is the harmonic mean

of Precision and Recall.

Figure 1: Example of an e�ort-based cumulative li�chart [54].

�e last evaluation metric used in this study is Popt , which is

de�ned as 1−∆opt , where ∆opt is the area between the e�ort (code-

churn-based) cumulative li� charts of the optimal model and the

prediction model (as shown in Figure 1). In this chart, the x-axis is

Page 6: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

considered as the percentage of required e�ort to inspect the change

and the y-axis is the percentage of defect-introducing change found

in the selected change. In the optimal model, all the changes are

sorted by the actual defect density in descending order, while for the

predicted model, all the changes are sorted by the actual predicted

value in descending order.

According to Kamei et al. and Xu et al. [21, 39, 54], Popt can be

normalized as follows:

Popt (m) = 1 − S(optimal) − S(m)S(optimal) − S(worst)

where S(optimal), S(m) and S(worst) represent the area of curve

under the optimal model, predicted model, and worst model, re-

spectively. Note that the worst model is built by sorting all the

changes according to the actual defect density in ascending order.

For any learner, it performs be�er than random predictor only if

the Popt is greater than 0.5.

Note that, following the practices of Yang et al. [54], we measure

Precision, Recall, F1 and Popt at the e�ort = 20% point. In this study,

in addition to Popt and ACC (i.e., Recall) that is used in Yang et al’s

work [54], we include Precision and F1 measures and they provide

more insights about all the learners evaluated in the study from

very di�erent perspectives, which will be shown in the next section.

6 EMPIRICAL RESULTSIn this section, we present the experimental results to investigate

how simple unsupervised predictors work in practice and evaluate

the performance of the proposed method, OneWay, compared with

supervised and unsupervised predictors.

Before we start o�, we need a sanity check to see if we can

fully reproduce Yang et al.’s results. Yang et al. [54] provide the

median values of Popt and Recall for the EALR model and the best

two unsupervised models, LT and AGE, from the time-wise cross

evaluation experiment. �erefore, we use those numbers to check

our results.

As shown in Table 3 and Table 4, for unsupervised predictors, LT

and AGE, we get the exact same performance scores on all projects

in terms of Recall and Popt . �is is reasonable because unsupervised

predictors are very straightforward and easy to implement. For the

supervised predictor, EALR, these two implementations do not have

di�erences in Popt , while the maximum di�erence in Recall is only

0.02. Since the di�erences are quite small, then we believe that our

implementation re�ects the details about EALR and unsupervised

learners in Yang et al. [54].

For other supervised predictors used in this study, like J48, IBk,

and Random Forests, we use the same algorithms from Weka pack-

age [12] and set the same parameters as used in Yang et al. [54].

RQ1: Do all unsupervised predictors perform better thansupervised predictors?

To answer this question, we build four supervised predictors and

twelve unsupervised predictors on the six project data sets using

incremental learning method as described in Section 5.3.

Figure 2 shows the boxplot of Recall, Popt , F1 and Precision for

supervised predictors and unsupervised predictors on all data sets.

For each predictor, the boxplot shows the 25th percentile, median

and 75 percentile values for one data set. �e horizontal dashed

lines indicate the median of the best supervised predictor, which

Table 3: Comparison in Popt : Yang’s method (A) vs. our im-plementation (B)

Project

EALR LT AGE

A B A B A B

Bugzilla 0.59 0.59 0.72 0.72 0.67 0.67

Platform 0.58 0.58 0.72 0.72 0.71 0.71

Mozilla 0.50 0.50 0.65 0.65 0.64 0.64

JDT 0.59 0.59 0.71 0.71 0.68 0.69

Columba 0.62 0.62 0.73 0.73 0.79 0.79

PostgreSQL 0.60 0.60 0.74 0.74 0.73 0.73

Average 0.58 0.58 0.71 0.71 0.70 0.70

Table 4: Comparison in Recall : Yang’s method (A) vs. ourimplementation (B)

Project

EALR LT AGE

A B A B A B

Bugzilla 0.29 0.30 0.45 0.45 0.38 0.38

Platform 0.31 0.30 0.43 0.43 0.43 0.43

Mozilla 0.18 0.18 0.36 0.36 0.28 0.28

JDT 0.32 0.34 0.45 0.45 0.41 0.41

Columba 0.40 0.42 0.44 0.44 0.57 0.57

PostgreSQL 0.36 0.36 0.43 0.43 0.43 0.43

Average 0.31 0.32 0.43 0.43 0.41 0.41

is to help visualize the median di�erences between unsupervised

predictors and supervised predictors.

�e colors of the boxes within Figure 2 indicate the signi�cant

di�erence between learners:

• �e blue color represents that the corresponding unsupervised

predictor is signi�cantly be�er than the best supervised predictor

according to Wilcoxon signed-rank, where the BH corrected

p-value is less than 0.05 and the magnitude of the di�erence

between these two learners is NOT trivial according to Cli�’s

delta, where |δ | ≥ 0.147.

• �e black color represents that the corresponding unsuper-

vised predictor is not signi�cantly be�er than the best super-

vised predictor or the magnitude of the di�erence between these

two learners is trivial, where |δ | ≤ 0.147.

• �e red color represents that the corresponding unsupervised

predictor is signi�cantlyworse than the best supervised predictor

and the magnitude of the di�erence between these two learners

is NOT trivial.

From Figure 2, we can clearly see that not all unsupervised predic-

tors perform statistically be�er than the best supervised predictor

across all di�erent evaluation metrics. Speci�cally, for Recall, on one

hand, there are only2

12,

3

12,

6

12,

2

12,

3

12and

2

12of all unsupervised

predictors that perform statistically be�er than the best supervised

predictor on six data sets, respectively. On the other hand, there

are6

12,

6

12,

4

12,

6

12,

5

12and

6

12of all unsupervised predictors perform

statistically worse than the best supervised predictor on the six data

sets, respectively. �is indicates that:

• About 50% of the unsupervised predictors perform worse than

the best supervised predictor on any data set;

• Without any prior knowledge, we can not know which unsu-

pervised predictor(s) works adequately on the testing data.

Page 7: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany0.

00.

6

Supervised UnsupervisedRecall

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk NS ND NFEntro

pyLT FIX

NDEVAGE

EXPREXP

SEXPNUC

Recall

0.0

0.6

Supervised UnsupervisedPopt

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk NS ND NFEntro

pyLT FIX

NDEVAGE

EXPREXP

SEXPNUC

Popt

0.0

0.6

Supervised UnsupervisedF1

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk NS ND NFEntro

pyLT FIX

NDEVAGE

EXPREXP

SEXPNUC

F1

0.0

0.6

Supervised UnsupervisedPrecision

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk NS ND NFEntro

pyLT FIX

NDEVAGE

EXPREXP

SEXPNUC

Precision

Figure 2: Performance comparisons between supervised and unsupervised predictors over six projects (from top to bottomare Bugzilla, Platform, Mozilla, JDT, Columba, PostgreSQL).

Note that the above two points from Recall also hold for Popt .

For F1, we see that only LT on Bugzilla and AGE on PostgreSQL

perform statistically be�er than the best supervised predictor. Other

than that, no unsupervised predictor performs be�er on any data

set. Furthermore, surprisingly, no unsupervised predictor works

signi�cantly be�er than the best supervised predictor on any data

sets in terms of Precision. As we can see, Random Forests performs

well on all six data sets. �is suggests that unsupervised predictors

have very low precision for e�ort-aware defect prediction and can

not be deployed to any business situation where precision is critical.

Overall, for a given data set, no one speci�c unsupervised pre-

dictor works be�er than the best supervised predictor across all

evaluation metrics. For a given measure, most unsupervised pre-

dictors did not perform be�er across all data sets. In summary:

Not all unsupervised predictors perform be�er than super-

vised predictors for each project and for di�erent evaluation

measures.

Note the implications of this �nding: some extra knowledge

is required to prune the worse unsupervised models, such as the

knowledge that can come from labelled data. Hence, we must

conclude the opposite to Yang et al.; i.e. some supervised labelled

data must be applied before we can reliably deploy unsupervised

defect predictors on testing data.

RQ2: Is it bene�cial to use supervised data to prune awayall but one of the Yang et al. predictors?

To answer this question, we compare the OneWay learner with

all twelve unsupervised predictors. All these predictors are tested

on the six project data sets using the same experiment scheme as

we did in RQ1.

Figure 3 shows the boxplot for the performance distribution of

unsupervised predictors and the proposed OneWay learner on six

data sets across four evaluation measures. �e horizontal dashed

line denotes the median value ofOneWay. Note that in Figure 4, blue

means this learner is statistically be�er than OneWay, red means

worse, and black means no di�erence. As we can see, in Recall, only

Page 8: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies0.

00.

6

Unsupervised ProposedRecall

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

NS ND NFEntro

py LT FIXNDEV

AGEEXP

REXPSEXP

NUC

OneWay

0.0

0.6

Unsupervised ProposedPopt

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

NS ND NFEntro

py LT FIXNDEV

AGEEXP

REXPSEXP

NUC

OneWay

0.0

0.6

Unsupervised ProposedF1

0.0

0.4

0.00

0.20

0.0

0.4

0.0

0.4

0.0

0.4

NS ND NFEntro

py LT FIXNDEV

AGEEXP

REXPSEXP

NUC

OneWay

0.0

0.6

Unsupervised ProposedPrecision

0.0

0.4

0.00

0.20

0.0

0.4

0.0

0.4

0.0

0.4

NS ND NFEntro

py LT FIXNDEV

AGEEXP

REXPSEXP

NUC

OneWay

Figure 3: Performance comparisons between the proposed OneWay learner and unsupervised predictors over six projects (fromtop to bottom are Bugzilla, Platform, Mozilla, JDT, Columba, PostgreSQL).

one unsupervised predictor, LT, outperforms OneWay in4

6data sets.

However, OneWay signi�cantly outperform9

12,

9

12,

9

12,

10

12,

8

12and

10

12of total unsupervised predictors on six data sets, respectively.

�is observation indicates that OneWay works signi�cantly be�er

than almost all learners on all 6 data sets in terms of Recall.Similarly, we observe that only LT predictor works be�er than

OneWay in3

6data sets in terms of Popt and AGE outperforms

OneWay only on the platform data set. For the remaining experi-

ments, OneWay performs be�er than all the other predictors (on

average, 9 out of 12 predictors).

In addition, according to F1, only three unsupervised predictors

EXP/REXP/SEXP perform be�er than OneWay on the Mozilla data

set and LT predictor just performs as well as OneWay (and has

no advantage over OneWay). We note that similar �ndings can be

observed in Precision measure.

Table 5 provides the median values of the best unsupervised

predictor compared with OneWay for each evaluation measure on

all data sets. Note that, in practice, we can not know which unsu-

pervised predictor is the best out of the 12 unsupervised predictors

by Yang et al.’s method before we access to the labels of testing

data. In other words, to aid our analysis, the best unsupervised

ones in Table 5 are selected when referring to the true labels of

testing data, which are not available in practice. In that table, for

each evaluation measure, the number in green cell indicates that

the best unsupervised predictor has a large advantage over OneWayaccording to the Cli�’s δ ; Similarly, the yellow cell means medium

advantage and the gray cell means small advantage.

From Table 5, we observe that out of 24 experiments on all

evaluation measures, none of these best unsupervised predictors

outperform OneWay with a large advantage according to the Cli�’s

δ . Speci�cally, according to Recall and Popt , even though the best

unsupervised predictor, LT, outperforms OneWay on four and three

data sets, all of these advantage are small. Meanwhile, REXP and

EXP have a medium improvement over OneWay on one and two

data sets for F1 and Precision, respectively. In terms of the average

Page 9: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany0.

00.

6

Supervised ProposedRecall

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk

OneWay

Recall0.

00.

6

Supervised ProposedPopt

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk

OneWay

Popt

0.0

0.6

Supervised ProposedF1

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk

OneWay

F1

0.0

0.6

Supervised ProposedPrecision

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

0.0

0.6

EALR RF J48 IBk

OneWay

Precision

Figure 4: Performance comparisons between the proposed OneWay learner and supervised predictors over six projects (fromtop to bottom are Bugzilla, Platform, Mozilla, JDT, Columba, PostgreSQL).

Table 5: Best unsupervised predictor (A) vs. OneWay (B).�ecolorful cell indicates the size e�ect: green for large; yellowfor medium; gray for small.

Project

Recall Popt F1 PrecisionA (LT) B A (LT) B A (REXP) B A (EXP) B

Bugzilla 0.45 0.36 0.72 0.65 0.33 0.35 0.40 0.39

Platform 0.43 0.41 0.72 0.69 0.16 0.17 0.14 0.11

Mozilla 0.36 0.33 0.65 0.62 0.10 0.08 0.06 0.04

JDT 0.45 0.42 0.71 0.70 0.18 0.18 0.15 0.12

Columba 0.44 0.56 0.73 0.76 0.23 0.32 0.24 0.25

PostgreSQL 0.43 0.44 0.74 0.74 0.24 0.29 0.25 0.23

Average 0.43 0.42 0.71 0.69 0.21 0.23 0.21 0.19

scores, the maximum magnitude of the di�erence between the best

unsupervised learner and OneWay is 0.02. In other words, OneWayis comparable with the best unsupervised predictors on all data

sets for all evaluation measures even though the best unsupervised

predictors might not be known before testing.

Overall, we �nd that (1) no one unsupervised predictor signi�-

cantly outperforms OneWay on all data sets for a given evaluation

measure; (2) mostly, OneWay works as well as the best unsupervised

predictor and has signi�cant be�er performance than almost all

unsupervised predictors on all data sets for all evaluation measures.

�erefore, the above results suggest:

As a simple supervised predictor, OneWay has competitive

performance and it performs be�er than most unsupervised

predictors for e�ort-aware JIT defect prediction.

Note the implications of this �nding: the supervised learning

utilized in OneWay can signi�cantly outperform the unsupervised

models.

RQ3: DoesOneWayperformbetter thanmore complex stan-dard supervised predictors?

To answer this question, we compare OneWay learner with four

supervised predictors, including EALR, Random Forests, J48 and

Table 6: Best supervised predictor (A) vs. OneWay (B). �ecolorful cell indicates the size e�ect: green for large; yellowfor medium; gray for small.

Project

Recall Popt F1 PrecisionA (EALR) B A (EALR) B A (IBk) B A (RF) B

Bugzilla 0.30 0.36 0.59 0.65 0.30 0.35 0.59 0.39

Platform 0.30 0.41 0.58 0.69 0.23 0.17 0.38 0.11

Mozilla 0.18 0.33 0.50 0.62 0.18 0.08 0.27 0.04

JDT 0.34 0.42 0.59 0.70 0.22 0.18 0.31 0.12

Columba 0.42 0.56 0.62 0.76 0.24 0.32 0.50 0.25

PostgreSQL 0.36 0.44 0.60 0.74 0.21 0.29 0.69 0.23

Average 0.32 0.42 0.58 0.69 0.23 0.23 0.46 0.19

IBk. EALR is considered to be state-of-the-art learner for e�ort-

aware JIT defect prediction [21, 54] and all the other three learners

are widely used in defect prediction literature over past years [7,

9, 13, 21, 30, 50]. We evaluate all these learners on the six project

data sets using the same experiment scheme as we did in RQ1.

From Figure 4, we have the following observations. Firstly, the

performance of OneWay is signi�cantly be�er than all these four

supervised predictors in terms of Recall and Popt on all six data

sets. Also, EALR works be�er than Random Forests, J48 and IBk,

which is consistent with Kamei et al’s �nding [21].

Secondly, according to F1, Random Forests and IBk perform

slightly be�er than OneWay in two out of six data sets. For most

cases, OneWay has a similar performance to these supervised pre-

dictors and there is not much di�erence between them.

However, when reading Precision scores, we �nd that, in most

cases, supervised learners perform signi�cantly be�er thanOneWay.

Speci�cally, Random Forests, J48 and IBk outperform OneWay on

all data sets and EALR is be�er on three data sets. �is �nding

is consistent with the observation in RQ1 where all unsupervised

predictors perform worse than supervised predictors for Precision.

From Table 6, we have the following observation. First of all, in

terms of Recall and Popt , the maximum di�erence in median values

between EALR and OneWay are 0.15 and 0.14, respectively, which

Page 10: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

are 83% and 23% improvements over 0.18 and 0.60 on Mozilla and

PostgreSQL data sets. For both measures, OneWay improves the av-

erage scores by 0.1 and 0.11, which are 31% and 19% improvement

over EALR. Secondly, according to F1, IBk outperforms OneWayon three data sets with a large, medium and small advantage, re-

spectively. �e largest di�erence in median is 0.1. Finally, as we

discussed before, the best supervised predictor for Precision, Ran-

dom Forests, has a very large advantage over OneWay on all data

sets. �e largest di�erence is 0.46 on PostgreSQL data set.

Overall, according to the above analysis, we conclude that:

OneWay performs signi�cantly be�er than all four supervised

learners in terms of Recall and Popt ; It performs just as well

as other learners for F1. As for Precision, other supervised

predictors outperform OneWay.

Note the implications of this �nding: simple tools like OneWayperform adequately but for all-around performance, more sophisti-

cated learners are recommended.

As to when to use OneWay or supervised predictors like Random

Forests, that is an open question. According to “No Free Lunch

�eorems”[53], no method is always best and we show unsuper-

vised predictors are o�en worse on a project-by-project basis. So

“best” predictor selection is a ma�er of local assessment, requiring

labelled training data (an issue ignored by Yang et al).

7 THREATS TO VALIDITYInternal Validity. �e internal validity is related to uncontrolled

aspects that may a�ect the experimental results. One threat to the

internal validity is how well our implementation of unsupervised

predictors could represent the Yang et al.’s method. To mitigate this

threat, based on Yang et al “R”code, we strictly follow the approach

described in Yang et al’s work and test our implementation on the

same data sets as in Yang et al. [54]. By comparing the performance

scores, we �nd that our implementation can generate the same

results. �erefore, we believe we can avoid this threat.

External Validity. �e external validity is related to the possi-

bility to generalize our results. Our observations and conclusions

from this study may not be generalized to other so�ware projects.

In this study, we use six widely used open source so�ware project

data as the subject. As all these so�ware projects are wri�en in java,

we can not guarantee that our �ndings can be directly generalized

to other projects, speci�cally to the so�ware that implemented in

other programming languages. �erefore, the future work might

include to verify our �ndings on other so�ware project.

In this work, we used the data sets from [21, 54], where totally

14 change metrics were extracted from the so�ware projects. We

build and test the OneWay learner on those metrics as well. How-

ever, there might be some other metrics that not measured in these

data sets that work well as indicators for defect prediction. For

example, when the change was commi�ed (e.g., morning, a�er-

noon or evening), functionality of the the �les modi�ed in this

change (e.g., core functionality or not). �ose new metrics that are

not explored in this study might improve the performance of our

OneWay learner.

8 CONCLUSION AND FUTUREWORK�is paper replicated and refutes Yang et al.’s results [54] on unsu-

pervised predictors for e�ort-ware just-in-time defect prediction.

Not all unsupervised predictors work be�er than supervised pre-

dictors (on all six data sets, for di�erent evaluation measures). �is

suggests that we can not randomly pick an unsupervised predictor

to perform e�ort-ware JIT defect prediction. Rather, it is necessary

to use supervised methods to pick best models before deploying

them to a project. For that task, supervised predictors like OneWayare useful to automatically select the potential best model.

In the above, OneWay peformed very well for Recall, Popt and F1.

Hence, it must be asked: “Is defect prediction inherently simple?

And does it need anything other than OneWay?”. In this context,

it is useful to recall that OneWay’s results for precision were not

competitive. Hence we say, that if learners are to be deployed in

domains where precision is critical, then OneWay is too simple.

�is study opens the new research direction of applying simple

supervised techniques to perform defect prediction. As shown in

this study as well as Yang et al.’s work [54], instead of using tradi-

tional machine learning algorithms like J48 and Random Forests,

simply sorting data according to one metric can be a good defect

predictor model, at least for e�ort-aware just-in-time defect pre-

diction. �erefore, we recommend the future defect prediction

research should focus more on simple techniques.

For the future work, we plan to extend this study on other so�-

ware projects, especially those developed by the other programming

languages. A�er that, we plan to investigate new change metrics

to see if that helps improve OneWay’s performance.

9 ADDENDUMAs this paper was going to press, we learned of new papers that

updated the Yang et al. study: Liu et al. at EMSE’17 [31] and Huang

et al. at ICSMSE’17 [17]. We thank these authors for the courtesy

of sharing a pre-print of those new results. We also thank them

for using concepts from a pre-print of our paper in their work3.

Regretfully, we have yet to return those favors: due to deadline

pressure, we have not been able to con�rm their results.

As to technical speci�cs, Liu et al. use a single churn measure

(sum of number of lines added and deleted) to build an unsupervised

predictors that does remarkably be�er than OneWay and EARL

(where the la�er could access all the variables). While this result

is currently uncon�rmed, it could well have “raised the bar” for

unsupervised defect prediction. Clearly, more experiments are

needed in this area. For example, when comparing the Liu et al.

methods to OneWay and standard supervised learners, we could

(a) give all learners access to the churn variable; (b) apply the Yang

transform of1

M (c) to all variables prior to learning; (c) use more

elaborate supervised methods including synthetic minority over-

sampling [1] and automatic hyper-parameter optimization [7].

ACKNOWLEDGEMENTS�e work is partially funded by an NSF award #1302169.

3Our pre-print was posted to arxiv.org March 1, 2017. We hope more researchers will

use tools like arxiv.org to speed along the pace of so�ware research.

Page 11: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

REFERENCES[1] Amritanshu Agrawal and Tim Menzies. ”be�er data” is be�er than ”be�er data

miners” (bene�ts of tuning SMOTE for defect prediction). CoRR, abs/1705.03697,

2017.

[2] Fumio Akiyama. An example of so�ware system debugging. In IFIP Congress (1),volume 71, pages 353–359, 1971.

[3] Erik Arisholm and Lionel C Briand. Predicting fault-prone components in a java

legacy system. In Proceedings of the 2006 ACM/IEEE international symposium onEmpirical so�ware engineering, pages 8–17. ACM, 2006.

[4] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac-

tical and powerful approach to multiple testing. Journal of the Royal StatisticalSociety. Series B (Methodological), pages 289–300, 1995.

[5] Shyam R Chidamber and Chris F Kemerer. A metrics suite for object oriented

design. IEEE Transactions on So�ware Engineering, 20(6):476–493, 1994.

[6] Marco D’Ambros, Michele Lanza, and Romain Robbes. An extensive comparison

of bug prediction approaches. In 2010 7th IEEE Working Conference on MiningSo�ware Repositories, pages 31–41. IEEE, 2010.

[7] Wei Fu, Tim Menzies, and Xipeng Shen. Tuning for so�ware analytics: Is it

really necessary? Information and So�ware Technology, 76:135–146, 2016.

[8] Wei Fu, Vivek Nair, and Tim Menzies. Why is di�erential evolution be�er than

grid search for tuning defect predictors? arXiv preprint arXiv:1609.02613, 2016.

[9] Takafumi Fukushima, Yasutaka Kamei, Shane McIntosh, Kazuhiro Yamashita,

and Naoyasu Ubayashi. An empirical study of just-in-time defect prediction

using cross-project models. In Proceedings of the 11th Working Conference onMining So�ware Repositories, pages 172–181. ACM, 2014.

[10] Todd L Graves, Alan F Karr, James S Marron, and Harvey Siy. Predicting fault in-

cidence using so�ware change history. IEEE Transactions on So�ware Engineering,

26(7):653–661, 2000.

[11] Philip J Guo, �omas Zimmermann, Nachiappan Nagappan, and Brendan Mur-

phy. Characterizing and predicting which bugs get �xed: an empirical study of

microso� windows. In Proceedings of the 32nd ACM/IEEE International Conferenceon So�ware Engineering - Volume 1, volume 1, pages 495–504. ACM, 2010.

[12] Mark Hall, Eibe Frank, Geo�rey Holmes, Bernhard Pfahringer, Peter Reutemann,

and Ian H Wi�en. �e weka data mining so�ware: an update. ACM SIGKDDExplorations Newsle�er, 11(1):10–18, 2009.

[13] Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. A

systematic literature review on fault prediction performance in so�ware engi-

neering. IEEE Transactions on So�ware Engineering, 38(6):1276–1304, 2012.

[14] Maurice Howard Halstead. Elements of so�ware science, volume 7. Elsevier New

York, 1977.

[15] Maggie Hamill and Katerina Goseva-Popstojanova. Common trends in so�ware

fault and failure data. IEEE Transactions on So�ware Engineering, 35(4):484–496,

2009.

[16] Ahmed E Hassan. Predicting faults using the complexity of code changes. In

Proceedings of the 31st International Conference on So�ware Engineering, pages

78–88. IEEE, 2009.

[17] Qiao Huang, Xin Xia, and David Lo. Supervised vs unsupervised models: a holis-

tic look at e�ort-aware just-in-time defect prediction. In 2017 IEEE InternationalConference on So�ware Maintenance and Evolution. IEEE, 2017.

[18] Tian Jiang, Lin Tan, and Sunghun Kim. Personalized defect prediction. In

Proceedings of the 28th IEEE/ACM International Conference on Automated So�wareEngineering, pages 279–289. IEEE, 2013.

[19] Xiao-Yuan Jing, Shi Ying, Zhi-Wu Zhang, Shan-Shan Wu, and Jin Liu. Dictionary

learning based so�ware defect prediction. In Proceedings of the 36th InternationalConference on So�ware Engineering, pages 414–423. ACM, 2014.

[20] Dennis Kafura and Geereddy R. Reddy. �e use of so�ware complexity metrics

in so�ware maintenance. IEEE Transactions on So�ware Engineering, (3):335–343,

1987.

[21] Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus,

Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in-

time quality assurance. IEEE Transactions on So�ware Engineering, 39(6):757–773,

2013.

[22] Taghi M Khoshgo�aar and Edward B Allen. Modeling so�ware quality with.

Recent Advances in Reliability and�ality Engineering, 2:247, 2001.

[23] Taghi M Khoshgo�aar and Naeem Seliya. So�ware quality classi�cation model-

ing using the sprint decision tree algorithm. International Journal on Arti�cialIntelligence Tools, 12(03):207–225, 2003.

[24] Taghi M Khoshgo�aar, Xiaojing Yuan, and Edward B Allen. Balancing misclassi-

�cation rates in classi�cation-tree models of so�ware quality. Empirical So�wareEngineering, 5(4):313–330, 2000.

[25] Sunghun Kim, E James Whitehead Jr, and Yi Zhang. Classifying so�ware changes:

Clean or buggy? IEEE Transactions on So�ware Engineering, 34(2):181–196, 2008.

[26] Sunghun Kim, �omas Zimmermann, E James Whitehead Jr, and Andreas Zeller.

Predicting faults from cached history. In Proceedings of the 29th InternationalConference on So�ware Engineering, pages 489–498. IEEE, 2007.

[27] Ekrem Kocaguneli, Tim Menzies, Ayse Bener, and Jacky W Keung. Exploiting

the essential assumptions of analogy-based e�ort estimation. IEEE Transactionson So�ware Engineering, 38(2):425–438, 2012.

[28] A Gunes Koru, Dongsong Zhang, Khaled El Emam, and Hongfang Liu. An

investigation into the functional form of the size-defect relationship for so�ware

modules. IEEE Transactions on So�ware Engineering, 35(2):293–304, 2009.

[29] Taek Lee, Jaechang Nam, DongGyun Han, Sunghun Kim, and Hoh Peter In.

Micro interaction metrics for defect prediction. In Proceedings of the 19th ACMSIGSOFT Symposium and the 13th European Conference on Foundations of So�wareEngineering, pages 311–321. ACM, 2011.

[30] Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Bench-

marking classi�cation models for so�ware defect prediction: A proposed frame-

work and novel �ndings. IEEE Transactions on So�ware Engineering, 34(4):485–

496, 2008.

[31] Jinping Liu, Yuming Zhou, Yibiao Yang, Hongmin Lu, and Baowen Xu. Code

churn: A neglected metric in e�ort-aware just-in-time defect prediction. In

Empirical So�ware Engineering and Measurement, 2017 ACM/IEEE InternationalSymposium on. IEEE, 2017.

[32] Shinsuke Matsumoto, Yasutaka Kamei, Akito Monden, Ken-ichi Matsumoto, and

Masahide Nakamura. An analysis of developer metrics for fault prediction. In

Proceedings of the 6th International Conference on Predictive Models in So�wareEngineering, page 18. ACM, 2010.

[33] �omas J McCabe. A complexity measure. IEEE Transactions on So�ware Engi-neering, (4):308–320, 1976.

[34] �ilo Mende and Rainer Koschke. E�ort-aware defect prediction models. In

2010 14th European Conference on So�ware Maintenance and Reengineering, pages

107–116. IEEE, 2010.

[35] Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code at-

tributes to learn defect predictors. IEEE Transactions on So�ware Engineering,

33(1), 2007.

[36] Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayse

Bener. Defect prediction from static code features: current results, limitations,

new approaches. Automated So�ware Engineering, 17(4):375–407, 2010.

[37] Ayse Tosun Misirli, Ayse Bener, and Resat Kale. Ai-based so�ware defect pre-

dictors: Applications and bene�ts in a case study. AI Magazine, 32(2):57–68,

2011.

[38] Audris Mockus and David M Weiss. Predicting risk of so�ware changes. BellLabs Technical Journal, 5(2):169–180, 2000.

[39] Akito Monden, Takuma Hayashi, Shoji Shinoda, Kumiko Shirai, Junichi Yoshida,

Mike Barker, and Kenichi Matsumoto. Assessing the cost e�ectiveness of fault

prediction in acceptance testing. IEEE Transactions on So�ware Engineering,

39(10):1345–1357, 2013.

[40] Raimund Moser, Witold Pedrycz, and Giancarlo Succi. A comparative analysis of

the e�ciency of change metrics and static code a�ributes for defect prediction.

In Proceedings of the 30th International Conference on So�ware Engineering, pages

181–190. ACM, 2008.

[41] Nachiappan Nagappan and �omas Ball. to predict system defect density. In

Proceedings of the 27th International Conference on So�ware Engineering, pages

284–292. IEEE, 2005.

[42] Nachiappan Nagappan, �omas Ball, and Andreas Zeller. Mining metrics to

predict component failures. In Proceedings of the 28th International Conferenceon So�ware Engineering, pages 452–461. ACM, 2006.

[43] Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. Transfer defect learning. In

Proceedings of the 2013 International Conference on So�ware Engineering, pages

382–391. IEEE, 2013.

[44] �omas J Ostrand, Elaine J Weyuker, and Robert M Bell. Where the bugs are. In

ACM SIGSOFT So�ware Engineering Notes, volume 29, pages 86–96. ACM, 2004.

[45] �omas J Ostrand, Elaine J Weyuker, and Robert M Bell. Predicting the location

and number of faults in large so�ware systems. IEEE Transactions on So�wareEngineering, 31(4):340–355, 2005.

[46] Corina S Pasareanu, Peter C Mehlitz, David H Bushnell, Karen Gundy-Burlet,

Michael Lowry, Suze�e Person, and Mark Pape. Combining unit-level symbolic

execution and system-level concrete execution for testing nasa so�ware. In

Proceedings of the 2008 International Symposium on So�ware Testing and Analysis,pages 15–26. ACM, 2008.

[47] Foyzur Rahman, Sameer Khatri, Earl T Barr, and Premkumar Devanbu. Com-

paring static bug �nders and statistical prediction. In Proceedings of the 36thInternational Conference on So�ware Engineering, pages 424–434. ACM, 2014.

[48] Jeanine Romano, Je�rey D Kromrey, Jesse Coraggio, Je� Skowronek, and Linda

Devine. Exploring methods for evaluating group di�erences on the nsse and

other surveys: Are the t-test and cohen�sd indices the most appropriate choices.

In Annual Meeting of the Southern Association for Institutional Research, 2006.

[49] Forrest Shull, Ioana Rus, and Victor Basili. Improving so�ware inspections by

using reading techniques. In Proceedings of the 23rd International Conference onSo�ware Engineering, pages 726–727. IEEE, 2001.

[50] Burak Turhan, Tim Menzies, Ayse B Bener, and Justin Di Stefano. On the relative

value of cross-company and within-company data for defect prediction. EmpiricalSo�ware Engineering, 14(5):540–578, 2009.

[51] Song Wang, Taiyue Liu, and Lin Tan. Automatically learning semantic features

for defect prediction. In Proceedings of the 38th International Conference onSo�ware Engineering, pages 297–308. ACM, 2016.

Page 12: Revisiting Unsupervised Learning for Defect … Unsupervised Learning for Defect Prediction Wei Fu, Tim Menzies Com. Sci., NC State, USA wfu@ncsu.edu,tim.menzies@gmail.com ABSTRACT

ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

[52] Maurice V Wilkes. Memoirs ofa computer pioneer. Cambridge, Mass., London,

1985.

[53] David H Wolpert. �e supervised learning no-free-lunch theorems. In So�Computing and Industry, pages 25–42. Springer, 2002.

[54] Yibiao Yang, Yuming Zhou, Jinping Liu, Yangyang Zhao, Hongmin Lu, Lei Xu,

Baowen Xu, and Hareton Leung. E�ort-aware just-in-time defect prediction:

simple unsupervised models could be be�er than supervised models. In Proceed-ings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of

So�ware Engineering, pages 157–168. ACM, 2016.

[55] Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi

Bairavasundaram. How do �xes become bugs? In Proceedings of the 19th ACMSIGSOFT Symposium and the 13th European Conference on Foundations of So�wareEngineering, pages 26–36. ACM, 2011.

[56] �omas Zimmermann, Rahul Premraj, and Andreas Zeller. Predicting defects for

eclipse. In Proceedings of the �ird International Workshop on Predictor Models inSo�ware Engineering, page 9. IEEE Computer Society, 2007.