roughly balanced bagging for imbalanced data

Roughly Balanced Bagging for Imbalanced Data

Shohei Hido1,2,∗, Hisashi Kashima1† and Yutaka Takahashi2

1 IBM Research-Tokyo Research Laboratory, 1623-14 Shimo-Tsuruma, Yamato-shi, Kanagawa 242-8502, Japan

2 Department of Systems Science, Kyoto University, Yoshida-Honmachi, Kyoto 606-8501, Japan

Received 12 May 2008; revised 5 October 2009; accepted 5 October 2009DOI:10.1002/sam.10061

Published online 19 November 2009 in Wiley InterScience (www.interscience.wiley.com).

Abstract: The class imbalance problem appears in many real-world applications of classification learning. We propose anensemble algorithm “Roughly Balanced (RB) Bagging” using a novel sampling technique to improve the original baggingalgorithm for data sets with skewed class distributions. For this sampling method, the number of samples in the largest andsmallest classes are different, but they are effectively balanced when averaged over all of the subsets, which supports theapproach of bagging in a more appropriate way. Individual models in RB Bagging tend to show larger diversity, which isone of the keys of ensemble models, compared with existing bagging-based methods for imbalanced data that use exactly thesame number of majority and minority examples for every training subset. In addition, the proposed method makes full use ofall of the minority examples by under-sampling, which is efficiently done by using negative binomial distributions. Numericalexperiments using benchmark and real-world data sets demonstrate that RB Bagging shows better performance than the existing“balanced” methods and other common methods for area under the ROC curve (AUC), which is a widely used metric in theclass imbalance problem. 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 412–426, 2009

Keywords: imbalanced data; bagging; resampling; negative binomial distribution

1. INTRODUCTION

In real-world applications, we often encounter data setswhose class distribution is highly skewed. For example, thenumber of sick people is much smaller than that of healthypeople in medical diagnosis data for normal populations.Again, the number of fraudulent actions is much smallerthan that of normal transactions in credit card usage data.When a prediction model is trained on such an imbalanceddata set, it tends to show a strong bias toward the majorityclass, as typical learning algorithms intend to maximize theoverall prediction accuracy. In fact, if 95% of the entire dataset belongs to the majority class, the model might ignorethe remaining 5% of examples from the minority class andpredict that all of the test examples are in the majority class.Even though the accuracy will be 95%, the examples of theminority class will be absolutely misclassified. The misclas-sification cost for the minority class, however, is usuallymuch higher than that of majority class and should not beignored. In addition, the class distribution in a test set may

Correspondence to: Shohei Hido ([email protected])†Currently with: Department of Mathematical Informatics,University of Tokyo, Tokyo 113-8656, Japan

be different from that of the imbalanced training set. In suchcases the trained model will perform poorly on the test set.

To address this practically important problem, many stud-ies have been conducted to improve learning algorithms forimbalanced data [1]. Most can be classified as resamplingmethods [2,3], boosting-based algorithms [4,5], or cost-sensitive ensemble approaches [6–8]. Bagging [9] has alsobeen applied to classification problems with class imbal-ances. However, the primary technique used in the previouswork involves correcting the skewness of the class distri-bution in each sampled subset by using under-sampling orover-sampling. For example, the under-sampling methodstypically sample a subset of the majority class data so thatits size is equal to the size of the minority class data, and themethods use all of the data from the minority class. Becauseevery derived subset includes exactly the same number ofmajority and minority examples, the trained model performsequally for both classes. While such strategy seems intu-itive and reasonable at first sight, we believe that it does nottruly reflect the philosophy of bagging, as the original bag-ging algorithm uses bootstrap sampling from the whole dataset independently of the class labels. Although the averageclass ratio of sampled subsets in bagging agrees with theoriginal class distribution, the class ratio of each subset

2009 Wiley Periodicals, Inc.

S. Hido et al.: Roughly Balanced Bagging for Imbalanced Data 413

varies according to the binomial distribution (or multino-mial distribution for multi-class classification) so that thetrained predictive models in the ensemble have differentclass preferences and disagree on some test samples. Thiskind of diversity within multiple models, in fact, is shownto be the key to success for ensemble-based learning algo-rithms [10–12]. In contrast, in the existing bagging-basedmethods for imbalanced data, every subset has exactly thesame class distribution as the desired (typically uniform)distribution so the learned models will be less diverse.This allows us to improve the performance of bagging forimbalanced data by using the philosophy of bagging in amore appropriate way. To the best of our knowledge, thereis no previous work addressing this issue empirically ortheoretically.

Our contributions in this paper are summarized asfollows:

• A new under-sampling technique using a negativebinomial distribution for bagging on imbalanced data.

• Justification of the proposed sampling method interms of an approximation of the original baggingalgorithm.

• An empirical evaluation of the bagging-based algo-rithms based on the diversity of base models.

• Extensive experimental evidence that Roughly Bal-anced (RB) Bagging works better with an appropriatemetric, area under the ROC curve (AUC).

The rest of this paper is organized as follows. Section 2provides the basis for the class imbalance problem. Weexplain RB Bagging and its sampling technique in Section 3.First we introduce the bagging-based algorithm to gener-ate training subsets and to build an ensemble model. Nextwe address how to equalize the probability of choosingeach class for a sample, rather than the sample size. Thetechnique allows the class distributions of the subsets tobecome slightly imbalanced and different. In Section 4,we derive the interpretation of our sampling method as anapproximation of bagging for the class imbalance problem,and discuss the diversity of ensemble algorithms. Section 4describes the performance metrics and comparable learningalgorithms for the evaluation. In Section 5, we examine thediversity of RB Bagging using artificial data sets and eval-uate the performance of RB Bagging in experiments usingbenchmark and real-world data sets. Section 7 gives a briefsummary and considers future work.

2. CLASS IMBALANCE PROBLEM

Real world data often has the problem of class imbal-ance or skewed class distribution in which the examples

Table I. Imbalanced confusion matrix.

Predicted negative Predicted positive

Actual True negative (TN) False positive (FP)negative 400 0Actual False negative (FN) True positive (TP)positive 20 4

of the majority class outnumber the minority examples.These results are not only from skewness in the class priordistributions, but also from the sampling bias. In manycases, the fraction of the minority class data can be 10% oreven less. With such imbalanced sets of training data, super-vised classifiers usually face difficulty in the prediction ofdata with the minority class label. Because their objective istypically to maximize the overall prediction accuracy, theirpredictions are strongly biased toward the majority class.

Table I shows an example of this problem as a confusionmatrix. Each cell represents the number of examples. Therows show the sample sizes of the actual classes in a test set,and the columns represent the number of predicted classesby a classification model. In this case all of the majorityexamples are predicted correctly. Although the predictionaccuracy is higher than 95%, most of the minority examplesare misclassified as belonging to the majority class.

In this paper, we focus on such binary classification prob-lems. We assume that the negative examples and the posi-tive examples belong to the majority class and the minorityclass, respectively. Let us denote by xi an input featurevector of d-dimensional real-valued or nominal-valued vari-ables, by yi ∈ {neg, pos} the class label of xi , and byD = {(x1, y1), (x2, y2), . . . , (xn, yn)} the training data setwhose class labels are highly skewed.

In summary this definition describes our goal

DEFINITION 2.1 (Class Imbalance Problem) Given theimbalanced data set D, construct a prediction model thatperforms well when evaluated by performance metrics thatbalance the accuracies for both class labels.

The “performance metrics” in the definition will be describedin Section 4.3.

3. ROUGHLY BALANCED BAGGING

3.1. Bagging Algorithm

Breiman proposed the first bagging algorithm which isa state-of-art ensemble-based algorithm and showed thatthe ensemble model generally performs better than a sin-gle model [9]. Bagging samples the training subsets fromthe entire data set with replacement, builds multiple baselearners and aggregates the outputs of the base learners tomake the final predictions.

Statistical Analysis and Data Mining DOI:10.1002/sam

414 Statistical Analysis and Data Mining, Vol. 2 (2009)

We describe the original bagging algorithm as we use itas a baseline for our algorithm. Let K denote the numberof base learners. A data set D is converted into K trainingsubsets of equal size {D1, D2, . . . , DK} using bootstrapsampling. Let f k(x) be the base model trained on thekth subset Dk, and f A(x) be the final ensemble model.The output of f A(x) is given by aggregating the set ofbase models {f 1(x), f 2(x), . . . , f K(x)}. Although f A(x)

is originally determined by voting using the predicted classlabels, we use the average of the estimated probabilitypk(y|x) for every bagging-based algorithms as:

f A(xi) = 1

K

K∑k=1

pk(yi |xi).

The averaged probability of the ensemble models is knownto be a good estimate of the true posterior probabilityand often reduces the loss more than using labels only invoting [13]. This also allows us to effectively evaluate theperformance with the some performance metrics based onreal-valued outputs in the experiments.

There are several studies which explain how baggingimproves the predictive performance by the reduction ofthe variance of the mean squared error. The amount ofimprovement depends on the bias-variance decompositionfor base learners, which suggests that an unstable modelwith high variances, such as decision trees, is preferable asthe base learner for bagging rather than a stable one, suchas logistic regression models [9].

In contrast to boosting-based algorithms for imbalanceddata sets [4,5], bagging has seemed less attractive as itssimple strategy leaves little space for handling class imbal-ances except for changing the bag size K and the subsetsampling strategies. Tao et al. proposed a balanced sam-pling approach to perform bootstrap sampling only on thenegative examples so that the sample size is equal to thenumber of positive examples in the original data set, whilekeeping all of the positive examples in every training subset[14]. However, the original bagging algorithm chooses eachbootstrap sample independently of the class labels, and thenthe class distribution of each subset is not always the sameas the original class distribution. Therefore, it is unclearwhether or not the aggregated model based on such exactlybalanced subsets preserves the advantage of the originalbagging algorithm.

3.2. Algorithm

We seek to provide a natural extension of the originalbagging algorithm for imbalanced data sets that reflectthe philosophy of bagging in a more appropriate way.We believe that the proposed approach will perform better

in imbalanced data domains. The class distribution in thesampled subsets should be corrected to build base learnersthat can perform fairly for both of the classes. Also, we needto avoid information loss in the usual bootstrap samplingof the rare positive examples.

We propose RB Bagging to address these requirements.Figure 1 describes the algorithm. As with the original bag-ging algorithm, the input to the algorithm consists of atraining data set D, a base learner L, and a constant parame-ter K for the number of base learners. The important point ishow to determine the size for each of the classes. We set thenumber of positive samples equal to the number in the orig-inal data set. If they are sampled without replacement, allof the positive examples will be contained in all of the sam-pled subsets. In contrast, the number of negative samples isdetermined probabilistically using a negative binomial dis-tribution, whose parameters are the number of minority (i.e.positive) examples and the probability of success q = 0.5,which will be discussed in the next subsection. The key isthat the examples of both classes are drawn with equal prob-ability, but only the size of the negative samples varies, andthe number of positive samples is kept constant. In termsof the resampling scheme, though the original bagging

Fig. 1 The RB Bagging algorithm.



algorithm uses sampling with replacement in a spirit ofbootstrapping, one can also use sampling method with-out replacement. Buhlmann and Yu proposed an extensionof the bagging algorithm to use sampling method withoutreplacement and showed that the prediction performancewas similar to that of the original bagging algorithm [15].Friedman and Hall gave a theoretical background to showthat the without-replacement bagging could give an identi-cal effect as the variance reductions and perform similarly[16]. Thus we will use both sampling methods with andwithout replacement. In prediction, the aggregated modelsimply outputs the class label with the highest average prob-ability as estimated by the base models.

3.3. Negative Binomial Sampling

When a subset is chosen by bootstrap sampling over twoequal-sized data sets, one of which belongs to the positiveclass and the other to the negative class, then the class dis-tribution in the resultant subset varies, but it also followsthe binomial distribution with probability 0.5. When weare given a data set with class imbalance, for sampling anequally balanced subset, we choose each class with proba-bility 0.5, and draw a sample uniformly at random from thedata set belonging to the chosen class, and repeat this pro-cedure until the size of the sampled subset reaches a givenconstant. However, this sampling method cannot controlthe size of the minority samples. In order to utilize all ofthe minority class data, we use the negative binomial dis-tribution. Given the number of successes n, the number offailures m in Bernoulli trials obeys a negative binomial dis-tribution which is defined by the probability mass function

p(m|n) =(

m + n − 1n

)qn(1 − q)m (1)

where q is the probability of success. For our purpose, weset q = 0.5. Note that the negative binomial distributionwith integer parameter n is also called Pascal distribution.

Figure 2 shows the distribution of m with q = 0.5 andn = 10. We observe that m falls around n = 10 with high

0 3 6 9 12 16 20 24

Value of m

Pro

babi

lity

0.00

0.04

0.08

Fig. 2 Distribution of the negative binomial distribution (q =0.5, n = 10).

probability. When we use sampling without replacement,there is a small chance of the sample size m being largerthan the number of data objects of the majority class(|Dneg|). In such a case, we simply set m = |Dneg|.

4. DISCUSSION

4.1. Justification

Our sampling method essentially depends on a widelyused concept called under-sampling, which tries to makefull use of minority class examples. At the same time, wealso aim to straightforwardly extend the original baggingalgorithm to handle the class imbalance problem by usingnegative binomial sampling.

Let us review the details of subset sampling in the orig-inal bagging algorithm. Basically, its bootstrap samplingmethod (sampling with replacement) draws an i.i.d samplefor a fixed number N (usually, N = |D|) from the dataset as a sampling method based on the empirical joint dis-tribution p(x, y). The point is that the bootstrap samplingcan be separated into two steps by conditioning the eventas p(x, y) = p(x|y)p(y). First we determine the samplesize of each class according to the class prior probabilityp(y). The sample size of each class follows a binomialdistribution. For a balanced class distribution with the classratio q for the class y (y = {neg, pos}). Then the densitydistribution of the sample size |Dk

y | is represented as:

p(|Dky | = m|N) =

(N

m

)qN. (2)

Next, we draw |Dky | samples according to the conditional

probability p(x|y), which is done by choosing |Dky | samples

independently with replacement from the examples belong-ing to the class y.

To address the class imbalance problem, we performsampling under a balanced class distribution. The simplestway to do this is to use the balanced prior p′(y = neg) =p′(y = pos) = 0.5 instead of the true p(y) in the first step.Then the sample size of each class varies according tothe balanced binomial distribution as Eq. (2) with q = 0.5,where the numbers of positive and negative samples areequal on average. This is a direct extension of the originalbagging algorithm to cope with class imbalance.

However, none of the previous work has adjusted thebootstrap sampling in this way. Because the simple imple-mentation described above cannot guarantee the size ofthe positive examples in each subset, some of the trainingsubsets might contain only a small fraction of the posi-tive examples causing the base models to work poorly forthe positive class. Instead, the existing algorithms fix the



sample size, and draw the same number of samples fromboth classes for each subset. This means that the number ofsamples from each class is replaced by its expected valueinstead of randomly choosing the sample size. Althoughsuch an exactly balanced sampling method performs betterfor imbalanced data set than the original bagging algorithm,it still seems to be an open problem to explain why it works.The difficulty we face is that there is a conflict betweenthe bootstrap sampling and the under-sampling. To the bestof our knowledge, there is no extension that satisfies bothrequirements simultaneously.

To resolve the conflict, we ease the restriction of equal-sized subsets. Our sampling strategy can be interpreted asrepeated instance-by-instance sampling method that prob-abilistically selects the class based on the balanced classprior p′(y) = 0.5 for both classes y and draws one examplefrom the class until the total size of the sampled positiveexamples reaches the size of the positive examples. Usingthis technique, we can implement the balanced samplingusing p′(y) = 0.5 and make full use of all the positiveexamples at the same time. Actually, the method is equiv-alent to bootstrap sampling method in which the samplesize of each class is selected according to a negative bino-mial distribution with p′(y) = 0.5. Although the size of thesubsets differs slightly from subset to subset, their class dis-tribution is almost balanced on average, which is the sameas with the original bagging algorithm. In addition, thereis only a slight drawback in computational cost with RBBagging compared with the exactly balanced bagging, asthe average sample size is the same.

On the basis of this analysis we can say that, by using thenegative binomial sampling, RB Bagging preserves both thenature of the original bagging algorithm and the informationof all of the minority class examples.

4.2. Variance Reduction and Diversity

As ensemble models and bagging have been proposed,their advantage is often described in terms of a decomposi-tion using the mean squared error (MSE; see Section 4.3)as the bias, variance, and Bayes error as:

MSE (f ) = bias(f ) + variance(f ) + Bayes error .

The bias and variance of the aggregated model f A alsodepends on the base model f . It is known that the aggre-gation strategy of bagging has the effect of canceling thevariance of the base learners rather than their bias and thenreduces the overall variance and MSE. Thus bagging workscompatibly with the base learners for which the variancetends to be large but the bias is low, as with decision trees.Therefore, we would like to evaluate the values of bias and

variance while considering the class imbalance for eachbagging-based algorithm.

bias(f A) = E[p(y|x) − E[f A(x)]]2.

variance(f A) = E[(E[f A(x)] − f A(x))2].

In order to calculate the bias, however, we have to know thetrue conditional probability p(y|x) which is never availablefor real-world data sets. In Section 5.2, therefore, we useartificial data sets in which p(y|x) can be defined explicitlyto generate the data samples.

There have been some previous reports that examined theadvantages of ensemble models in terms of diversity ratherthan the bias-variance decomposition. In these reports, thediversity still means that the base predictive models in anensemble model tend to disagree on some test samples. Thisseems to be important for ensemble algorithms as there isno chance to improve the prediction performance if thebase models work similarly and always agree in their pre-dictions. For the regression problem, Krogh et al. showedthat ensemble algorithms can reduce the generalization erroronly if they have diversity [11]. The generalization error ofan ensemble model E satisfies E = E − A where E andA represents the mean generalization error and ambigu-ity (i.e. diversity) of the individual models. Although sucha simple effect does not hold for the 0/1 loss functionin classification problems, some metrics of diversity havebeen proposed. For example, Cunningham et al. proposedan entropy-based metric for the diversity of a classifierensemble [10]. Kuncheva et al. conducted a comprehensiveempirical evaluation of some metrics and showed that theyhave strong relationships to the improvement of predictionaccuracy [12].

In Section 5.2, we will present experimental results usingartificial data sets to evaluate the variance and diversity ofthe bagging-based algorithms. The fluctuation within theclass ratios of the training subsets in RB Bagging mightresult in larger diversity in the final ensemble model incomparison to the exactly balanced approach. We use thevariance of the predictions to measure the diversity and theeffectiveness of the bagging-based algorithms. In addition,we also compute the mean and variance of the AUC values(See Section 4.3) for the base classifiers over all of the testsamples, as we focus on the AUC as the primary metricin our experiment using benchmark and real-world datasets. Although it is expected that the larger diversity ofensemble model will also lead to larger variances in theAUC, the relationship between the AUC values of the basemodels and the overall AUC of the aggregated predictionsseems unclear. To the best of our knowledge, this is the firstattempt to estimate the effectiveness of ensemble modelsusing the variance of the AUC values.



4.3. Performance Metrics

We reviewed seven performance metrics commonly usedfor the evaluation of methods used for the imbalanced classproblem. Let us denote by Dtest the test set (Ntest = |Dtest|).We assume that the prediction results are given in thesame form as in Table I. We use the estimated posteriorprobability p(yi |xi) to compute the error metrics (MSE andISE), ROC curve and AUC.

4.3.1. Prediction accuracy

The prediction accuracy represents the population of thecorrectly predicted examples. For the case of imbalancedclass problem, as mentioned in Section 2, an overly simplemodel which predicts all of the test examples as the negativeclass might maximize the accuracy. Therefore we do notgive much weight to the naive accuracy in this paper.

Accuracy = TN + TP

TN + FN + TP + FP.

4.3.2. F-measure

The F-measure combines the precision TP/(TP + FP)

and the recall TP/(TP + FN ) for the prediction of thepositive class. A higher F-measure value indicates that themodel performs better.

F -measure = 2 × Precision × Recall

Precision + Recall.

4.3.3. G-mean

The G-mean (geometric mean) has been offered as thegeometric mean of the prediction accuracies for both classes[3] while considering the class imbalance. Even if a modelclassifies the negative examples correctly, a poor perfor-mance in prediction of the positive examples will lead to alow G-mean value. In fact, the G-mean is quite importantto measure the avoidance of overfitting to the negative classand the degree to which the positive class is ignored.

G-mean =√

TN

(TN + FP)× TP

(TP + FN ).

4.3.4. Mean squared error

The mean squared error (MSE) shows the error of theestimated posterior probability of a sample xi for whichactual class label is yi . A model which gives a preciseprobability estimate can reduce the MSE. Although it usesthe true posterior probability pT (yi |xi) as the answer inthe original definition, we usually do not know pT (yi |xi)

in real-world data sets. Instead, in general, the MSE isempirically calculated assuming pT (yi |xi) = 1.

MSE = 1

Ntest

Ntest∑i=1

(1 − p(yi |xi))2.

4.3.5. Improved squared error

The MSE is commonly used as a metric of estimatedprobability. Although if the model predicts all of the classlabels correctly, the MSE will still be high if the probabilityestimate is unstable. Alternatively, Fan et al. introduced theimproved squared error (ISE) [13]. The ISE accounts forthe probability error as long as the model misclassifiessome examples. In other words, the ISE combines theaccuracy and the MSE into one value. We assume theprediction threshold of the estimated probability in thebinary classification is fixed as 0.5.

ISE = 1

Ntest

Ntest∑i=1

(1 − min(1.0,p(yi |xi)

0.5))2.

It is clear that for fairness the performance evaluationshould not include any parameter. However, for these sixmetrics, the evaluation depends on the choice of the thresh-old for estimated probability, which is usually 0.5. Thereis no systematic way to choose reasonable value of thethreshold, especially for the imbalanced class problem, assome predictive models tend to show higher posterior prob-abilities for the majority class p(yi = neg|xi). Therefore,another metric is needed to fairly evaluate the models inde-pendently of the threshold selection.

4.3.6. ROC curve and AUC

Beginning with the early stages of research on imbal-anced data, the ROC curves have been the primary met-ric to evaluate the performance of algorithms. The curvesare computed based on the estimated class probabilitiesp(yi |xi) of the classifiers and they represent the trade-offbetween the true positive ratio (TP) and the false positiveratio (FP) while changing the decision threshold for theestimated probability. The upper curve corresponds to bet-ter performance in prediction. Figure 3 shows an exampleof the ROC curves for RB Bagging and Exactly BalancedBagging for a real-world data set described in Section 5.4.From this figure, one can determine which algorithm hasthe higher true negative ratio by choosing a FP ratio, orvice versa. For example, with the FP ratio 0.3, one can seethat RB Bagging has a higher TP ratio. However, it is stillunclear which algorithm is superior independent of parame-ter selection. In addition, these ROC curves tend to cross at



0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curves

RB Bagging (87.8)RB Bagging w/replace (88.0)Exactly Balanced (86.9)

False positive ratio

Tru

e po

sitiv

e ra

tio

Fig. 3 Example of ROC curves where the horizontal axis is thefalse positive ratio and the vertical axis represents the true positiveratio. The percentage next to the algorithms indicate AUC valueswhich corresponds to the area under ROC curves.

various points so that the superiority depends on the choiceof the parameter. For example, the curve of RB Baggingwith replacement is lower than that of Exactly BalancedBagging for FP ratio around 0.2.

AUC is a value which indicates the size of the area belowan ROC curve. In contrast to the difficulty of identifyingthe clear advantages in multiple crossing ROC curves, theranking of the AUC values on the classification results sta-tistically makes sense. An AUC value indicates that theprobability P (X > Y) where X is the output value of theclassification model for a randomly chosen positive sam-ple and Y is that for a randomly chosen negative sample[17]. If the model correctly classifies all the samples, thenthe AUC value becomes 1.0. In other words, this meansthat the average possibility that a positive class samplebecomes larger than for the negative class samples withthat classification model. This value describes how well theclassifier separates the two classes. Figure 3 indicates thecorresponding AUC values in percentage terms, next to thenames of the algorithms. RB Bagging has higher AUC val-ues than Exactly Balanced Bagging. In fact, AUC has beenoften used as a performance metric in the recent data min-ing contests including the ACM KDD Cup and the IEEEICDM Data Mining Contest [18–20], as AUC is regardedas a practically important metric. Moreover, AUC valuesare known to be equivalent to the values of the Wilcoxonstatistic (also called the Mann-Whitney statistic) [21].

Therefore, we focus on the value of the AUC as theprimary metric in our evaluation.

4.4. Algorithms for imbalanced data

Even for imbalanced class problems, decision tree algo-rithms, such as C4.5 [22], are the most widely used

approach. It is well-known that bagging with tree algo-rithms is a good idea as the aggregation counteracts theinstability of the trees by reducing the variance in the MSE[9,13]. In contrast, more stable models such as logisticregression are not appropriate as the base learner. For thisreason, we use C4.5 as the base learner for the RB Baggingand other ensemble methods in Section 5.

Following the previous work [4,5], we used AdaBoost[23] and RIPPER [24] as the comparison algorithms in theexperiments. Boosting is a family of ensemble algorithms,which assigns larger weights to the misclassified examplesfrom the current base model and trains the next base modelusing the new weights. If the base learners are stronglybiased toward negative examples, then boosting can revisethe weights of the misclassified positive examples andautomatically cope with the imbalanced class problem.AdaBoost [23] is the most successful boosting algorithm.RIPPER is a rule based algorithm that generates rulesets using an MDL-based (minimum description length)stopping criteria and does greedy pruning of the rulesto minimize their description length. Because rules aregenerated for each class, RIPPER also performs well forimbalanced data sets.

5. EXPERIMENT

5.1. Setting

We compared two models based on the proposed algo-rithm with seven models based on the other algorithms. Wetested RB Bagging with size K = 100, with and withoutreplacement, using C4.5 as the base learner. We also imple-mented Exactly Balanced Bagging and Breiman’s originalbagging algorithm of equal size ensembles. For the stan-dard models and base learners, we employed the widelyused data mining tool Weka [25]. C4.5, AdaBoost, andRIPPER are implemented as J48, AdaBoostM1, and JRip,respectively, in Weka. A single C4.5 tree was constructedto assess the difficulties of learning with these data sets.The next two models are AdaBoost with size K = 100and 200 (for the number of iterations). The number ofiterations follows the experimental values of the previouswork [4] that compared many boosting-based algorithms.Their base learners were also C4.5. As a rule-based algo-rithm, we made use of two models based on RIPPER. Thenumbers of optimizations Optimize were set to 2 (follow-ing the original paper [24]) and 10. The other parameterswere all the same as the default parameters in Weka. Allof the experiments and the statistical processing were per-formed in the statistical language environment R [26]. Inthe following experiments, we ran tenfold cross valida-tion in a stratified manner so that the training set and testsets preserve the same class distributions as the original



-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

Den

sity

posneg

Fig. 4 Distribution of positive and negative samples.

sets. Then the output probabilities p(yi |xi) of the classi-fiers were used for computing six metrics: AUC, MSE, ISE,F-measure, G-mean, and accuracy—the performance met-rics introduced in Section 4.3. Decision trees constructedwith C4.5 can give probability output for test samples bycomputing the fraction of the samples of each class in theleaves for the training subset. In addition, we enabled theuseLaplace option in the J48 implementation on Weka tosmooth the output values. The ensemble-based algorithmsalso give a real-value output by aggregating the estimatedprobabilities of the base decision trees. The AUC values arecomputed using the colAUC function which is designed forprocessing classification results with the parameter alg asWilcoxon in a widely used R package caTools [27]. Thevalue of each metric is averaged over ten trials for tenfoldcross validation.

5.2. Artificial Data Set

In this subsection, we clarify the difference betweenRB bagging and Exactly Balanced Bagging in detail using

simple artificial data sets to discuss the reason why RBBagging is expected to perform better for the AUC.

We examined the advantages of negative binomial sam-pling and RB Bagging using one-dimensional artificial datasets. The positive and negative samples were generatedfrom two Gaussian distributions, N(0, 0.5), which has zeromean and 0.5 variance, and N(2, 1), respectively, as shownin Fig. 4. The size of data sets is fixed at 1000 and thefraction of positive samples is set at 5, 10, and 20%. Wecalculated the bias and variance of the MSE, the mean, and,the variance of the AUC values of the individual models andthe overall AUC value of ensemble model. The results forten trials for tenfold cross validation are shown in Table II.

All of the algorithms succeed in improving the over-all AUC value compared to the mean of the AUC valuesgiven by the base models. This shows the advantage of theensemble-based models for the AUC values for imbalanceddata sets. The bias of the MSE is almost identical in theexperiments for both approaches. RB Bagging shows highervalue for accuracy, the variance of the AUC, and the over-all AUC value. However, Exactly Balanced Bagging hasa larger variance for the MSE when the minority ratiosare 5 or 10%, which contradicts the reported relationshipof larger variances of the MSE corresponding to higheraccuracies. In addition, the individual models of ExactlyBalanced Bagging present higher AUC values on averagefor the lower minority ratio. RB Bagging can combine thepredictions of the weak learners with large diversity andresult in higher performance. Consequently, at least in thissetting, the variance of the AUC has a stronger relationshipwith the increases of the accuracy and AUC than the vari-ance of the MSE. Therefore, the variance of the AUC wouldbe more appropriate than the variance of the MSE as a met-ric to evaluate the desired diversity of the ensemble modelsin the class imbalance problem as well as the evaluation ofthe performance with imbalanced class using for AUC.

Next we used higher dimensional data sets in the samesetting. We increase the dimensionality dA from 1 to 5 to10 using multivariate Gaussian distributions. The means ofthe distributions of the positive and negative classes shiftby the root of the dA-th power of 2 for each dimension.

Table II. The bias and variance of the MSE and AUC values of algorithms with one-dimensional artificial data set.

% Minority Algorithm Bias (MSE) Var (MSE) Accuracy Mean (AUCs) Var( AUCs) AUC

RB Bagging (K = 100) 0.098 0.091 90.3 92.3 2.92 95.65 RB Bagging w/replace (K = 100) 0.098 0.091 90.4 92.3 2.85 95.6

Exact Balanced Bagging (K = 100) 0.097 0.101 88.8 92.6 1.88 95.3RB Bagging (K = 100) 0.120 0.117 90.5 92.5 2.13 95.7

10 RB Bagging w/replace (K = 100) 0.120 0.117 90.5 92.4 2.15 95.7Exact Balanced Bagging (K = 100) 0.120 0.120 89.6 92.6 1.49 95.2RB Bagging (K = 100) 0.149 0.149 89.6 91.9 1.85 95.7

20 RB Bagging w/replace (K = 100) 0.149 0.149 89.6 91.9 1.84 95.7Exact Balanced Bagging (K = 100) 0.149 0.150 88.8 91.7 1.22 95.1



Table III. The bias and variance of the MSE and AUC values of algorithms with high-dimensional data set where the fraction of theminority examples is 5%.

Dimension Algorithm Bias (MSE) Var (MSE) Accuracy Mean (AUCs) Var (AUCs) AUC

RB Bagging (K = 100) 0.082 0.049 94.2 84.0 7.95 97.15 RB Bagging w/replace (K = 100) 0.082 0.049 94.3 84.2 7.83 97.1

Exact balanced Bagging (K = 100) 0.085 0.059 92.0 85.0 7.06 96.5RB Bagging (K = 100) 0.074 0.038 96.4 82.2 8.77 98.5

10 RB Bagging w/replace (K = 100) 0.073 0.038 96.4 82.4 8.91 98.5Exact Balanced Bagging (K = 100) 0.081 0.047 94.0 82.9 8.03 97.6

The covariance matrices are both diagonal and the diagonalelements are all 0.5 and 1 respectively. The results with%Minority = 5 are summarized in Table III. RB Baggingstill achieved higher overall AUC values showing largervariance in both MSE and AUCs. Interestingly, though thehigher dimensionality leads to a higher overall AUC, therewere lower values for the mean AUC individual models.This means that aggregated models tend to perform welleven if the performance of base models decreases dueto high dimensionality and RB Bagging again has theadvantage. These results can be taken as an evidence thatRB Bagging has higher diversity and better performance asmentioned in Section 4.2.

5.3. Benchmark Data Sets

The characteristics of the benchmark data sets are sum-marized in Table IV. Following the earlier studies [4,5,28],we randomly chose eight frequently used benchmark datasets from the UCI repository [29]. The multi-class datasets were converted into binary data sets that included theminority labels shown in the “Min. label” column. Theexamples of the other classes were assigned the majoritylabel. The largest data set has 20 000 items and the ratioof the smallest minority class ranged from 34.77 to 3.95%.

The performance metrics resulting from the nine algo-rithms for the benchmark data sets are summarized inTables V and VI. In order to examine the statistical sig-nificance in different performances, we used the Wilcoxonsign rank test [30]. In the tables, the bold values indicate

Table IV. Summary of the benchmark data sets.

Data set Size #Attr. Min. label %Minority

Diabetes 768 8 pos 34.77Breast 286 9 malignant 34.48German 1 000 20 bad 30.00E-Coli-4 336 7 iMU 10.42Satimage 6 435 36 4 9.73Flag 194 29 white 8.76Glass 768 9 6 7.94Letter-A 20 000 16 a 3.95

the best results for a metric in an experiment. The asterisksmeans that the results are not statistically significant at thesignificance level α = 0.05 with the t-test.

Surprisingly, on the Diabetes, German, and Satimagedata sets, RB Bagging consistently outperformed Exact Bal-anced Bagging for all of the metrics. The results suggest ourapproach of using the roughly balanced subsets is promis-ing. Our algorithm also worked better compared with stan-dard algorithms, including the original bagging algorithm.The differences are clear in AUC and the G-mean.

The Flag data set shows that the values of the F-measureand G-mean with the original bagging algorithm, C4.5, andAdaBoost are zero. These observations show that the threealgorithms completely failed in the prediction of the posi-tive examples in every trial. We examined why AdaBoostalso failed, and found that the name attribute which variesfor each sample caused the problem. In contrast, RB Bag-ging showed well-balanced performance for this difficultdata set, especially for the AUC and the G-mean. WhileRIPPER had higher accuracy than RB Bagging, it seemsto be slightly overfit toward the correct classification of thenegative examples.

On the Glass data set, RB Bagging had the same accuracyas AdaBoost and outperformed it in the other metrics exceptfor MSE. The superiority of RB Bagging compared with theoriginal bagging algorithm seems clearer for data sets suchas Flag and Glass in terms of AUC and G-mean, rather thanfor Diabetes and Breast. This shows that RB Bagging canbe useful for highly imbalanced data sets.

The Letter-A data set results indicate that AdaBoostworked nearly perfectly and outperformed RB Bagging inall of the metrics except for AUC. The reason appears tobe that this problem is so easy that a single C4.5 tree canprovide acceptably high performance. We conclude that it isdifficult to make the ensemble model based on RB Baggingconverge to the Bayesian error in these simple cases.

5.4. Real-world Data Set

RealF is the real-world data set from a financial companyas shown in Table VII. The RealF examples consist ofeleven processed customer characteristics for seven months



Table V. The experimental results on the benchmark data sets (taken from the UCI repository).

Data set Algorithm AUC MSE ISE F-measure G-mean Accuracy

Diabetes RB Bagging (K = 100) 83.5 0.162 0.0490∗ 69.2 76.2 76.3RB Bagging w/replace (K = 100) 83.7 0.161 0.0487 70.4 77.2 77.5Exact Balanced Bagging (K = 100) 82.7 0.173 0.0541 67.5 74.7 74.1Original bagging (K = 100) 83.4 0.157 0.0513 63.8 71.2 76.4C4.5 (pruned) 76.6 0.193 0.106 59.0 67.6 73.3AdaBoost (K = 100) 79.6 0.244 0.239 63.7 71.4 75.1AdaBoost (K = 200) 78.9 0.255 0.251 63.0 70.9 74.1RIPPER (Optimize = 2) 69.5 0.193 0.0854 58.6 67.1 73.6RIPPER (Optimize = 10) 71.2 0.188 0.0885 61.5 69.5 74.9

Breast RB Bagging (K = 100) 98.8 0.0359 0.0182 94.1 95.7 95.8RB Bagging w/replace (K = 100) 98.7∗ 0.0358 0.0181 94.1 95.7 95.8Exact Balanced Bagging (K = 100) 98.6 0.0413 0.0257 93.1 95.1 95.1Original bagging (K = 100) 98.3 0.0374 0.0198∗ 93.7 95.4 95.6C4.5 (pruned) 96.4 0.0467 0.0332 92.2 94.1 94.6AdaBoost (K = 100) 98.3 0.0293 0.0288 95.7 97.0 97.0AdaBoost (K = 200) 98.2 0.0301∗ 0.0301 95.7 97.0 97.0RIPPER (Optimize = 2) 92.7 0.0619 0.0525 89.6 91.5 93.0RIPPER (Optimize = 10) 92.9 0.0574 0.0522 90.8 92.3 93.8

German RB Bagging (K = 100) 77.3 0.188 0.0431 77.7 70.1 71.1RB Bagging w/replace (K = 100) 78.1 0.186 0.0415 77.3 69.9∗ 70.8Exact Balanced Bagging (K = 100) 76.0 0.208 0.0566 71.8 67.9 65.9Original bagging (K = 100) 76.9 0.173 0.0566 82.5 60.4 74.0C4.5 (pruned) 66.2 0.227 0.146 80.2 56.3 70.8AdaBoost (K = 100) 71.0 0.249 0.245 83.0 62.6 74.9∗AdaBoost (K = 200) 70.0 0.249 0.248 82.9∗ 63.4 75.0RIPPER (Optimize = 2) 63.5 0.194 0.0723 81.5 58.4 72.6RIPPER (Optimize = 10) 63.9 0.197 0.0758 80.6 59.9 71.8

E-Coli-4 RB Bagging (K = 100) 94.7 0.0877 0.0365 62.7 89.3 87.5RB Bagging w/replace (K = 100) 95.7 0.0871 0.0365 61.2 88.9 86.9Exact Balanced Bagging (K = 100) 94.0 0.103 0.0516 58.5 88.3 85.7Original bagging (K = 100) 94.3 0.0460 0.0215 65.3 74.1 93.8C4.5 (pruned) 81.7 0.0523 0.0449 63.7∗ 69.5 94.4AdaBoost (K = 100) 93.7 0.0680 0.0679 62.2 70.1 93.2AdaBoost (K = 200) 93.3 0.0775 0.0756 55.4 65.7 92.0RIPPER (Optimize = 2) 77.1 0.0619 0.0478 57.3 67.8 92.6RIPPER (Optimize = 10) 78.8 0.0671 0.0513 61.6 74.7 91.7

in the loan business. Though we cannot make the dataset public due to privacy and confidentiality issues, it wasdefinitely valuable to evaluate the prediction algorithms forimbalanced data with a real-world application.

The motivation in analyzing this data set is to estimate thetrue financial conditions of the customers to more quicklystop doing business with bad customers, as the resultingincrease of noncollectable debts would have a large impacton the profits of the company. The class label is the riskof a customer, low or high, determined after six months.In this data set, there is a clear difference between the per-formance of RB Bagging and the others especially for theAUC. The relative performances of the algorithms are iden-tical compared to those of the benchmark data sets. Byputting appropriate focus on the high risk customers, onlyRB Bagging had well balanced AUC and G-mean values. In

contrast, the other algorithms showed strong bias toward theprediction of the low risk (i.e. majority) customers. There-fore, RB Bagging was useful for detecting untrustworthycustomers and in reducing the bad debts. This clear advan-tage shows that our method is also promising for real-worldapplications.

In summary, RB Bagging almost always outperformsExact Balanced Bagging by all of the metrics. It usu-ally worked better than the other algorithms for the AUC,ISE, and G-mean. The values of MSE and F-measure werealso comparable. As mentioned in Section 4.3, the trade-offbetween accuracy and emphasizing the minority is basicallyunavoidable. Seeing the overall accuracy as the balancingfactor, the accuracy of RB Bagging seems advantageousover that of Exact Balanced Bagging.



Table VI. The experimental results on the benchmark data sets .


Satimage RB Bagging (K = 100) 95.4 0.0785 0.0243 60.5 87.6 89.0RB Bagging w/ replace (K = 100) 95.5 0.0781 0.0235 60.0 87.5 88.8Exact Balanced Bagging (K = 100) 95.4 0.0960 0.0344 56.0 88.1 86.0Original bagging (K = 100) 95.5 0.0424 0.0159 65.9 73.9 94.4C4.5 (pruned) 76.1 0.0751 0.0705 57.4 72.5 92.1AdaBoost (K = 100) 96.7 0.0495 0.0492 70.1 76.9 95.0AdaBoost (K = 200) 96.8 0.0503 0.0501 69.5 76.5 94.9RIPPER (Optimize = 2) 74.7 0.0636 0.0481 56.8 70.9 92.3RIPPER (Optimize = 10) 75.8 0.0640 0.0501 58.1 72.4 92.4

Flag RB Bagging (K = 100) 75.2 0.178 0.0197 25.8 55.4 72.1RB Bagging w/replace (K = 100) 74.5∗ 0.179 0.0200∗ 21.3 47.4 71.1Exact Balanced Bagging (K = 100) 74.2∗ 0.212 0.0555 22.9 54.0 62.1Original bagging (K = 100) 61.0 0.0795 0.0600 0.0 0.0 91.3C4.5 (pruned) 50.0 0.0792 0.0591 0.0 0.0 91.3AdaBoost (K = 100) 67.2 0.0817 0.0574 0.0 0.0 91.3AdaBoost (K = 200) 67.2 0.0817 0.0574 0.0 0.0 91.3RIPPER (Optimize = 2) 61.5 0.0854 0.0566 19.7 23.7 88.6RIPPER (Optimize = 10) 64.8 0.0884 0.0596 28.0 30.9 88.5

Glass RB Bagging (K = 100) 96.7 0.0466 0.0231 85.9 92.9 95.8∗RB Bagging w/replace (K = 100) 96.6 0.0474 0.0240 86.7 92.8 95.8Exact Balanced Bagging (K = 100) 95.4 0.0495 0.0257 85.3 92.5 95.3Original bagging (K = 100) 93.3 0.0368 0.0280 83.3 87.9 95.3∗C4.5 (pruned) 93.6 0.0415∗ 0.0371 84.0 89.5 95.3∗AdaBoost (K = 100) 95.2 0.0418∗ 0.0415 84.8∗ 89.8 95.8∗AdaBoost (K = 200) 95.2 0.0418∗ 0.0415 84.8∗ 89.8 95.8∗RIPPER (Optimize = 2) 91.2 0.0373∗ 0.0341 85.5∗ 90.5∗ 96.2RIPPER (Optimize = 10) 89.6 0.0417∗ 0.0385 83.5 88.7 95.8∗

Letter-A RB Bagging (K = 100) 99.9 0.0103 0.00207 99.4 98.7 98.9RB Bagging w/replace (K = 100) 99.9 0.0103 0.00219 99.4 98.7 98.9Exact Balanced Bagging (K = 100) 99.9 0.0128 0.00300 99.2 98.6 98.4Original bagging (K = 100) 100 0.00210 0.00100 99.9 97.4 99.7C4.5 (pruned) 98.9 0.00330 0.00305 99.8 97.7 99.6AdaBoost (K = 100) 99.4 0.000557∗ 0.000550 100 99.3 99.9AdaBoost (K = 200) 99.4 0.000550 0.000550∗ 100 99.3 99.9RIPPER (Optimize = 2) 97.9 0.00334 0.00292 99.8 97.8 99.6RIPPER (Optimize = 10) 98.0 0.00311 0.00296 99.8 97.9 99.7

Table VII. Summary of RealF data set .

Data set Size #Attr. Min. label %Minority

*RealF 6,651 77 low 4.99%

5.5. Experimental Summary

First we show the merits of all of the algorithms accord-ing to the performance metrics in Table VIII. Each cellrepresents the count of how many times that algorithmachieved the best or nearly best (without significant dif-ference), with the nine data sets as shown in Tables V,VI, and IX. For AUC, which is our primary metric,RB Bagging worked significantly best in most cases. Inparticular, the average AUC values of RB Bagging arenever less than those of Exactly Balanced bagging andthe original bagging, except for the Letter-A data set. This

indicates that RB Bagging is good at producing a high AUCvalue as an extension of bagging. For the MSE, in contrast,the original bagging tends to work best. C4.5 and AdaBoostalso showed low MSE. Although MSE and ISE are simi-lar metrics, the merits of the algorithms are quite different.Interestingly, RB Bagging always resulted in a quite lowISE value even if it had a larger MSE than the others. Thisindicates that RB Bagging generally fails only if it have lit-tle confidence as the estimated probability p(yi |xi) becomesclose to 0.5. AdaBoost was the best for F-measure over RBBagging. On the other hand, for G-mean, which was alsoproposed as a metric for the class imbalance problem [3],RB Bagging showed stronger results. As expected, RB Bag-ging showed lower prediction accuracy for most data setsdue to the incorrect classification of negative examples inexchange for the high AUC value. The difference betweenRB Bagging alternatives using sampling with or without



Table VIII. Summary of the prediction performances of all of the algorithms. Each cell represents the count of how many times thealgorithms achieves the best or close to the best (without any statistically significant difference).

Algorithm AUC MSE ISE F-measure G-mean Accuracy

RB Bagging (K = 100) 4 0 4 0 4 1RB Bagging w/replace (K = 100) 6 0 4 2 3 1Exact Balanced Bagging (K = 100) 1 0 0 0 2 0Original bagging (K = 100) 1 6 3 1 0 2C4.5 (pruned) 0 3 0 1 0 3AdaBoost (K = 100) 0 3 1 6 2 7AdaBoost (K = 200) 1 3 1 5 2 6RIPPER (Optimize = 2) 0 1 0 1 1 1RIPPER (Optimize = 10) 0 1 0 1 0 1

Table IX. Statistics of the experimental result on the real-world data set (RealF).


RealF RB Bagging (K = 100) 83.4∗ 0.112 0.0136 92.5 72.4 86.4RB Bagging w/replace (K = 100) 83.4 0.112 0.0140 92.3 72.8∗ 86.2Exact Balanced Bagging (K = 100) 82.5 0.157 0.0275 85.4 72.8 75.5Original bagging (K = 100) 81.9 0.0321 0.0260 98.2 56.4 96.5C4.5 (pruned) 65.6 0.0342 0.0308 98.2 55.4 96.4AdaBoost (K = 100) 67.9 0.0332 0.0332 98.3 58.5 96.7AdaBoost (K = 200) 67.8 0.0329 0.0329 98.3 58.8 96.7RIPPER (Optimize = 2) 62.3 0.0367 0.0329 98.0 49.1 96.2RIPPER (Optimize = 10) 64.0 0.0352 0.0316 98.1 52.4 96.3

Table X. Mean AUC values and p-values for the Wilcoxon and Friedman test between RB Bagging and other algorithms for the abovenine data sets.

RBB RBB w/r EBB Bagging C4.5 Ada.100 Ada.200 RIP.2 RIP.10

Mean AUC 89.4 89.6 88.6 87.2 78.5 85.4 85.2 76.7 77.7

Wilcoxon RBB – 0.359 0.012 0.020 0.004 0.027 0.039 0.004 0.004(Paired) RBB w/r 0.359 – 0.004 0.012 0.004 0.020 0.020 0.004 0.004Friedman RBB – 1.000 0.984 0.966 0.058 0.904 0.776 0.001 0.010(1.97 × 10−9) RBB w/r 1.000 – 0.953 0.916 0.029 0.814 0.643 0.000 0.004

RBB, EBB, Ada., and RIP are the abbreviations for RB Bagging, Exactly Balanced Bagging, AdaBoost, and RIPPER, respectively.

replacement is small and there seems to be no superiority.Overall, RB Bagging tends to show higher performance forthe AUC and G-mean, which are widely used for comparingclassifiers under class imbalance.

Finally, we would like to statistically compare the AUCperformance between RB Bagging and other algorithms onboth the benchmarking and real-world data sets. In general,it is not easy to compare multiple algorithms over multipledata sets when no algorithm works consistently better thanthe others. Because AdaBoost sometimes outperformed RBBagging, the statistical significance of the superiority ofRB Bagging over AdaBoost is not clear. For evaluating theperformance of multiple classifiers for multiple data sets,Demsar [31] applied the Wilcoxon signed rank test and

the Friedman rank sum test [32]1. The tests are applied toAUC values which are the averaged values over fivefoldcross validation for each data set. Following the Demsar’swork, we also use these two tests and show the resultsin Table X. The second row indicates the overall AUCvalues for the algorithms averaged over the nine data setsincluding the benchmarks and RealF. The cells on the thirdand fourth rows represent the p-values of the statistical testsbetween RB Bagging and other algorithms shown in thefirst row, respectively. The Wilcoxon signed rank test is a

1 DeLong et al. showed that the difference between two AUCvalues is directly tested under an assumption that the AUCvalues follow a Gaussian distribution [35]. However, there is noguarantee that the assumption holds in general. Therefore we didnot use their technique.



nonparametric alternative to the widely used paired t-test.Demsar indicated that the Wilcoxon test is more appropriatefor comparing two classifiers than a t-test as the parametricassumption of a t-test is rarely met in machine learningapplications. However, the Wilcoxon test only considers theranks of the performance values, not the difference of theactual values. Therefore, it does not need any assumptionsabout the distribution of the performance metric. Based ona significance α = 0.05, we obtained the results in Table Xshowing that the performance differs significantly for everypair of RB Bagging and any other algorithm regardless ofthe replacement in sampling method. At the same time, thedifference between RB Bagging algorithms with or withoutreplacement is not significant.

For multiple comparisons among the nine algorithms,we used the Friedman test, which is preferable over com-mon analysis of variance (ANOVA) for comparing classi-fiers with different data sets as it also has no parametricassumption. The p-value 1.97 × 10−9 indicated in Table Xis assigned for the null hypothesis that all of the classi-fiers are equivalent. For the significance level α = 0.05,we conclude that there are surely significant differences inthe algorithms. However, for detecting the significance ofthe differences between two classifiers, the Friedman testis more conservative than the Wilcoxon’s test, as each nullhypothesis is easily rejected in multiple comparison [31].In fact, we cannot see any significant difference with ExactBalanced Bagging, the original bagging, and AdaBoostwhen using the same significance level, α = 0.05.

In summary, RB Bagging shows statistical significancefor C4.5 and RIPPER under significance level α = 0.05both for the Wilcoxon and Friedman tests. In addition,for other ensemble-based algorithms, RB Bagging workssignificantly better in terms of the Wilcoxon test.

6. RELATED WORK

A traditional approach for the imbalanced class problemis intelligent resampling. Kubat and Martin proposed one-sided selection (OSS) based on the under-sampling of themajority examples which lie around the possible border-line or noisy area [3]. SMOTE is an over-sampling tech-nique [2]. The method generates some synthetic minorityexamples by interpolating the minority examples carefullyso that it avoids overfit. In previous work on the com-prehensive evaluation of resampling methods in the classimbalance problem, it is shown that the under-samplingapproaches including random under-sampling tend to workwell, especially with C4.5 [33,34], for benchmark data sets.Their goal, which was to choose the best subset for a singlemodel, is different from our approach of obtaining bettermultiple subsets for an ensemble model.

Boosting could solve the imbalanced class problem nat-urally, as it automatically assigns greater weights to theminority examples. In this paper, we compare RB Bag-ging with the most successful boosting algorithm, AdaBoost[23]. Recently, boosting algorithms using over-samplingtechniques for class imbalance have been presented.SMOTEBoost [4] applies SMOTE-based over-samplingto change the total weights of the misclassified minorityexamples. While the authors presented favorable exper-imental results for the metrics of precision, recall, andF-measure, it still remains a nontrivial problem to determinewhat amount of over sampling is sufficient. DataBoostIM[5] generates data to balance not only the class distribu-tion but also the total weight within the classes. The broadempirical studies show that DataBoostIM generally outper-forms other algorithms including SMOTEBoost. However,there is no evaluation in terms of AUC, which is the largestadvantage of RB Bagging, on the boosted over-samplingmethods. In addition, RB Bagging does not require syn-thetic examples which may result in artifacts.

Cost-sensitive algorithms intend to minimize the totalcost of the misclassification when the costs are given. Thereare many cost-sensitive algorithms based on the ensemblemethods, such as MetaCost [6], Costing [8], and AdaCost[7]. On the basis of a bagging-like ensemble, MetaCostmakes any base learner cost-sensitive. Costing applies cost-proportionate rejection sampling, but also generates sub-sets with slightly imbalanced total-cost-within-class. Whilethese algorithms can handle the class imbalances with highmisclassification costs for the minority class, empiricalcomparisons are impossible, as the objectives and perfor-mance metrics are different from ours. Note that applyingthe rejection sampling only to the negative class can beregarded as an approach similar to negative binomial sam-pling, so the performance of trained models should beidentical.

Breiman performed some experiments using the samedata sets from the UCI repository as used in our experi-ments [9]. He simply duplicated the minority examples andhandled the original bagging algorithm so that the mul-tiple replications of an example in a subset may lead tooverfit. In fact, we reimplemented those experiments andfound that his approach never outperforms RB Bagging.Gao et al. recently proposed an ensemble-based frameworkfor a data stream with a skewed distribution [36]. Chanet al. extended Random Forest [37] to learn from imbal-anced data by performing bootstrap sampling only on theminority class [38]. As they also draw exactly the samenumber of minority and majority examples, our negativebinomial sampling might improve the algorithm, using theslightly imbalanced subsets.

On the basis of the justification of the diversity of ensem-ble models [10–12], new ensemble learning algorithms



have been proposed. For instance, Prem et al. presentedan over-sampling-based algorithm which generates artifi-cial data samples and assigns class labels probabilisticallyto increase the diversity of the base learners and refine thedecision boundary [39]. In contrast, RB Bagging only mod-ifies the class ratio of the training subsets so that the trainedclassifiers achieve higher diversity for the AUC.

7. CONCLUSION

We have addressed the imbalanced class problem frompractical perspectives. Although many ensemble-based algo-rithms have been proposed for the problem, Breiman’sbagging algorithm has not been used that often. In thispaper, we proposed RB Bagging to equalize the samplingprobability of each class, instead of fixing the sample sizeas a constant. The number of the majority class examplesis determined probabilistically according to the negativebinomial distribution. The class distribution of the sampledsubsets becomes slightly imbalanced as well as the origi-nal bagging algorithm for balanced data sets. RB Baggingis successful in both preserving the nature of the originalbagging algorithm, and at the same time making effectiveuse of the information of all of the minority examples. Thisaggregated model becomes more robust than the commonapproach depending on exactly balanced subsets. In addi-tion, there is only a minor drawback in computational costby choosing RB Bagging instead of Exactly Balanced Bag-ging, as the average sample size is the same.

First, we examined the effect of variance reduction anddiversity of the exactly balanced and roughly balancedapproaches. In the preliminary experiments using simpleartificial data sets, RB Bagging showed higher accuracy andAUC than Exactly Balanced Bagging. We showed that thevariance of AUC values given by the individual models inan ensemble is a good metric to estimate the diversity of theensemble models with class imbalance. Next, we evaluatedour algorithm in the experiments using nine data sets includ-ing the usual benchmark data sets and a real-world dataset. We compared RB Bagging with the Exactly BalancedModel and other well-known algorithms such as AdaBoostand RIPPER for imbalanced data. RB Bagging generallyoutperformed them, especially for the performance met-rics such as AUC, ISE, and G-mean, which are known tobe appropriate for the imbalanced class problem. For thereal-world financial data set, RB Bagging showed a veryclear advantage. RB Bagging shows statistical significanceto C4.5 and RIPPER both for the Wilcoxon and Friedmantests. In addition, for other ensemble-based algorithms, RBBagging works significantly better in terms of the Wilcoxontest. These results show that our approach is practical andpromising.

There is future work for more comprehensive evaluationof the algorithms including RB Bagging. To the best of ourknowledge, there are only a few studies that broadly com-pare various algorithms for the class imbalance problem[34]. Empirical comparisons should be done using a numberof benchmark and real-world data sets of different sam-ple sizes, minority ratios, and difficulties. In particular, ourinterest lies on how the difference between over-sampling-based boosting algorithms (SMOTEBoost, DataBoostIM)and RB Bagging depends on under-sampling, allowing toexamine the effect of artificial data samples which might beproblematic in some cases. As the data generation approachand our negative binomial sampling are not mutually exclu-sive, it would also be valuable to combine these techniques.In addition, it must be meaningful to compare the diversitymetrics, including the variance of AUC that we presentedin this paper, with imbalanced data sets as it seems unclearwhich metric is appropriate to represent the diversity for theclass imbalance problem if we evaluate multiple algorithmsrelying mainly on the AUC values.

Acknowledgment

We appreciate fruitful discussions with Naoki Abe, PremMelville, Tsuyoshi Ide, and Yuta Tsuboi. We are also grate-ful to Wei Fan, Yan Liu, and all of the anonymous reviewersfor their comments to improve the quality of this paper.

REFERENCES

[1] N. V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: specialissue on learning from imbalanced data sets. SIGKDDExplorations Newsletter, 6(1) (2004), 1–6.

[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.Kegelmeyer, SMOTE: Synthetic minority over-samplingtechnique. Journal of Artificial Intelligence and Research,16 (2002), 321–357.

[3] M. Kubat and S. Matwin, Addressing the curse ofimbalanced training sets: one-sided selection, In Proceedingsof the Fourteenth International Conference on MachineLearning, Nashville, TN, USA, 1997, 179–186.

[4] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer,SMOTEBoost: Improving prediction of the minority classin boosting. In Proceedings of the Seventh EuropeanConference on Principles and Practice of KnowledgeDiscovery in Databases, Cuvtat-Dubrovnik, Croatia, 2003,107–119.

[5] H. Guo and H. L. Viktor. Learning from imbalanced datasets with boosting and data generation: the DataBoost-IMapproach. SIGKDD Explorations Newsletter, 6(1) (2004),30–39.

[6] P. Domingos, Metacost: A general method for makingclassifiers cost-sensitive. In Proceedings of the Fifth ACMSIGKDD International Conference Knowledge Discoveryand Data Mining, San Diego, CA, USA, 1999, 155–164.



[7] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, AdaCost:misclassification cost-sensitive boosting. In Proceedings ofthe Sixth International Conference on Machine Learning,Bled, Slovenia, 1999, 97–105.

[8] B. Zadrozny, J. Langford, and N. Abe, Cost-sensitivelearning by cost-proportionate example weighting, InProceedings of the Third IEEE International Conference onData Mining, Melbourne, Florida, USA, 2003, 435–442.

[9] L. Breiman. Bagging predictors. Machine Learning, 24(2)(1996), 123–140.

[10] P. Cunningham and J. Carney, Diversity versus qualityin classification ensembles based on feature selection.In Proceedings of the Eleventh European Conferenceon Machine Learning, Barcelona, Catalonia, Spain, 2000,109–116.

[11] A. Krogh and J. Vedelsby. Neural network ensembles,cross validation, and active learning. Advances in NeuralInformation Processing Systems, Vol. 7, Cambridge, MA,The MIT Press, 1995, 231–238.

[12] L. I. Kuncheva and C. J. Whitaker. Measures of diversityin classifier ensembles. Machine Learning, 51 (2003),181–207.

[13] W. Fan, E. Greengrass, J. McCloskey, P. S. Yu, and K.Drummey. Effective estimation of posterior probabilities:explaining the accuracy of randomized decision treeapproaches. In Proceedings of the Fifth IEEE InternationalConference on Data Mining, Houston, TX, USA, 2005,154–161.

[14] D. Tao, X. Tang, X. Li, and X. Wu. Asymmetric baggingand random subspace for support vector machines-basedrelevance feedback in image retrieval. IEEE Transactionson Pattern Analysis and Machine Intelligence, 28(7) (2006),1088–1099.

[15] P. Buhlmann and B. Yu. Analyzing bagging. Annals ofStatistics, 30 (2002), 927–961.

[16] J. H. Friedman and P. Hall. On bagging and nonlinearestimation. Journal of Statistical Planning and Inference,137(3) (2007), 669–683.

[17] C. Cortes and M. Mohri. Auc optimization vs. error rateminimization. In Advances in Neural Information ProcessingSystems, S. Thrun, L. K. Saul, and B. Schlkopf, eds,Cambridge, MA, MIT Press, 2004.

[18] ACM KDD Cup. 2009. http://www.kddcup-orange.com/.[19] IEEE ICDM 2008 Data Mining Contest. http://www.cs.

uu.nl/groups/ADA/icdm08cup/.[20] R. B. Rao, O. Yakhnenko, and B. Krishnapuram. Kdd cup

2008 and the workshop on mining medical data. ACMSIGKDD Explorations Newsletter, 10(2) (2008), 34–38.

[21] A. P. Bradley. The use of the area under the ROC curvein the evaluation of machine learning algorithms. PatternRecognition, 30(7) (1997), 1145–1159.

[22] J. R. Quinlan, C4.5: Programs for Machine Learning, SanMateo, CA, Morgan Kaufmann, 1993.

[23] Y. Freund and R. E. Schapire, Experiments with anew boosting algorithm, In Proceedings of the Thirteenth

International Conference on Machine Learning, Bari, Italy,1996, 148–156.

[24] W. W. Cohen, Fast effective rule induction, In Proceedingsof the Twelfth International Conference on MachineLearning, Tahoe City, California, USA, 1995, 115–123.

[25] I. H. Witten and E. Frank, Data Mining: Practical MachineLearning Tools, Oxford, Elsevier, 2005.

[26] R Development Core Team. R: A language and environmentfor statistical computing, R Foundation for StatisticalComputing, 2005.

[27] J. Tuszynski. catools: Tools: moving window statistics, gif,base64, roc auc, etc. http://cran.r-project.org/web/packages/caTools/.

[28] G. Batista, R. C. Prati, and M. C. Monard. A study of thebehavior of several methods for balancing machine learningtraining data. SIGKDD Explorations Newsletter, 6(1) (2004),20–29.

[29] D. Newman, S. Hettich, C. Blake, and C. Merz, UCIRepository of Machine Learning Databases, 1998.

[30] F. Wilcoxon. Individual comparisons by ranking methods.Biometrics, 1(6) (1945), 80–83.

[31] J. Demsar. Statistical comparisons of classifiers over multipledata sets. Journal of Machine Learning Research, 7 (2006),1–30.

[32] M. Friedman. The use of ranks to avoid the assumptionof normality implicit in the analysis of variance. Journalof American Statistical Association, 32(200) (1937),674–701.

[33] C. Drummond and R. C. Holte, C4.5, class imbalance, andcost-sensitivity: why under-sampling beats over-sampling,In Workshop on Learning from Imbalanced Data Sets II,2003.

[34] J. D. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano,Experimental perspectives on learning from imbalanceddata, In Proceedings of the Twenty-fourth InternationalConference on Machine Learning, Corvalis, Oregon, USA,2007, 935–942.

[35] E.R. De Long, D.M. De Long, and D.L. Clarke-Pearson,Comparing the area under two or more correlated receiveroperating characteristic curves: a nonparametric approach.Biometrics, 44 (1988), 837–845.

[36] J. Gao, W. Fan, J. Han, and P. S. Yu. A generalframework for mining concept-drifting data streams withskewed distributions, In Proceedings of the Seventh SIAMInternational Conference on Data Mining, Minneapolis,Minnesota, USA, 2007.

[37] L. Breiman. Random forests. Machine Learning, 45(1)(2001), 5–32.

[38] C. Chen, A. Liaw, and L. Breiman, Using Random Forest toLearn Imbalanced Data, Department of Statistics, Universityof California, Berkeley, Technical Report No. 666, 2004.

[39] P. Melville and R. J. Mooney, Constructing diverse classifierensembles using artificial training examples, In Proceedingsof the Eighteenth International Joint Conference on ArtificialIntelligence, Acapulco, Mexico, 2003, 505–512.


roughly balanced bagging for imbalanced data

Documents