research article outlier detection method in linear ...research article outlier detection method in...

13
Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K. K. L. B. Adikaram, 1,2,3 M. A. Hussein, 1 M. Effenberger, 2 and T. Becker 1 1 Group Bio-Process Analysis Technology, Technische Universit¨ at M¨ unchen, Weihenstephaner Steig 20, 85354 Freising, Germany 2 Institut f¨ ur Landtechnik und Tierhaltung, V¨ ottinger Straße 36, 85354 Freising, Germany 3 Computer Unit, Faculty of Agriculture, University of Ruhuna, Mapalana, 81100 Kamburupitiya, Sri Lanka Correspondence should be addressed to K. K. L. B. Adikaram; [email protected] Received 25 March 2014; Revised 23 May 2014; Accepted 26 May 2014; Published 10 July 2014 Academic Editor: Zengyou He Copyright © 2014 K. K. L. B. Adikaram et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We introduce a new nonparametric outlier detection method for linear series, which requires no missing or removed data imputation. For an arithmetic progression (a series without outliers) with elements, the ratio () of the sum of the minimum and the maximum elements and the sum of all elements is always 2/ : (0, 1]. ̸ = 2/ always implies the existence of outliers. Usually, < 2/ implies that the minimum is an outlier, and > 2/ implies that the maximum is an outlier. Based upon this, we derived a new method for identifying significant and nonsignificant outliers, separately. Two different techniques were used to manage missing data and removed outliers: (1) recalculate the terms aſter (or before) the removed or missing element while maintaining the initial angle in relation to a certain point or (2) transform data into a constant value, which is not affected by missing or removed elements. With a reference element, which was not an outlier, the method detected all outliers from data sets with 6 to 1000 elements containing 50% outliers which deviated by a factor of ±1.0 − 2 to ±1.0 + 2 from the correct value. 1. Introduction Outlier detection and management of missing data are the two major steps in the data cleaning/cleansing process [13]. For achieving a training set, data mining, and statistical analyses, it is very important to have data sets that have no (or as few as possible) outliers and missing values. Except for model-based approaches, outlier detection and replacing of detected outliers or replacing missing values are two separate processes. e existing outlier detection methods are based on statistical, distance, density, distribution, depth, clustering, angle, and model approaches [1, 47]. e nonparametric outlier detection methods are independent of the model. For the data without prior knowledge, nonparametric methods are known as a better solution than the statistical (parametric) methods [810]. e most common nonparametric methods are based on distance, density, depth, cluster, angle, and res- olution techniques. Among various methods/techniques are least square method (LSM) [4] and the sigma filter [11] which have been used frequently to remove the outliers of linear regression. ese methods require data in Gaussian or near Gaussian distribution, which cannot be always guaranteed. If the correct model can be identified, model-based approaches like the Kalman filter [1214] are suitable for removing and replacing outliers. However, if it is not possible to identify the correct model, the model-based approach is not feasible [15]. In addition to the noise, missing data is another challenge in the data cleaning/cleansing process. Even if the original data set is without missing elements, removing outliers (without replacement) automatically creates a missing data environment. e most common two techniques to recover this situation are (1) filling the missing data with an estimated value (filling) or (2) using the data without missing val- ues (reject missing values). Complete-case analysis (listwise deletion) and available-case analysis (pairwise deletion) are the most common missing data rejection methods [1618]. e mentioned methods are under the assumption that they yield unbiased results. Among the different missing data filling methods hot deck, cold deck, mean, median, k-nearest Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 821623, 12 pages http://dx.doi.org/10.1155/2014/821623

Upload: others

Post on 27-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

Research ArticleOutlier Detection Method in Linear Regression Based on Sum ofArithmetic Progression

K K L B Adikaram123 M A Hussein1 M Effenberger2 and T Becker1

1 Group Bio-Process Analysis Technology Technische Universitat Munchen Weihenstephaner Steig 20 85354 Freising Germany2 Institut fur Landtechnik und Tierhaltung Vottinger Straszlige 36 85354 Freising Germany3 Computer Unit Faculty of Agriculture University of Ruhuna Mapalana 81100 Kamburupitiya Sri Lanka

Correspondence should be addressed to K K L B Adikaram lasanthadaad-alumnide

Received 25 March 2014 Revised 23 May 2014 Accepted 26 May 2014 Published 10 July 2014

Academic Editor Zengyou He

Copyright copy 2014 K K L B Adikaram et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

We introduce a new nonparametric outlier detection method for linear series which requires no missing or removed dataimputation For an arithmetic progression (a series without outliers) with 119899 elements the ratio (119877) of the sum of the minimumand the maximum elements and the sum of all elements is always 2119899 (0 1] 119877 = 2119899 always implies the existence of outliersUsually 119877 lt 2119899 implies that the minimum is an outlier and 119877 gt 2119899 implies that the maximum is an outlier Based upon thiswe derived a new method for identifying significant and nonsignificant outliers separately Two different techniques were usedto manage missing data and removed outliers (1) recalculate the terms after (or before) the removed or missing element whilemaintaining the initial angle in relation to a certain point or (2) transform data into a constant value which is not affected bymissing or removed elements With a reference element which was not an outlier the method detected all outliers from data setswith 6 to 1000 elements containing 50 outliers which deviated by a factor of plusmn10119890 minus 2 to plusmn10119890 + 2 from the correct value

1 Introduction

Outlier detection and management of missing data are thetwo major steps in the data cleaningcleansing process [1ndash3] For achieving a training set data mining and statisticalanalyses it is very important to have data sets that have no(or as few as possible) outliers and missing values Except formodel-based approaches outlier detection and replacing ofdetected outliers or replacing missing values are two separateprocesses

The existing outlier detection methods are based onstatistical distance density distribution depth clusteringangle and model approaches [1 4ndash7] The nonparametricoutlier detection methods are independent of the model Forthe data without prior knowledge nonparametric methodsare known as a better solution than the statistical (parametric)methods [8ndash10] The most common nonparametric methodsare based on distance density depth cluster angle and res-olution techniques Among various methodstechniques areleast square method (LSM) [4] and the sigma filter [11] which

have been used frequently to remove the outliers of linearregression These methods require data in Gaussian or nearGaussian distribution which cannot be always guaranteed Ifthe correct model can be identified model-based approacheslike the Kalman filter [12ndash14] are suitable for removing andreplacing outliers However if it is not possible to identify thecorrect model the model-based approach is not feasible [15]

In addition to the noise missing data is another challengein the data cleaningcleansing process Even if the originaldata set is without missing elements removing outliers(without replacement) automatically creates a missing dataenvironment The most common two techniques to recoverthis situation are (1) filling themissing data with an estimatedvalue (filling) or (2) using the data without missing val-ues (reject missing values) Complete-case analysis (listwisedeletion) and available-case analysis (pairwise deletion) arethe most common missing data rejection methods [16ndash18]The mentioned methods are under the assumption that theyyield unbiased results Among the different missing datafilling methods hot deck cold deck mean median k-nearest

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 821623 12 pageshttpdxdoiorg1011552014821623

2 The Scientific World Journal

neighbours model-based methods maximum likelihoodmethods and multiple imputation are the most commonmethods [18ndash22] Filling methods derive the filling valuefrom the same or other known existing data If there are aconsiderable number of outliers derived data may be biaseddue to the influence of outliers [23 24] Therefore the bestway is to remove all outliers and replace the outliers with asuitable method

In this paper we introduce a new nonparametric outlierdetection method based on sum of arithmetic progressionwhich used an indicator 2119899 where 119899 is the number of termsin the series The properties used in existing nonparametricmethods such as distance density depth cluster angle andresolution are domain dependent In contrast the value 2119899which we used in our new method is independent of thedomain conditions

Contrary to the existing nonparametric methods men-tioned earlier this work addressed identifying outliers in adataset that is expected to have linear relationThemethod iscapable of identifying significant and nonsignificant outliersseparately Moreover until all the outliers were removedthe new method requires no missing or removed dataimputation This will eliminate the negative influence dueto wrongly filled data points This is an advantage over themethods which require filling the removed data points Theoutlier detection method we introduced showed its best per-formances when the significant outliers are in non-GaussiandistributionThis is an advantage over existing methods suchas LMS and sigma filter The method uses a single data pointas a reference data pointThe reference point is assumed to benonoutlier Therefore accuracy of the outcome is dependingon the reference point especially when locating nonsignifi-cant outliers If the selected reference point is not an outlierthe method was capable of locating outliers from a data setcontaining very high rate of outliers such as 50 outliers

In this work data from biogas plants were used forevaluating the new method Since the biogas process isvery sensitive these data contain a considerable amount ofnoise even during apparently stable conditionsThis providessuitable data set for evaluating our method We were able toget the best outlier-freemacroscale data set which agrees withlinear (increasing decreasing or constant) regression fromselected segments of a data set

2 Methodology

21 Arithmetic Progression An arithmetic progression (AP)or arithmetic sequence is a sequence of numbers (ascendingdescending or constant) such that the difference between thesuccessive terms is constant [25] The 119899th term of a finite APwith 119899 elements is given by

119886119899= 119889 (119899 minus 1) + 119886

1 (1)

where 119889 is the common difference of successivemembers and1198861is the first element of the series The sum of the elements

of a finite AP with 119899 elements is given by

119878119899= (

119899

2

) lowast (1198861+ 119886119899) (2)

Table 1 Sample calculations for illustrating the relation between 2nand (119886

1+ 119886119899)119878119899

119886119899

Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861

100 100 100 9999 11198862

101 101 101 101 1011198863

102 102 102 102 1021198864

103 103 103 103 1031198865

104 10401 204 104 104(1198861+ 1198865)1198785

04 040001 0498 0399 02552119899 04 04 04 04 04Outlier mdash Yes-119886

5Yes-1198865

Yes-1198861

Yes-1198861

where a1 is the first element and an is the last element of theseries

Equation (1) is a 119891(119899) and fulfils the requirements of aline In other words finite AP is a straight line In addition astraight line is a series without outliers If there are outliersthe series is not a finite AP Therefore any arithmetic seriesthat fulfils the requirements of an AP can be considered aseries without outliers Equation (2) can be represented as

2

119899

=

(1198861+ 119886119899)

119878119899

infin gt 119899 ge 2 0 lt

2

119899

le 1 (3)

For any AP the right-hand side (RHS) of (3) is always 2119899which is independent of the terms of the series In otherwords if there are no outliers the value (119886

1+119886119899)119878119899will always

be equal to 2119899 If the RHS of (3) is not 2119899 it always impliesthat the series contains outliers Therefore the value 2119899 canbe used as a global indicator to identify any AP with outliers

Since we use the relation of AP we define that elementslying on or between two lines (linear border) are nonoutliersand others are outliers When the distance between twolines is zero they represent a single line In relation to themethod presented in this paper the term nonoutlier impliesan element that lies within a certain linear border and theterm outlier implies an element that does not lie within thelinear border

Primary investigations showed that themethod is capableof not only indicating the existence of outliers but alsolocating the outlier (119886

1+ 119886119899)119878119899

lt 2119899 indicates that themaximum element is the outlier (119886

1+ 119886119899)119878119899lt 2119899 indicates

that the minimum element is the outlier However (1198861+

119886119899)119878119899

= 2119899 does not imply that the series is free ofoutliers Furthermore primary investigations showed thatthe method is capable of locating both large and smalloutliers Table 1 shows sample calculations for illustrating therelation between 2119899 and (119886

1+ 119886119899)119878119899

As a principle the relation of (3) is capable of identi-fying and locating the outliers However we found sevendrawbacks which made relation (3) unusable for identifyingoutliers in actual data In Sections 21 to 27 we address thechallenges for making the relation usable

22 Challenge 1 Notation of the Equation The symbols usedin (3) especially 119886

1 119886119899 create a logical barrier For example

if there are outliers the minimum and the maximum can beother elements rather than 119886

1 119886119899 Therefore it is necessary

The Scientific World Journal 3

Outliers Outliers

0 12n = no outliers

Figure 1 Distribution of criteria range (0 1]

to use meaningful symbols that reflect the purpose of themethod The first and the last elements are either the mini-mum or the maximum Therefore it is possible to replace a1and 119886

119899by the minimum (119886min) and the maximum (119886max) of

the series Then (3) can be represented as

2

119899

=

(119886min + 119886max)

119878119899

(4)

Since the RHS of (4) consists of minimum maximum andsum of the series RHS was namedMMSwith the meaning ofminimum maximum and sum

MMS =

(119886min + 119886max)

119878119899

(5)

23 Challenge 2 Set a Range for the Outlier DetectionCriterion According to (3) outlier detection criterion is 2119899and can be used to check the elements that exactly agree witha line (Figure 1) To identify elements in a certain range it isnecessary to have a criteria range rather than a single value2119899

The left-hand side of (4) is the ratio 2 n and named as Rwby adding a weight ldquowrdquo to ldquoRrdquo Then

119877119908

=

2

119899

+ 119908 0 le 119908 le 1 minus

2

119899

(6)

The status 119908 = 0 (1198770) represents a single line and w gt

0 represents a line with a certain width (linear border)The outlier criteria range is a range with both floor (0)

and ceiling (1) and standardization is not required Thisis an additional advantage over the most common averagevariance and slandered deviation based approaches whichrequire a separate standardization process

24 Challenge 3 Influence of Negative Values Due to negativevalues the numerator or both the numerator and the denomi-nator of RHS of (5) can be 0 (egminus4 minus1 0 1 4) evenwithoutoutliers When there are outliers RHS of (5) can be negativewhich cannot be accepted as valid values for 2119899 0 lt 2119899 le 1must always hold

Subtracting the first element (119886119894 new = 119886

119894minus 119886min) from

each element of any AP creates a new transformed AP where119886min = 0 and guarantees a series without negative valuesFrom (5) and 119886

119894 new = 119886119894minus 119886min (7) is derived which is

more robust Another advantage of (7) is that it performs thetransformation automatically

MMS =

((119886min minus 119886min) + (119886max minus 119886min))

sum119899

119894=1(119886119894minus 119886min)

MMS =

(119886max minus 119886min)

(119878119899minus 119886min lowast 119899)

(7)

25 Challenge 4 Uneven Distribution of Criteria Range Theranges (0 2119899) and (2119899 1] are to identify outliers whichareminimums andmaximums respectively (Figure 1)When119899 rarr infin and 119877

0rarr 0 then 119877

119908 (0 1] is not equally

distributed which provides a large range for maximumoutliers and a small range for minimum outliers This is aproblem when locating minimum outliers

To solve this we used the idea of complement Forany series this will convert the maximum value into theminimum theminimum value into themaximum and inter-mediate values into their complements Most importantlynow the minimum value represents the maximum value ofthe original series and vice versa while still representing theoriginal series The complement of an element in a seriescan be defined as 119886

119894 119888= (119886max + 119886min) minus 119886

119894 From (5) and

119886119894 119888

= (119886max + 119886min) minus 119886119894this gives

MMS =

((119886max + 119886min minus 119886max) + (119886max + 119886min minus 119886min))

sum119899

119894=1(119886max + 119886min minus 119886

119894)

MMS =

((119886min) + (119886max))

sum119899

119894=1(119886max + 119886min minus 119886

119894)

(8)

Apply 119886119894 new = 119886

119894minus 119886min (to remove effect from negative

values)

MMS =

((119886min minus 119886min) + (119886max minus 119886min))

sum119899

119894=1((119886max minus 119886min) + (119886min minus 119886min) minus (119886

119894minus 119886min))

MMS =

(119886max minus 119886min)

sum119899

119894=1(119886max minus 119886

119894)

MMS =

(119886max minus 119886min)

(119886max lowast 119899 minus 119878119899)

(9)

Consequently the range 1198770

gt 2119899 represents the range forminimum outliers related to the original series and vice versa(Figure 2) and it is possible to ignore the range (0 2119899) Inaddition (9) automatically performs the transformation

Now there are two equations for MMS (7) and (9) tocheck whether the maximum or the minimum of the seriesis an outlier We named the two versions of MMS asMMSmax(10) and MMSmin (11)

MMSmax =

(119886max minus 119886min)

(119878119899minus 119886min lowast 119899)

(10)

MMSmin =

(119886max minus 119886min)

(119886max lowast 119899 minus 119878119899)

(11)

4 The Scientific World Journal

Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin

119886119899

Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861

100 100 100 9999 11198862

101 101 101 101 1011198863

102 102 102 102 1021198864

103 103 103 103 1031198865

104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min

Maximum outliersMinimum outliers

Minimumoutliers

Maximum outliers(represent the minimum outliers of original series)

Rw for original series

Rw for complement series

R0 = 2n0

1

Figure 2 Range of119877119908for original series and complement of original

series

The following equation shows the overview of the MMSprocess

minusor

MMSmax

=amax minus aminSn minus amin lowast n

maximum is the outlier

MMSmin

=amax minus aminamax lowast n minus Sn

minimum is the outlier

))))

))))

)))

gt(( 2n) + w1)

le(( 2n) + w1)

le(( 2n) + w1)

gt(( 2n) + w1)

))))))))

(12)

and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1

26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using

the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier

To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect

261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)

In Figure 4 the plot consists of elements a0 to 119886119903+1

(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements

119861119903119879119903= (

119861119903+1

119862119903+1

119860119861119903+1

) lowast 119860119861119903= (

(119886119903+1

minus 1198860)

(119903 + 1)

) lowast 119903

(119886119903+1

)new = 119886119900+ 119861119903119879119903

(13)

262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection

If yT is a linear series where 119910119879

119896= 119910119896minus 1199101 119909119879119896

= 119909119896minus 1199091

119909119896is the initial index of elements and yk is the 119896th element of

the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899

119894=1119910119896sum119899

119894=1119909119896 If one element (eg the first element) is

(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910

1015840 where 119910119896

= 119909119896lowast 119898 If there

are no outliers both yT and 1199101015840 coincide and 119910

119879

minus 1199101015840

= 0If 119910119879119879

= 119910119879

minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without

any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)

27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is

The Scientific World Journal 5

Valu

e

Remove

0 r Index

(a)

0 r

Valu

e

Index

(b)

Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier

Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier

119886119899

Data set 61198861

1001198862

101 MMSmax 03771198863

102 MMSmin 04251198864

1036 2119899 041198865

104 Outlier Yes-Min

neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262

EMMS is expressed as

EMMSmax =

(119886119879119879

max minus 119886119879119879

min)

(119878119879119879

119899minus 119886119879119879

min lowast 119899)

119886119879119879

max ⟨ ⟩ 0 (14)

EMMSmin =

(119886119879119879

max minus 119886119879119879

min)

(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (15)

where 119886119879119879

119896= |119886119879

119896minus 119909119896(119866119886119879

119866119909)| 119886119879119896

= 119886119896minus 1198860 xk is the index

of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886

119879

= sum119899minus1

119896=0119886119879

119896

119866119909 = sum119899minus1

119896=0119883119896 and 119878

119879119879

119899= sum119899minus1

119896=0119886119879119879

119896⟨ ⟩0

Remove

Valu

e

A120579r+1

120579r+2

a0

arminus1

Tr

ar+1

Cr+1

Tr+1

ar ar+2

Br Br+1

r Index

Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift

Always the term 119886119879119879

gt 0 Thus the term 119886119879119879

min = 0 Then(14) and (15) are simplified as

EMMSmax =

119886119879119879

max119878119879119879

119899

119886119879119879

max ⟨ ⟩ 0 (16)

EMMSmin =

119886119879119879

max(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (17)

If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899

and the greater value represents the outlier Table 4 shows

6 The Scientific World Journal

Table 4 EMMS for identifying an outlier

119883(119899 minus 1) 119910(119886119899) 119910

119879

(119886119899minus 1198861) 119910

119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)

0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866

119879

119910= 2120

EMMS (Max) 0500EMMS (Min) 0333

2119899 04Outlier Yes-Max

0 1 2 n n minus 1

Valu

e

Index

Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =

119891(119909) form yellow square corresponds to 119910119879

= 119891(119909) minus 119891(1199090) form

cross corresponds to 1199101015840

= 119898 lowast 119909119894form (119898 = sum

119899

119894=0119910119879

119894sum119899

119894=0119909119879

119894)

red circle corresponds to 119910119879119879

= 119910119879

minus 1199101015840 and circle corresponds to

missing values

an example calculation of EMMS and the following equationshows an overview of EMMS process

minusor

maximum is the outlier

minimum is the outlier

))))

EMMSmax =

aTTmaxSTTn

EMMSmin

=aTTmax

aTTmax lowast n minus STTn ))

gt(( 2n) + w2)

gt(( 2n) + w2)

le(( 2n) + w2)

le(( 2n) + w2)

))))

))))

))))))))

(18)

However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS

Start

Consecutiveseries

Recalculate

Outlierdetected

the series

Yes

Yes

No

Rw = 2n lowast (1 + k1)

use MMS

Outlierdetected

YesYes

NoNo

No

Apply

Stop

EMMS

use EMMS

∙ Remove the outlier∙ n = n ndash 1

∙ Remove the outlier∙ n = n ndash 1

∙ n = number of elements∙ k = k2

∙ n = number of elements∙ k = k1

Rw = 2n lowast (1 + k2)

Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS

28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877

119908is the factor that determines the outliers

when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0

represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria

281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)

then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908

=

2119899 + 2 lowast 119896119899

119877119908

=

2

119899

lowast (1 + 119896)

119877119908

1198770

= 1 + 119896 (= constant) (19)

When the MMS or the EMMS is greater than 119877119908of (19) this

implies the existence of outliers Because 1198771199081198770is constant

and gives standards to 119877119908 determination of k still depends

on the knowledge of the domain Figure 6 shows an algorithmbased on this technique

282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

2 The Scientific World Journal

neighbours model-based methods maximum likelihoodmethods and multiple imputation are the most commonmethods [18ndash22] Filling methods derive the filling valuefrom the same or other known existing data If there are aconsiderable number of outliers derived data may be biaseddue to the influence of outliers [23 24] Therefore the bestway is to remove all outliers and replace the outliers with asuitable method

In this paper we introduce a new nonparametric outlierdetection method based on sum of arithmetic progressionwhich used an indicator 2119899 where 119899 is the number of termsin the series The properties used in existing nonparametricmethods such as distance density depth cluster angle andresolution are domain dependent In contrast the value 2119899which we used in our new method is independent of thedomain conditions

Contrary to the existing nonparametric methods men-tioned earlier this work addressed identifying outliers in adataset that is expected to have linear relationThemethod iscapable of identifying significant and nonsignificant outliersseparately Moreover until all the outliers were removedthe new method requires no missing or removed dataimputation This will eliminate the negative influence dueto wrongly filled data points This is an advantage over themethods which require filling the removed data points Theoutlier detection method we introduced showed its best per-formances when the significant outliers are in non-GaussiandistributionThis is an advantage over existing methods suchas LMS and sigma filter The method uses a single data pointas a reference data pointThe reference point is assumed to benonoutlier Therefore accuracy of the outcome is dependingon the reference point especially when locating nonsignifi-cant outliers If the selected reference point is not an outlierthe method was capable of locating outliers from a data setcontaining very high rate of outliers such as 50 outliers

In this work data from biogas plants were used forevaluating the new method Since the biogas process isvery sensitive these data contain a considerable amount ofnoise even during apparently stable conditionsThis providessuitable data set for evaluating our method We were able toget the best outlier-freemacroscale data set which agrees withlinear (increasing decreasing or constant) regression fromselected segments of a data set

2 Methodology

21 Arithmetic Progression An arithmetic progression (AP)or arithmetic sequence is a sequence of numbers (ascendingdescending or constant) such that the difference between thesuccessive terms is constant [25] The 119899th term of a finite APwith 119899 elements is given by

119886119899= 119889 (119899 minus 1) + 119886

1 (1)

where 119889 is the common difference of successivemembers and1198861is the first element of the series The sum of the elements

of a finite AP with 119899 elements is given by

119878119899= (

119899

2

) lowast (1198861+ 119886119899) (2)

Table 1 Sample calculations for illustrating the relation between 2nand (119886

1+ 119886119899)119878119899

119886119899

Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861

100 100 100 9999 11198862

101 101 101 101 1011198863

102 102 102 102 1021198864

103 103 103 103 1031198865

104 10401 204 104 104(1198861+ 1198865)1198785

04 040001 0498 0399 02552119899 04 04 04 04 04Outlier mdash Yes-119886

5Yes-1198865

Yes-1198861

Yes-1198861

where a1 is the first element and an is the last element of theseries

Equation (1) is a 119891(119899) and fulfils the requirements of aline In other words finite AP is a straight line In addition astraight line is a series without outliers If there are outliersthe series is not a finite AP Therefore any arithmetic seriesthat fulfils the requirements of an AP can be considered aseries without outliers Equation (2) can be represented as

2

119899

=

(1198861+ 119886119899)

119878119899

infin gt 119899 ge 2 0 lt

2

119899

le 1 (3)

For any AP the right-hand side (RHS) of (3) is always 2119899which is independent of the terms of the series In otherwords if there are no outliers the value (119886

1+119886119899)119878119899will always

be equal to 2119899 If the RHS of (3) is not 2119899 it always impliesthat the series contains outliers Therefore the value 2119899 canbe used as a global indicator to identify any AP with outliers

Since we use the relation of AP we define that elementslying on or between two lines (linear border) are nonoutliersand others are outliers When the distance between twolines is zero they represent a single line In relation to themethod presented in this paper the term nonoutlier impliesan element that lies within a certain linear border and theterm outlier implies an element that does not lie within thelinear border

Primary investigations showed that themethod is capableof not only indicating the existence of outliers but alsolocating the outlier (119886

1+ 119886119899)119878119899

lt 2119899 indicates that themaximum element is the outlier (119886

1+ 119886119899)119878119899lt 2119899 indicates

that the minimum element is the outlier However (1198861+

119886119899)119878119899

= 2119899 does not imply that the series is free ofoutliers Furthermore primary investigations showed thatthe method is capable of locating both large and smalloutliers Table 1 shows sample calculations for illustrating therelation between 2119899 and (119886

1+ 119886119899)119878119899

As a principle the relation of (3) is capable of identi-fying and locating the outliers However we found sevendrawbacks which made relation (3) unusable for identifyingoutliers in actual data In Sections 21 to 27 we address thechallenges for making the relation usable

22 Challenge 1 Notation of the Equation The symbols usedin (3) especially 119886

1 119886119899 create a logical barrier For example

if there are outliers the minimum and the maximum can beother elements rather than 119886

1 119886119899 Therefore it is necessary

The Scientific World Journal 3

Outliers Outliers

0 12n = no outliers

Figure 1 Distribution of criteria range (0 1]

to use meaningful symbols that reflect the purpose of themethod The first and the last elements are either the mini-mum or the maximum Therefore it is possible to replace a1and 119886

119899by the minimum (119886min) and the maximum (119886max) of

the series Then (3) can be represented as

2

119899

=

(119886min + 119886max)

119878119899

(4)

Since the RHS of (4) consists of minimum maximum andsum of the series RHS was namedMMSwith the meaning ofminimum maximum and sum

MMS =

(119886min + 119886max)

119878119899

(5)

23 Challenge 2 Set a Range for the Outlier DetectionCriterion According to (3) outlier detection criterion is 2119899and can be used to check the elements that exactly agree witha line (Figure 1) To identify elements in a certain range it isnecessary to have a criteria range rather than a single value2119899

The left-hand side of (4) is the ratio 2 n and named as Rwby adding a weight ldquowrdquo to ldquoRrdquo Then

119877119908

=

2

119899

+ 119908 0 le 119908 le 1 minus

2

119899

(6)

The status 119908 = 0 (1198770) represents a single line and w gt

0 represents a line with a certain width (linear border)The outlier criteria range is a range with both floor (0)

and ceiling (1) and standardization is not required Thisis an additional advantage over the most common averagevariance and slandered deviation based approaches whichrequire a separate standardization process

24 Challenge 3 Influence of Negative Values Due to negativevalues the numerator or both the numerator and the denomi-nator of RHS of (5) can be 0 (egminus4 minus1 0 1 4) evenwithoutoutliers When there are outliers RHS of (5) can be negativewhich cannot be accepted as valid values for 2119899 0 lt 2119899 le 1must always hold

Subtracting the first element (119886119894 new = 119886

119894minus 119886min) from

each element of any AP creates a new transformed AP where119886min = 0 and guarantees a series without negative valuesFrom (5) and 119886

119894 new = 119886119894minus 119886min (7) is derived which is

more robust Another advantage of (7) is that it performs thetransformation automatically

MMS =

((119886min minus 119886min) + (119886max minus 119886min))

sum119899

119894=1(119886119894minus 119886min)

MMS =

(119886max minus 119886min)

(119878119899minus 119886min lowast 119899)

(7)

25 Challenge 4 Uneven Distribution of Criteria Range Theranges (0 2119899) and (2119899 1] are to identify outliers whichareminimums andmaximums respectively (Figure 1)When119899 rarr infin and 119877

0rarr 0 then 119877

119908 (0 1] is not equally

distributed which provides a large range for maximumoutliers and a small range for minimum outliers This is aproblem when locating minimum outliers

To solve this we used the idea of complement Forany series this will convert the maximum value into theminimum theminimum value into themaximum and inter-mediate values into their complements Most importantlynow the minimum value represents the maximum value ofthe original series and vice versa while still representing theoriginal series The complement of an element in a seriescan be defined as 119886

119894 119888= (119886max + 119886min) minus 119886

119894 From (5) and

119886119894 119888

= (119886max + 119886min) minus 119886119894this gives

MMS =

((119886max + 119886min minus 119886max) + (119886max + 119886min minus 119886min))

sum119899

119894=1(119886max + 119886min minus 119886

119894)

MMS =

((119886min) + (119886max))

sum119899

119894=1(119886max + 119886min minus 119886

119894)

(8)

Apply 119886119894 new = 119886

119894minus 119886min (to remove effect from negative

values)

MMS =

((119886min minus 119886min) + (119886max minus 119886min))

sum119899

119894=1((119886max minus 119886min) + (119886min minus 119886min) minus (119886

119894minus 119886min))

MMS =

(119886max minus 119886min)

sum119899

119894=1(119886max minus 119886

119894)

MMS =

(119886max minus 119886min)

(119886max lowast 119899 minus 119878119899)

(9)

Consequently the range 1198770

gt 2119899 represents the range forminimum outliers related to the original series and vice versa(Figure 2) and it is possible to ignore the range (0 2119899) Inaddition (9) automatically performs the transformation

Now there are two equations for MMS (7) and (9) tocheck whether the maximum or the minimum of the seriesis an outlier We named the two versions of MMS asMMSmax(10) and MMSmin (11)

MMSmax =

(119886max minus 119886min)

(119878119899minus 119886min lowast 119899)

(10)

MMSmin =

(119886max minus 119886min)

(119886max lowast 119899 minus 119878119899)

(11)

4 The Scientific World Journal

Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin

119886119899

Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861

100 100 100 9999 11198862

101 101 101 101 1011198863

102 102 102 102 1021198864

103 103 103 103 1031198865

104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min

Maximum outliersMinimum outliers

Minimumoutliers

Maximum outliers(represent the minimum outliers of original series)

Rw for original series

Rw for complement series

R0 = 2n0

1

Figure 2 Range of119877119908for original series and complement of original

series

The following equation shows the overview of the MMSprocess

minusor

MMSmax

=amax minus aminSn minus amin lowast n

maximum is the outlier

MMSmin

=amax minus aminamax lowast n minus Sn

minimum is the outlier

))))

))))

)))

gt(( 2n) + w1)

le(( 2n) + w1)

le(( 2n) + w1)

gt(( 2n) + w1)

))))))))

(12)

and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1

26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using

the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier

To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect

261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)

In Figure 4 the plot consists of elements a0 to 119886119903+1

(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements

119861119903119879119903= (

119861119903+1

119862119903+1

119860119861119903+1

) lowast 119860119861119903= (

(119886119903+1

minus 1198860)

(119903 + 1)

) lowast 119903

(119886119903+1

)new = 119886119900+ 119861119903119879119903

(13)

262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection

If yT is a linear series where 119910119879

119896= 119910119896minus 1199101 119909119879119896

= 119909119896minus 1199091

119909119896is the initial index of elements and yk is the 119896th element of

the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899

119894=1119910119896sum119899

119894=1119909119896 If one element (eg the first element) is

(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910

1015840 where 119910119896

= 119909119896lowast 119898 If there

are no outliers both yT and 1199101015840 coincide and 119910

119879

minus 1199101015840

= 0If 119910119879119879

= 119910119879

minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without

any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)

27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is

The Scientific World Journal 5

Valu

e

Remove

0 r Index

(a)

0 r

Valu

e

Index

(b)

Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier

Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier

119886119899

Data set 61198861

1001198862

101 MMSmax 03771198863

102 MMSmin 04251198864

1036 2119899 041198865

104 Outlier Yes-Min

neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262

EMMS is expressed as

EMMSmax =

(119886119879119879

max minus 119886119879119879

min)

(119878119879119879

119899minus 119886119879119879

min lowast 119899)

119886119879119879

max ⟨ ⟩ 0 (14)

EMMSmin =

(119886119879119879

max minus 119886119879119879

min)

(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (15)

where 119886119879119879

119896= |119886119879

119896minus 119909119896(119866119886119879

119866119909)| 119886119879119896

= 119886119896minus 1198860 xk is the index

of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886

119879

= sum119899minus1

119896=0119886119879

119896

119866119909 = sum119899minus1

119896=0119883119896 and 119878

119879119879

119899= sum119899minus1

119896=0119886119879119879

119896⟨ ⟩0

Remove

Valu

e

A120579r+1

120579r+2

a0

arminus1

Tr

ar+1

Cr+1

Tr+1

ar ar+2

Br Br+1

r Index

Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift

Always the term 119886119879119879

gt 0 Thus the term 119886119879119879

min = 0 Then(14) and (15) are simplified as

EMMSmax =

119886119879119879

max119878119879119879

119899

119886119879119879

max ⟨ ⟩ 0 (16)

EMMSmin =

119886119879119879

max(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (17)

If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899

and the greater value represents the outlier Table 4 shows

6 The Scientific World Journal

Table 4 EMMS for identifying an outlier

119883(119899 minus 1) 119910(119886119899) 119910

119879

(119886119899minus 1198861) 119910

119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)

0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866

119879

119910= 2120

EMMS (Max) 0500EMMS (Min) 0333

2119899 04Outlier Yes-Max

0 1 2 n n minus 1

Valu

e

Index

Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =

119891(119909) form yellow square corresponds to 119910119879

= 119891(119909) minus 119891(1199090) form

cross corresponds to 1199101015840

= 119898 lowast 119909119894form (119898 = sum

119899

119894=0119910119879

119894sum119899

119894=0119909119879

119894)

red circle corresponds to 119910119879119879

= 119910119879

minus 1199101015840 and circle corresponds to

missing values

an example calculation of EMMS and the following equationshows an overview of EMMS process

minusor

maximum is the outlier

minimum is the outlier

))))

EMMSmax =

aTTmaxSTTn

EMMSmin

=aTTmax

aTTmax lowast n minus STTn ))

gt(( 2n) + w2)

gt(( 2n) + w2)

le(( 2n) + w2)

le(( 2n) + w2)

))))

))))

))))))))

(18)

However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS

Start

Consecutiveseries

Recalculate

Outlierdetected

the series

Yes

Yes

No

Rw = 2n lowast (1 + k1)

use MMS

Outlierdetected

YesYes

NoNo

No

Apply

Stop

EMMS

use EMMS

∙ Remove the outlier∙ n = n ndash 1

∙ Remove the outlier∙ n = n ndash 1

∙ n = number of elements∙ k = k2

∙ n = number of elements∙ k = k1

Rw = 2n lowast (1 + k2)

Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS

28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877

119908is the factor that determines the outliers

when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0

represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria

281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)

then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908

=

2119899 + 2 lowast 119896119899

119877119908

=

2

119899

lowast (1 + 119896)

119877119908

1198770

= 1 + 119896 (= constant) (19)

When the MMS or the EMMS is greater than 119877119908of (19) this

implies the existence of outliers Because 1198771199081198770is constant

and gives standards to 119877119908 determination of k still depends

on the knowledge of the domain Figure 6 shows an algorithmbased on this technique

282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

The Scientific World Journal 3

Outliers Outliers

0 12n = no outliers

Figure 1 Distribution of criteria range (0 1]

to use meaningful symbols that reflect the purpose of themethod The first and the last elements are either the mini-mum or the maximum Therefore it is possible to replace a1and 119886

119899by the minimum (119886min) and the maximum (119886max) of

the series Then (3) can be represented as

2

119899

=

(119886min + 119886max)

119878119899

(4)

Since the RHS of (4) consists of minimum maximum andsum of the series RHS was namedMMSwith the meaning ofminimum maximum and sum

MMS =

(119886min + 119886max)

119878119899

(5)

23 Challenge 2 Set a Range for the Outlier DetectionCriterion According to (3) outlier detection criterion is 2119899and can be used to check the elements that exactly agree witha line (Figure 1) To identify elements in a certain range it isnecessary to have a criteria range rather than a single value2119899

The left-hand side of (4) is the ratio 2 n and named as Rwby adding a weight ldquowrdquo to ldquoRrdquo Then

119877119908

=

2

119899

+ 119908 0 le 119908 le 1 minus

2

119899

(6)

The status 119908 = 0 (1198770) represents a single line and w gt

0 represents a line with a certain width (linear border)The outlier criteria range is a range with both floor (0)

and ceiling (1) and standardization is not required Thisis an additional advantage over the most common averagevariance and slandered deviation based approaches whichrequire a separate standardization process

24 Challenge 3 Influence of Negative Values Due to negativevalues the numerator or both the numerator and the denomi-nator of RHS of (5) can be 0 (egminus4 minus1 0 1 4) evenwithoutoutliers When there are outliers RHS of (5) can be negativewhich cannot be accepted as valid values for 2119899 0 lt 2119899 le 1must always hold

Subtracting the first element (119886119894 new = 119886

119894minus 119886min) from

each element of any AP creates a new transformed AP where119886min = 0 and guarantees a series without negative valuesFrom (5) and 119886

119894 new = 119886119894minus 119886min (7) is derived which is

more robust Another advantage of (7) is that it performs thetransformation automatically

MMS =

((119886min minus 119886min) + (119886max minus 119886min))

sum119899

119894=1(119886119894minus 119886min)

MMS =

(119886max minus 119886min)

(119878119899minus 119886min lowast 119899)

(7)

25 Challenge 4 Uneven Distribution of Criteria Range Theranges (0 2119899) and (2119899 1] are to identify outliers whichareminimums andmaximums respectively (Figure 1)When119899 rarr infin and 119877

0rarr 0 then 119877

119908 (0 1] is not equally

distributed which provides a large range for maximumoutliers and a small range for minimum outliers This is aproblem when locating minimum outliers

To solve this we used the idea of complement Forany series this will convert the maximum value into theminimum theminimum value into themaximum and inter-mediate values into their complements Most importantlynow the minimum value represents the maximum value ofthe original series and vice versa while still representing theoriginal series The complement of an element in a seriescan be defined as 119886

119894 119888= (119886max + 119886min) minus 119886

119894 From (5) and

119886119894 119888

= (119886max + 119886min) minus 119886119894this gives

MMS =

((119886max + 119886min minus 119886max) + (119886max + 119886min minus 119886min))

sum119899

119894=1(119886max + 119886min minus 119886

119894)

MMS =

((119886min) + (119886max))

sum119899

119894=1(119886max + 119886min minus 119886

119894)

(8)

Apply 119886119894 new = 119886

119894minus 119886min (to remove effect from negative

values)

MMS =

((119886min minus 119886min) + (119886max minus 119886min))

sum119899

119894=1((119886max minus 119886min) + (119886min minus 119886min) minus (119886

119894minus 119886min))

MMS =

(119886max minus 119886min)

sum119899

119894=1(119886max minus 119886

119894)

MMS =

(119886max minus 119886min)

(119886max lowast 119899 minus 119878119899)

(9)

Consequently the range 1198770

gt 2119899 represents the range forminimum outliers related to the original series and vice versa(Figure 2) and it is possible to ignore the range (0 2119899) Inaddition (9) automatically performs the transformation

Now there are two equations for MMS (7) and (9) tocheck whether the maximum or the minimum of the seriesis an outlier We named the two versions of MMS asMMSmax(10) and MMSmin (11)

MMSmax =

(119886max minus 119886min)

(119878119899minus 119886min lowast 119899)

(10)

MMSmin =

(119886max minus 119886min)

(119886max lowast 119899 minus 119878119899)

(11)

4 The Scientific World Journal

Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin

119886119899

Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861

100 100 100 9999 11198862

101 101 101 101 1011198863

102 102 102 102 1021198864

103 103 103 103 1031198865

104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min

Maximum outliersMinimum outliers

Minimumoutliers

Maximum outliers(represent the minimum outliers of original series)

Rw for original series

Rw for complement series

R0 = 2n0

1

Figure 2 Range of119877119908for original series and complement of original

series

The following equation shows the overview of the MMSprocess

minusor

MMSmax

=amax minus aminSn minus amin lowast n

maximum is the outlier

MMSmin

=amax minus aminamax lowast n minus Sn

minimum is the outlier

))))

))))

)))

gt(( 2n) + w1)

le(( 2n) + w1)

le(( 2n) + w1)

gt(( 2n) + w1)

))))))))

(12)

and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1

26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using

the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier

To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect

261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)

In Figure 4 the plot consists of elements a0 to 119886119903+1

(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements

119861119903119879119903= (

119861119903+1

119862119903+1

119860119861119903+1

) lowast 119860119861119903= (

(119886119903+1

minus 1198860)

(119903 + 1)

) lowast 119903

(119886119903+1

)new = 119886119900+ 119861119903119879119903

(13)

262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection

If yT is a linear series where 119910119879

119896= 119910119896minus 1199101 119909119879119896

= 119909119896minus 1199091

119909119896is the initial index of elements and yk is the 119896th element of

the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899

119894=1119910119896sum119899

119894=1119909119896 If one element (eg the first element) is

(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910

1015840 where 119910119896

= 119909119896lowast 119898 If there

are no outliers both yT and 1199101015840 coincide and 119910

119879

minus 1199101015840

= 0If 119910119879119879

= 119910119879

minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without

any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)

27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is

The Scientific World Journal 5

Valu

e

Remove

0 r Index

(a)

0 r

Valu

e

Index

(b)

Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier

Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier

119886119899

Data set 61198861

1001198862

101 MMSmax 03771198863

102 MMSmin 04251198864

1036 2119899 041198865

104 Outlier Yes-Min

neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262

EMMS is expressed as

EMMSmax =

(119886119879119879

max minus 119886119879119879

min)

(119878119879119879

119899minus 119886119879119879

min lowast 119899)

119886119879119879

max ⟨ ⟩ 0 (14)

EMMSmin =

(119886119879119879

max minus 119886119879119879

min)

(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (15)

where 119886119879119879

119896= |119886119879

119896minus 119909119896(119866119886119879

119866119909)| 119886119879119896

= 119886119896minus 1198860 xk is the index

of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886

119879

= sum119899minus1

119896=0119886119879

119896

119866119909 = sum119899minus1

119896=0119883119896 and 119878

119879119879

119899= sum119899minus1

119896=0119886119879119879

119896⟨ ⟩0

Remove

Valu

e

A120579r+1

120579r+2

a0

arminus1

Tr

ar+1

Cr+1

Tr+1

ar ar+2

Br Br+1

r Index

Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift

Always the term 119886119879119879

gt 0 Thus the term 119886119879119879

min = 0 Then(14) and (15) are simplified as

EMMSmax =

119886119879119879

max119878119879119879

119899

119886119879119879

max ⟨ ⟩ 0 (16)

EMMSmin =

119886119879119879

max(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (17)

If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899

and the greater value represents the outlier Table 4 shows

6 The Scientific World Journal

Table 4 EMMS for identifying an outlier

119883(119899 minus 1) 119910(119886119899) 119910

119879

(119886119899minus 1198861) 119910

119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)

0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866

119879

119910= 2120

EMMS (Max) 0500EMMS (Min) 0333

2119899 04Outlier Yes-Max

0 1 2 n n minus 1

Valu

e

Index

Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =

119891(119909) form yellow square corresponds to 119910119879

= 119891(119909) minus 119891(1199090) form

cross corresponds to 1199101015840

= 119898 lowast 119909119894form (119898 = sum

119899

119894=0119910119879

119894sum119899

119894=0119909119879

119894)

red circle corresponds to 119910119879119879

= 119910119879

minus 1199101015840 and circle corresponds to

missing values

an example calculation of EMMS and the following equationshows an overview of EMMS process

minusor

maximum is the outlier

minimum is the outlier

))))

EMMSmax =

aTTmaxSTTn

EMMSmin

=aTTmax

aTTmax lowast n minus STTn ))

gt(( 2n) + w2)

gt(( 2n) + w2)

le(( 2n) + w2)

le(( 2n) + w2)

))))

))))

))))))))

(18)

However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS

Start

Consecutiveseries

Recalculate

Outlierdetected

the series

Yes

Yes

No

Rw = 2n lowast (1 + k1)

use MMS

Outlierdetected

YesYes

NoNo

No

Apply

Stop

EMMS

use EMMS

∙ Remove the outlier∙ n = n ndash 1

∙ Remove the outlier∙ n = n ndash 1

∙ n = number of elements∙ k = k2

∙ n = number of elements∙ k = k1

Rw = 2n lowast (1 + k2)

Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS

28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877

119908is the factor that determines the outliers

when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0

represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria

281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)

then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908

=

2119899 + 2 lowast 119896119899

119877119908

=

2

119899

lowast (1 + 119896)

119877119908

1198770

= 1 + 119896 (= constant) (19)

When the MMS or the EMMS is greater than 119877119908of (19) this

implies the existence of outliers Because 1198771199081198770is constant

and gives standards to 119877119908 determination of k still depends

on the knowledge of the domain Figure 6 shows an algorithmbased on this technique

282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

4 The Scientific World Journal

Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin

119886119899

Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861

100 100 100 9999 11198862

101 101 101 101 1011198863

102 102 102 102 1021198864

103 103 103 103 1031198865

104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min

Maximum outliersMinimum outliers

Minimumoutliers

Maximum outliers(represent the minimum outliers of original series)

Rw for original series

Rw for complement series

R0 = 2n0

1

Figure 2 Range of119877119908for original series and complement of original

series

The following equation shows the overview of the MMSprocess

minusor

MMSmax

=amax minus aminSn minus amin lowast n

maximum is the outlier

MMSmin

=amax minus aminamax lowast n minus Sn

minimum is the outlier

))))

))))

)))

gt(( 2n) + w1)

le(( 2n) + w1)

le(( 2n) + w1)

gt(( 2n) + w1)

))))))))

(12)

and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1

26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using

the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier

To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect

261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)

In Figure 4 the plot consists of elements a0 to 119886119903+1

(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements

119861119903119879119903= (

119861119903+1

119862119903+1

119860119861119903+1

) lowast 119860119861119903= (

(119886119903+1

minus 1198860)

(119903 + 1)

) lowast 119903

(119886119903+1

)new = 119886119900+ 119861119903119879119903

(13)

262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection

If yT is a linear series where 119910119879

119896= 119910119896minus 1199101 119909119879119896

= 119909119896minus 1199091

119909119896is the initial index of elements and yk is the 119896th element of

the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899

119894=1119910119896sum119899

119894=1119909119896 If one element (eg the first element) is

(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910

1015840 where 119910119896

= 119909119896lowast 119898 If there

are no outliers both yT and 1199101015840 coincide and 119910

119879

minus 1199101015840

= 0If 119910119879119879

= 119910119879

minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without

any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)

27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is

The Scientific World Journal 5

Valu

e

Remove

0 r Index

(a)

0 r

Valu

e

Index

(b)

Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier

Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier

119886119899

Data set 61198861

1001198862

101 MMSmax 03771198863

102 MMSmin 04251198864

1036 2119899 041198865

104 Outlier Yes-Min

neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262

EMMS is expressed as

EMMSmax =

(119886119879119879

max minus 119886119879119879

min)

(119878119879119879

119899minus 119886119879119879

min lowast 119899)

119886119879119879

max ⟨ ⟩ 0 (14)

EMMSmin =

(119886119879119879

max minus 119886119879119879

min)

(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (15)

where 119886119879119879

119896= |119886119879

119896minus 119909119896(119866119886119879

119866119909)| 119886119879119896

= 119886119896minus 1198860 xk is the index

of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886

119879

= sum119899minus1

119896=0119886119879

119896

119866119909 = sum119899minus1

119896=0119883119896 and 119878

119879119879

119899= sum119899minus1

119896=0119886119879119879

119896⟨ ⟩0

Remove

Valu

e

A120579r+1

120579r+2

a0

arminus1

Tr

ar+1

Cr+1

Tr+1

ar ar+2

Br Br+1

r Index

Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift

Always the term 119886119879119879

gt 0 Thus the term 119886119879119879

min = 0 Then(14) and (15) are simplified as

EMMSmax =

119886119879119879

max119878119879119879

119899

119886119879119879

max ⟨ ⟩ 0 (16)

EMMSmin =

119886119879119879

max(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (17)

If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899

and the greater value represents the outlier Table 4 shows

6 The Scientific World Journal

Table 4 EMMS for identifying an outlier

119883(119899 minus 1) 119910(119886119899) 119910

119879

(119886119899minus 1198861) 119910

119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)

0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866

119879

119910= 2120

EMMS (Max) 0500EMMS (Min) 0333

2119899 04Outlier Yes-Max

0 1 2 n n minus 1

Valu

e

Index

Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =

119891(119909) form yellow square corresponds to 119910119879

= 119891(119909) minus 119891(1199090) form

cross corresponds to 1199101015840

= 119898 lowast 119909119894form (119898 = sum

119899

119894=0119910119879

119894sum119899

119894=0119909119879

119894)

red circle corresponds to 119910119879119879

= 119910119879

minus 1199101015840 and circle corresponds to

missing values

an example calculation of EMMS and the following equationshows an overview of EMMS process

minusor

maximum is the outlier

minimum is the outlier

))))

EMMSmax =

aTTmaxSTTn

EMMSmin

=aTTmax

aTTmax lowast n minus STTn ))

gt(( 2n) + w2)

gt(( 2n) + w2)

le(( 2n) + w2)

le(( 2n) + w2)

))))

))))

))))))))

(18)

However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS

Start

Consecutiveseries

Recalculate

Outlierdetected

the series

Yes

Yes

No

Rw = 2n lowast (1 + k1)

use MMS

Outlierdetected

YesYes

NoNo

No

Apply

Stop

EMMS

use EMMS

∙ Remove the outlier∙ n = n ndash 1

∙ Remove the outlier∙ n = n ndash 1

∙ n = number of elements∙ k = k2

∙ n = number of elements∙ k = k1

Rw = 2n lowast (1 + k2)

Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS

28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877

119908is the factor that determines the outliers

when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0

represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria

281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)

then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908

=

2119899 + 2 lowast 119896119899

119877119908

=

2

119899

lowast (1 + 119896)

119877119908

1198770

= 1 + 119896 (= constant) (19)

When the MMS or the EMMS is greater than 119877119908of (19) this

implies the existence of outliers Because 1198771199081198770is constant

and gives standards to 119877119908 determination of k still depends

on the knowledge of the domain Figure 6 shows an algorithmbased on this technique

282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

The Scientific World Journal 5

Valu

e

Remove

0 r Index

(a)

0 r

Valu

e

Index

(b)

Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier

Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier

119886119899

Data set 61198861

1001198862

101 MMSmax 03771198863

102 MMSmin 04251198864

1036 2119899 041198865

104 Outlier Yes-Min

neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262

EMMS is expressed as

EMMSmax =

(119886119879119879

max minus 119886119879119879

min)

(119878119879119879

119899minus 119886119879119879

min lowast 119899)

119886119879119879

max ⟨ ⟩ 0 (14)

EMMSmin =

(119886119879119879

max minus 119886119879119879

min)

(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (15)

where 119886119879119879

119896= |119886119879

119896minus 119909119896(119866119886119879

119866119909)| 119886119879119896

= 119886119896minus 1198860 xk is the index

of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886

119879

= sum119899minus1

119896=0119886119879

119896

119866119909 = sum119899minus1

119896=0119883119896 and 119878

119879119879

119899= sum119899minus1

119896=0119886119879119879

119896⟨ ⟩0

Remove

Valu

e

A120579r+1

120579r+2

a0

arminus1

Tr

ar+1

Cr+1

Tr+1

ar ar+2

Br Br+1

r Index

Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift

Always the term 119886119879119879

gt 0 Thus the term 119886119879119879

min = 0 Then(14) and (15) are simplified as

EMMSmax =

119886119879119879

max119878119879119879

119899

119886119879119879

max ⟨ ⟩ 0 (16)

EMMSmin =

119886119879119879

max(119886119879119879

max lowast 119899 minus 119878119879119879

119899)

119886119879119879

max ⟨ ⟩ 0 (17)

If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899

and the greater value represents the outlier Table 4 shows

6 The Scientific World Journal

Table 4 EMMS for identifying an outlier

119883(119899 minus 1) 119910(119886119899) 119910

119879

(119886119899minus 1198861) 119910

119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)

0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866

119879

119910= 2120

EMMS (Max) 0500EMMS (Min) 0333

2119899 04Outlier Yes-Max

0 1 2 n n minus 1

Valu

e

Index

Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =

119891(119909) form yellow square corresponds to 119910119879

= 119891(119909) minus 119891(1199090) form

cross corresponds to 1199101015840

= 119898 lowast 119909119894form (119898 = sum

119899

119894=0119910119879

119894sum119899

119894=0119909119879

119894)

red circle corresponds to 119910119879119879

= 119910119879

minus 1199101015840 and circle corresponds to

missing values

an example calculation of EMMS and the following equationshows an overview of EMMS process

minusor

maximum is the outlier

minimum is the outlier

))))

EMMSmax =

aTTmaxSTTn

EMMSmin

=aTTmax

aTTmax lowast n minus STTn ))

gt(( 2n) + w2)

gt(( 2n) + w2)

le(( 2n) + w2)

le(( 2n) + w2)

))))

))))

))))))))

(18)

However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS

Start

Consecutiveseries

Recalculate

Outlierdetected

the series

Yes

Yes

No

Rw = 2n lowast (1 + k1)

use MMS

Outlierdetected

YesYes

NoNo

No

Apply

Stop

EMMS

use EMMS

∙ Remove the outlier∙ n = n ndash 1

∙ Remove the outlier∙ n = n ndash 1

∙ n = number of elements∙ k = k2

∙ n = number of elements∙ k = k1

Rw = 2n lowast (1 + k2)

Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS

28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877

119908is the factor that determines the outliers

when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0

represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria

281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)

then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908

=

2119899 + 2 lowast 119896119899

119877119908

=

2

119899

lowast (1 + 119896)

119877119908

1198770

= 1 + 119896 (= constant) (19)

When the MMS or the EMMS is greater than 119877119908of (19) this

implies the existence of outliers Because 1198771199081198770is constant

and gives standards to 119877119908 determination of k still depends

on the knowledge of the domain Figure 6 shows an algorithmbased on this technique

282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

6 The Scientific World Journal

Table 4 EMMS for identifying an outlier

119883(119899 minus 1) 119910(119886119899) 119910

119879

(119886119899minus 1198861) 119910

119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)

0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866

119879

119910= 2120

EMMS (Max) 0500EMMS (Min) 0333

2119899 04Outlier Yes-Max

0 1 2 n n minus 1

Valu

e

Index

Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =

119891(119909) form yellow square corresponds to 119910119879

= 119891(119909) minus 119891(1199090) form

cross corresponds to 1199101015840

= 119898 lowast 119909119894form (119898 = sum

119899

119894=0119910119879

119894sum119899

119894=0119909119879

119894)

red circle corresponds to 119910119879119879

= 119910119879

minus 1199101015840 and circle corresponds to

missing values

an example calculation of EMMS and the following equationshows an overview of EMMS process

minusor

maximum is the outlier

minimum is the outlier

))))

EMMSmax =

aTTmaxSTTn

EMMSmin

=aTTmax

aTTmax lowast n minus STTn ))

gt(( 2n) + w2)

gt(( 2n) + w2)

le(( 2n) + w2)

le(( 2n) + w2)

))))

))))

))))))))

(18)

However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS

Start

Consecutiveseries

Recalculate

Outlierdetected

the series

Yes

Yes

No

Rw = 2n lowast (1 + k1)

use MMS

Outlierdetected

YesYes

NoNo

No

Apply

Stop

EMMS

use EMMS

∙ Remove the outlier∙ n = n ndash 1

∙ Remove the outlier∙ n = n ndash 1

∙ n = number of elements∙ k = k2

∙ n = number of elements∙ k = k1

Rw = 2n lowast (1 + k2)

Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS

28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877

119908is the factor that determines the outliers

when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0

represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria

281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)

then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908

=

2119899 + 2 lowast 119896119899

119877119908

=

2

119899

lowast (1 + 119896)

119877119908

1198770

= 1 + 119896 (= constant) (19)

When the MMS or the EMMS is greater than 119877119908of (19) this

implies the existence of outliers Because 1198771199081198770is constant

and gives standards to 119877119908 determination of k still depends

on the knowledge of the domain Figure 6 shows an algorithmbased on this technique

282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

The Scientific World Journal 7

Table 5 Different environments used to validate the new method

Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values

Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no

Outlierdetected

Firstor last term

Outlierdetected

Firstor last term

ApplyEMMS

∙ Remove the outlier

Consecutiveseries

Recalculate theseries

use MMS

use EMMS

Yes

Yes

Yes

Yes

No

No

No

No

Stop

No

Yes

Start

∙ Get window with the first and the last terms beingnonoutliers

YesNo

∙ n = number of elements

∙ n = number of elements

∙ k = k1

∙ k = k2

∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1

Rw = 2n lowast (1 + k1)

Rw = 2n lowast (1 + k2)

Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS

However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique

29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different

sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined

210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H

2

content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the

third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined

We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend

3 Results and Discussion

Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

8 The Scientific World Journal

Valu

e4000

3500

3000

2500

2000

1500

1000

500

0

Index0 42 6 8

10 50 60 80 905005202 705

4100

3000

(a) Data type increment outlier type non-Gaussian

Valu

e

1000

01000 900

7007994

450500 300

Index0 42 6 8

minus1000

minus2000

minus3000

minus4000

minus5000minus6000

minus6000

minus2200

minus7000

minus8000

minus9000

minus10000minus10000

(b) Data type decrement outlier type non-Gaussian

Valu

e

0100 100 100 100 100

1000010000

9000

8000

7000

60006000

5000

4000

3000

2000 2200

35010011000

Index0 42 6 8

(c) Data type constant outlier type non-Gaussian

Valu

e

minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000

minus4100

40003500300025002000150010005000

10 1995005

725

4000

Index0 42 6 8

50 60 80 90

(d) Data type increment outlier type Gaussian

Valu

e

1000 900 700 400 300

10000100009000

80007000600050004000300020001000

0

Index0 42 6 8

7995 650minus1000minus2000

minus2200minus3000minus4000minus5000minus6000

minus6000

(e) Data type decrement outlier type Gaussian

100100100100100

10000

999 350

Valu

e

10000900080007000600050004000300020001000

0

Index0 42 6 8

minus1000minus2000minus3000minus4000minus5000minus6000

minus6000

minus2200

(f) Data type constant outlier type Gaussian

Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))

and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and

9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

The Scientific World Journal 9

Valu

e4000 4100

3500

30003000

2500

2000

1500

1000

500

0

Index0 42 6 8

101 202 70530 50 60 80 90

(a) Data type increment outlier type non-Gaussian

Valu

e

600060005500

50105000450040003500300025002000 220015001000500

0

Index0 42 6 8

900500 400 300

100808 800

(b) Data type decrement outlier type non-Gaussian

1000010000

9000

8000

7000

60006000

5000

4000

3000

20002200

1000

0

Valu

e

Index0 42 6 8

1001 100 100 100 100999 100

(c) Data type constant outlier type non-Gaussian

400040003500

300025002000150010005000

minus500

minus4000

minus3000

minus2000

minus1000

minus4500

minus3500

minus2500

minus1500

minus5000minus5100

Valu

e

Index0 42 6 8

1000

50 60 725 80 90

199 30

(d) Data type increment outlier type Gaussian

900090008000

70006000

620050004000300020001000

0

minus4000minus3000minus2000minus1000

minus5000minus6000

minus6000

Valu

e

700800 400 300907 605

100

Index0 42 6 8

(e) Data type decrement outlier type Gaussian

60006000

5000

4000

3000

2000

1000

0

minus4000

minus3000

minus2000

minus1000

minus5000

minus6000minus6000

Valu

e

Index0 42 6 8

100100 100 100

101 350 100999

(f) Data type constant outlier type Gaussian

Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))

general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements

When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

10 The Scientific World Journal

Index

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

10008006004002000

Valu

e

(a)

Valu

e

900000800000700000600000500000400000300000200000100000

minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000

Index10008006004002000

0

(b)

Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error

the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors

In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results

Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers

The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error

In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a

significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window

Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877

119908values Figure 11 shows some selected results of

biogas data for a k value of 02 for MMS and a k value of 01for EMMS

One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size

When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

The Scientific World Journal 11

650

600

550

500

450

400

350

300

250

200

150

100

50

0

300250200150100500

H2

cont

ent o

f bio

gas (

ppm

)

Index

(a) Elements 334

55

50

45

40

35

30

25

20

15

10

5

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

(b) Elements 372

H2

cont

ent o

f bio

gas (

ppm

)

70

65

60

55

50

45

40

35

30

25

20

300 350250200150100500

Index

(c) Elements 350

H2

cont

ent o

f bio

gas (

ppm

)

300 350250200150100500

Index

95

90

85

80

75

70

65

60

55

50

45

40

35

(d) Elements 352

H2

cont

ent o

f bio

gas (

ppm

)

8580757065605550454035302520151050

300 350250200150100500

Index

(e) Elements 356

H2

cont

ent o

f bio

gas (

ppm

)

3002001000

Index

140

130

120

110

100

90

80

70

60

50

40

(f) Elements 372

Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively

segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported

4 Conclusions and Outlook

This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

12 The Scientific World Journal

both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output

If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work

References

[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983

[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010

[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011

[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001

[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008

[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005

[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002

[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006

[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983

[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional

Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011

[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014

[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004

[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001

[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011

[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004

[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010

[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013

[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010

[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011

[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013

[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006

[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969

[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975

[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981

[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Outlier Detection Method in Linear ...Research Article Outlier Detection Method in Linear Regression Based on Sum of Arithmetic Progression K.K.L.B.Adikaram, 1,2,3

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014