research article outlier detection method in linear ...research article outlier detection method in...
TRANSCRIPT
Research ArticleOutlier Detection Method in Linear Regression Based on Sum ofArithmetic Progression
K K L B Adikaram123 M A Hussein1 M Effenberger2 and T Becker1
1 Group Bio-Process Analysis Technology Technische Universitat Munchen Weihenstephaner Steig 20 85354 Freising Germany2 Institut fur Landtechnik und Tierhaltung Vottinger Straszlige 36 85354 Freising Germany3 Computer Unit Faculty of Agriculture University of Ruhuna Mapalana 81100 Kamburupitiya Sri Lanka
Correspondence should be addressed to K K L B Adikaram lasanthadaad-alumnide
Received 25 March 2014 Revised 23 May 2014 Accepted 26 May 2014 Published 10 July 2014
Academic Editor Zengyou He
Copyright copy 2014 K K L B Adikaram et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
We introduce a new nonparametric outlier detection method for linear series which requires no missing or removed dataimputation For an arithmetic progression (a series without outliers) with 119899 elements the ratio (119877) of the sum of the minimumand the maximum elements and the sum of all elements is always 2119899 (0 1] 119877 = 2119899 always implies the existence of outliersUsually 119877 lt 2119899 implies that the minimum is an outlier and 119877 gt 2119899 implies that the maximum is an outlier Based upon thiswe derived a new method for identifying significant and nonsignificant outliers separately Two different techniques were usedto manage missing data and removed outliers (1) recalculate the terms after (or before) the removed or missing element whilemaintaining the initial angle in relation to a certain point or (2) transform data into a constant value which is not affected bymissing or removed elements With a reference element which was not an outlier the method detected all outliers from data setswith 6 to 1000 elements containing 50 outliers which deviated by a factor of plusmn10119890 minus 2 to plusmn10119890 + 2 from the correct value
1 Introduction
Outlier detection and management of missing data are thetwo major steps in the data cleaningcleansing process [1ndash3] For achieving a training set data mining and statisticalanalyses it is very important to have data sets that have no(or as few as possible) outliers and missing values Except formodel-based approaches outlier detection and replacing ofdetected outliers or replacing missing values are two separateprocesses
The existing outlier detection methods are based onstatistical distance density distribution depth clusteringangle and model approaches [1 4ndash7] The nonparametricoutlier detection methods are independent of the model Forthe data without prior knowledge nonparametric methodsare known as a better solution than the statistical (parametric)methods [8ndash10] The most common nonparametric methodsare based on distance density depth cluster angle and res-olution techniques Among various methodstechniques areleast square method (LSM) [4] and the sigma filter [11] which
have been used frequently to remove the outliers of linearregression These methods require data in Gaussian or nearGaussian distribution which cannot be always guaranteed Ifthe correct model can be identified model-based approacheslike the Kalman filter [12ndash14] are suitable for removing andreplacing outliers However if it is not possible to identify thecorrect model the model-based approach is not feasible [15]
In addition to the noise missing data is another challengein the data cleaningcleansing process Even if the originaldata set is without missing elements removing outliers(without replacement) automatically creates a missing dataenvironment The most common two techniques to recoverthis situation are (1) filling themissing data with an estimatedvalue (filling) or (2) using the data without missing val-ues (reject missing values) Complete-case analysis (listwisedeletion) and available-case analysis (pairwise deletion) arethe most common missing data rejection methods [16ndash18]The mentioned methods are under the assumption that theyyield unbiased results Among the different missing datafilling methods hot deck cold deck mean median k-nearest
Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 821623 12 pageshttpdxdoiorg1011552014821623
2 The Scientific World Journal
neighbours model-based methods maximum likelihoodmethods and multiple imputation are the most commonmethods [18ndash22] Filling methods derive the filling valuefrom the same or other known existing data If there are aconsiderable number of outliers derived data may be biaseddue to the influence of outliers [23 24] Therefore the bestway is to remove all outliers and replace the outliers with asuitable method
In this paper we introduce a new nonparametric outlierdetection method based on sum of arithmetic progressionwhich used an indicator 2119899 where 119899 is the number of termsin the series The properties used in existing nonparametricmethods such as distance density depth cluster angle andresolution are domain dependent In contrast the value 2119899which we used in our new method is independent of thedomain conditions
Contrary to the existing nonparametric methods men-tioned earlier this work addressed identifying outliers in adataset that is expected to have linear relationThemethod iscapable of identifying significant and nonsignificant outliersseparately Moreover until all the outliers were removedthe new method requires no missing or removed dataimputation This will eliminate the negative influence dueto wrongly filled data points This is an advantage over themethods which require filling the removed data points Theoutlier detection method we introduced showed its best per-formances when the significant outliers are in non-GaussiandistributionThis is an advantage over existing methods suchas LMS and sigma filter The method uses a single data pointas a reference data pointThe reference point is assumed to benonoutlier Therefore accuracy of the outcome is dependingon the reference point especially when locating nonsignifi-cant outliers If the selected reference point is not an outlierthe method was capable of locating outliers from a data setcontaining very high rate of outliers such as 50 outliers
In this work data from biogas plants were used forevaluating the new method Since the biogas process isvery sensitive these data contain a considerable amount ofnoise even during apparently stable conditionsThis providessuitable data set for evaluating our method We were able toget the best outlier-freemacroscale data set which agrees withlinear (increasing decreasing or constant) regression fromselected segments of a data set
2 Methodology
21 Arithmetic Progression An arithmetic progression (AP)or arithmetic sequence is a sequence of numbers (ascendingdescending or constant) such that the difference between thesuccessive terms is constant [25] The 119899th term of a finite APwith 119899 elements is given by
119886119899= 119889 (119899 minus 1) + 119886
1 (1)
where 119889 is the common difference of successivemembers and1198861is the first element of the series The sum of the elements
of a finite AP with 119899 elements is given by
119878119899= (
119899
2
) lowast (1198861+ 119886119899) (2)
Table 1 Sample calculations for illustrating the relation between 2nand (119886
1+ 119886119899)119878119899
119886119899
Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861
100 100 100 9999 11198862
101 101 101 101 1011198863
102 102 102 102 1021198864
103 103 103 103 1031198865
104 10401 204 104 104(1198861+ 1198865)1198785
04 040001 0498 0399 02552119899 04 04 04 04 04Outlier mdash Yes-119886
5Yes-1198865
Yes-1198861
Yes-1198861
where a1 is the first element and an is the last element of theseries
Equation (1) is a 119891(119899) and fulfils the requirements of aline In other words finite AP is a straight line In addition astraight line is a series without outliers If there are outliersthe series is not a finite AP Therefore any arithmetic seriesthat fulfils the requirements of an AP can be considered aseries without outliers Equation (2) can be represented as
2
119899
=
(1198861+ 119886119899)
119878119899
infin gt 119899 ge 2 0 lt
2
119899
le 1 (3)
For any AP the right-hand side (RHS) of (3) is always 2119899which is independent of the terms of the series In otherwords if there are no outliers the value (119886
1+119886119899)119878119899will always
be equal to 2119899 If the RHS of (3) is not 2119899 it always impliesthat the series contains outliers Therefore the value 2119899 canbe used as a global indicator to identify any AP with outliers
Since we use the relation of AP we define that elementslying on or between two lines (linear border) are nonoutliersand others are outliers When the distance between twolines is zero they represent a single line In relation to themethod presented in this paper the term nonoutlier impliesan element that lies within a certain linear border and theterm outlier implies an element that does not lie within thelinear border
Primary investigations showed that themethod is capableof not only indicating the existence of outliers but alsolocating the outlier (119886
1+ 119886119899)119878119899
lt 2119899 indicates that themaximum element is the outlier (119886
1+ 119886119899)119878119899lt 2119899 indicates
that the minimum element is the outlier However (1198861+
119886119899)119878119899
= 2119899 does not imply that the series is free ofoutliers Furthermore primary investigations showed thatthe method is capable of locating both large and smalloutliers Table 1 shows sample calculations for illustrating therelation between 2119899 and (119886
1+ 119886119899)119878119899
As a principle the relation of (3) is capable of identi-fying and locating the outliers However we found sevendrawbacks which made relation (3) unusable for identifyingoutliers in actual data In Sections 21 to 27 we address thechallenges for making the relation usable
22 Challenge 1 Notation of the Equation The symbols usedin (3) especially 119886
1 119886119899 create a logical barrier For example
if there are outliers the minimum and the maximum can beother elements rather than 119886
1 119886119899 Therefore it is necessary
The Scientific World Journal 3
Outliers Outliers
0 12n = no outliers
Figure 1 Distribution of criteria range (0 1]
to use meaningful symbols that reflect the purpose of themethod The first and the last elements are either the mini-mum or the maximum Therefore it is possible to replace a1and 119886
119899by the minimum (119886min) and the maximum (119886max) of
the series Then (3) can be represented as
2
119899
=
(119886min + 119886max)
119878119899
(4)
Since the RHS of (4) consists of minimum maximum andsum of the series RHS was namedMMSwith the meaning ofminimum maximum and sum
MMS =
(119886min + 119886max)
119878119899
(5)
23 Challenge 2 Set a Range for the Outlier DetectionCriterion According to (3) outlier detection criterion is 2119899and can be used to check the elements that exactly agree witha line (Figure 1) To identify elements in a certain range it isnecessary to have a criteria range rather than a single value2119899
The left-hand side of (4) is the ratio 2 n and named as Rwby adding a weight ldquowrdquo to ldquoRrdquo Then
119877119908
=
2
119899
+ 119908 0 le 119908 le 1 minus
2
119899
(6)
The status 119908 = 0 (1198770) represents a single line and w gt
0 represents a line with a certain width (linear border)The outlier criteria range is a range with both floor (0)
and ceiling (1) and standardization is not required Thisis an additional advantage over the most common averagevariance and slandered deviation based approaches whichrequire a separate standardization process
24 Challenge 3 Influence of Negative Values Due to negativevalues the numerator or both the numerator and the denomi-nator of RHS of (5) can be 0 (egminus4 minus1 0 1 4) evenwithoutoutliers When there are outliers RHS of (5) can be negativewhich cannot be accepted as valid values for 2119899 0 lt 2119899 le 1must always hold
Subtracting the first element (119886119894 new = 119886
119894minus 119886min) from
each element of any AP creates a new transformed AP where119886min = 0 and guarantees a series without negative valuesFrom (5) and 119886
119894 new = 119886119894minus 119886min (7) is derived which is
more robust Another advantage of (7) is that it performs thetransformation automatically
MMS =
((119886min minus 119886min) + (119886max minus 119886min))
sum119899
119894=1(119886119894minus 119886min)
MMS =
(119886max minus 119886min)
(119878119899minus 119886min lowast 119899)
(7)
25 Challenge 4 Uneven Distribution of Criteria Range Theranges (0 2119899) and (2119899 1] are to identify outliers whichareminimums andmaximums respectively (Figure 1)When119899 rarr infin and 119877
0rarr 0 then 119877
119908 (0 1] is not equally
distributed which provides a large range for maximumoutliers and a small range for minimum outliers This is aproblem when locating minimum outliers
To solve this we used the idea of complement Forany series this will convert the maximum value into theminimum theminimum value into themaximum and inter-mediate values into their complements Most importantlynow the minimum value represents the maximum value ofthe original series and vice versa while still representing theoriginal series The complement of an element in a seriescan be defined as 119886
119894 119888= (119886max + 119886min) minus 119886
119894 From (5) and
119886119894 119888
= (119886max + 119886min) minus 119886119894this gives
MMS =
((119886max + 119886min minus 119886max) + (119886max + 119886min minus 119886min))
sum119899
119894=1(119886max + 119886min minus 119886
119894)
MMS =
((119886min) + (119886max))
sum119899
119894=1(119886max + 119886min minus 119886
119894)
(8)
Apply 119886119894 new = 119886
119894minus 119886min (to remove effect from negative
values)
MMS =
((119886min minus 119886min) + (119886max minus 119886min))
sum119899
119894=1((119886max minus 119886min) + (119886min minus 119886min) minus (119886
119894minus 119886min))
MMS =
(119886max minus 119886min)
sum119899
119894=1(119886max minus 119886
119894)
MMS =
(119886max minus 119886min)
(119886max lowast 119899 minus 119878119899)
(9)
Consequently the range 1198770
gt 2119899 represents the range forminimum outliers related to the original series and vice versa(Figure 2) and it is possible to ignore the range (0 2119899) Inaddition (9) automatically performs the transformation
Now there are two equations for MMS (7) and (9) tocheck whether the maximum or the minimum of the seriesis an outlier We named the two versions of MMS asMMSmax(10) and MMSmin (11)
MMSmax =
(119886max minus 119886min)
(119878119899minus 119886min lowast 119899)
(10)
MMSmin =
(119886max minus 119886min)
(119886max lowast 119899 minus 119878119899)
(11)
4 The Scientific World Journal
Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin
119886119899
Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861
100 100 100 9999 11198862
101 101 101 101 1011198863
102 102 102 102 1021198864
103 103 103 103 1031198865
104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min
Maximum outliersMinimum outliers
Minimumoutliers
Maximum outliers(represent the minimum outliers of original series)
Rw for original series
Rw for complement series
R0 = 2n0
1
Figure 2 Range of119877119908for original series and complement of original
series
The following equation shows the overview of the MMSprocess
minusor
MMSmax
=amax minus aminSn minus amin lowast n
maximum is the outlier
MMSmin
=amax minus aminamax lowast n minus Sn
minimum is the outlier
))))
))))
)))
gt(( 2n) + w1)
le(( 2n) + w1)
le(( 2n) + w1)
gt(( 2n) + w1)
))))))))
(12)
and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1
26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using
the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier
To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect
261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)
In Figure 4 the plot consists of elements a0 to 119886119903+1
(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements
119861119903119879119903= (
119861119903+1
119862119903+1
119860119861119903+1
) lowast 119860119861119903= (
(119886119903+1
minus 1198860)
(119903 + 1)
) lowast 119903
(119886119903+1
)new = 119886119900+ 119861119903119879119903
(13)
262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection
If yT is a linear series where 119910119879
119896= 119910119896minus 1199101 119909119879119896
= 119909119896minus 1199091
119909119896is the initial index of elements and yk is the 119896th element of
the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899
119894=1119910119896sum119899
119894=1119909119896 If one element (eg the first element) is
(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910
1015840 where 119910119896
= 119909119896lowast 119898 If there
are no outliers both yT and 1199101015840 coincide and 119910
119879
minus 1199101015840
= 0If 119910119879119879
= 119910119879
minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without
any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)
27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is
The Scientific World Journal 5
Valu
e
Remove
0 r Index
(a)
0 r
Valu
e
Index
(b)
Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier
Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier
119886119899
Data set 61198861
1001198862
101 MMSmax 03771198863
102 MMSmin 04251198864
1036 2119899 041198865
104 Outlier Yes-Min
neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262
EMMS is expressed as
EMMSmax =
(119886119879119879
max minus 119886119879119879
min)
(119878119879119879
119899minus 119886119879119879
min lowast 119899)
119886119879119879
max ⟨ ⟩ 0 (14)
EMMSmin =
(119886119879119879
max minus 119886119879119879
min)
(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (15)
where 119886119879119879
119896= |119886119879
119896minus 119909119896(119866119886119879
119866119909)| 119886119879119896
= 119886119896minus 1198860 xk is the index
of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886
119879
= sum119899minus1
119896=0119886119879
119896
119866119909 = sum119899minus1
119896=0119883119896 and 119878
119879119879
119899= sum119899minus1
119896=0119886119879119879
119896⟨ ⟩0
Remove
Valu
e
A120579r+1
120579r+2
a0
arminus1
Tr
ar+1
Cr+1
Tr+1
ar ar+2
Br Br+1
r Index
Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift
Always the term 119886119879119879
gt 0 Thus the term 119886119879119879
min = 0 Then(14) and (15) are simplified as
EMMSmax =
119886119879119879
max119878119879119879
119899
119886119879119879
max ⟨ ⟩ 0 (16)
EMMSmin =
119886119879119879
max(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (17)
If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899
and the greater value represents the outlier Table 4 shows
6 The Scientific World Journal
Table 4 EMMS for identifying an outlier
119883(119899 minus 1) 119910(119886119899) 119910
119879
(119886119899minus 1198861) 119910
119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)
0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866
119879
119910= 2120
EMMS (Max) 0500EMMS (Min) 0333
2119899 04Outlier Yes-Max
0 1 2 n n minus 1
Valu
e
Index
Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =
119891(119909) form yellow square corresponds to 119910119879
= 119891(119909) minus 119891(1199090) form
cross corresponds to 1199101015840
= 119898 lowast 119909119894form (119898 = sum
119899
119894=0119910119879
119894sum119899
119894=0119909119879
119894)
red circle corresponds to 119910119879119879
= 119910119879
minus 1199101015840 and circle corresponds to
missing values
an example calculation of EMMS and the following equationshows an overview of EMMS process
minusor
maximum is the outlier
minimum is the outlier
))))
EMMSmax =
aTTmaxSTTn
EMMSmin
=aTTmax
aTTmax lowast n minus STTn ))
gt(( 2n) + w2)
gt(( 2n) + w2)
le(( 2n) + w2)
le(( 2n) + w2)
))))
))))
))))))))
(18)
However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS
Start
Consecutiveseries
Recalculate
Outlierdetected
the series
Yes
Yes
No
Rw = 2n lowast (1 + k1)
use MMS
Outlierdetected
YesYes
NoNo
No
Apply
Stop
EMMS
use EMMS
∙ Remove the outlier∙ n = n ndash 1
∙ Remove the outlier∙ n = n ndash 1
∙ n = number of elements∙ k = k2
∙ n = number of elements∙ k = k1
Rw = 2n lowast (1 + k2)
Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS
28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877
119908is the factor that determines the outliers
when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0
represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria
281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)
then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908
=
2119899 + 2 lowast 119896119899
119877119908
=
2
119899
lowast (1 + 119896)
119877119908
1198770
= 1 + 119896 (= constant) (19)
When the MMS or the EMMS is greater than 119877119908of (19) this
implies the existence of outliers Because 1198771199081198770is constant
and gives standards to 119877119908 determination of k still depends
on the knowledge of the domain Figure 6 shows an algorithmbased on this technique
282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
2 The Scientific World Journal
neighbours model-based methods maximum likelihoodmethods and multiple imputation are the most commonmethods [18ndash22] Filling methods derive the filling valuefrom the same or other known existing data If there are aconsiderable number of outliers derived data may be biaseddue to the influence of outliers [23 24] Therefore the bestway is to remove all outliers and replace the outliers with asuitable method
In this paper we introduce a new nonparametric outlierdetection method based on sum of arithmetic progressionwhich used an indicator 2119899 where 119899 is the number of termsin the series The properties used in existing nonparametricmethods such as distance density depth cluster angle andresolution are domain dependent In contrast the value 2119899which we used in our new method is independent of thedomain conditions
Contrary to the existing nonparametric methods men-tioned earlier this work addressed identifying outliers in adataset that is expected to have linear relationThemethod iscapable of identifying significant and nonsignificant outliersseparately Moreover until all the outliers were removedthe new method requires no missing or removed dataimputation This will eliminate the negative influence dueto wrongly filled data points This is an advantage over themethods which require filling the removed data points Theoutlier detection method we introduced showed its best per-formances when the significant outliers are in non-GaussiandistributionThis is an advantage over existing methods suchas LMS and sigma filter The method uses a single data pointas a reference data pointThe reference point is assumed to benonoutlier Therefore accuracy of the outcome is dependingon the reference point especially when locating nonsignifi-cant outliers If the selected reference point is not an outlierthe method was capable of locating outliers from a data setcontaining very high rate of outliers such as 50 outliers
In this work data from biogas plants were used forevaluating the new method Since the biogas process isvery sensitive these data contain a considerable amount ofnoise even during apparently stable conditionsThis providessuitable data set for evaluating our method We were able toget the best outlier-freemacroscale data set which agrees withlinear (increasing decreasing or constant) regression fromselected segments of a data set
2 Methodology
21 Arithmetic Progression An arithmetic progression (AP)or arithmetic sequence is a sequence of numbers (ascendingdescending or constant) such that the difference between thesuccessive terms is constant [25] The 119899th term of a finite APwith 119899 elements is given by
119886119899= 119889 (119899 minus 1) + 119886
1 (1)
where 119889 is the common difference of successivemembers and1198861is the first element of the series The sum of the elements
of a finite AP with 119899 elements is given by
119878119899= (
119899
2
) lowast (1198861+ 119886119899) (2)
Table 1 Sample calculations for illustrating the relation between 2nand (119886
1+ 119886119899)119878119899
119886119899
Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861
100 100 100 9999 11198862
101 101 101 101 1011198863
102 102 102 102 1021198864
103 103 103 103 1031198865
104 10401 204 104 104(1198861+ 1198865)1198785
04 040001 0498 0399 02552119899 04 04 04 04 04Outlier mdash Yes-119886
5Yes-1198865
Yes-1198861
Yes-1198861
where a1 is the first element and an is the last element of theseries
Equation (1) is a 119891(119899) and fulfils the requirements of aline In other words finite AP is a straight line In addition astraight line is a series without outliers If there are outliersthe series is not a finite AP Therefore any arithmetic seriesthat fulfils the requirements of an AP can be considered aseries without outliers Equation (2) can be represented as
2
119899
=
(1198861+ 119886119899)
119878119899
infin gt 119899 ge 2 0 lt
2
119899
le 1 (3)
For any AP the right-hand side (RHS) of (3) is always 2119899which is independent of the terms of the series In otherwords if there are no outliers the value (119886
1+119886119899)119878119899will always
be equal to 2119899 If the RHS of (3) is not 2119899 it always impliesthat the series contains outliers Therefore the value 2119899 canbe used as a global indicator to identify any AP with outliers
Since we use the relation of AP we define that elementslying on or between two lines (linear border) are nonoutliersand others are outliers When the distance between twolines is zero they represent a single line In relation to themethod presented in this paper the term nonoutlier impliesan element that lies within a certain linear border and theterm outlier implies an element that does not lie within thelinear border
Primary investigations showed that themethod is capableof not only indicating the existence of outliers but alsolocating the outlier (119886
1+ 119886119899)119878119899
lt 2119899 indicates that themaximum element is the outlier (119886
1+ 119886119899)119878119899lt 2119899 indicates
that the minimum element is the outlier However (1198861+
119886119899)119878119899
= 2119899 does not imply that the series is free ofoutliers Furthermore primary investigations showed thatthe method is capable of locating both large and smalloutliers Table 1 shows sample calculations for illustrating therelation between 2119899 and (119886
1+ 119886119899)119878119899
As a principle the relation of (3) is capable of identi-fying and locating the outliers However we found sevendrawbacks which made relation (3) unusable for identifyingoutliers in actual data In Sections 21 to 27 we address thechallenges for making the relation usable
22 Challenge 1 Notation of the Equation The symbols usedin (3) especially 119886
1 119886119899 create a logical barrier For example
if there are outliers the minimum and the maximum can beother elements rather than 119886
1 119886119899 Therefore it is necessary
The Scientific World Journal 3
Outliers Outliers
0 12n = no outliers
Figure 1 Distribution of criteria range (0 1]
to use meaningful symbols that reflect the purpose of themethod The first and the last elements are either the mini-mum or the maximum Therefore it is possible to replace a1and 119886
119899by the minimum (119886min) and the maximum (119886max) of
the series Then (3) can be represented as
2
119899
=
(119886min + 119886max)
119878119899
(4)
Since the RHS of (4) consists of minimum maximum andsum of the series RHS was namedMMSwith the meaning ofminimum maximum and sum
MMS =
(119886min + 119886max)
119878119899
(5)
23 Challenge 2 Set a Range for the Outlier DetectionCriterion According to (3) outlier detection criterion is 2119899and can be used to check the elements that exactly agree witha line (Figure 1) To identify elements in a certain range it isnecessary to have a criteria range rather than a single value2119899
The left-hand side of (4) is the ratio 2 n and named as Rwby adding a weight ldquowrdquo to ldquoRrdquo Then
119877119908
=
2
119899
+ 119908 0 le 119908 le 1 minus
2
119899
(6)
The status 119908 = 0 (1198770) represents a single line and w gt
0 represents a line with a certain width (linear border)The outlier criteria range is a range with both floor (0)
and ceiling (1) and standardization is not required Thisis an additional advantage over the most common averagevariance and slandered deviation based approaches whichrequire a separate standardization process
24 Challenge 3 Influence of Negative Values Due to negativevalues the numerator or both the numerator and the denomi-nator of RHS of (5) can be 0 (egminus4 minus1 0 1 4) evenwithoutoutliers When there are outliers RHS of (5) can be negativewhich cannot be accepted as valid values for 2119899 0 lt 2119899 le 1must always hold
Subtracting the first element (119886119894 new = 119886
119894minus 119886min) from
each element of any AP creates a new transformed AP where119886min = 0 and guarantees a series without negative valuesFrom (5) and 119886
119894 new = 119886119894minus 119886min (7) is derived which is
more robust Another advantage of (7) is that it performs thetransformation automatically
MMS =
((119886min minus 119886min) + (119886max minus 119886min))
sum119899
119894=1(119886119894minus 119886min)
MMS =
(119886max minus 119886min)
(119878119899minus 119886min lowast 119899)
(7)
25 Challenge 4 Uneven Distribution of Criteria Range Theranges (0 2119899) and (2119899 1] are to identify outliers whichareminimums andmaximums respectively (Figure 1)When119899 rarr infin and 119877
0rarr 0 then 119877
119908 (0 1] is not equally
distributed which provides a large range for maximumoutliers and a small range for minimum outliers This is aproblem when locating minimum outliers
To solve this we used the idea of complement Forany series this will convert the maximum value into theminimum theminimum value into themaximum and inter-mediate values into their complements Most importantlynow the minimum value represents the maximum value ofthe original series and vice versa while still representing theoriginal series The complement of an element in a seriescan be defined as 119886
119894 119888= (119886max + 119886min) minus 119886
119894 From (5) and
119886119894 119888
= (119886max + 119886min) minus 119886119894this gives
MMS =
((119886max + 119886min minus 119886max) + (119886max + 119886min minus 119886min))
sum119899
119894=1(119886max + 119886min minus 119886
119894)
MMS =
((119886min) + (119886max))
sum119899
119894=1(119886max + 119886min minus 119886
119894)
(8)
Apply 119886119894 new = 119886
119894minus 119886min (to remove effect from negative
values)
MMS =
((119886min minus 119886min) + (119886max minus 119886min))
sum119899
119894=1((119886max minus 119886min) + (119886min minus 119886min) minus (119886
119894minus 119886min))
MMS =
(119886max minus 119886min)
sum119899
119894=1(119886max minus 119886
119894)
MMS =
(119886max minus 119886min)
(119886max lowast 119899 minus 119878119899)
(9)
Consequently the range 1198770
gt 2119899 represents the range forminimum outliers related to the original series and vice versa(Figure 2) and it is possible to ignore the range (0 2119899) Inaddition (9) automatically performs the transformation
Now there are two equations for MMS (7) and (9) tocheck whether the maximum or the minimum of the seriesis an outlier We named the two versions of MMS asMMSmax(10) and MMSmin (11)
MMSmax =
(119886max minus 119886min)
(119878119899minus 119886min lowast 119899)
(10)
MMSmin =
(119886max minus 119886min)
(119886max lowast 119899 minus 119878119899)
(11)
4 The Scientific World Journal
Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin
119886119899
Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861
100 100 100 9999 11198862
101 101 101 101 1011198863
102 102 102 102 1021198864
103 103 103 103 1031198865
104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min
Maximum outliersMinimum outliers
Minimumoutliers
Maximum outliers(represent the minimum outliers of original series)
Rw for original series
Rw for complement series
R0 = 2n0
1
Figure 2 Range of119877119908for original series and complement of original
series
The following equation shows the overview of the MMSprocess
minusor
MMSmax
=amax minus aminSn minus amin lowast n
maximum is the outlier
MMSmin
=amax minus aminamax lowast n minus Sn
minimum is the outlier
))))
))))
)))
gt(( 2n) + w1)
le(( 2n) + w1)
le(( 2n) + w1)
gt(( 2n) + w1)
))))))))
(12)
and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1
26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using
the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier
To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect
261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)
In Figure 4 the plot consists of elements a0 to 119886119903+1
(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements
119861119903119879119903= (
119861119903+1
119862119903+1
119860119861119903+1
) lowast 119860119861119903= (
(119886119903+1
minus 1198860)
(119903 + 1)
) lowast 119903
(119886119903+1
)new = 119886119900+ 119861119903119879119903
(13)
262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection
If yT is a linear series where 119910119879
119896= 119910119896minus 1199101 119909119879119896
= 119909119896minus 1199091
119909119896is the initial index of elements and yk is the 119896th element of
the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899
119894=1119910119896sum119899
119894=1119909119896 If one element (eg the first element) is
(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910
1015840 where 119910119896
= 119909119896lowast 119898 If there
are no outliers both yT and 1199101015840 coincide and 119910
119879
minus 1199101015840
= 0If 119910119879119879
= 119910119879
minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without
any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)
27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is
The Scientific World Journal 5
Valu
e
Remove
0 r Index
(a)
0 r
Valu
e
Index
(b)
Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier
Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier
119886119899
Data set 61198861
1001198862
101 MMSmax 03771198863
102 MMSmin 04251198864
1036 2119899 041198865
104 Outlier Yes-Min
neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262
EMMS is expressed as
EMMSmax =
(119886119879119879
max minus 119886119879119879
min)
(119878119879119879
119899minus 119886119879119879
min lowast 119899)
119886119879119879
max ⟨ ⟩ 0 (14)
EMMSmin =
(119886119879119879
max minus 119886119879119879
min)
(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (15)
where 119886119879119879
119896= |119886119879
119896minus 119909119896(119866119886119879
119866119909)| 119886119879119896
= 119886119896minus 1198860 xk is the index
of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886
119879
= sum119899minus1
119896=0119886119879
119896
119866119909 = sum119899minus1
119896=0119883119896 and 119878
119879119879
119899= sum119899minus1
119896=0119886119879119879
119896⟨ ⟩0
Remove
Valu
e
A120579r+1
120579r+2
a0
arminus1
Tr
ar+1
Cr+1
Tr+1
ar ar+2
Br Br+1
r Index
Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift
Always the term 119886119879119879
gt 0 Thus the term 119886119879119879
min = 0 Then(14) and (15) are simplified as
EMMSmax =
119886119879119879
max119878119879119879
119899
119886119879119879
max ⟨ ⟩ 0 (16)
EMMSmin =
119886119879119879
max(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (17)
If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899
and the greater value represents the outlier Table 4 shows
6 The Scientific World Journal
Table 4 EMMS for identifying an outlier
119883(119899 minus 1) 119910(119886119899) 119910
119879
(119886119899minus 1198861) 119910
119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)
0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866
119879
119910= 2120
EMMS (Max) 0500EMMS (Min) 0333
2119899 04Outlier Yes-Max
0 1 2 n n minus 1
Valu
e
Index
Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =
119891(119909) form yellow square corresponds to 119910119879
= 119891(119909) minus 119891(1199090) form
cross corresponds to 1199101015840
= 119898 lowast 119909119894form (119898 = sum
119899
119894=0119910119879
119894sum119899
119894=0119909119879
119894)
red circle corresponds to 119910119879119879
= 119910119879
minus 1199101015840 and circle corresponds to
missing values
an example calculation of EMMS and the following equationshows an overview of EMMS process
minusor
maximum is the outlier
minimum is the outlier
))))
EMMSmax =
aTTmaxSTTn
EMMSmin
=aTTmax
aTTmax lowast n minus STTn ))
gt(( 2n) + w2)
gt(( 2n) + w2)
le(( 2n) + w2)
le(( 2n) + w2)
))))
))))
))))))))
(18)
However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS
Start
Consecutiveseries
Recalculate
Outlierdetected
the series
Yes
Yes
No
Rw = 2n lowast (1 + k1)
use MMS
Outlierdetected
YesYes
NoNo
No
Apply
Stop
EMMS
use EMMS
∙ Remove the outlier∙ n = n ndash 1
∙ Remove the outlier∙ n = n ndash 1
∙ n = number of elements∙ k = k2
∙ n = number of elements∙ k = k1
Rw = 2n lowast (1 + k2)
Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS
28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877
119908is the factor that determines the outliers
when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0
represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria
281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)
then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908
=
2119899 + 2 lowast 119896119899
119877119908
=
2
119899
lowast (1 + 119896)
119877119908
1198770
= 1 + 119896 (= constant) (19)
When the MMS or the EMMS is greater than 119877119908of (19) this
implies the existence of outliers Because 1198771199081198770is constant
and gives standards to 119877119908 determination of k still depends
on the knowledge of the domain Figure 6 shows an algorithmbased on this technique
282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 3
Outliers Outliers
0 12n = no outliers
Figure 1 Distribution of criteria range (0 1]
to use meaningful symbols that reflect the purpose of themethod The first and the last elements are either the mini-mum or the maximum Therefore it is possible to replace a1and 119886
119899by the minimum (119886min) and the maximum (119886max) of
the series Then (3) can be represented as
2
119899
=
(119886min + 119886max)
119878119899
(4)
Since the RHS of (4) consists of minimum maximum andsum of the series RHS was namedMMSwith the meaning ofminimum maximum and sum
MMS =
(119886min + 119886max)
119878119899
(5)
23 Challenge 2 Set a Range for the Outlier DetectionCriterion According to (3) outlier detection criterion is 2119899and can be used to check the elements that exactly agree witha line (Figure 1) To identify elements in a certain range it isnecessary to have a criteria range rather than a single value2119899
The left-hand side of (4) is the ratio 2 n and named as Rwby adding a weight ldquowrdquo to ldquoRrdquo Then
119877119908
=
2
119899
+ 119908 0 le 119908 le 1 minus
2
119899
(6)
The status 119908 = 0 (1198770) represents a single line and w gt
0 represents a line with a certain width (linear border)The outlier criteria range is a range with both floor (0)
and ceiling (1) and standardization is not required Thisis an additional advantage over the most common averagevariance and slandered deviation based approaches whichrequire a separate standardization process
24 Challenge 3 Influence of Negative Values Due to negativevalues the numerator or both the numerator and the denomi-nator of RHS of (5) can be 0 (egminus4 minus1 0 1 4) evenwithoutoutliers When there are outliers RHS of (5) can be negativewhich cannot be accepted as valid values for 2119899 0 lt 2119899 le 1must always hold
Subtracting the first element (119886119894 new = 119886
119894minus 119886min) from
each element of any AP creates a new transformed AP where119886min = 0 and guarantees a series without negative valuesFrom (5) and 119886
119894 new = 119886119894minus 119886min (7) is derived which is
more robust Another advantage of (7) is that it performs thetransformation automatically
MMS =
((119886min minus 119886min) + (119886max minus 119886min))
sum119899
119894=1(119886119894minus 119886min)
MMS =
(119886max minus 119886min)
(119878119899minus 119886min lowast 119899)
(7)
25 Challenge 4 Uneven Distribution of Criteria Range Theranges (0 2119899) and (2119899 1] are to identify outliers whichareminimums andmaximums respectively (Figure 1)When119899 rarr infin and 119877
0rarr 0 then 119877
119908 (0 1] is not equally
distributed which provides a large range for maximumoutliers and a small range for minimum outliers This is aproblem when locating minimum outliers
To solve this we used the idea of complement Forany series this will convert the maximum value into theminimum theminimum value into themaximum and inter-mediate values into their complements Most importantlynow the minimum value represents the maximum value ofthe original series and vice versa while still representing theoriginal series The complement of an element in a seriescan be defined as 119886
119894 119888= (119886max + 119886min) minus 119886
119894 From (5) and
119886119894 119888
= (119886max + 119886min) minus 119886119894this gives
MMS =
((119886max + 119886min minus 119886max) + (119886max + 119886min minus 119886min))
sum119899
119894=1(119886max + 119886min minus 119886
119894)
MMS =
((119886min) + (119886max))
sum119899
119894=1(119886max + 119886min minus 119886
119894)
(8)
Apply 119886119894 new = 119886
119894minus 119886min (to remove effect from negative
values)
MMS =
((119886min minus 119886min) + (119886max minus 119886min))
sum119899
119894=1((119886max minus 119886min) + (119886min minus 119886min) minus (119886
119894minus 119886min))
MMS =
(119886max minus 119886min)
sum119899
119894=1(119886max minus 119886
119894)
MMS =
(119886max minus 119886min)
(119886max lowast 119899 minus 119878119899)
(9)
Consequently the range 1198770
gt 2119899 represents the range forminimum outliers related to the original series and vice versa(Figure 2) and it is possible to ignore the range (0 2119899) Inaddition (9) automatically performs the transformation
Now there are two equations for MMS (7) and (9) tocheck whether the maximum or the minimum of the seriesis an outlier We named the two versions of MMS asMMSmax(10) and MMSmin (11)
MMSmax =
(119886max minus 119886min)
(119878119899minus 119886min lowast 119899)
(10)
MMSmin =
(119886max minus 119886min)
(119886max lowast 119899 minus 119878119899)
(11)
4 The Scientific World Journal
Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin
119886119899
Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861
100 100 100 9999 11198862
101 101 101 101 1011198863
102 102 102 102 1021198864
103 103 103 103 1031198865
104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min
Maximum outliersMinimum outliers
Minimumoutliers
Maximum outliers(represent the minimum outliers of original series)
Rw for original series
Rw for complement series
R0 = 2n0
1
Figure 2 Range of119877119908for original series and complement of original
series
The following equation shows the overview of the MMSprocess
minusor
MMSmax
=amax minus aminSn minus amin lowast n
maximum is the outlier
MMSmin
=amax minus aminamax lowast n minus Sn
minimum is the outlier
))))
))))
)))
gt(( 2n) + w1)
le(( 2n) + w1)
le(( 2n) + w1)
gt(( 2n) + w1)
))))))))
(12)
and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1
26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using
the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier
To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect
261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)
In Figure 4 the plot consists of elements a0 to 119886119903+1
(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements
119861119903119879119903= (
119861119903+1
119862119903+1
119860119861119903+1
) lowast 119860119861119903= (
(119886119903+1
minus 1198860)
(119903 + 1)
) lowast 119903
(119886119903+1
)new = 119886119900+ 119861119903119879119903
(13)
262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection
If yT is a linear series where 119910119879
119896= 119910119896minus 1199101 119909119879119896
= 119909119896minus 1199091
119909119896is the initial index of elements and yk is the 119896th element of
the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899
119894=1119910119896sum119899
119894=1119909119896 If one element (eg the first element) is
(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910
1015840 where 119910119896
= 119909119896lowast 119898 If there
are no outliers both yT and 1199101015840 coincide and 119910
119879
minus 1199101015840
= 0If 119910119879119879
= 119910119879
minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without
any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)
27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is
The Scientific World Journal 5
Valu
e
Remove
0 r Index
(a)
0 r
Valu
e
Index
(b)
Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier
Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier
119886119899
Data set 61198861
1001198862
101 MMSmax 03771198863
102 MMSmin 04251198864
1036 2119899 041198865
104 Outlier Yes-Min
neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262
EMMS is expressed as
EMMSmax =
(119886119879119879
max minus 119886119879119879
min)
(119878119879119879
119899minus 119886119879119879
min lowast 119899)
119886119879119879
max ⟨ ⟩ 0 (14)
EMMSmin =
(119886119879119879
max minus 119886119879119879
min)
(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (15)
where 119886119879119879
119896= |119886119879
119896minus 119909119896(119866119886119879
119866119909)| 119886119879119896
= 119886119896minus 1198860 xk is the index
of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886
119879
= sum119899minus1
119896=0119886119879
119896
119866119909 = sum119899minus1
119896=0119883119896 and 119878
119879119879
119899= sum119899minus1
119896=0119886119879119879
119896⟨ ⟩0
Remove
Valu
e
A120579r+1
120579r+2
a0
arminus1
Tr
ar+1
Cr+1
Tr+1
ar ar+2
Br Br+1
r Index
Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift
Always the term 119886119879119879
gt 0 Thus the term 119886119879119879
min = 0 Then(14) and (15) are simplified as
EMMSmax =
119886119879119879
max119878119879119879
119899
119886119879119879
max ⟨ ⟩ 0 (16)
EMMSmin =
119886119879119879
max(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (17)
If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899
and the greater value represents the outlier Table 4 shows
6 The Scientific World Journal
Table 4 EMMS for identifying an outlier
119883(119899 minus 1) 119910(119886119899) 119910
119879
(119886119899minus 1198861) 119910
119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)
0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866
119879
119910= 2120
EMMS (Max) 0500EMMS (Min) 0333
2119899 04Outlier Yes-Max
0 1 2 n n minus 1
Valu
e
Index
Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =
119891(119909) form yellow square corresponds to 119910119879
= 119891(119909) minus 119891(1199090) form
cross corresponds to 1199101015840
= 119898 lowast 119909119894form (119898 = sum
119899
119894=0119910119879
119894sum119899
119894=0119909119879
119894)
red circle corresponds to 119910119879119879
= 119910119879
minus 1199101015840 and circle corresponds to
missing values
an example calculation of EMMS and the following equationshows an overview of EMMS process
minusor
maximum is the outlier
minimum is the outlier
))))
EMMSmax =
aTTmaxSTTn
EMMSmin
=aTTmax
aTTmax lowast n minus STTn ))
gt(( 2n) + w2)
gt(( 2n) + w2)
le(( 2n) + w2)
le(( 2n) + w2)
))))
))))
))))))))
(18)
However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS
Start
Consecutiveseries
Recalculate
Outlierdetected
the series
Yes
Yes
No
Rw = 2n lowast (1 + k1)
use MMS
Outlierdetected
YesYes
NoNo
No
Apply
Stop
EMMS
use EMMS
∙ Remove the outlier∙ n = n ndash 1
∙ Remove the outlier∙ n = n ndash 1
∙ n = number of elements∙ k = k2
∙ n = number of elements∙ k = k1
Rw = 2n lowast (1 + k2)
Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS
28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877
119908is the factor that determines the outliers
when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0
represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria
281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)
then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908
=
2119899 + 2 lowast 119896119899
119877119908
=
2
119899
lowast (1 + 119896)
119877119908
1198770
= 1 + 119896 (= constant) (19)
When the MMS or the EMMS is greater than 119877119908of (19) this
implies the existence of outliers Because 1198771199081198770is constant
and gives standards to 119877119908 determination of k still depends
on the knowledge of the domain Figure 6 shows an algorithmbased on this technique
282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
4 The Scientific World Journal
Table 2 Sample calculations for illustrating the relation between 2nand MMSmax and MMSmin
119886119899
Data set 1 Data set 2 Data set 3 Data set 4 Data set 51198861
100 100 100 9999 11198862
101 101 101 101 1011198863
102 102 102 102 1021198864
103 103 103 103 1031198865
104 10401 204 104 104MMS (Max) 04 0401 0945 0399 0254MMS (Min) 04 0399 0254 0401 09452119899 04 04 04 04 04Outlier mdash Yes-Max Yes-Max Yes-Min Yes-Min
Maximum outliersMinimum outliers
Minimumoutliers
Maximum outliers(represent the minimum outliers of original series)
Rw for original series
Rw for complement series
R0 = 2n0
1
Figure 2 Range of119877119908for original series and complement of original
series
The following equation shows the overview of the MMSprocess
minusor
MMSmax
=amax minus aminSn minus amin lowast n
maximum is the outlier
MMSmin
=amax minus aminamax lowast n minus Sn
minimum is the outlier
))))
))))
)))
gt(( 2n) + w1)
le(( 2n) + w1)
le(( 2n) + w1)
gt(( 2n) + w1)
))))))))
(12)
and Table 2 shows sample calculations using (10) and (11) forthe same data sets in Table 1
26 Challenge 5 How to Deal with Removed OutliersMissingValues In a series there can be initial missing values Inaddition if there is no replacement after removing an outlierit also creates a missing value environment If there is nofilling it would transform the elements after the element isremoved into another value and destroy the original relation-ship of elements (Figure 3)These transformed values becomeoutliers in relation to the original data Therefore for using
the relation of AP it is compulsory to maintain the originalrelation of the data even after removing an outlier Thus anyrejection technique is not feasible To maintain the originalrelation one possible way is replacing the missing valueHowever the data we are considering contain a considerableamount of outliers Therefore we cannot guarantee that anelement derived from existing elements is not an outlier
To overcome this problem we considered two differentoptions (1) recalculate only the data points after (or before)the removed or missing element thereby maintaining theinitial angle in relation to a certain point or (2) transform theelements into a new series where the missing value has noeffect
261 Recalculate the Data Points after (or before) Removedand Missing Elements If there is a missing element the nextelements will be shifted horizontally and transformed intowrong values in relation to the current index of the elements(Figure 3) However angular shifting will not introduce suchan error (Figure 3)
In Figure 4 the plot consists of elements a0 to 119886119903+1
(119903 isin R+) and element ar at r needed to be removed Afterremoving element r element 119903+1 becomes element r element119903+2 becomes element 119903+1 and so onHowever shiftingwhilemaintaining the same anglewith respect to a certain referenceelement (eg the first element) the same form of the seriescan be maintained Equation (13) shows the new value afterangular shiftingWeused this techniquewithMMSalgorithmto recalculate the series after (or before) missing values orremoved elements
119861119903119879119903= (
119861119903+1
119862119903+1
119860119861119903+1
) lowast 119860119861119903= (
(119886119903+1
minus 1198860)
(119903 + 1)
) lowast 119903
(119886119903+1
)new = 119886119900+ 119861119903119879119903
(13)
262 Transformation of Data to a Constant Value A serieswith a constant value (119910 = 119888 form where c is a constant) is aseries that has no effect of missing values Because of that ifit is possible to transform any linear series to 119910 = 119888 form thetransformed series is free of any effect ofmissing values Afterthat the transformed series can be used for outlier detection
If yT is a linear series where 119910119879
119896= 119910119896minus 1199101 119909119879119896
= 119909119896minus 1199091
119909119896is the initial index of elements and yk is the 119896th element of
the series 119896 = 1 2 119899 The gradient of the line (m) is givenby sum119899
119894=1119910119896sum119899
119894=1119909119896 If one element (eg the first element) is
(0 0) this relation is always true even with missing valuesThe element (0 0) can be considered as the reference elementThe yT is a series with first element (0 0) and m that canbe calculated even with missing values Also it is possibleto derive a new series as 119910
1015840 where 119910119896
= 119909119896lowast 119898 If there
are no outliers both yT and 1199101015840 coincide and 119910
119879
minus 1199101015840
= 0If 119910119879119879
= 119910119879
minus 1199101015840 119910119879119879 is in the form of 119910 = 119888 without
any influence from missing values Therefore this is anothermethod to overcome missing values without replacing them(Figure 5)
27 Challenge 6 Locate Outliers That Are Neither the Max-imum Nor the Minimum of the Series When the outlier is
The Scientific World Journal 5
Valu
e
Remove
0 r Index
(a)
0 r
Valu
e
Index
(b)
Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier
Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier
119886119899
Data set 61198861
1001198862
101 MMSmax 03771198863
102 MMSmin 04251198864
1036 2119899 041198865
104 Outlier Yes-Min
neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262
EMMS is expressed as
EMMSmax =
(119886119879119879
max minus 119886119879119879
min)
(119878119879119879
119899minus 119886119879119879
min lowast 119899)
119886119879119879
max ⟨ ⟩ 0 (14)
EMMSmin =
(119886119879119879
max minus 119886119879119879
min)
(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (15)
where 119886119879119879
119896= |119886119879
119896minus 119909119896(119866119886119879
119866119909)| 119886119879119896
= 119886119896minus 1198860 xk is the index
of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886
119879
= sum119899minus1
119896=0119886119879
119896
119866119909 = sum119899minus1
119896=0119883119896 and 119878
119879119879
119899= sum119899minus1
119896=0119886119879119879
119896⟨ ⟩0
Remove
Valu
e
A120579r+1
120579r+2
a0
arminus1
Tr
ar+1
Cr+1
Tr+1
ar ar+2
Br Br+1
r Index
Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift
Always the term 119886119879119879
gt 0 Thus the term 119886119879119879
min = 0 Then(14) and (15) are simplified as
EMMSmax =
119886119879119879
max119878119879119879
119899
119886119879119879
max ⟨ ⟩ 0 (16)
EMMSmin =
119886119879119879
max(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (17)
If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899
and the greater value represents the outlier Table 4 shows
6 The Scientific World Journal
Table 4 EMMS for identifying an outlier
119883(119899 minus 1) 119910(119886119899) 119910
119879
(119886119899minus 1198861) 119910
119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)
0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866
119879
119910= 2120
EMMS (Max) 0500EMMS (Min) 0333
2119899 04Outlier Yes-Max
0 1 2 n n minus 1
Valu
e
Index
Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =
119891(119909) form yellow square corresponds to 119910119879
= 119891(119909) minus 119891(1199090) form
cross corresponds to 1199101015840
= 119898 lowast 119909119894form (119898 = sum
119899
119894=0119910119879
119894sum119899
119894=0119909119879
119894)
red circle corresponds to 119910119879119879
= 119910119879
minus 1199101015840 and circle corresponds to
missing values
an example calculation of EMMS and the following equationshows an overview of EMMS process
minusor
maximum is the outlier
minimum is the outlier
))))
EMMSmax =
aTTmaxSTTn
EMMSmin
=aTTmax
aTTmax lowast n minus STTn ))
gt(( 2n) + w2)
gt(( 2n) + w2)
le(( 2n) + w2)
le(( 2n) + w2)
))))
))))
))))))))
(18)
However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS
Start
Consecutiveseries
Recalculate
Outlierdetected
the series
Yes
Yes
No
Rw = 2n lowast (1 + k1)
use MMS
Outlierdetected
YesYes
NoNo
No
Apply
Stop
EMMS
use EMMS
∙ Remove the outlier∙ n = n ndash 1
∙ Remove the outlier∙ n = n ndash 1
∙ n = number of elements∙ k = k2
∙ n = number of elements∙ k = k1
Rw = 2n lowast (1 + k2)
Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS
28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877
119908is the factor that determines the outliers
when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0
represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria
281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)
then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908
=
2119899 + 2 lowast 119896119899
119877119908
=
2
119899
lowast (1 + 119896)
119877119908
1198770
= 1 + 119896 (= constant) (19)
When the MMS or the EMMS is greater than 119877119908of (19) this
implies the existence of outliers Because 1198771199081198770is constant
and gives standards to 119877119908 determination of k still depends
on the knowledge of the domain Figure 6 shows an algorithmbased on this technique
282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 5
Valu
e
Remove
0 r Index
(a)
0 r
Valu
e
Index
(b)
Figure 3 (a) Data set with an outlier at index r (b) Value autotransformation effect after removing the outlier at index r without replacementwhere circle corresponds to elements after removing the outlier red triangle corresponds to expected (correct) elements and squarecorresponds to initial values of the shifted elements after removing the outlier
Table 3 ldquoBad Detectionrdquo identified wrong (minimum) element asthe outlier
119886119899
Data set 61198861
1001198862
101 MMSmax 03771198863
102 MMSmin 04251198864
1036 2119899 041198865
104 Outlier Yes-Min
neither the maximum nor the minimum MMS is unable tolocate the outlier (Table 3) We named this phenomenon asldquoBad Detectionrdquo When Rw reaches ldquoBad Detection LevelrdquoMMS cannot be applied To overcome this situation weintroduced an improved version of MMS as enhanced MMS(EMMS) based on the missing data imputation technique inSection 262
EMMS is expressed as
EMMSmax =
(119886119879119879
max minus 119886119879119879
min)
(119878119879119879
119899minus 119886119879119879
min lowast 119899)
119886119879119879
max ⟨ ⟩ 0 (14)
EMMSmin =
(119886119879119879
max minus 119886119879119879
min)
(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (15)
where 119886119879119879
119896= |119886119879
119896minus 119909119896(119866119886119879
119866119909)| 119886119879119896
= 119886119896minus 1198860 xk is the index
of data ak is the kth term of the series 119896 = 0 1 119899 minus 1 n isthe number of elements in current window 119866119886
119879
= sum119899minus1
119896=0119886119879
119896
119866119909 = sum119899minus1
119896=0119883119896 and 119878
119879119879
119899= sum119899minus1
119896=0119886119879119879
119896⟨ ⟩0
Remove
Valu
e
A120579r+1
120579r+2
a0
arminus1
Tr
ar+1
Cr+1
Tr+1
ar ar+2
Br Br+1
r Index
Figure 4 Solution for value autotransformation phenomenon Useangular shifting instead of horizontal shift where (times) correspondsto horizontal shift and () corresponds to angular shift
Always the term 119886119879119879
gt 0 Thus the term 119886119879119879
min = 0 Then(14) and (15) are simplified as
EMMSmax =
119886119879119879
max119878119879119879
119899
119886119879119879
max ⟨ ⟩ 0 (16)
EMMSmin =
119886119879119879
max(119886119879119879
max lowast 119899 minus 119878119879119879
119899)
119886119879119879
max ⟨ ⟩ 0 (17)
If there are outliers EMMSmin gt 2119899 or EMMSmax gt 2119899
and the greater value represents the outlier Table 4 shows
6 The Scientific World Journal
Table 4 EMMS for identifying an outlier
119883(119899 minus 1) 119910(119886119899) 119910
119879
(119886119899minus 1198861) 119910
119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)
0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866
119879
119910= 2120
EMMS (Max) 0500EMMS (Min) 0333
2119899 04Outlier Yes-Max
0 1 2 n n minus 1
Valu
e
Index
Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =
119891(119909) form yellow square corresponds to 119910119879
= 119891(119909) minus 119891(1199090) form
cross corresponds to 1199101015840
= 119898 lowast 119909119894form (119898 = sum
119899
119894=0119910119879
119894sum119899
119894=0119909119879
119894)
red circle corresponds to 119910119879119879
= 119910119879
minus 1199101015840 and circle corresponds to
missing values
an example calculation of EMMS and the following equationshows an overview of EMMS process
minusor
maximum is the outlier
minimum is the outlier
))))
EMMSmax =
aTTmaxSTTn
EMMSmin
=aTTmax
aTTmax lowast n minus STTn ))
gt(( 2n) + w2)
gt(( 2n) + w2)
le(( 2n) + w2)
le(( 2n) + w2)
))))
))))
))))))))
(18)
However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS
Start
Consecutiveseries
Recalculate
Outlierdetected
the series
Yes
Yes
No
Rw = 2n lowast (1 + k1)
use MMS
Outlierdetected
YesYes
NoNo
No
Apply
Stop
EMMS
use EMMS
∙ Remove the outlier∙ n = n ndash 1
∙ Remove the outlier∙ n = n ndash 1
∙ n = number of elements∙ k = k2
∙ n = number of elements∙ k = k1
Rw = 2n lowast (1 + k2)
Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS
28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877
119908is the factor that determines the outliers
when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0
represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria
281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)
then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908
=
2119899 + 2 lowast 119896119899
119877119908
=
2
119899
lowast (1 + 119896)
119877119908
1198770
= 1 + 119896 (= constant) (19)
When the MMS or the EMMS is greater than 119877119908of (19) this
implies the existence of outliers Because 1198771199081198770is constant
and gives standards to 119877119908 determination of k still depends
on the knowledge of the domain Figure 6 shows an algorithmbased on this technique
282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
6 The Scientific World Journal
Table 4 EMMS for identifying an outlier
119883(119899 minus 1) 119910(119886119899) 119910
119879
(119886119899minus 1198861) 119910
119879119879(|119910119879 minus 119909119899(119866119879119910119866119909)|)
0 100 0000 01 101 1000 0062 102 2000 0123 1036 3600 0424 104 4000 024119866119909= 2 119866
119879
119910= 2120
EMMS (Max) 0500EMMS (Min) 0333
2119899 04Outlier Yes-Max
0 1 2 n n minus 1
Valu
e
Index
Figure 5 Transformation of data to a constant value to overcomethe missing data problem where red diamond corresponds to 119910 =
119891(119909) form yellow square corresponds to 119910119879
= 119891(119909) minus 119891(1199090) form
cross corresponds to 1199101015840
= 119898 lowast 119909119894form (119898 = sum
119899
119894=0119910119879
119894sum119899
119894=0119909119879
119894)
red circle corresponds to 119910119879119879
= 119910119879
minus 1199101015840 and circle corresponds to
missing values
an example calculation of EMMS and the following equationshows an overview of EMMS process
minusor
maximum is the outlier
minimum is the outlier
))))
EMMSmax =
aTTmaxSTTn
EMMSmin
=aTTmax
aTTmax lowast n minus STTn ))
gt(( 2n) + w2)
gt(( 2n) + w2)
le(( 2n) + w2)
le(( 2n) + w2)
))))
))))
))))))))
(18)
However EMMS uses derived information from existingdata If there are biased values it may lead to biased infor-mation Because of that direct application of EMMS is not agood practice Hence significant outliers should be removedfirst using MMS before applying EMMS
Start
Consecutiveseries
Recalculate
Outlierdetected
the series
Yes
Yes
No
Rw = 2n lowast (1 + k1)
use MMS
Outlierdetected
YesYes
NoNo
No
Apply
Stop
EMMS
use EMMS
∙ Remove the outlier∙ n = n ndash 1
∙ Remove the outlier∙ n = n ndash 1
∙ n = number of elements∙ k = k2
∙ n = number of elements∙ k = k1
Rw = 2n lowast (1 + k2)
Figure 6 Implementation of MMS and EMMS Initially algorithmchecks for the significant outliers using MMS After removing allsignificant outliers then remove the nonsignificant outliers usingEMMS There is no removed data imputation in relation to bothMMS and EMMS
28 Challenge 7 Determining of Outlier Detection Criteria(119877119908) The value 119877
119908is the factor that determines the outliers
when 119908 = 0 (1198770= 2119899) represents exactly a line and 119908 gt 0
represents a linear border with certain width In this sectionwe propose several possible methods that can be used todetermine the outlier detection criteria
281 Express the Value ldquo119908rdquo as119891(1119899) If the valuew is119891(1119899)
then 119908 = 2 lowast 119896119899 119896 le (1198992) minus 1 and 119896 isin R+ Then 119877119908
=
2119899 + 2 lowast 119896119899
119877119908
=
2
119899
lowast (1 + 119896)
119877119908
1198770
= 1 + 119896 (= constant) (19)
When the MMS or the EMMS is greater than 119877119908of (19) this
implies the existence of outliers Because 1198771199081198770is constant
and gives standards to 119877119908 determination of k still depends
on the knowledge of the domain Figure 6 shows an algorithmbased on this technique
282 When the First and the Last Items Are NonoutliersIn the total process the ldquoBad Detection levelrdquo is the mostimportant criteria If Rw of MMS is less than the ldquoBadDetection Levelrdquo it is possible to identify nonoutliers asoutliers as mentioned in Section 27 If there is preknowledgeabout outliers it is possible to use a safe value for MMSOtherwise there is no 100 guarantee on ldquoBad DetectionLevelrdquo
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 7
Table 5 Different environments used to validate the new method
Type of the original dataset Number of elements Type of outliers Reference (first) elementis an outlier Initial missing values
Increment constant anddecrement 10 to 1000 Non-Gaussian Gaussian Yes no Yes no
Outlierdetected
Firstor last term
Outlierdetected
Firstor last term
ApplyEMMS
∙ Remove the outlier
Consecutiveseries
Recalculate theseries
use MMS
use EMMS
Yes
Yes
Yes
Yes
No
No
No
No
Stop
No
Yes
Start
∙ Get window with the first and the last terms beingnonoutliers
YesNo
∙ n = number of elements
∙ n = number of elements
∙ k = k1
∙ k = k2
∙ n = n ndash 1∙ Remove the outlier∙ n = n ndash 1
Rw = 2n lowast (1 + k1)
Rw = 2n lowast (1 + k2)
Figure 7 Outlier detection method including the ldquoBad DetectionLevelrdquo detection technique The first and the last data points of thewindow must be nonoutliers If the first or the last element wasidentified as an outlier it will become a contradictory situationThus this point can be considered as the terminating point of MMSand EMMS
However when the first and the last elements are not out-liers the ldquoBadDetection Levelrdquo can be detected automaticallyIf the first or the last element was identified as an outlier itwill become a contradictory situationThus this point can beconsidered as the terminating point ofMMS and EMMSThedecision diagram elaborated in Figure 7 expresses the newoutlier detectionmethod including the ldquoBadDetection Levelrdquodetection technique
29 Validate the Method We implemented the MMS (withrecalculation after an outlier is removed) and EMMS withC++ and conducted the validation process For the recalcu-lation process the existing first element of the window wasthe reference element and always used the original value ofthe element (not the current updated value of the element)To validate themethod we used artificial data sets of different
sizes (10 to 1000) of a line representing increasing decreasingand constant line Then 50 of items of those data sets werereplaced with very small and very large outliers (plusmn10119890 minus 2 toplusmn10119890 + 2 times of correct value) We checked the data setsfor all the environment combinations shown in Table 5 Theoutlier detection criteria were determined based on (19) Forall data sets the same k value was used (for MMS 119896 = 05and for EMMS 119896 = 001) Then the percentage of correctlyand falsely detected nonoutliers in relation to the number ofactual nonoutliers and the percentage of correctly and falselydetected outliers from the total number of outliers (small andlarge outliers) were determined
210 Evaluation Using Real Data To check the best linearfitting identification capability the algorithmwas tested usingseveral real data sets which were automatically recordedwith a frequency of twelve data points per day (ie everyother hour) from a biogas plant over a period of sevenmonths Among the different parameters we selected the H
2
content measured in ppm which we expected to maintainlinear behaviour during stable operation We selected sevensegments of different size for evaluating the algorithm Insome data sets there were initial missing elements We setthe 119877119908for MMS and EMMS by analysing the first and the
third data sets For the recalculation process the existingfirst element of the window was the reference element andwe always used the original value of the elements (not thecurrent updated value of the element)Then the percentage ofcorrectly falsely detected nonoutliers in relation to the totalnumber of nonoutliers and the percentage of correctly andfalsely detected outliers from the total number of outliers(small and large outliers) were determined
We decided to use the LSM Sigma filter and Grubbrsquostest [26ndash29] also known as maximum normed residual testor ldquoextreme studentized deviaterdquo (ESD) test to compare ourresults We selected Grubbrsquos test since it has nearly the sameformulation as our method We checked all the biogas datausing abovementioned methods We used each of the datasegments as a single window First we checked the abilityof each method to identify the general trend of the seriesThenwe checked the amount of correctly and falsely detectedoutliers and nonoutliers for each method in relation to thegeneral trend
3 Results and Discussion
Results related to validation show that when the referenceelement (the first element) was not an outlier the algorithmwas capable of identifying all outliers with 0 error despite ofthe type of outliers (Gaussian or non-Gaussian) (Figure 8) Ifthe outliers were Gaussian there were no significant outliers
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
8 The Scientific World Journal
Valu
e4000
3500
3000
2500
2000
1500
1000
500
0
Index0 42 6 8
10 50 60 80 905005202 705
4100
3000
(a) Data type increment outlier type non-Gaussian
Valu
e
1000
01000 900
7007994
450500 300
Index0 42 6 8
minus1000
minus2000
minus3000
minus4000
minus5000minus6000
minus6000
minus2200
minus7000
minus8000
minus9000
minus10000minus10000
(b) Data type decrement outlier type non-Gaussian
Valu
e
0100 100 100 100 100
1000010000
9000
8000
7000
60006000
5000
4000
3000
2000 2200
35010011000
Index0 42 6 8
(c) Data type constant outlier type non-Gaussian
Valu
e
minus500minus1000minus1500minus2000minus2500minus3000minus3500minus4000
minus4100
40003500300025002000150010005000
10 1995005
725
4000
Index0 42 6 8
50 60 80 90
(d) Data type increment outlier type Gaussian
Valu
e
1000 900 700 400 300
10000100009000
80007000600050004000300020001000
0
Index0 42 6 8
7995 650minus1000minus2000
minus2200minus3000minus4000minus5000minus6000
minus6000
(e) Data type decrement outlier type Gaussian
100100100100100
10000
999 350
Valu
e
10000900080007000600050004000300020001000
0
Index0 42 6 8
minus1000minus2000minus3000minus4000minus5000minus6000
minus6000
minus2200
(f) Data type constant outlier type Gaussian
Figure 8 Outlier detection from data sets with ten elements The first element is the reference element which is not an outlier where redtriangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green square corresponds tononoutliers Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first) element is not an outlier the newmethodis capable of locating all outliers When the outliers are Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e)(f))
and MMS automatically became inactive (Figures 8(d) 8(e)and 8(f)) When the first few elements were outliers andoutliers were non-Gaussian MMS detected the significantoutliers correctly (Figures 9(a) 9(b) and 9(c)) HoweverEMMSwas unable to locate the nonsignificant outliers whenthe first element for EMMS was an outlier (Figures 9(a) and
9(c)) If the reference element for EMMS was not an outlierit was still possible to achieve correct results (Figure 9(b))Though it was impossible to locate all nonoutliers thedetected nonoutliers were 100 correct detections Thesevalues can be used to estimate the other values usingmethodslike LSM since now all the existing data are cleaned In
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 9
Valu
e4000 4100
3500
30003000
2500
2000
1500
1000
500
0
Index0 42 6 8
101 202 70530 50 60 80 90
(a) Data type increment outlier type non-Gaussian
Valu
e
600060005500
50105000450040003500300025002000 220015001000500
0
Index0 42 6 8
900500 400 300
100808 800
(b) Data type decrement outlier type non-Gaussian
1000010000
9000
8000
7000
60006000
5000
4000
3000
20002200
1000
0
Valu
e
Index0 42 6 8
1001 100 100 100 100999 100
(c) Data type constant outlier type non-Gaussian
400040003500
300025002000150010005000
minus500
minus4000
minus3000
minus2000
minus1000
minus4500
minus3500
minus2500
minus1500
minus5000minus5100
Valu
e
Index0 42 6 8
1000
50 60 725 80 90
199 30
(d) Data type increment outlier type Gaussian
900090008000
70006000
620050004000300020001000
0
minus4000minus3000minus2000minus1000
minus5000minus6000
minus6000
Valu
e
700800 400 300907 605
100
Index0 42 6 8
(e) Data type decrement outlier type Gaussian
60006000
5000
4000
3000
2000
1000
0
minus4000
minus3000
minus2000
minus1000
minus5000
minus6000minus6000
Valu
e
Index0 42 6 8
100100 100 100
101 350 100999
(f) Data type constant outlier type Gaussian
Figure 9 Outlier detection from data sets with ten elementsThe first element is the reference element which is an outlier where red trianglecorresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS green square corresponds to nonoutliersand black arrow corresponds to wrong detections Value of k for MMS and EMMS is 05 and 001 respectively When the reference (first)element is an outlier and outliers are non-Gaussian the new method identifies only the significant outliers ((a) (b) (c)) When the outliersare Gaussian MMS automatically becomes inactive (now no significant outliers) ((d) (e) (f))
general it is fair to state that (1)when the reference element isnot an outlier themethod is capable of identifying all outliersand (2) when the first few elements of the series are outliersand the outliers are non-Gaussian the method is capable ofidentifying only the significant outliers and part of correctelements
When the first few elements (reference elements forboth MMS and EMMS) were outliers and the outlier dis-tribution was Gaussian outlier detection was poor (Figures9(d) 9(e) and 9(f)) Due to the Gaussian distribution ofoutliers MMS was inactive and it was not possible to identifythe large outliers Most importantly the results highlighted
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
10 The Scientific World Journal
Index
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
10008006004002000
Valu
e
(a)
Valu
e
900000800000700000600000500000400000300000200000100000
minus100000minus200000minus300000minus400000minus500000minus600000minus700000minus800000minus900000
Index10008006004002000
0
(b)
Figure 10 Two artificial data samples with 1000 elements each including 50 100 100 and 50 (total 300) missing value regions The firstelement is the reference element which is not an outlier where (a) corresponds to a data set with outliers in non-Gaussian (b) correspondsto a data set with outliers in nearly Gaussian red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliersdetected by EMMS and green square corresponds to nonoutliers The value of k for MMS and EMMS is 05 and 001 respectively The newmethod was able to identify all the elements related to the line with 0 error
the importance of the reference element If the referenceelement forMMSandEMMSwas not an outlier it guaranteedgood results despite of other factors
In the methodology we derived the method based onthe first element However it is also possible to use anyother element as reference point and modify the methodWe considered the simplest situation where the first elementis not an outlier Therefore if it is possible to segment thedata excluding extreme outliers at the beginning it providesaccurate outlier detection Another possibility is to replacethe first element with an already known element This leadsto another possibility for applying the method if we knowonly a single correct element the use of that element asreference element and of the modified method according tothe reference element can yield very accurate results
Some model-based approaches demand a trained dataset for correct output In contrast this method requires onlyone correct element to produce a correct output In additionit is possible to use multiple reference points and considerthe best fitting For example (a) consider each point in firstx (eg 10) of data points as reference point and (b)consider all data points as the reference point Furthermoreit is important to distinguish the purpose of MMS andEMMS MMS removes only the significant outliers whileEMMS removes nonsignificant outliers Depending on therequirement MMS orand EMMS can be used to removeoutliers
The results show that the new method is a good solutionfor managing missing values Figure 10 shows two data setswith 1000 elements each Each data set consists of 50 100100 and 50 (total 300) missing value regions When the firstelement was not an outlier the new method was able toidentify all the elements related to the line with 0 error
In real world it is not possible to find nonoutliers thatexactly agreewith linear regressionTherefore 100 accuracyis inapplicable However it is very important to have a
significant outlier-free data set The new method guaranteeda significant outlier-free data set when the outliers were non-Gaussian Furthermore in real world situations dataoutliersare not always in Gaussian distribution Due to that we hopethe new method can be applied to the majority of outlierdetection applications Our new method is an effective solu-tion for most common LSM and sigma filter need Gaussianoutliers Some methods like sigma filter cannot be applieddirectly to a certain data segment and further segmentation(windowing) is required for better results In contrast thenew method is capable of locating nonoutliers automaticallyin increment decrement or constant form regardless of thesize of the window
Results related to biogas data proved the abovementionedidea and showed that the algorithm clearly identifies threeregions as significant outliers (outliers from MMS) non-significant outliers (outliers from EMMS) and nonoutlierswithin a data segment (Figure 11) In addition the resultsshowed that the nonoutliers follow a linear path Further-more the width of the regions can be tuned by changing therelevant 119877
119908values Figure 11 shows some selected results of
biogas data for a k value of 02 for MMS and a k value of 01for EMMS
One of the interesting observations was the ability of thealgorithm to continue linear detection even with the noncon-tinuous clusters (Figures 11(b) and 11(e)) In all data segmentsthere occurred no false detection (there were no outliers innonoutlier regions and vice versa)Most importantly the newmethod required no further windowing and nonoutliers weredetected independent of the window size
When the general trend was constant and elements werein Gaussian distribution the Sigma filter and LSM were ableto identify the linear trend However for series with biasedelements both methods failed to identify the general trendWhen the general trend was increment or decrement theSigma filter failed to identify the general trend (a further
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 11
650
600
550
500
450
400
350
300
250
200
150
100
50
0
300250200150100500
H2
cont
ent o
f bio
gas (
ppm
)
Index
(a) Elements 334
55
50
45
40
35
30
25
20
15
10
5
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
(b) Elements 372
H2
cont
ent o
f bio
gas (
ppm
)
70
65
60
55
50
45
40
35
30
25
20
300 350250200150100500
Index
(c) Elements 350
H2
cont
ent o
f bio
gas (
ppm
)
300 350250200150100500
Index
95
90
85
80
75
70
65
60
55
50
45
40
35
(d) Elements 352
H2
cont
ent o
f bio
gas (
ppm
)
8580757065605550454035302520151050
300 350250200150100500
Index
(e) Elements 356
H2
cont
ent o
f bio
gas (
ppm
)
3002001000
Index
140
130
120
110
100
90
80
70
60
50
40
(f) Elements 372
Figure 11 Results related to real biogas data with different size of data sets The first element is the reference element which is assumed notto be an outlier Results showed that the algorithm clearly identifies three regions as significant outliers (outliers fromMMS) nonsignificantoutliers (outliers from EMMS) and nonoutliers within each data segment Most importantly all the nonoutliers lied within a linear borderwhere red triangle corresponds to outliers detected by MMS yellow circle corresponds to outliers detected by EMMS and green squarecorresponds to nonoutliers The value of k for MMS and EMMS is 02 and 01 respectively
segment would give better result but we used the wholewindow)The newmethodwas capable of locating 4 to 45of elements as outliers with 0 error Grubbsrsquo test was capableof identifying very small amount of elements as outliers(0ndash17) even with the significance level of 005 Howeverall outliers were significant and no wrong detections werereported
4 Conclusions and Outlook
This paper introduced a new outlier detection method usingthe relation of the sum of the elements of an arithmeticprogression The results of this work prove that the newmethod is a robust solution for outlier detection in a data setwith missing elements The method is capable of identifying
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
12 The Scientific World Journal
both significant and nonsignificant outliers when the firstvalue of the data set is not an outlier Most importantlythe method is a solution for identifying significant outliersin a series with outliers in non-Gaussian distribution Inaddition the outlier detection is nonparametric has floor andceiling values and does not require standardization Whenthe reference elements are unknown the method can be usedwith multiple reference elements to gain optimal output
If the frequency of the data is sufficient any nonlinearrelation can be represented as a combination of straight linesTherefore by using a suitable segmentation technique it ispossible to identify outliers in any data series This will allowfor detecting outliers in a process-oriented data setThereforeto bring a data series into a form that is suitable for ourmethod an intelligent segmentation technique is necessary
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The University of Ruhuna Sri Lanka provided the paperprocessing charges of this paper The German AcademicExchange Service (German Deutscher Akademischer Aus-tauschdienst) financed this work
References
[1] R J Beckman and R D Cook ldquoOutlier srdquoTechnometrics vol 25 no 2 pp 119ndash149 1983
[2] F Molinari ldquoMissing treatmentsrdquo Journal of Business andEconomic Statistics vol 28 no 1 pp 82ndash95 2010
[3] J Qui B Zhang and D H Y Leung ldquoEmpirical likelihoodin missing data problemsrdquo Journal of the American StatisticalAssociation vol 104 no 488 pp 1492ndash1503 2009
[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004
[5] Z X Niu S Shi J Sun and X He ldquoA survey of outlier detectionmethodologies and their applicationsrdquo in Artificial Intelligenceand Computational Intelligence vol 7002 of Lecture Notes inComputer Science pp 380ndash387 2011
[6] W Jin A K H Tung and J Han ldquoMining top-n local outliersin large databasesrdquo in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo01) pp 293ndash298 ACM San Francisco CalifUSA August 2001
[7] H-P Kriegel M S Schubert and A Zimek ldquoAngle-basedoutlier detection in high-dimensional datardquo in Proceedings ofthe 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo08) pp 444ndash452 ACM LasVegas Nev USA August 2008
[8] I Ben-Gal ldquoOutlier detectionrdquo in Data Mining and KnowledgeDiscovery Handbook O Maimon and L Rokach Eds pp 131ndash146 Springer New York NY USA 2005
[9] G Williams R Baxter H He S Hawkins and L Gu ldquoAcomparative study of RNN for outlier detection in dataminingrdquoin Proceedings of the IEEE International Conference on DataMining (ICDM 02) pp 709ndash712 December 2002
[10] H Fan O R Zaıane A Foss and J Wu ldquoA nonparametricoutlier detection for effectively discovering top-N outliers fromengineering datardquo inAdvances inKnowledgeDiscovery andDataMining W-K Ng M Kitsuregawa J Li and K Chang Edsvol 3918 of Lecture Notes in Computer Science pp 557ndash566Springer Berlin Germany 2006
[11] J-S Lee ldquoDigital image smoothing and the sigma filterrdquoComputer Vision Graphics and Image Processing vol 24 no 2pp 255ndash269 1983
[12] A Gelb Applied Optimal Estimation MIT Press 1974[13] D Sierociuk I Tejado and BM Vinagre ldquoImproved fractional
Kalman filter and its application to estimation over lossynetworksrdquo Signal Processing vol 91 no 3 pp 542ndash552 2011
[14] P H Abreu J Xavier D C Silva L P Reis andM Petry ldquoUsingKalman filters to reduce noise from RFID location systemrdquoTheScientific World Journal vol 2014 Article ID 796279 9 pages2014
[15] H Liu S Shah and W Jiang ldquoOn-line outlier detection anddata cleaningrdquoComputers andChemical Engineering vol 28 no9 pp 1635ndash1647 2004
[16] T D Pigott ldquoA review ofmethods formissing datardquo EducationalResearch and Evaluation vol 7 no 4 pp 353ndash383 2001
[17] M Nakai and W Ke ldquoReview of the methods for handlingmissing data in longitudinal data analysisrdquo International Journalof Mathematical Analysis vol 5 no 1ndash4 pp 1ndash13 2011
[18] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[19] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks L House F RMcMorris P Arabie and W Gaul Eds pp 639ndash647 SpringerBerlin Germany 2004
[20] A N Baraldi and C K Enders ldquoAn introduction to modernmissing data analysesrdquo Journal of School Psychology vol 48 no1 pp 5ndash37 2010
[21] J Tian B Yu D Yu and S Ma ldquoClustering-based multipleimputation via gray relational analysis for missing data and itsapplication to aerospace fieldrdquoThe Scientific World Journal vol2013 Article ID 720392 10 pages 2013
[22] S Zhaowei Z Lingfeng M Shangjun F Bin and Z TaipingldquoIncomplete time series prediction using max-margin classifi-cation of data with absent featuresrdquo Mathematical Problems inEngineering vol 2010 Article ID 513810 14 pages 2010
[23] J-F Cai Z Shen and G-B Ye ldquoApproximation of frame basedmissing data recoveryrdquo Applied and Computational HarmonicAnalysis vol 31 no 2 pp 185ndash204 2011
[24] B L Wiens and G K Rosenkranz ldquoMissing data in noninferi-ority trialsrdquo Statistics in Biopharmaceutical Research vol 5 no4 pp 383ndash393 2013
[25] Aryabhata The Aryabhatiya of Aryabhata An Ancient IndianWork onMathematics and Astronomy vol 1 Kessinger Publish-ing 2006
[26] F E Grubbs ldquoProcedures for detecting outlying observations insamplesrdquo Technometrics vol 11 no 1 pp 1ndash21 1969
[27] B Rosner ldquoOn the detection of many outliersrdquo Technometricsvol 17 no 2 pp 221ndash227 1975
[28] R B Jain ldquoPercentage points of many-outlier detection proce-duresrdquo Technometrics vol 23 no 1 pp 71ndash75 1981
[29] S P Verma L Dıaz-Gonzalez M Rosales-Rivera and AQuiroz-Ruiz ldquoComparative performance of four single extremeoutlier discordancy tests from Monte Carlo simulationsrdquo TheScientific World Journal vol 2014 Article ID 746451 27 pages2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014