[ieee 2009 sixth international conference on information technology: new generations - las vegas,...

2
Outlier Detection in Spatial Databases Using Clustering Data Mining Amitava Karmaker Syed M. Rahman Department of Mathematics, Statistics and Computer Science University of Wisconsin-Stout Menomonie, WI 54751, USA. Department of Computer Science and Software Engineering University of Wisconsin-Platteville Platteville, WI 53818, USA. Abstract Data mining refers to extracting or “mining” knowledge from large amounts of data. Thus, it plays an important role in extracting spatial patterns and features. It is an essential process where intelligent methods are applied in order to extract data patterns. In this paper, we have proposed a technique with which it is possible to detect whether a given data set is erroneous. Furthermore, our technique locates the possible errors and comprehends the pattern of errors to minimize outliers. Finally, it ensures the integrity and correctness of large databases. We have made use of some of the existing clustering algorithms (like PAM, CLARA, CLARANS) to formulate our proposed technique. The proposed outlier detection and minimization method is simpler to implement, efficient comparing with respect to both time and memory complexity than other existing methods. Key Words: noise in database, outlier analysis. 1. Introduction Data mining is called the process of knowledge Discovery in Databases (KDD) [3, 4]. Data mining techniques, in general, look for hidden patterns that may exist in large databases. It also discovers the interesting relationships and characteristics that may exist implicitly in spatial databases. Because of the huge amount of data that may be obtained from satellite images, medical equipments, surveillance video cameras etc., it is virtually unrealistic and often expensive for users to investigate spatial data in detail. To deal with these problems, whether cluster analysis techniques are applicable is explored here. Cluster analysis is a branch of statistics that in the past three decades has been intensely studied and successfully applied to many applications. To the spatial data-mining task at hand, the attractiveness of cluster analysis is its ability to find structures or clusters directly from the given data, without relying on any hierarchies. However, cluster analysis has been applied rather unsuccessfully in the past to general data mining and machine learning. The complaints are that cluster analysis is ineffective and inefficient. Indeed, for cluster analysis algorithms to work effectively there need to be a natural notion of similarities among the "objects" to be clustered. And traditional cluster analysis algorithms are not designed for large data sets [1]. 2. Our Technique In clustering algorithms, one of the crucial and important factors is to consider data points that lie outside the final clustering. Moreover, detection & minimization of outlier data points improves the usability & scalability of the algorithm and simultaneously fulfils the clustering demand. Outliers are the data objects in a database that do not comply with the general behavior or model of the data. Most data mining methods discard outliers as noise or exceptions. In this paper, we have tried to present a technique to detect & minimize outliers in clustering problem for spatial databases [2]. Although there are a number of outlier detection methods [8] and techniques, these are expensive in many cases and are not consistent with variable types of data formats. We have proposed to device a technique with which it is possible to detect whether a given data set is erroneous, to locate the possible errors, to comprehend the pattern of errors and finally to ensure the integrity and correctness of large databases. We have made use of some of the existing clustering algorithms (like PAM [5], CLARA [6], CLARANS [5, 7]) to formulate our proposed technique. 2.1. Analysis of Algorithm Let there be a large data set D. We are to detect outlier in this large data set using our proposed technique. The algorithm of the proposed technique is given as follows (Figure - 1): 2009 Sixth International Conference on Information Technology: New Generations 978-0-7695-3596-8/09 $25.00 © 2009 IEEE DOI 10.1109/ITNG.2009.198 1657

Upload: syed

Post on 13-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2009 Sixth International Conference on Information Technology: New Generations - Las Vegas, NV, USA (2009.04.27-2009.04.29)] 2009 Sixth International Conference on Information

Outlier Detection in Spatial Databases Using Clustering Data Mining

Amitava Karmaker Syed M. Rahman Department of Mathematics, Statistics

and Computer Science University of Wisconsin-Stout Menomonie, WI 54751, USA.

Department of Computer Scienceand Software Engineering

University of Wisconsin-Platteville Platteville, WI 53818, USA.

Abstract

Data mining refers to extracting or “mining” knowledge from large amounts of data. Thus, it plays an important role in extracting spatial patterns and features. It is an essential process where intelligent methods are applied in order to extract data patterns. In this paper, we have proposed a technique with which it is possible to detect whether a given data set is erroneous. Furthermore, our technique locates the possible errors and comprehends the pattern of errors to minimize outliers. Finally, it ensures the integrity and correctness of large databases. We have made use of some of the existing clustering algorithms (like PAM, CLARA, CLARANS) to formulate our proposed technique. The proposed outlier detection and minimization method is simpler to implement, efficient comparing with respect to both time and memory complexity than other existing methods.

Key Words: noise in database, outlier analysis.

1. Introduction

Data mining is called the process of knowledge Discovery in Databases (KDD) [3, 4]. Data mining techniques, in general, look for hidden patterns that may exist in large databases. It also discovers the interesting relationships and characteristics that may exist implicitly in spatial databases. Because of the huge amount of data that may be obtained from satellite images, medical equipments, surveillance video cameras etc., it is virtually unrealistic and often expensive for users to investigate spatial data in detail.

To deal with these problems, whether cluster analysis techniques are applicable is explored here. Cluster analysis is a branch of statistics that in the past three decades has been intensely studied and successfully applied to many applications. To the spatial data-mining task at hand, the attractiveness of

cluster analysis is its ability to find structures or clusters directly from the given data, without relying on any hierarchies. However, cluster analysis has been applied rather unsuccessfully in the past to general data mining and machine learning. The complaints are that cluster analysis is ineffective and inefficient. Indeed, for cluster analysis algorithms to work effectively there need to be a natural notion of similarities among the "objects" to be clustered. And traditional cluster analysis algorithms are not designed for large data sets [1].

2. Our Technique

In clustering algorithms, one of the crucial and important factors is to consider data points that lie outside the final clustering. Moreover, detection & minimization of outlier data points improves the usability & scalability of the algorithm and simultaneously fulfils the clustering demand. Outliers are the data objects in a database that do not comply with the general behavior or model of the data. Most data mining methods discard outliers as noise or exceptions. In this paper, we have tried to present a technique to detect & minimize outliers in clustering problem for spatial databases [2].

Although there are a number of outlier detection methods [8] and techniques, these are expensive in many cases and are not consistent with variable types of data formats. We have proposed to device a technique with which it is possible to detect whether a given data set is erroneous, to locate the possible errors, to comprehend the pattern of errors and finally to ensure the integrity and correctness of large databases. We have made use of some of the existing clustering algorithms (like PAM [5], CLARA [6], CLARANS [5, 7]) to formulate our proposed technique.

2.1. Analysis of Algorithm

Let there be a large data set D. We are to detect outlier in this large data set using our proposed technique. The algorithm of the proposed technique is given as follows (Figure - 1):

2009 Sixth International Conference on Information Technology: New Generations

978-0-7695-3596-8/09 $25.00 © 2009 IEEE

DOI 10.1109/ITNG.2009.198

1657

Page 2: [IEEE 2009 Sixth International Conference on Information Technology: New Generations - Las Vegas, NV, USA (2009.04.27-2009.04.29)] 2009 Sixth International Conference on Information

Input: D = {d1, d2, . . ., dn} is a set of spatial data entries; A = {a1, a2, . . ., an} is a set of attributes

correspond to the dataset S; N = number of outliers Output: set of outliers.

Algorithm: Outlier_Detection(D, A, N){1. Given a large Data set D;2. Partition the entire Data set D into a number of regions k, such that each region contains a considerable number of data objects; 3. Apply an efficient clustering algorithm to individual region to generate local clusters. Let the number of clusters in each region be n;4. Each cluster has a cluster center, which is itself a data object 5. Total number of cluster centers i.e. data objects (after single pass) is k× n;6. Apply clustering algorithm again to the new data objects and this time it generates a single cluster in the data set D;7. Let the cluster center of the final cluster (after the second pass) be CP; 8. Apply clustering algorithm to the entire data set D without considering the regions and take it as a single cluster. Let the cluster center be CT; 9. The dissimilarity between CP and CT indicates the correctness of the databases; }

Figure 1: Our proposed technique

2.2. Features:

Our proposed technique can be used to detect whether a given data set is erroneous and inconsistent with types. It is also possible to locate the possible errors, to comprehend the pattern of errors and finally to ensure the integrity and correctness of large databases. As we have made use of some of the existing clustering algorithms (like PAM, CLARA, CLARANS) to formulate our proposed technique, the time complexity depends on that of the clustering algorithm. Generally the complexity of single pass clustering algorithm is O(n2). The total complexity of our technique in two passes is O(n2) + O(n2) = 2O(n2) = O(n2). The weakness of our method is that we have used the clusters in circular shapes. So there is a possibility to ignore some significant data objects from consideration. Again, we can adjust the radius of the circular-shaped clusters. It’s a controversy to optimize the radius. Low values of radius may exclude data objects while large values of radius may

result in overlapping clusters. So users should provide necessary optimization for the values of cluster radii.

3. Conclusions

Outlier detection and analysis are very useful for fraud detection, customized marketing, medical analysis and many other tasks. We have proposed a technique with which it is possible to ensure the integrity and correctness of large databases. We have devised a method and proved the uprightness of that using simulation. Experimental results and analysis indicate that our proposed technique is more effective and efficient than existing techniques. It is simpler to implement also. So it can be used as an efficient outlier analysis technique in spatial databases. In further investigation, we want to do further advanced researches on clustering data mining and improve the performances of our proposed techniques.

4. References

[1] Raymond T. Ng and Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining, VLDB Conference, Santiago, Chile, 1994.

[2] H. Spath. Cluster Discussion and Analysis: Theory, FORTRAN Programs, Examples. Ellis Horwood Ltd., 1984.

[3] Usama M. Fayad, Gregory Piatetsky-Shapiro and Padhraic Smyth, Advances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press, 1996, American Association of Artificial Intelligence, California, USA. Page 35.

[4] W. J. Frawley, Gregory Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in Databases. AAAI Press / The MIT Press, 1991. Page 1-27.

[5] A. K. Jain and R. C. Dubes, Algorithm for Clustering Data, Prentice Hall, 1988.

[6] A. Jain, M. Murty and P. Flynn, Data Clustering: A Review, ACM Computing Surveys, 31(3), pp. 264–323, 1999.

[7] L. Kaufman and P.J.Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley& Sons, 1990.

[8] S. Papadimitriou and C. Faloutsos, Cross-Outlier Detection, Proceedings of 8th International Symposium on Spatial and Temporal Databases, Greece, pp. 199–213, 2003.

1658