11601802

Upload: hub23

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 11601802

    1/60

    Data Clustering: A Review

    A.K. JAIN

    Michigan State University

    M.N. MURTY

    Indian Institute of Science

    AND

    P.J. FLYNN

    The Ohio State University

    Clustering is the unsupervised classification of patterns (observations, data items,

    or feature vectors) into groups (clusters). The clustering problem has been

    addressed in many contexts and by researchers in many disciplines; this reflects its

    broad appeal and usefulness as one of the steps in exploratory data analysis.

    However, clustering is a difficult problem combinatorially, and differences in

    assumptions and contexts in different communities has made the transfer of useful

    generic concepts and methodologies slow to occur. This paper presents an overview

    of pattern clustering methods from a statistical pattern recognition perspective,

    with a goal of providing useful advice and references to fundamental concepts

    accessible to the broad community of clustering practitioners. We present a

    taxonomy of clustering techniques, and identify cross-cutting themes and recent

    advances. We also describe some important applications of clustering algorithms

    such as image segmentation, object recognition, and information retrieval.

    Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models; I.5.3

    [Pattern Recognition]: Clustering; I.5.4 [Pattern Recognition]: Applications

    Computer vision; H.3.3 [Information Storage and Retrieval]: Information

    Search and RetrievalClustering; I.2.6 [Artificial Intelligence]:

    LearningKnowledge acquisition

    General Terms: Algorithms

    Additional Key Words and Phrases: Cluster analysis, clustering applications,

    exploratory data analysis, incremental clustering, similarity indices, unsupervised

    learning

    Section 6.1 is based on the chapter Image Segmentation Using Clustering by A.K. Jain and P.J.Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja,Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society.

    Authors addresses: A. Jain, Department of Computer Science, Michigan State University, A714 WellsHall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, IndianInstitute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, TheOhio State University, Columbus, OH 43210.

    Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted

    without fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and/or a fee. 2000 ACM 0360-0300/99/09000001 $5.00

  • 8/12/2019 11601802

    2/60

    1. INTRODUCTION

    1.1 Motivation

    Data analysis underlies many comput-ing applications, either in a designphase or as part of their on-line opera-tions. Data analysis procedures can bedichotomized as either exploratory orconfirmatory, based on the availabilityof appropriate models for the datasource, but a key element in both typesof procedures (whether for hypothesisformation or decision-making) is thegrouping, or classification of measure-ments based on either (i) goodness-of-fitto a postulated model, or (ii) naturalgroupings (clustering) revealed throughanalysis. Cluster analysis is the organi-zation of a collection of patterns (usual-ly represented as a vector of measure-ments, or a point in a multidimensionalspace) into clusters based on similarity.

    Intuitively, patterns within a valid clus-ter are more similar to each other thanthey are to a pattern belonging to adifferent cluster. An example of cluster-

    ing is depicted in Figure 1. The inputpatterns are shown in Figure 1(a), andthe desired clusters are shown in Figure1(b). Here, points belonging to the samecluster are given the same label. The

    variety of techniques for representingdata, measuring proximity (similarity)between data elements, and groupingdata elements has produced a rich andoften confusing assortment of clusteringmethods.

    It is important to understand the dif-ference between clustering (unsuper-vised classification) and discriminantanalysis (supervised classification). Insupervised classification, we are pro-

    vided with a collection of labeled (pre-classified) patterns; the problem is tolabel a newly encountered, yet unla-beled, pattern. Typically, the given la-beled (training) patterns are used tolearn the descriptions of classes which

    in turn are used to label a new pattern.In the case of clustering, the problem isto group a given collection of unlabeledpatterns into meaningful clusters. In asense, labels are associated with clus-ters also, but these category labels aredata driven; that is, they are obtainedsolely from the data.

    Clustering is useful in several explor-atory pattern-analysis, grouping, deci-sion-making, and machine-learning sit-uations, including data mining,document retrieval, image segmenta-tion, and pattern classification. How-ever, in many such problems, there islittle prior information (e.g., statisticalmodels) available about the data, andthe decision-maker must make as fewassumptions about the data as possible.It is under these restrictions that clus-tering methodology is particularly ap-propriate for the exploration of interre-lationships among the data points tomake an assessment (perhaps prelimi-nary) of their structure.

    The term clustering is used in sev-eral research communities to describe

    CONTENTS

    1. Introduction1.1 Motivation1.2 Components of a Clustering Task

    1.3 The Users Dilemma and the Role of Expertise1.4 History

    1.5 Outline2. Definitions and Notation3. Pattern Representation, Feature Selection and

    Extraction4. Similarity Measures5. Clustering Techniques

    5.1 Hierarchical Clustering Algorithms5.2 Partitional Algorithms5.3 Mixture-Resolving and Mode-Seeking

    Algorithms5.4 Nearest Neighbor Clustering5.5 Fuzzy Clustering

    5.6 Representation of Clusters5.7 Artificial Neural Networks for Clustering5.8 Evolutionary Approaches for Clustering5.9 Search-Based Approaches5.10 A Comparison of Techniques5.11 Incorporating Domain Constraints in

    Clustering5.12 Clustering Large Data Sets

    6. Applications6.1 Image Segmentation Using Clustering6.2 Object and Character Recognition

    6.3 Information Retrieval6.4 Data Mining

    7. Summary

  • 8/12/2019 11601802

    3/60

    methods for grouping of unlabeled data.These communities have different ter-minologies and assumptions for thecomponents of the clustering processand the contexts in which clustering isused. Thus, we face a dilemma regard-ing the scope of this survey. The produc-

    tion of a truly comprehensive surveywould be a monumental task given thesheer mass of literature in this area.The accessibility of the survey mightalso be questionable given the need toreconcile very different vocabulariesand assumptions regarding clusteringin the various communities.

    The goal of this paper is to survey thecore concepts and techniques in thelarge subset of cluster analysis with itsroots in statistics and decision theory.Where appropriate, references will bemade to key concepts and techniquesarising from clustering methodology inthe machine-learning and other commu-nities.

    The audience for this paper includespractitioners in the pattern recognitionand image analysis communities (whoshould view it as a summarization ofcurrent practice), practitioners in themachine-learning communities (whoshould view it as a snapshot of a closelyrelated field with a rich history of well-understood techniques), and thebroader audience of scientific profes-

    sionals (who should view it as an acces-sible introduction to a mature field thatis making important contributions tocomputing application areas).

    1.2 Components of a Clustering Task

    Typical pattern clustering activity in-

    volves the following steps [Jain andDubes 1988]:

    (1) pattern representation (optionallyincluding feature extraction and/orselection),

    (2) definition of a pattern proximitymeasure appropriate to the data do-main,

    (3) clustering or grouping,

    (4) data abstraction (if needed), and

    (5) assessment of output (if needed).

    Figure 2 depicts a typical sequencing ofthe first three of these steps, includinga feedback path where the groupingprocess output could affect subsequentfeature extraction and similarity com-putations.

    Pattern representation refers to thenumber of classes, the number of avail-able patterns, and the number, type,and scale of the features available to theclustering algorithm. Some of this infor-mation may not be controllable by the

    X X

    Y Y

    (a) (b)

    x x

    x

    x

    x 1 1

    1

    x x

    11

    2 2

    x x 2 2

    x x

    x

    xx

    x

    xx

    x

    x

    xxx

    x

    x

    x

    x

    x

    x

    3 3 3

    3

    4

    4

    4

    4

    4

    4

    44

    4

    4

    4

    4

    4 44

    x

    x

    x

    x

    x

    x

    x

    x 6

    6

    6

    7

    7

    7

    7

    6

    x x x

    xx

    xx

    4 5 55

    5

    55

    Figure 1. Data clustering.

  • 8/12/2019 11601802

    4/60

    practitioner. Feature selection is theprocess of identifying the most effectivesubset of the original features to use inclustering. Feature extraction is the useof one or more transformations of the

    input features to produce new salientfeatures. Either or both of these tech-niques can be used to obtain an appro-priate set of features to use in cluster-ing.

    Pattern proximity is usually measuredby a distance function defined on pairsof patterns. A variety of distance mea-sures are in use in the various commu-nities [Anderberg 1973; Jain and Dubes1988; Diday and Simon 1976]. A simple

    distance measure like Euclidean dis-tance can often be used to reflect dis-similarity between two patterns,whereas other similarity measures canbe used to characterize the conceptualsimilarity between patterns [Michalskiand Stepp 1983]. Distance measures arediscussed in Section 4.

    The grouping step can be performedin a number of ways. The output clus-tering (or clusterings) can be hard (apartition of the data into groups) orfuzzy (where each pattern has a vari-able degree of membership in each ofthe output clusters). Hierarchical clus-tering algorithms produce a nested se-ries of partitions based on a criterion formerging or splitting clusters based onsimilarity. Partitional clustering algo-rithms identify the partition that opti-mizes (usually locally) a clustering cri-terion. Additional techniques for thegrouping operation include probabilistic[Brailovski 1991] and graph-theoretic[Zahn 1971] clustering methods. The

    variety of techniques for cluster forma-tion is described in Section 5.

    Data abstraction is the process of ex-tracting a simple and compact represen-tation of a data set. Here, simplicity iseither from the perspective of automaticanalysis (so that a machine can perform

    further processing efficiently) or it ishuman-oriented (so that the representa-tion obtained is easy to comprehend andintuitively appealing). In the clusteringcontext, a typical data abstraction is acompact description of each cluster,usually in terms of cluster prototypes orrepresentative patterns such as the cen-troid [Diday and Simon 1976].

    How is the output of a clustering algo-rithm evaluated? What characterizes a

    good clustering result and a poor one?All clustering algorithms will, whenpresented with data, produce clusters regardless of whether the data containclusters or not. If the data does containclusters, some clustering algorithmsmay obtain better clusters than others.The assessment of a clustering proce-dures output, then, has several facets.One is actually an assessment of thedata domain rather than the clusteringalgorithm itself data which do notcontain clusters should not be processedby a clustering algorithm. The study ofcluster tendency, wherein the input dataare examined to see if there is any meritto a cluster analysis prior to one beingperformed, is a relatively inactive re-search area, and will not be consideredfurther in this survey. The interestedreader is referred to Dubes [1987] andCheng [1995] for information.

    Cluster validity analysis, by contrast,is the assessment of a clustering proce-dures output. Often this analysis uses aspecific criterion of optimality; however,these criteria are usually arrived at

    FeatureSelection/Extraction

    PatternGrouping

    ClustersInterpatternSimilarity

    Representations

    Patterns

    feedback loop

    Figure 2. Stages in clustering.

  • 8/12/2019 11601802

    5/60

    subjectively. Hence, little in the way ofgold standards exist in clustering ex-cept in well-prescribed subdomains. Va-lidity assessments are objective [Dubes

    1993] and are performed to determinewhether the output is meaningful. Aclustering structure is valid if it cannotreasonably have occurred by chance oras an artifact of a clustering algorithm.When statistical approaches to cluster-ing are used, validation is accomplishedby carefully applying statistical meth-ods and testing hypotheses. There arethree types of validation studies. An

    external assessment of validity com-

    pares the recovered structure to an apriori structure. An internal examina-tion of validity tries to determine if thestructure is intrinsically appropriate forthe data. A relative test compares twostructures and measures their relativemerit. Indices used for this comparisonare discussed in detail in Jain andDubes [1988] and Dubes [1993], and arenot discussed further in this paper.

    1.3 The Users Dilemma and the Role of

    Expertise

    The availability of such a vast collectionof clustering algorithms in the litera-ture can easily confound a user attempt-ing to select an algorithm suitable forthe problem at hand. In Dubes and Jain[1976], a set of admissibility criteriadefined by Fisher and Van Ness [1971]are used to compare clustering algo-rithms. These admissibility criteria are

    based on: (1) the manner in which clus-ters are formed, (2) the structure of thedata, and (3) sensitivity of the cluster-ing technique to changes that do notaffect the structure of the data. How-ever, there is no critical analysis of clus-tering algorithms dealing with the im-portant questions such as

    How should the data be normalized?

    Which similarity measure is appropri-

    ate to use in a given situation?How should domain knowledge be uti-

    lized in a particular clustering prob-lem?

    How can a vary large data set (say, amillion patterns) be clustered effi-ciently?

    These issues have motivated this sur-vey, and its aim is to provide a perspec-tive on the state of the art in clusteringmethodology and algorithms. With sucha perspective, an informed practitionershould be able to confidently assess thetradeoffs of different techniques, andultimately make a competent decisionon a technique or suite of techniques toemploy in a particular application.

    There is no clustering technique thatis universally applicable in uncoveringthe variety of structures present in mul-tidimensional data sets. For example,consider the two-dimensional data setshown in Figure 1(a). Not all clusteringtechniques can uncover all the clusterspresent here with equal facility, becauseclustering algorithms often contain im-plicit assumptions about cluster shapeor multiple-cluster configurations basedon the similarity measures and group-ing criteria used.

    Humans perform competitively withautomatic clustering procedures in twodimensions, but most real problems in-

    volve clustering in higher dimensions. Itis difficult for humans to obtain an intu-itive interpretation of data embedded ina high-dimensional space. In addition,data hardly follow the ideal structures(e.g., hyperspherical, linear) shown inFigure 1. This explains the large num-ber of clustering algorithms which con-tinue to appear in the literature; eachnew clustering algorithm performsslightly better than the existing ones ona specific distribution of patterns.

    It is essential for the user of a cluster-ing algorithm to not only have a thor-ough understanding of the particulartechnique being utilized, but also toknow the details of the data gatheringprocess and to have some domain exper-tise; the more information the user hasabout the data at hand, the more likelythe user would be able to succeed inassessing its true class structure [Jainand Dubes 1988]. This domain informa-

  • 8/12/2019 11601802

    6/60

    tion can also be used to improve thequality of feature extraction, similaritycomputation, grouping, and cluster rep-resentation [Murty and Jain 1995].

    Appropriate constraints on the datasource can be incorporated into a clus-tering procedure. One example of this ismixture resolving [Titterington et al.1985], wherein it is assumed that thedata are drawn from a mixture of anunknown number of densities (often as-sumed to be multivariate Gaussian).The clustering problem here is to iden-tify the number of mixture componentsand the parameters of each component.

    The concept ofdensity clustering and amethodology for decomposition of fea-ture spaces [Bajcsy 1997] have alsobeen incorporated into traditional clus-tering methodology, yielding a tech-nique for extracting overlapping clus-ters.

    1.4 History

    Even though there is an increasing in-

    terest in the use of clustering methodsin pattern recognition [Anderberg1973], image processing [Jain andFlynn 1996] and information retrieval[Rasmussen 1992; Salton 1991], cluster-ing has a rich history in other disci-plines [Jain and Dubes 1988] such asbiology, psychiatry, psychology, archae-ology, geology, geography, and market-ing. Other terms more or less synony-mous with clustering includeunsupervised learning [Jain and Dubes1988], numerical taxonomy [Sneath andSokal 1973], vector quantization[Oehlerand Gray 1995], and learning by obser-vation [Michalski and Stepp 1983]. Thefield of spatial analysis of point pat-terns [Ripley 1988] is also related tocluster analysis. The importance andinterdisciplinary nature of clustering isevident through its vast literature.

    A number of books on clustering havebeen published [Jain and Dubes 1988;

    Anderberg 1973; Hartigan 1975; Spath1980; Duran and Odell 1974; Everitt1993; Backer 1995], in addition to someuseful and influential review papers. A

    survey of the state of the art in cluster-ing circa 1978 was reported in Dubesand Jain [1980]. A comparison of vari-ous clustering algorithms for construct-

    ing the minimal spanning tree and theshort spanning path was given in Lee[1981]. Cluster analysis was also sur-

    veyed in Jain et al. [1986]. A review ofimage segmentation by clustering wasreported in Jain and Flynn [1996]. Com-parisons of various combinatorial opti-mization schemes, based on experi-ments, have been reported in Mishraand Raghavan [1994] and Al-Sultan andKhan [1996].

    1.5 Outline

    This paper is organized as follows. Sec-tion 2 presents definitions of terms to beused throughout the paper. Section 3summarizes pattern representation,feature extraction, and feature selec-tion. Various approaches to the compu-tation of proximity between patternsare discussed in Section 4. Section 5

    presents a taxonomy of clustering ap-proaches, describes the major tech-niques in use, and discusses emergingtechniques for clustering incorporatingnon-numeric constraints and the clus-tering of large sets of patterns. Section6 discusses applications of clusteringmethods to image analysis and datamining problems. Finally, Section 7 pre-sents some concluding remarks.

    2. DEFINITIONS AND NOTATION

    The following terms and notation areused throughout this paper.

    A pattern (or feature vector, observa-

    tion, or datum)x is a single data itemused by the clustering algorithm. It

    typically consists of a vector ofd mea-

    surements: x x1, . . . x d.

    The individual scalar components x iof a pattern x are called features (orattributes).

  • 8/12/2019 11601802

    7/60

    d is the dimensionality of the patternor of the pattern space.

    A pattern set is denoted

    x1, . . . xn. The ith pattern in isdenoted xi x i,1 , . . . x i,d. In manycases a pattern set to be clustered is

    viewed as an n d pattern matrix.

    A class, in the abstract, refers to astate of nature that governs the pat-tern generation process in some cases.More concretely, a class can be viewedas a source of patterns whose distri-bution in feature space is governed by

    a probability density specific to theclass. Clustering techniques attemptto group patterns so that the classesthereby obtained reflect the differentpattern generation processes repre-sented in the pattern set.

    Hard clustering techniques assign a

    class label l i to each patterns xi, iden-tifying its class. The set of all labels

    for a pattern set is

    l 1, . . . ln, with l i 1, , k,where k is the number of clusters.

    Fuzzy clustering procedures assign to

    each input pattern xi a fractional de-

    gree of membership fij in each output

    cluster j.

    A distance measure (a specializationof a proximity measure) is a metric(or quasi-metric) on the feature space

    used to quantify the similarity of pat-terns.

    3. PATTERN REPRESENTATION, FEATURE

    SELECTION AND EXTRACTION

    There are no theoretical guidelines thatsuggest the appropriate patterns andfeatures to use in a specific situation.Indeed, the pattern generation processis often not directly controllable; theusers role in the pattern representationprocess is to gather facts and conjec-tures about the data, optionally performfeature selection and extraction, and de-sign the subsequent elements of the

    clustering system. Because of the diffi-culties surrounding pattern representa-tion, it is conveniently assumed that thepattern representation is available prior

    to clustering. Nonetheless, a careful in-vestigation of the available features andany available transformations (evensimple ones) can yield significantly im-proved clustering results. A good pat-tern representation can often yield asimple and easily understood clustering;a poor pattern representation may yielda complex clustering whose true struc-ture is difficult or impossible to discern.Figure 3 shows a simple example. The

    points in this 2D feature space are ar-ranged in a curvilinear cluster of ap-proximately constant distance from theorigin. If one chooses Cartesian coordi-nates to represent the patterns, manyclustering algorithms would be likely tofragment the cluster into two or moreclusters, since it is not compact. If, how-ever, one uses a polar coordinate repre-sentation for the clusters, the radiuscoordinate exhibits tight clustering and

    a one-cluster solution is likely to beeasily obtained.A pattern can measure either a phys-

    ical object (e.g., a chair) or an abstractnotion (e.g., a style of writing). As notedabove, patterns are represented conven-tionally as multidimensional vectors,where each dimension is a single fea-ture [Duda and Hart 1973]. These fea-tures can be either quantitative or qual-itative. For example, if weight and colorare the two features used, then

    20, black is the representation of ablack object with 20 units of weight.The features can be subdivided into thefollowing types [Gowda and Diday1992]:

    (1) Quantitative features: e.g.(a) continuous values (e.g., weight);(b) discrete values (e.g., the number

    of computers);(c) interval values (e.g., the dura-

    tion of an event).

    (2) Qualitative features:(a) nominal or unordered (e.g., color);

  • 8/12/2019 11601802

    8/60

    (b) ordinal (e.g., military rank orqualitative evaluations of tem-perature (cool or hot) orsound intensity (quiet orloud)).

    Quantitative features can be measured

    on a ratio scale (with a meaningful ref-erence value, such as temperature), oron nominal or ordinal scales.

    One can also use structured features[Michalski and Stepp 1983] which arerepresented as trees, where the parentnode represents a generalization of itschild nodes. For example, a parent nodevehicle may be a generalization ofchildren labeled cars, buses,trucks, and motorcycles. Further,the node cars could be a generaliza-tion of cars of the type Toyota, Ford,Benz, etc. A generalized representa-tion of patterns, called symbolic objectswas proposed in Diday [1988]. Symbolicobjects are defined by a logical conjunc-tion of events. These events link valuesand features in which the features cantake one or more values and all theobjects need not be defined on the sameset of features.

    It is often valuable to isolate only themost descriptive and discriminatory fea-tures in the input set, and utilize thosefeatures exclusively in subsequent anal-ysis. Feature selection techniques iden-

    tify a subset of the existing features forsubsequent use, while feature extrac-tion techniques compute new featuresfrom the original set. In either case, the

    goal is to improve classification perfor-mance and/or computational efficiency.Feature selection is a well-exploredtopic in statistical pattern recognition[Duda and Hart 1973]; however, in aclustering context (i.e., lacking class la-bels for patterns), the feature selectionprocess is of necessity ad hoc, and mightinvolve a trial-and-error process where

    various subsets of features are selected,the resulting patterns clustered, and

    the output evaluated using a validityindex. In contrast, some of the popularfeature extraction processes (e.g., prin-cipal components analysis [Fukunaga1990]) do not depend on labeled dataand can be used directly. Reduction ofthe number of features has an addi-tional benefit, namely the ability to pro-duce output that can be visually in-spected by a human.

    4. SIMILARITY MEASURES

    Since similarity is fundamental to thedefinition of a cluster, a measure of thesimilarity between two patterns drawnfrom the same feature space is essentialto most clustering procedures. Becauseof the variety of feature types andscales, the distance measure (or mea-sures) must be chosen carefully. It ismost common to calculate the dissimi-larity between two patterns using a dis-tance measure defined on the featurespace. We will focus on the well-knowndistance measures used for patternswhose features are all continuous.

    The most popular metric for continu-ous features is the Euclidean distance

    d2xi, xj k1

    d

    xi, k xj,k21/ 2

    xixj2,

    which is a special case (p2) of theMinkowski metric

    .

    .

    ..

    .

    .

    .

    ..

    .

    . .

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .. .

    .

    ...

    .

    .

    .

    .

    ..

    ..

    .. .

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .. .

    .

    ...

    .

    .

    .

    .

    ..

    ..

    .

    . .

    .

    ...

    .

    .

    .

    .

    ..

    ..

    .

    . .

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .

    . .

    .

    ...

    .

    .

    .

    .

    ..

    ..

    .

    . .

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .

    . .

    .

    ...

    .

    .

    .

    .

    .

    .

    ..

    ...

    .

    ...

    .

    ..

    .

    . .

    ..

    .

    ..

    .

    ...

    .

    .

    .

    .

    ..

    ..

    .

    ..

    .

    ...

    .

    ..

    .

    .

    .

    ..

    .

    ..

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .

    ..

    .

    ...

    .

    .

    .

    .

    ..

    ..

    ...

    .

    ...

    .

    ..

    .

    .

    .

    ..

    .

    ..

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .

    ..

    .

    ..

    .

    .

    .

    .

    .

    ..

    ..

    .

    ..

    .

    ..

    .

    .

    .

    .

    .

    .

    .

    ..

    . ..

    .

    ..

    .

    .

    .

    .

    .

    .

    .

    ..

    . ..

    .

    ..

    Figure 3. A curvilinear cluster whose pointsare approximately equidistant from the origin.Different pattern representations (coordinatesystems) would cause clustering algorithms toyield different results for this data (see text).

  • 8/12/2019 11601802

    9/60

    dpxi, xj k1

    d

    xi, k xj, kp1/p

    xixjp.

    The Euclidean distance has an intuitiveappeal as it is commonly used to evalu-ate the proximity of objects in two orthree-dimensional space. It works wellwhen a data set has compact or iso-lated clusters [Mao and Jain 1996].The drawback to direct use of theMinkowski metrics is the tendency ofthe largest-scaled feature to dominate

    the others. Solutions to this probleminclude normalization of the continuousfeatures (to a common range or vari-ance) or other weighting schemes. Lin-ear correlation among features can alsodistort distance measures; this distor-tion can be alleviated by applying awhitening transformation to the data orby using the squared Mahalanobis dis-tance

    dMxi, xj xi xj1

    xi xjT

    ,

    where the patterns xi and xj are as-

    sumed to be row vectors, and is thesample covariance matrix of the pat-terns or the known covariance matrix of

    the pattern generation process; dM , assigns different weights to differentfeatures based on their variances andpairwise linear correlations. Here, it isimplicitly assumed that class condi-tional densities are unimodal and char-acterized by multidimensional spread,i.e., that the densities are multivariateGaussian. The regularized Mahalanobisdistance was used in Mao and Jain[1996] to extract hyperellipsoidal clus-ters. Recently, several researchers[Huttenlocher et al. 1993; Dubuissonand Jain 1994] have used the Hausdorffdistance in a point set matching con-text.

    Some clustering algorithms work on amatrix of proximity values instead of onthe original pattern set. It is useful insuch situations to precompute all the

    nn 12 pairwise distance values

    for the n patterns and store them in a(symmetric) matrix.

    Computation of distances between

    patterns with some or all features beingnoncontinuous is problematic, since thedifferent types of features are not com-parable and (as an extreme example)the notion of proximity is effectively bi-nary-valued for nominal-scaled fea-tures. Nonetheless, practitioners (espe-cially those in machine learning, wheremixed-type patterns are common) havedeveloped proximity measures for heter-ogeneous type patterns. A recent exam-ple is Wilson and Martinez [1997],which proposes a combination of a mod-ified Minkowski metric for continuousfeatures and a distance based on counts(population) for nominal attributes. A

    variety of other metrics have been re-ported in Diday and Simon [1976] andIchino and Yaguchi [1994] for comput-ing the similarity between patterns rep-resented using quantitative as well as

    qualitative features.Patterns can also be represented us-ing string or tree structures [Knuth1973]. Strings are used in syntacticclustering [Fu and Lu 1977]. Severalmeasures of similarity between stringsare described in Baeza-Yates [1992]. Agood summary of similarity measuresbetween trees is given by Zhang [1995].

    A comparison of syntactic and statisti-cal approaches for pattern recognition

    using several criteria was presented inTanaka [1995] and the conclusion wasthat syntactic methods are inferior inevery aspect. Therefore, we do not con-sider syntactic methods further in thispaper.

    There are some distance measures re-ported in the literature [Gowda andKrishna 1977; Jarvis and Patrick 1973]that take into account the effect of sur-rounding or neighboring points. These

    surrounding points are called context inMichalski and Stepp [1983]. The simi-

    larity between two points xi and xj,given this context, is given by

  • 8/12/2019 11601802

    10/60

    sxi, xj fxi, xj, ,

    where is the context (the set of sur-rounding points). One metric definedusing context is the mutual neighbordistance (MND), proposed in Gowda andKrishna [1977], which is given by

    MNDxi, xjNNxi, xjNNxj, x i,

    where NNxi, xj is the neighbor num-

    ber of xj with respect to xi. Figures 4and 5 give an example. In Figure 4, thenearest neighbor of A is B, and Bs

    nearest neighbor is A. So, NNA, B

    NNB, A 1 and the MND between

    A and B is 2. However, NNB, C 1

    but NNC, B 2, and therefore

    MNDB, C 3. Figure 5 was ob-tained from Figure 4 by adding three new

    points D, E, and F. Now MNDB, C

    3 (as before), but MNDA, B 5.The MND between A and B has in-creased by introducing additionalpoints, even though A and B have notmoved. The MND is not a metric (it doesnot satisfy the triangle inequality[Zhang 1995]). In spite of this, MND hasbeen successfully applied in severalclustering applications [Gowda and Di-day 1992]. This observation supportsthe viewpoint that the dissimilaritydoes not need to be a metric.

    Watanabes theorem of the ugly duck-ling [Watanabe 1985] states:

    Insofar as we use a finite set ofpredicates that are capable of dis-t in gu is hi ng a ny t wo o b je ct s c on-sidered, t he numbe r of p r e di c a t e ss ha r e d by a ny t w o s uc h obj e c t s i sconstant, independent of the

    choice of objects .This implies that it is possible to

    make any two arbitrary patternsequally similar by encoding them with asufficiently large number of features. Asa consequence, any two arbitrary pat-terns are equally similar, unless we usesome additional domain information.For example, in the case of conceptualclustering [Michalski and Stepp 1983],

    the similarity between xi and xj is de-fined as

    sxi,xj fxi, xj, , ,

    where is a set of pre-defined concepts.This notion is illustrated with the helpof Figure 6. Here, the Euclidean dis-tance between points A and B is lessthan that between B and C. However, Band C can be viewed as more similarthan A and B because B and C belong tothe same concept (ellipse) and A belongsto a different concept (rectangle). Theconceptual similarity measure is themost general similarity measure. We

    A

    B

    C

    X

    X

    1

    2

    Figure 4. A and B are more similar than Aand C.

    A

    B

    C

    X

    X

    1

    2

    D

    F E

    Figure 5. After a change in context, B and Care more similar than B and A.

  • 8/12/2019 11601802

    11/60

    discuss several pragmatic issues associ-ated with its use in Section 5.

    5. CLUSTERING TECHNIQUES

    Different approaches to clustering datacan be described with the help of thehierarchy shown in Figure 7 (other tax-

    onometric representations of clusteringmethodology are possible; ours is basedon the discussion in Jain and Dubes[1988]). At the top level, there is a dis-tinction between hierarchical and parti-tional approaches (hierarchical methodsproduce a nested series of partitions,while partitional methods produce onlyone).

    The taxonomy shown in Figure 7must be supplemented by a discussion

    of cross-cutting issues that may (inprinciple) affect all of the different ap-proaches regardless of their placementin the taxonomy.

    Agglomerative vs. divisive: This as-pect relates to algorithmic structureand operation. An agglomerative ap-proach begins with each pattern in adistinct (singleton) cluster, and suc-cessively merges clusters together un-til a stopping criterion is satisfied. Adivisive method begins with all pat-terns in a single cluster and performssplitting until a stopping criterion ismet.

    Monothetic vs. polythetic: This aspectrelates to the sequential or simulta-neous use of features in the clusteringprocess. Most algorithms are polythe-

    tic; that is, all features enter into thecomputation of distances betweenpatterns, and decisions are based onthose distances. A simple monotheticalgorithm reported in Anderberg[1973] considers features sequentiallyto divide the given collection of pat-terns. This is illustrated in Figure 8.Here, the collection is divided into

    two groups using feature x 1; the verti-cal broken line V is the separating

    line. Each of these clusters is furtherdivided independently using feature

    x2, as depicted by the broken lines H1and H2. The major problem with this

    algorithm is that it generates 2d clus-ters where d is the dimensionality of

    the patterns. For large values of d

    (d 100 is typical in information re-trieval applications [Salton 1991]),the number of clusters generated by

    this algorithm is so large that thedata set is divided into uninterest-ingly small and fragmented clusters.

    Hard vs. fuzzy: A hard clustering al-gorithm allocates each pattern to asingle cluster during its operation andin its output. A fuzzy clusteringmethod assigns degrees of member-ship in several clusters to each inputpattern. A fuzzy clustering can be

    converted to a hard clustering by as-signing each pattern to the clusterwith the largest measure of member-ship.

    Deterministic vs. stochastic: This is-sue is most relevant to partitionalapproaches designed to optimize asquared error function. This optimiza-tion can be accomplished using tradi-tional techniques or through a ran-

    dom search of the state spaceconsisting of all possible labelings.

    Incremental vs. non-incremental:This issue arises when the pattern set

    x x x x x x x x x x x x x x

    x x x x x x x x x x x x x x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    xx

    x x x x xx

    xx

    x

    xxx

    xxxxxxx

    xx

    x

    A

    B

    C

    Figure 6. Conceptual similarity be-

    tween points .

  • 8/12/2019 11601802

    12/60

    to be clustered is large, and con-straints on execution time or memoryspace affect the architecture of thealgorithm. The early history of clus-tering methodology does not containmany examples of clustering algo-rithms designed to work with largedata sets, but the advent of data min-ing has fostered the development ofclustering algorithms that minimizethe number of scans through the pat-tern set, reduce the number of pat-terns examined during execution, orreduce the size of data structuresused in the algorithms operations.

    A cogent observation in Jain andDubes [1988] is that the specification of

    an algorithm for clustering usuallyleaves considerable flexibilty in imple-mentation.

    5.1 Hierarchical Clustering Algorithms

    The operation of a hierarchical cluster-ing algorithm is illustrated using thetwo-dimensional data set in Figure 9.This figure depicts seven patterns la-beled A, B, C, D, E, F, and G in threeclusters. A hierarchical algorithm yieldsa dendrogram representing the nestedgrouping of patterns and similarity lev-els at which groupings change. A den-drogram corresponding to the seven

    points in Figure 9 (obtained from thesingle-link algorithm [Jain and Dubes1988]) is shown in Figure 10. The den-drogram can be broken at different lev-els to yield different clusterings of thedata.

    Most hierarchical clustering algo-rithms are variants of the single-link[Sneath and Sokal 1973], complete-link[King 1967], and minimum-variance[Ward 1963; Murtagh 1984] algorithms.Of these, the single-link and complete-link algorithms are most popular. Thesetwo algorithms differ in the way theycharacterize the similarity between apair of clusters. In the single-linkmethod, the distance between two clus-

    Clustering

    Partitional

    Single

    Link

    Complete

    Link

    Hierarchical

    Square

    Error

    Graph

    Theoretic

    Mixture

    Resolving

    Mode

    Seeking

    k-means

    Maximization

    Expectation

    Figure 7. A taxonomy of clustering approaches.

    X

    11

    1

    11

    1111

    1

    1

    1 1 11

    22

    22

    22

    22

    2

    3 3

    33

    33

    33

    3

    33

    3333

    3

    33

    3

    4 4

    44

    4

    4 4 44

    44

    44

    44

    2222

    222

    22

    1

    1

    11111

    1 11

    V

    H

    H2

    1

    X2

    1

    Figure 8. Monothetic partitional clustering.

  • 8/12/2019 11601802

    13/60

    ters is the minimum of the distancesbetween all pairs of patterns drawnfrom the two clusters (one pattern fromthe first cluster, the other from the sec-ond). In the complete-link algorithm,the distance between two clusters is themaximum of all pairwise distances be-

    tween patterns in the two clusters. Ineither case, two clusters are merged toform a larger cluster based on minimumdistance criteria. The complete-link al-

    gorithm produces tightly bound or com-pact clusters [Baeza-Yates 1992]. Thesingle-link algorithm, by contrast, suf-fers from a chaining effect [Nagy 1968].It has a tendency to produce clustersthat are straggly or elongated. Thereare two clusters in Figures 12 and 13separated by a bridge of noisy pat-terns. The single-link algorithm pro-duces the clusters shown in Figure 12,whereas the complete-link algorithm ob-

    tains the clustering shown in Figure 13.The clusters obtained by the complete-link algorithm are more compact thanthose obtained by the single-link algo-rithm; the cluster labeled 1 obtainedusing the single-link algorithm is elon-gated because of the noisy patterns la-beled *. The single-link algorithm ismore versatile than the complete-linkalgorithm, otherwise. For example, thesingle-link algorithm can extract the

    concentric clusters shown in Figure 11,but the complete-link algorithm cannot.However, from a pragmatic viewpoint, ithas been observed that the complete-link algorithm produces more useful hi-erarchies in many applications than thesingle-link algorithm [Jain and Dubes1988].

    Agglomerative Single-Link Clus-tering Algorithm

    (1) Place each pattern in its own clus-ter. Construct a list of interpatterndistances for all distinct unorderedpairs of patterns, and sort this listin ascending order.

    (2) Step through the sorted list of dis-tances, forming for each distinct dis-

    similarity value dk a graph on thepatterns where pairs of patterns

    closer than dk are connected by agraph edge. If all the patterns aremembers of a connected graph, stop.Otherwise, repeat this step.

    X

    ABC D E

    F G

    Cluster1Cluster2

    Cluster3

    X

    1

    2

    Figure 9. Points falling in three clusters.

    A B C D E F G

    S

    i

    m

    i

    l

    a

    r

    i

    t

    y

    Figure 10. The dendrogram obtained usingthe single-link algorithm.

    X

    Y

    1

    11

    1

    1

    11

    1

    1

    2

    2

    2

    2

    22

    Figure 11. Two concentric clusters.

  • 8/12/2019 11601802

    14/60

    (3) The output of the algorithm is anested hierarchy of graphs whichcan be cut at a desired dissimilaritylevel forming a partition (clustering)identified by simply connected com-ponents in the corresponding graph.

    Agglomerative Complete-Link Clus-tering Algorithm

    (1) Place each pattern in its own clus-ter. Construct a list of interpatterndistances for all distinct unorderedpairs of patterns, and sort this listin ascending order.

    (2) Step through the sorted list of dis-tances, forming for each distinct dis-

    similarity value dk a graph on thepatterns where pairs of patterns

    closer than dk are connected by agraph edge. If all the patterns aremembers of a completely connectedgraph, stop.

    (3) The output of the algorithm is anested hierarchy of graphs whichcan be cut at a desired dissimilaritylevel forming a partition (clustering)identified by completely connectedcomponents in the correspondinggraph.

    Hierarchical algorithms are more ver-satile than partitional algorithms. Forexample, the single-link clustering algo-rithm works well on data sets contain-ing non-isotropic clusters including

    well-separated, chain-like, and concen-tric clusters, whereas a typical parti-

    tional algorithm such as the k-meansalgorithm works well only on data setshaving isotropic clusters [Nagy 1968].On the other hand, the time and spacecomplexities [Day 1992] of the parti-tional algorithms are typically lowerthan those of the hierarchical algo-rithms. It is possible to develop hybridalgorithms [Murty and Krishna 1980]that exploit the good features of bothcategories.

    Hierarchical Agglomerative Clus-tering Algorithm

    (1) Compute the proximity matrix con-taining the distance between eachpair of patterns. Treat each patternas a cluster.

    (2) Find the most similar pair of clus-ters using the proximity matrix.Merge these two clusters into onecluster. Update the proximity ma-trix to reflect this merge operation.

    (3) If all patterns are in one cluster,stop. Otherwise, go to step 2.

    Based on the way the proximity matrixis updated in step 2, a variety of ag-glomerative algorithms can be designed.Hierarchical divisive algorithms startwith a single cluster of all the givenobjects and keep splitting the clustersbased on some criterion to obtain a par-tition of singleton clusters.

    1 1

    111

    1

    1 1

    111

    11 1

    1

    2

    22

    222

    22

    2 2

    2

    X

    1

    1

    11

    1

    1 2

    2

    2

    2

    2

    22

    * * * * * * * * *

    1

    X2

    Figure 12. A single-link clustering of a patternset containing two classes (1 and 2) connected bya chain of noisy patterns (*).

    1 111

    1

    1

    1 11

    11

    11 1

    1

    2

    22

    222

    22

    2 2

    2

    X

    1

    1

    11

    1

    1 2

    2

    2

    2

    2

    22

    * * * * * * * * *

    1

    X2

    Figure 13. A complete-link clustering of a pat-tern set containing two classes (1 and 2) con-nected by a chain of noisy patterns (*).

  • 8/12/2019 11601802

    15/60

  • 8/12/2019 11601802

    16/60

    (2) Assign each pattern to the closestcluster center.

    (3) Recompute the cluster centers usingthe current cluster memberships.

    (4) If a convergence criterion is not met,go to step 2. Typical convergencecriteria are: no (or minimal) reas-signment of patterns to new cluster

    centers, or minimal decrease insquared error.

    Several variants [Anderberg 1973] of

    the k-means algorithm have been re-ported in the literature. Some of themattempt to select a good initial partitionso that the algorithm is more likely tofind the global minimum value.

    Another variation is to permit split-ting and merging of the resulting clus-ters. Typically, a cluster is split whenits variance is above a pre-specifiedthreshold, and two clusters are mergedwhen the distance between their cen-troids is below another pre-specifiedthreshold. Using this variant, it is pos-sible to obtain the optimal partitionstarting from any arbitrary initial parti-tion, provided proper threshold valuesare specified. The well-known ISO-DATA [Ball and Hall 1965] algorithmemploys this technique of merging andsplitting clusters. If ISODATA is giventhe ellipse partitioning shown in Fig-ure 14 as an initial partitioning, it willproduce the optimal three-cluster parti-

    tioning. ISODATA will first merge theclusters {A} and {B,C} into one clusterbecause the distance between their cen-troids is small and then split the cluster

    {D,E,F,G}, which has a large variance,into two clusters {D,E} and {F,G}.

    Another variation of the k-means al-gorithm involves selecting a differentcriterion function altogether. The dy-namic clustering algorithm (which per-mits representations other than thecentroid for each cluster) was proposedin Diday [1973], and Symon [1977] anddescribes a dynamic clustering ap-proach obtained by formulating the

    clustering problem in the framework ofmaximum-likelihood estimation. Theregularized Mahalanobis distance wasused in Mao and Jain [1996] to obtainhyperellipsoidal clusters.

    5.2.2 Graph-Theoretic Clustering.The best-known graph-theoretic divisiveclustering algorithm is based on con-struction of the minimal spanning tree(MST) of the data [Zahn 1971], and then

    deleting the MST edges with the largestlengths to generate clusters. Figure 15depicts the MST obtained from ninetwo-dimensional points. By breakingthe link labeled CD with a length of 6units (the edge with the maximum Eu-clidean length), two clusters ({A, B, C}and {D, E, F, G, H, I}) are obtained. Thesecond cluster can be further dividedinto two clusters by breaking the edgeEF, which has a length of 4.5 units.

    The hierarchical approaches are alsorelated to graph-theoretic clustering.Single-link clusters are subgraphs ofthe minimum spanning tree of the data[Gower and Ross 1969] which are alsothe connected components [Gotlieb andKumar 1968]. Complete-link clustersare maximal complete subgraphs, andare related to the node colorability ofgraphs [Backer and Hubert 1976]. Themaximal complete subgraph was consid-ered the strictest definition of a clusterin Augustson and Minker [1970] andRaghavan and Yu [1981]. A graph-ori-ented approach for non-hierarchicalstructures and overlapping clusters is

    Figure 14. The k-means algorithm is sensitiveto the initial partition.

  • 8/12/2019 11601802

    17/60

    presented in Ozawa [1985]. The Delau-nay graph (DG) is obtained by connect-ing all the pairs of points that are

    Voronoi neighbors. The DG contains allthe neighborhood information containedin the MST and the relative neighbor-hood graph (RNG) [Toussaint 1980].

    5.3 Mixture-Resolving and Mode-Seeking

    Algorithms

    The mixture resolving approach to clus-ter analysis has been addressed in anumber of ways. The underlying as-sumption is that the patterns to be clus-tered are drawn from one of severaldistributions, and the goal is to identifythe parameters of each and (perhaps)their number. Most of the work in thisarea has assumed that the individualcomponents of the mixture density areGaussian, and in this case the parame-ters of the individual Gaussians are tobe estimated by the procedure. Tradi-tional approaches to this problem in-

    volve obtaining (iteratively) a maximumlikelihood estimate of the parameter

    vectors of the component densities [Jainand Dubes 1988].

    More recently, the Expectation Maxi-mization (EM) algorithm (a general-purpose maximum likelihood algorithm[Dempster et al. 1977] for missing-dataproblems) has been applied to the prob-lem of parameter estimation. A recentbook [Mitchell 1997] provides an acces-

    sible description of the technique. In theEM framework, the parameters of thecomponent densities are unknown, asare the mixing parameters, and these

    are estimated from the patterns. TheEM procedure begins with an initialestimate of the parameter vector anditeratively rescores the patterns againstthe mixture density produced by theparameter vector. The rescored patternsare then used to update the parameterestimates. In a clustering context, thescores of the patterns (which essentiallymeasure their likelihood of being drawnfrom particular components of the mix-

    ture) can be viewed as hints at the classof the pattern. Those patterns, placed(by their scores) in a particular compo-nent, would therefore be viewed as be-longing to the same cluster.

    Nonparametric techniques for densi-ty-based clustering have also been de-

    veloped [Jain and Dubes 1988]. Inspiredby the Parzen window approach to non-parametric density estimation, the cor-responding clustering procedure

    searches for bins with large counts in amultidimensional histogram of the in-put pattern set. Other approaches in-clude the application of another parti-tional or hierarchical clusteringalgorithm using a distance measurebased on a nonparametric density esti-mate.

    5.4 Nearest Neighbor Clustering

    Since proximity plays a key role in ourintuitive notion of a cluster, nearest-neighbor distances can serve as the ba-sis of clustering procedures. An itera-tive procedure was proposed in Lu andFu [1978]; it assigns each unlabeledpattern to the cluster of its nearest la-beled neighbor pattern, provided thedistance to that labeled neighbor is be-low a threshold. The process continuesuntil all patterns are labeled or no addi-tional labelings occur. The mutualneighborhood value (described earlier inthe context of distance computation) canalso be used to grow clusters from nearneighbors.

    X

    B

    AC D

    E

    F

    G HI

    edge with the maximum length

    2

    2

    6 2.3

    4.5

    22

    2

    X2

    1

    Figure 15. Using the minimal spanning tree toform clusters.

  • 8/12/2019 11601802

    18/60

    5.5 Fuzzy Clustering

    Traditional clustering approaches gen-erate partitions; in a partition, eachpattern belongs to one and only one

    cluster. Hence, the clusters in a hardclustering are disjoint. Fuzzy clusteringextends this notion to associate eachpattern with every cluster using a mem-bership function [Zadeh 1965]. The out-put of such algorithms is a clustering,but not a partition. We give a high-levelpartitional fuzzy clustering algorithmbelow.

    Fuzzy Clustering Algorithm

    (1) Select an initial fuzzy partition ofthe N objects into K clusters by

    selecting the N K membership

    matrix U. An element u ij of thismatrix represents the grade of mem-

    bership of object xi in cluster cj.

    Typically, u ij 0,1.

    (2) Using U, find the value of a fuzzycriterion function, e.g., a weighted

    squared error criterion function, as-sociated with the corresponding par-tition. One possible fuzzy criterionfunction is

    E2, U i1

    N

    k1

    K

    uijxi c k2,

    where ck i1

    N

    u ik xi is the kth fuzzy

    cluster center.

    Reassign patterns to clusters to re-duce this criterion function valueand recompute U.

    (3) Repeat step 2 until entries in U donot change significantly.

    In fuzzy clustering, each cluster is afuzzy set of all the patterns. Figure 16illustrates the idea. The rectangles en-close two hard clusters in the data:

    H1 1,2,3,4,5 and H2 6,7,8,9.A fuzzy clustering algorithm might pro-

    duce the two fuzzy clusters F1 and F2depicted by ellipses. The patterns will

    have membership values in [0,1] foreach cluster. For example, fuzzy cluster

    F1 could be compactly described as

    1,0.9, 2,0.8, 3,0.7, 4,0.6,5,0.55,

    6,0.2, 7,0.2, 8,0.0,9,0.0

    and F2 could be described as

    1,0.0, 2,0.0, 3,0.0, 4,0.1,5,0.15,

    6,0.4, 7,0.35, 8,1.0,9,0.9

    The ordered pairs i, i in each clusterrepresent the ith pattern and its mem-

    bership value to the cluster i. Largermembership values indicate higher con-fidence in the assignment of the patternto the cluster. A hard clustering can be

    obtained from a fuzzy partition bythresholding the membership value.

    Fuzzy set theory was initially appliedto clustering in Ruspini [1969]. Thebook by Bezdek [1981] is a good sourcefor material on fuzzy clustering. Themost popular fuzzy clustering algorithm

    is the fuzzy c-means (FCM) algorithm.Even though it is better than the hard

    k-means algorithm at avoiding localminima, FCM can still converge to localminima of the squared error criterion.The design of membership functions isthe most important problem in fuzzyclustering; different choices include

    X

    Y

    12

    3 4

    56

    7

    8

    9

    H 2H

    FF

    1

    1 2

    Figure 16. Fuzzy clusters.

  • 8/12/2019 11601802

    19/60

    those based on similarity decomposition

    and centroids of clusters. A generaliza-tion of the FCM algorithm was proposedby Bezdek [1981] through a family of

    objective functions. A fuzzy c-shell algo-rithm and an adaptive variant for de-tecting circular and elliptical bound-aries was presented in Dave [1992].

    5.6 Representation of Clusters

    In applications where the number of

    classes or clusters in a data set must bediscovered, a partition of the data set isthe end product. Here, a partition givesan idea about the separability of thedata points into clusters and whether itis meaningful to employ a supervisedclassifier that assumes a given numberof classes in the data set. However, inmany other applications that involvedecision making, the resulting clustershave to be represented or described in a

    compact form to achieve data abstrac-tion. Even though the construction of acluster representation is an importantstep in decision making, it has not beenexamined closely by researchers. Thenotion of cluster representation was in-troduced in Duran and Odell [1974] andwas subsequently studied in Diday andSimon [1976] and Michalski et al.[1981]. They suggested the followingrepresentation schemes:

    (1) Represent a cluster of points bytheir centroid or by a set of distantpoints in the cluster. Figure 17 de-picts these two ideas.

    (2) Represent clusters using nodes in a

    classification tree. This is illus-trated in Figure 18.

    (3) Represent clusters by using conjunc-tive logical expressions. For example,

    the expression X1 3X2 2 inFigure 18 stands for the logical state-

    ment X1 is greater than 3 and X2 isless than 2.

    Use of the centroid to represent a

    cluster is the most popular scheme. Itworks well when the clusters are com-pact or isotropic. However, when theclusters are elongated or non-isotropic,then this scheme fails to represent themproperly. In such a case, the use of acollection of boundary points in a clus-ter captures its shape well. The numberof points used to represent a clustershould increase as the complexity of itsshape increases. The two different rep-

    resentations illustrated in Figure 18 areequivalent. Every path in a classifica-tion tree from the root node to a leafnode corresponds to a conjunctive state-ment. An important limitation of thetypical use of the simple conjunctiveconcept representations is that they candescribe only rectangular or isotropicclusters in the feature space.

    Data abstraction is useful in decisionmaking because of the following:

    (1) It gives a simple and intuitive de-scription of clusters which is easyfor human comprehension. In bothconceptual clustering [Michalski

    X XBy Three Distant PointsBy The Centroid

    *

    *

    * *

    *

    ** *

    *

    *

    *

    *

    *

    *

    1

    X2 X2

    1

    Figure 17. Representation of a cluster by points.

  • 8/12/2019 11601802

    20/60

    and Stepp 1983] and symbolic clus-tering [Gowda and Diday 1992] thisrepresentation is obtained withoutusing an additional step. These al-gorithms generate the clusters aswell as their descriptions. A set offuzzy rules can be obtained fromfuzzy clusters of a data set. These

    rules can be used to build fuzzy clas-sifiers and fuzzy controllers.

    (2) It helps in achieving data compres-sion that can be exploited further bya computer [Murty and Krishna1980]. Figure 19(a) shows samplesbelonging to two chain-like clusterslabeled 1 and 2. A partitional clus-

    tering like the k-means algorithmcannot separate these two struc-tures properly. The single-link algo-rithm works well on this data, but iscomputationally expensive. So a hy-brid approach may be used to ex-ploit the desirable properties of boththese algorithms. We obtain 8 sub-clusters of the data using the (com-

    putationally efficient) k-means algo-rithm. Each of these subclusters canbe represented by their centroids asshown in Figure 19(a). Now the sin-gle-link algorithm can be applied onthese centroids alone to clusterthem into 2 groups. The resultinggroups are shown in Figure 19(b).Here, a data reduction is achieved

    by representing the subclusters bytheir centroids.

    (3) It increases the efficiency of the de-cision making task. In a cluster-based document retrieval technique[Salton 1991], a large collection ofdocuments is clustered and each of

    the clusters is represented using itscentroid. In order to retrieve docu-ments relevant to a query, the queryis matched with the cluster cen-troids rather than with all the docu-ments. This helps in retrieving rele-

    vant documents efficiently. Also inseveral applications involving largedata sets, clustering is used to per-form indexing, which helps in effi-cient decision making [Dorai andJain 1995].

    5.7 Artificial Neural Networks for

    Clustering

    Artificial neural networks (ANNs)[Hertz et al. 1991] are motivated bybiological neural networks. ANNs havebeen used extensively over the pastthree decades for both classification and

    clustering [Sethi and Jain 1991; Jainand Mao 1994]. Some of the features ofthe ANNs that are important in patternclustering are:

    X0 1 2 3 4 50

    1

    2

    3

    4

    5

    |

    |

    |

    |

    |

    ||

    |

    |

    |

    |

    |

    |

    | - - - - - - - - - -

    2

    2

    2

    22

    2

    2

    33

    33

    33

    3

    3

    33

    3

    1

    1 11

    1

    111

    1

    11 1

    11 1

    1

    11

    1

    1

    1

    1

    1

    X < 3 X >3

    1 2 3

    Using Nodes in a Classification Tree

    Using Conjunctive Statements

    X2

    1

    1: [X 3][X 3][X >2]1

    11

    X 22 2

    1 2 1 2

    Figure 18. Representation of clusters by a classification tree or by conjunctive statements.

  • 8/12/2019 11601802

    21/60

    (1) ANNs process numerical vectors and

    so require patterns to be representedusing quantitative features only.

    (2) ANNs are inherently parallel anddistributed processing architec-tures.

    (3) ANNs may learn their interconnec-tion weights adaptively [Jain andMao 1996; Oja 1982]. More specifi-cally, they can act as pattern nor-malizers and feature selectors by

    appropriate selection of weights.Competitive (or winnertakeall)

    neural networks [Jain and Mao 1996]are often used to cluster input data. Incompetitive learning, similar patternsare grouped by the network and repre-sented by a single unit (neuron). Thisgrouping is done automatically based ondata correlations. Well-known examplesof ANNs used for clustering include Ko-honens learning vector quantization(LVQ) and self-organizing map (SOM)[Kohonen 1984], and adaptive reso-nance theory models [Carpenter andGrossberg 1990]. The architectures ofthese ANNs are simple: they are single-layered. Patterns are presented at theinput and are associated with the out-put nodes. The weights between the in-put nodes and the output nodes areiteratively changed (this is called learn-ing) until a termination criterion is sat-isfied. Competitive learning has beenfound to exist in biological neural net-works. However, the learning or weightupdate procedures are quite similar to

    those in some classical clustering ap-proaches. For example, the relationship

    between the k-means algorithm andLVQ is addressed in Pal et al. [1993].

    The learning algorithm in ART modelsis similar to the leader clustering algo-rithm [Moor 1988].

    The SOM gives an intuitively appeal-ing two-dimensional map of the multidi-mensional data set, and it has beensuccessfully used for vector quantiza-tion and speech recognition [Kohonen1984]. However, like its sequentialcounterpart, the SOM generates a sub-optimal partition if the initial weights

    are not chosen properly. Further, itsconvergence is controlled by various pa-rameters such as the learning rate anda neighborhood of the winning node inwhich learning takes place. It is possi-ble that a particular input pattern canfire different output units at differentiterations; this brings up the stabilityissue of learning systems. The system issaid to be stable if no pattern in thetraining data changes its category after

    a finite number of learning iterations.This problem is closely associated withthe problem of plasticity, which is theability of the algorithm to adapt to newdata. For stability, the learning rateshould be decreased to zero as iterationsprogress and this affects the plasticity.The ART models are supposed to bestable and plastic [Carpenter andGrossberg 1990]. However, ART netsare order-dependent; that is, different

    partitions are obtained for different or-ders in which the data is presented tothe net. Also, the size and number ofclusters generated by an ART net de-pend on the value chosen for the vigi-lance threshold, which is used to decidewhether a pattern is to be assigned toone of the existing clusters or start anew cluster. Further, both SOM and

    ART are suitable for detecting only hy-perspherical clusters [Hertz et al. 1991].

    A two-layer network that employs regu-larized Mahalanobis distance to extracthyperellipsoidal clusters was proposedin Mao and Jain [1994]. All these ANNsuse a fixed number of output nodes

    1

    1

    1

    1 2

    2

    2

    2 1

    1

    1

    1

    2

    2

    X X

    1 1 1

    1

    11

    1

    2

    2

    1

    11

    11

    111

    1

    11

    1

    1

    1

    1 1 111

    1

    11

    1

    2

    2 2

    22

    22

    2

    22

    2

    2

    222

    2

    2 2 2

    2

    2

    2 22

    22

    2 2

    2

    2

    2

    21

    11

    1 1

    11

    111

    (a) (b)1 1

    X2 X2

    Figure 19. Data compression by clustering.

  • 8/12/2019 11601802

    22/60

    which limit the number of clusters thatcan be produced.

    5.8 Evolutionary Approaches for

    Clustering

    Evolutionary approaches, motivated bynatural evolution, make use of evolu-tionary operators and a population ofsolutions to obtain the globally optimalpartition of the data. Candidate solu-tions to the clustering problem are en-coded as chromosomes. The most com-monly used evolutionary operators are:selection, recombination, and mutation.Each transforms one or more inputchromosomes into one or more outputchromosomes. A fitness function evalu-ated on a chromosome determines achromosomes likelihood of survivinginto the next generation. We give belowa high-level description of an evolution-ary algorithm applied to clustering.

    An Evolutionary Algorithm forClustering

    (1) Choose a random population of solu-tions. Each solution here corre-

    sponds to a valid k-partition of thedata. Associate a fitness value witheach solution. Typically, fitness isinversely proportional to thesquared error value. A solution witha small squared error will have alarger fitness value.

    (2) Use the evolutionary operators se-lection, recombination and mutation

    to generate the next population ofsolutions. Evaluate the fitness val-ues of these solutions.

    (3) Repeat step 2 until some termina-tion condition is satisfied.

    The best-known evolutionary tech-niques are genetic algorithms (GAs)[Holland 1975; Goldberg 1989], evolu-tion strategies (ESs) [Schwefel 1981],and evolutionary programming (EP)[Fogel et al. 1965]. Out of these threeapproaches, GAs have been most fre-quently used in clustering. Typically,solutions are binary strings in GAs. In

    GAs, a selection operator propagates so-lutions from the current generation tothe next generation based on their fit-ness. Selection employs a probabilisticscheme so that solutions with higherfitness have a higher probability of get-ting reproduced.

    There are a variety of recombinationoperators in use; crossover is the most

    popular. Crossover takes as input a pairof chromosomes (called parents) andoutputs a new pair of chromosomes(called children or offspring) as depictedin Figure 20. In Figure 20, a singlepoint crossover operation is depicted. Itexchanges the segments of the parentsacross a crossover point. For example,in Figure 20, the parents are the binarystrings 10110101 and 11001110. Thesegments in the two parents after thecrossover point (between the fourth andfifth locations) are exchanged to pro-duce the child chromosomes. Mutationtakes as input a chromosome and out-puts a chromosome by complementingthe bit value at a randomly selectedlocation in the input chromosome. Forexample, the string 11111110 is gener-ated by applying the mutation operatorto the second bit location in the string10111110 (starting at the left). Bothcrossover and mutation are applied withsome prespecified probabilities whichdepend on the fitness values.

    GAs represent points in the searchspace as binary strings, and rely on the

    parent1

    parent2

    child1

    child2

    1 0 1 1 0 1 0 1

    1 0 1 1 1 1 1 0

    1 1 0 0 0 1 0 1

    1 1 0 0 1 1 1 0

    crossover point

    Figure 20. Crossover operation.

  • 8/12/2019 11601802

    23/60

    crossover operator to explore the searchspace. Mutation is used in GAs for thesake of completeness, that is, to makesure that no part of the search space is

    left unexplored. ESs and EP differ fromthe GAs in solution representation andtype of the mutation operator used; EPdoes not use a recombination operator,but only selection and mutation. Each ofthese three approaches have been usedto solve the clustering problem by view-ing it as a minimization of the squarederror criterion. Some of the theoreticalissues such as the convergence of theseapproaches were studied in Fogel and

    Fogel [1994].GAs perform a globalized search forsolutions whereas most other clusteringprocedures perform a localized search.In a localized search, the solution ob-tained at the next iteration of the pro-cedure is in the vicinity of the current

    solution. In this sense, the k-means al-gorithm, fuzzy clustering algorithms,

    ANNs used for clustering, various an-nealing schemes (see below), and tabu

    search are all localized search tech-niques. In the case of GAs, the crossoverand mutation operators can producenew solutions that are completely dif-ferent from the current ones. We illus-trate this fact in Figure 21. Let us as-sume that the scalar X is coded using a

    5-bit binary representation, and let S1and S2 be two points in the one-dimen-sional search space. The decimal values

    ofS 1 and S2 are 8 and 31, respectively.

    Their binary representations are S 1

    01000 and S2 11111. Let us applythe single-point crossover to thesestrings, with the crossover site fallingbetween the second and third most sig-nificant bits as shown below.

    01!000

    11!111

    This will produce a new pair of points or

    chromosomes S 3 and S4 as shown in

    Figure 21. Here, S 3 01111 and

    S4 11000. The corresponding deci-mal values are 15 and 24, respectively.Similarly, by mutating the most signifi-cant bit in the binary string 01111 (dec-imal 15), the binary string 11111 (deci-mal 31) is generated. These jumps, orgaps between points in successive gen-erations, are much larger than thoseproduced by other approaches.

    Perhaps the earliest paper on the use

    of GAs for clustering is by Raghavanand Birchand [1979], where a GA wasused to minimize the squared error of aclustering. Here, each point or chromo-

    some represents a partition ofN objects

    into K clusters and is represented by a

    K-ary string of length N. For example,consider six patternsA, B, C, D, E,and Fand the string 101001. This six-

    bit binary (K 2) string corresponds to

    placing the six patterns into two clus-ters. This string represents a two-parti-tion, where one cluster has the first,third, and sixth patterns and the secondcluster has the remaining patterns. Inother words, the two clusters are{A,C,F} and {B,D,E} (the six-bit binarystring 010110 represents the same clus-tering of the six patterns). When there

    are K clusters, there are K! differentchromosomes corresponding to each

    K-partition of the data. This increasesthe effective search space size by a fac-

    tor ofK!. Further, if crossover is appliedon two good chromosomes, the resulting

    f(X)

    XSS S S

    1 23 4

    X XX

    X

    Figure 21. GAs perform globalized search.

  • 8/12/2019 11601802

    24/60

  • 8/12/2019 11601802

    25/60

    tic search techniques guarantee an opti-mal partition by performing exhaustiveenumeration. On the other hand, thestochastic search techniques generate a

    near-optimal partition reasonablyquickly, and guarantee convergence tooptimal partition asymptotically.

    Among the techniques considered so far,evolutionary approaches are stochasticand the remainder are deterministic.Other deterministic approaches to clus-tering include the branch-and-boundtechnique adopted in Koontz et al.[1975] and Cheng [1995] for generatingoptimal partitions. This approach gen-

    erates the optimal partition of the dataat the cost of excessive computationalrequirements. In Rose et al. [1993], adeterministic annealing approach wasproposed for clustering. This approachemploys an annealing technique inwhich the error surface is smoothed, butconvergence to the global optimum isnot guaranteed. The use of determinis-tic annealing in proximity-mode cluster-ing (where the patterns are specified in

    terms of pairwise proximities ratherthan multidimensional points) was ex-plored in Hofmann and Buhmann[1997]; later work applied the determin-istic annealing approach to texture seg-mentation [Hofmann and Buhmann1998].

    The deterministic approaches are typ-ically greedy descent approaches,whereas the stochastic approaches per-mit perturbations to the solutions innon-locally optimal directions also withnonzero probabilities. The stochasticsearch techniques are either sequentialor parallel, while evolutionary ap-proaches are inherently parallel. Thesimulated annealing approach (SA)[Kirkpatrick et al. 1983] is a sequentialstochastic search technique, whose ap-plicability to clustering is discussed inKlein and Dubes [1989]. Simulated an-nealing procedures are designed toavoid (or recover from) solutions whichcorrespond to local optima of the objec-tive functions. This is accomplished byaccepting with some probability a newsolution for the next iteration of lower

    quality (as measured by the criterionfunction). The probability of acceptanceis governed by a critical parametercalled the temperature (by analogy with

    annealing in metals), which is typicallyspecified in terms of a starting (firstiteration) and final temperature value.Selim and Al-Sultan [1991] studied theeffects of control parameters on the per-formance of the algorithm, and Baeza-

    Yates [1992] used SA to obtain near-optimal partition of the data. SA isstatistically guaranteed to find the glo-bal optimal solution [Aarts and Korst1989]. A high-level outline of a SA

    based algorithm for clustering is givenbelow.

    Clustering Based on SimulatedAnnealing

    (1) Randomly select an initial partition

    and P0, and compute the squared

    error value, EP0. Select values forthe control parameters, initial and

    final temperaturesT0 and Tf.

    (2) Select a neighbor P1 ofP 0 and com-pute its squared error value, EP1. If

    EP1 is larger than EP0, then assign

    P1 to P0 with a temperature-depen-

    dent probability. Else assign P1 to

    P0. Repeat this step for a fixed num-ber of iterations.

    (3) Reduce the value of T0, i.e. T0

    cT0, where c is a predetermined

    constant. If T0 is greater than Tf,then go to step 2. Else stop.

    The SA algorithm can be slow inreaching the optimal solution, becauseoptimal results require the temperatureto be decreased very slowly from itera-tion to iteration.

    Tabu search [Glover 1986], like SA, isa method designed to cross boundariesof feasibility or local optimality and tosystematically impose and release con-straints to permit exploration of other-wise forbidden regions. Tabu searchwas used to solve the clustering prob-lem in Al-Sultan [1995].

  • 8/12/2019 11601802

    26/60

    5.10 A Comparison of Techniques

    In this section we have examined vari-ous deterministic and stochastic searchtechniques to approach the clustering

    problem as an optimization problem. Amajority of these methods use thesquared error criterion function. Hence,the partitions generated by these ap-proaches are not as versatile as thosegenerated by hierarchical algorithms.The clusters generated are typically hy-perspherical in shape. Evolutionary ap-proaches are globalized search tech-niques, whereas the rest of theapproaches are localized search tech-

    nique. ANNs and GAs are inherentlyparallel, so they can be implementedusing parallel hardware to improvetheir speed. Evolutionary approachesare population-based; that is, theysearch using more than one solution ata time, and the rest are based on usinga single solution at a time. ANNs, GAs,SA, and Tabu search (TS) are all sensi-tive to the selection of various learning/control parameters. In theory, all four of

    these methods are weak methods [Rich1983] in that they do not use explicitdomain knowledge. An important fea-ture of the evolutionary approaches isthat they can find the optimal solutioneven when the criterion function is dis-continuous.

    An empirical study of the perfor-mance of the following heuristics forclustering was presented in Mishra andRaghavan [1994]; SA, GA, TS, random-

    ized branch-and-bound (RBA) [Mishraand Raghavan 1994], and hybrid search(HS) strategies [Ismail and Kamel 1989]were evaluated. The conclusion wasthat GA performs well in the case ofone-dimensional data, while its perfor-mance on high dimensional data sets isnot impressive. The performance of SAis not attractive because it is very slow.RBA and TS performed best. HS is goodfor high dimensional data. However,

    none of the methods was found to besuperior to others by a significant mar-

    gin. An empirical study ofk-means, SA,TS, and GA was presented in Al-Sultan

    and Khan [1996]. TS, GA and SA werejudged comparable in terms of solutionquality, and all were better than

    k-means. However, thek -means method

    is the most efficient in terms of execu-tion time; other schemes took more time(by a factor of 500 to 2500) to partition adata set of size 60 into 5 clusters. Fur-ther, GA encountered the best solutionfaster than TS and SA; SA took moretime than TS to encounter the best solu-tion. However, GA took the maximumtime for convergence, that is, to obtain apopulation of only the best solutions,followed by TS and SA. An important

    observation is that in both Mishra andRaghavan [1994] and Al-Sultan andKhan [1996] the sizes of the data setsconsidered are small; that is, fewer than200 patterns.

    A two-layer network was employed inMao and Jain [1996], with the firstlayer including a number of principalcomponent analysis subnets, and thesecond layer using a competitive net.This network performs partitional clus-

    tering using the regularized Mahalano-bis distance. This net was trained usinga set of 1000 randomly selected pixelsfrom a large image and then used toclassify every pixel in the image. Babuet al. [1997] proposed a stochastic con-nectionist approach (SCA) and com-pared its performance on standard data

    sets with both the SA and k -means algo-rithms. It was observed that SCA is

    superior to both SA and k-means in

    terms of solution quality. Evolutionaryapproaches are good only when the datasize is less than 1000 and for low di-mensional data.

    In summary, only the k-means algo-rithm and its ANN equivalent, the Ko-honen net [Mao and Jain 1996] havebeen applied on large data sets; otherapproaches have been tested, typically,on small data sets. This is because ob-taining suitable learning/control param-eters for ANNs, GAs, TS, and SA isdifficult and their execution times are

    very high for large data sets. However,it has been shown [Selim and Ismail

  • 8/12/2019 11601802

    27/60

    1984] that the k-means method con-verges to a locally optimal solution. Thisbehavior is linked with the initial seed

    selection in the k-means algorithm. So

    if a good initial partition can be ob-tained quickly using any of the other

    techniques, then k-means would workwell even on problems with large datasets. Even though various methods dis-cussed in this section are comparativelyweak, it was revealed through experi-mental studies that combining domainknowledge would improve their perfor-mance. For example, ANNs work betterin classifying images represented using

    extracted features than with raw im-ages, and hybrid classifiers work betterthan ANNs [Mohiuddin and Mao 1994].Similarly, using domain knowledge tohybridize a GA improves its perfor-mance [Jones and Beltramo 1991]. So itmay be useful in general to use domainknowledge along with approaches likeGA, SA, ANN, and TS. However, theseapproaches (specifically, the criteriafunctions used in them) have a tendency

    to generate a partition of hyperspheri-cal clusters, and this could be a limita-tion. For example, in cluster-based doc-ument retrieval, it was observed thatthe hierarchical algorithms performedbetter than the partitional algorithms[Rasmussen 1992].

    5.11 Incorporating Domain Constraints in

    Clustering

    As a task, clustering is subjective innature. The same data set may need tobe partitioned differently for differentpurposes. For example, consider awhale, an elephant, a n d a tuna fish[Watanabe 1985]. Whales and elephantsform a cluster of mammals. However, ifthe user is interested in partitioningthem based on the concept of living inwater, then whale and tuna fish areclustered together. Typically, this sub-

    jectivity is incorporated into the cluster-ing criterion by incorporating domainknowledge in one or more phases ofclustering.

    Every clustering algorithm uses sometype of knowledge either implicitly orexplicitly. Implicit knowledge plays arole in (1) selecting a pattern represen-

    tation scheme (e.g., using ones priorexperience to select and encode fea-tures), (2) choosing a similarity measure(e.g., using the Mahalanobis distanceinstead of the Euclidean distance to ob-tain hyperellipsoidal clusters), and (3)selecting a grouping scheme (e.g., speci-

    fying the k-means algorithm when it isknown that clusters are hyperspheri-cal). Domain knowledge is used implic-itly in ANNs, GAs, TS, and SA to selectthe control/learning parameter valuesthat affect the performance of these al-gorithms.

    It is also possible to use explicitlyavailable domain knowledge to con-strain or guide the clustering process.Such specialized clustering algorithmshave been used in several applications.Domain concepts can play several rolesin the clustering process, and a variety

    of choices are available to the practitio-ner. At one extreme, the available do-main concepts might easily serve as anadditional feature (or several), and theremainder of the procedure might beotherwise unaffected. At the other ex-treme, domain concepts might be usedto confirm or veto a decision arrived atindependently by a traditional cluster-ing algorithm, or used to affect the com-putation of distance in a clustering algo-

    rithm employing proximity. Theincorporation of domain knowledge intoclustering consists mainly of ad hoc ap-proaches with little in common; accord-ingly, our discussion of the idea willconsist mainly of motivational materialand a brief survey of past work. Ma-chine learning research and pattern rec-ognition research intersect in this topi-cal area, and the interested reader isreferred to the prominent journals in

    machine learning (e.g., Machine Learn-ing,J. of AI Research, or Artificial Intel-ligence) for a fuller treatment of thistopic.

  • 8/12/2019 11601802

    28/60

    As documented in Cheng and Fu[1985], rules in an expert system maybe clustered to reduce the size of theknowledge base. This modification of

    clustering was also explored in the do-mains of universities, congressional vot-ing records, and terrorist events by Leb-owitz [1987].

    5.11.1 Similarity Computation. Con-ceptual knowledge was used explicitlyin the similarity computation phase inMichalski and Stepp [1983]. It was as-sumed that the pattern representationswere available and the dynamic cluster-ing algorithm [Diday 1973] was used to

    group patterns. The clusters formedwere described using conjunctive state-ments in predicate logic. It was statedin Stepp and Michalski [1986] andMichalski and Stepp [1983] that thegroupings obtained by the conceptualclustering are superior to those ob-tained by the numerical methods forclustering. A critical analysis of thatwork appears in Dale [1985], and it wasobserved that monothetic divisive clus-

    tering algorithms generate clusters thatcan be described by conjunctive state-ments. For example, consider Figure 8.Four clusters in this figure, obtainedusing a monothetic algorithm, can bedescribed by using conjunctive conceptsas shown below:

    Cluster 1: X a Y b

    Cluster 2: X a Y b

    Cluster 3: X a Y c

    Cluster 4: X a Y c

    where is the Boolean conjunction(and) operator, and a, b, and c areconstants.

    5.11.2 Pattern Representation. It wasshown in Srivastava and Murty [1990]that by using knowledge in the patternrepresentation phase, as is implicitlydone in numerical taxonomy ap-proaches, it is possible to obtain thesame partitions as those generated byconceptual clustering. In this sense,

    conceptual clustering and numericaltaxonomy are not diametrically oppo-site, but are equivalent. In the case ofconceptual clustering, domain knowl-

    edge is explicitly used in interpatternsimilarity computation, whereas in nu-merical taxonomy it is implicitly as-sumed that pattern representations areobtained using the domain knowledge.

    5.11.3 Cluster Descriptions. Typi-cally, in knowledge-based clustering,both the clusters and their descriptionsor characterizations are generated

    [Fisher and Langley 1985]. There aresome exceptions, for instance,, Gowdaand Diday [1992], where only clusteringis performed and no descriptions aregenerated explicitly. In conceptual clus-tering, a cluster of objects is describedby a conjunctive logical expression[Michalski and Stepp 1983]. Eventhough a conjunctive statement is one ofthe most common descriptive formsused by humans, it is a limited form. In

    Shekar et al. [1987], functional knowl-edge of objects was used to generatemore intuitively appealing cluster de-scriptions that employ the Boolean im-

    plication operator. A system that repre-sents clusters probabilistically wasdescribed in Fisher [1987]; these de-scriptions are more general than con-

    junctive concepts, and are well-suited tohierarchical classification domains (e.g.,the animal species hierarchy). A concep-

    tual clustering system in which cluster-ing is done first is described in Fisherand Langley [1985]. These clusters arethen described using probabilities. Asimilar scheme was described in Murtyand Jain [1995], but the descriptionsare logical expressions that employ bothconjunction and disjunction.

    An important characteristic of concep-tual clustering is that it is possible togroup objects represented by both qual-itative and quantitative features if theclustering leads to a conjunctive con-cept. For example, the concept cricketball might be represented as

  • 8/12/2019 11601802

    29/60

    colorred shape sphere

    make leather

    radius 1.4 inches,where radius is a quantitative featureand the rest are all qualitative features.This description is used to describe acluster of cricket ba