ijfs11-2-r-3-miyamoto(ijfs20100601-00430.pdf

9
International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011 © 2011 TFSA 89 Different Objective Functions in Fuzzy c-Means Algorithms and Kernel-Based Clustering Sadaaki Miyamoto Abstract 1 An overview of fuzzy c-means clustering algo- rithms is given where we focus on different objective functions: they use regularized dissimilarity, en- tropy-based function, and function for possibilistic clustering. Classification functions for the objective functions and their properties are studied. Fuzzy c-means algorithms using kernel functions is also discussed with kernelized cluster validity measures and numerical experiments. New kernel functions derived from the classification functions are more- over studied. Keywords: cluster validity measure, fuzzy c-means clustering, kernel functions, possibilistic clustering. 1. Introduction Fuzzy clustering is well-known not only in fuzzy community but also in the related fields of data analysis, neural networks, and other areas in computational intel- ligence. Among various techniques of clustering using fuzzy concepts [16, 23, 30, 37], the word of fuzzy clus- tering mostly refers to fuzzy c-means clustering by Dunn and Bezdek [1, 2, 6, 7, 8, 13]. This paper gives an over- view of this method. Nevertheless, we adopt a non-standard formulation. That is, we begin from three different objective functions, and none of them is exactly the same as the one by Dunn and Bezdek. Comparing different objective functions and their so- lutions, we find theoretical properties of fuzzy c-means clustering: different fuzzy classifiers are derived from different solutions. Moreover generalization including a “cluster size” variable and a “covariance'” variable is developed. This generalization is shown to be closely related to mixture distributions. Kernel-based fuzzy c-means clustering is moreover studied with associated cluster validity measures. Many numerical simulations are used to evaluate whether or Corresponding Author: Sadaaki Miyamoto is with the Department of Risk Engineering, the University of Tsukuba, Ibaraki 305-8573, Ja- pan. E-mail: [email protected] Manuscript received June 2010; revised Nov. 2010; accepted Dec. 2010. not the kernelized measures are adequate for ordinary ball-shaped clusters. Finally, a new class of kernel functions is proposed; they are derived from fuzzy c-means solutions. Illustra- tive examples are given. 2. Fuzzy c-Means Clustering We first give three objective functions. Possibilistic clustering [18] is included as a variation of fuzzy c-means clustering. A. Preliminary consideration Let objects for clustering be points in the p-dimensional Euclidean space. They are denoted by 1 ( , , ) p p k k k x x x R = F ( 1, , k N = ). A generic point 1 ( , , ) p x x x = F implies a variable in p R . We assume c clusters; cluster centers are denoted by i v ( 1, , i c = ). We write 1 ( , , ) c V v v = as the collec- tion of all cluster centers. The dissimilarity between an object and a cluster cen- ter is the squared Euclidean distance: 2 ( , ) . k i k i Dx v x v = (1) We sometimes write ( , ) ki k i D Dx v = for simplicity. Moreover (, ) i Dxv means that variable x is substi- tuted into object k x . ( ) ki U u = is the membership matrix: ki u means the degree of belongingness of k x to cluster i . Crisp and fuzzy c-means clustering are based on the minimization of objection functions. Crisp c-means clustering [21] uses the following: 1 1 ( , ) ( , ) c N H ki k i i k J UV uDx v = = = ∑∑ (2) Alternate minimization with respect to one of ( , ) UV , while another variable is fixed, is repeated until conver- gence [1]. Minimization with respect to U uses the following constraint: 1 { ( ): 1; 0, , }. c ki ki kj i M U u u u kj = = = = (3)

Upload: edrianhadinata

Post on 19-Jul-2016

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011

© 2011 TFSA

89

Different Objective Functions in Fuzzy c-Means Algorithms and

Kernel-Based Clustering

Sadaaki Miyamoto

Abstract1

An overview of fuzzy c-means clustering algo-rithms is given where we focus on different objective functions: they use regularized dissimilarity, en-tropy-based function, and function for possibilistic clustering. Classification functions for the objective functions and their properties are studied. Fuzzy c-means algorithms using kernel functions is also discussed with kernelized cluster validity measures and numerical experiments. New kernel functions derived from the classification functions are more-over studied.

Keywords: cluster validity measure, fuzzy c-means clustering, kernel functions, possibilistic clustering.

1. Introduction

Fuzzy clustering is well-known not only in fuzzy community but also in the related fields of data analysis, neural networks, and other areas in computational intel-ligence. Among various techniques of clustering using fuzzy concepts [16, 23, 30, 37], the word of fuzzy clus-tering mostly refers to fuzzy c-means clustering by Dunn and Bezdek [1, 2, 6, 7, 8, 13]. This paper gives an over-view of this method. Nevertheless, we adopt a non-standard formulation. That is, we begin from three different objective functions, and none of them is exactly the same as the one by Dunn and Bezdek.

Comparing different objective functions and their so-lutions, we find theoretical properties of fuzzy c-means clustering: different fuzzy classifiers are derived from different solutions. Moreover generalization including a “cluster size” variable and a “covariance'” variable is developed. This generalization is shown to be closely related to mixture distributions.

Kernel-based fuzzy c-means clustering is moreover studied with associated cluster validity measures. Many numerical simulations are used to evaluate whether or Corresponding Author: Sadaaki Miyamoto is with the Department of Risk Engineering, the University of Tsukuba, Ibaraki 305-8573, Ja-pan. E-mail: [email protected] Manuscript received June 2010; revised Nov. 2010; accepted Dec. 2010.

not the kernelized measures are adequate for ordinary ball-shaped clusters.

Finally, a new class of kernel functions is proposed; they are derived from fuzzy c-means solutions. Illustra-tive examples are given.

2. Fuzzy c-Means Clustering

We first give three objective functions. Possibilistic clustering [18] is included as a variation of fuzzy c-means clustering. A. Preliminary consideration

Let objects for clustering be points in the p-dimensional Euclidean space. They are denoted by

1( , , )p pk k kx x x R= … ∈ ( 1, ,k N= … ). A generic point

1( , , )px x x= … implies a variable in pR . We assume c clusters; cluster centers are denoted by iv ( 1, ,i c= … ). We write 1( , , )cV v v= … as the collec-tion of all cluster centers.

The dissimilarity between an object and a cluster cen-ter is the squared Euclidean distance:

2( , ) .k i k iD x v x v= −∥ ∥ (1) We sometimes write ( , )ki k iD D x v= for simplicity. Moreover ( , )iD x v means that variable x is substi-tuted into object kx . ( )kiU u= is the membership matrix: kiu means the degree of belongingness of kx to cluster i . Crisp and fuzzy c-means clustering are based on the minimization of objection functions. Crisp c-means clustering [21] uses the following:

1 1

( , ) ( , )c N

H ki k ii k

J U V u D x v= =

=∑∑ (2)

Alternate minimization with respect to one of ( , )U V , while another variable is fixed, is repeated until conver-gence [1]. Minimization with respect to U uses the following constraint:

1

{ ( ) : 1; 0, , }.c

ki ki kji

M U u u u k j=

= = = ≥ ∀∑ (3)

Page 2: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011

90

We consider three objective functions:

1 1( , ) ( ) { ( , )},

( 1, 0),

c Nm

B ki k ii k

J U V u D x v

m

ε

ε= =

= +

> ≥

∑∑ (4)

1

1 1( , ) { ( , ) (log 1)},

( 0),

c N

E ki k i ki kii k

J U V u D x v u uλ

λ

= =

= + −

>

∑∑ (5)

1

1 1( , ) {( ) ( , ) (1 ) },

( 0).

c Nm m

P ki k i kii k

J U V u D x v uζ

ζ

= =

= + −

>

∑∑ (6)

All above are different from the original function pro-posed by Dunn [7, 8] and Bezdek [1, 2]. ( , )BJ U V has a nonnegative parameter ε proposed by Ichihashi [28]. When 0ε = , ( , )BJ U V is the original objective function. ( , )EJ U V has an additional term of entropy. The use of entropy in fuzzy c -means clustering has been proposed by a number of researchers, e.g.,[19, 20, 24]. ( , )PJ U V has been proposed by Krishnapuram and Keller [18] for possibilistic clustering. This function can also be used for fuzzy c -means with constraint (3) when 2m = . We use alternate minimization procedure FCM in the following, where ( , )J U V is either ( , )BJ U V ,

( , )EJ U V , or ( , )PJ U V . Minimization with respect to U is with constraint (3). FCM Algorithm of Alternate Optimization. FCM1: Put initial value V randomly. FCM2: Minimize ( , )J U V with respect to U . Let the optimal solution be U . FCM3: Minimize ( , )J U V with respect to V . Let the optimal solution be V . FCM4: If ( , )U V is convergent, stop. Otherwise go to FCM2. End FCM. We show solutions of FCM2 and FCM3 for each ob-jective function, where the derivations are omitted. Solution for BJ :

11

11 1

1

( ( , )) ,1

( ( , ))

mk i

ki c

j mk j

D x vu

D x v

ε

ε

= −

+=

+∑

(7)

1

1

( ).

( )

Nm

ki kk

i Nm

kik

u xv

u

=

=

=∑

∑ (8)

Solution for EJ :

1

exp( ( , )) ,exp( ( , ))

k iki c

k jj

D x vuD x v

λ

λ=

−=

−∑ (9)

1

1

.

N

ki kk

i N

kik

u xv

u

=

=

=∑

∑ (10)

Solution for PJ :

11

11 1

1

1 ( , ) ,1

1 ( , )

mk i

ki c

j mk j

D x vu

D x v

ζ

ζ

= −

+=

+∑

(11)

1

1

( ).

( )

Nm

ki kk

i Nm

kik

u xv

u

=

=

=∑

∑ (12)

where 2m = . B. Basic Functions

We introduce what we call basic functions in this pa-per:

11

1( , ) ,( ( , ))

Bm

g x yD x yε −

=+

(13)

( , ) exp( ( , )),Eg x y D x yλ= − (14)

11

1( , ) .1 ( , )

Pm

g x yD x yζ −

=+

(15)

We also assume that ( , )g x y is either ( , )Bg x y , ( , )Eg x y , or ( , )Pg x y .

A unified representation is now obtained for optimal kiu :

1

( , )

( , )

k iki c

k jj

g x vug x v

=

=

∑ (16)

for all three objective functions, since ( , )g x y repre-sents either ( , )Bg x y , ( , )Eg x y , or ( , )Pg x y .

Page 3: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering

91

C. Possibilistic Clustering Possibilistic clustering [18] uses ( , )PJ U V but with a different constraint:

{ ( ) : 0, , }ki kjM U u u k j= = > ∀ .

Note that ( , )PJ U V and M in this paper are simpler than the original formulation [18], but the essential dis-cussion is the same.

We cannot use ( , )BJ U V which leads to a trivial solution in possibilistic clustering, but ( , )EJ U V can be used [4]. We have the solution of possibilistic clustering for

( , )EJ U V : ( , )ki E k iu g x v= (17)

using basic function Eg with iv given by (10), while the solution for ( , )PJ U V is the following:

( , )ki P k iu g x v= (18) using basic function Pg with iv given by (12). Note that 2m = is not assumed for possibilistic clustering. D. Fuzzy Classifiers There have been many discussions on fuzzy classifiers derived from fuzzy clustering, but we show a standard classifier that is naturally derived from the optimal solu-tions. Note that kiu is given only on objects kx , while what we need is fuzzy classification rules whereby the solutions are provided.

To understand classification rules clearly, let us con-sider the crisp c -means, where we use the nearest pro-totype allocation rule: when the set of cluster prototypes are determined, we allocate an object to its nearest pro-totype, i.e.,

11 ( arg min ( , )),0 (otherwise).

j c k jki

i D x vu ≤ ≤=⎧

= ⎨⎩

Note that the objective function is HJ . This allocation rule is applied to all points in the space, and the result is the Voronoi regions [17] with the cen-ters of the cluster prototypes. Specifically, we define

( ) { : , }pi i jS V x R x v x v j i= ∈ − < − ∀ =/∥ ∥ ∥ ∥

as a Voronoi region for a given set of cluster prototypes V . We then have

1

( ) , ( ) ( ) ( ),c

pi i j

i

S V R S V S V i j=

= ∩ =∅ =/∪

where ( )iS V is the closure of ( )iS V . The nearest al-location rule then is as follows:

if ( ) then cluster .ix S V x i∈ →

When we consider fuzzy rules, a function ( ; )iU x V that interpolates kiu is used. We define the following function using the basic function:

1

( , )( ; ) ,( , )

pii c

jj

g x vU x V x Rg x v

=

= ∈

∑ (19)

where ( , )g x y is either ( , )Bg x y , ( , )Eg x y , or ( , )Pg x y .

Fuzzy rules are simpler in possibilistic clustering: ( ; ) ( , ), p

i i iU x v g x v x R= ∈ (20) where ( , )g x y is either ( , )Eg x y , or ( , )Pg x y . The rule is thus the same as basic functions in possibilistic clustering. We show a number of theoretical properties of the fuzzy rules defined by the above functions. The proofs are given in [25, 28] and omitted here. Proposition 1: Let ( ; )iU x V is with function Bg . In other words, BJ is used. Suppose 0ε → . Then the

maximum value of ( ; )iU x V is at ix v= :

arg max ( ; ) , 0.p i ix RU x V v as ε

∈→ →

Moreover, for all 0ε ≥ , we have 1lim ( ; )ix

U x Vc→∞

=∥∥

.

Proposition 2: Let ( ; )iU x V is with function Pg . In other words, PJ is used with 2m = . Suppose

ζ → +∞ . Then the maximum value of ( ; )iU x V is at

ix v= :

arg max ( ; ) , .p i ix RU x V v as ζ

∈→ → +∞

Moreover, for all 0ζ ≥ , we have 1lim ( ; )ix

U x Vc→∞

=∥∥

.

Hence the functions of the fuzzy rules for BJ and PJ behave similarly when point x goes far, while the maximum point approaches to the cluster center as the respective parameters tend to their limitations. In con-trast, fuzzy rule ( ; )iU x V for EJ has a quite different property. To describe this, we should discuss Voronoi regions again. In many cases, fuzzy clusters are made crisp by the maximum membership rule:

1if arg max ( ; ) then cluster .j c ji U x V x i≤ ≤= →

Page 4: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011

92

Accordingly we can define the set of points that belongs to cluster i :

1( ) { : arg max ( ; )}.pi j c jT V x R i U x V≤ ≤= ∈ =

We then have the next proposition. Proposition 3: For all choices of Bg g= , Eg g= , and

Pg g= ,

( ) ( ).i iT V S V=

Thus ( )iT V is the closure of the Voronoi region with

center V , and ( )iT V is the same for all the three ob-jective functions BJ , EJ , and PJ .

Let us now consider ( ; )iU x V for EJ .

Proposition 4: Let ( ; )iU x V is with function Eg . In other words, EJ is used. Assume iv 's are in general positions in the sense that none of the three are on a line. If a Voronoi region ( )iT V is bounded, then

lim ( ; ) 0.ixU x V

→∞=

∥∥

If a Voronoi region ( )iT V is unbounded and x

moves inside ( )iT V , then

lim ( ; ) 1.ixU x V

→∞=

∥∥

In the both cases, 0 ( ; ) 1iU x V< < for all px R∈ . The proof is given in [25] and omitted here.

Possibilistic clustering As fuzzy rules in possibilistic clustering are bell-shaped functions, we have the same property:

arg max ( ; ) ,p i i ix RU x v v

∈=

lim ( ; ) 0i ixU x v

→∞=

∥∥

for both Eg and Pg . If possibilistic clusters should be made crisp, we de-

fine

1( ) { : arg max ( ; )}.pi j c j iT V x R i U x v≤ ≤′ = ∈ =

We have the next proposition: Proposition 5: For both Eg g= and Pg g= ,

( ) ( ).i iT V S V′ = The Voronoi regions are thus derived again.

3. Size and Covariance of a Cluster

We frequently need to recognize a prolonged cluster,

but the original fuzzy c-means cannot do this, as the Voronoi region cannot separate such a prolonged region.

To solve such a problem, cluster covariances in fuzzy c-means have been considered by Gustafson and Kessel

[11]. However, there is another problem to separate a dense cluster and a sparse cluster for which “density” or “cluster size” has to be considered.

To solve the both problems, a generalized objective function with a Kullback-Leibler information term has been proposed by Ichihashi and his colleagues [15, 28]. That is, the following function is used for this purpose:

1 112

1 1

( , , , ) ( , ; )

{ log log | | }

c N

KL ki k i ii k

c Nki

ki ii k i

J U V A S u D x v S

uu Sνα

= =

= =

=

+ +

∑∑

∑∑(23)

where variable 1( , , )cA α α= … controls cluster sizes with the constraint

1

{ : 1, 0, 1, , }.c

i ji

A j cα α=

= = ≥ = …∑A (24)

Another variable is 1( , , )cS S S= … ; iS ( 1, ,i c= … ) is p p× positive-definite matrix with determinant | |iS . In addition,

1( , ; ) ( ) ( )Ti i i i iD x v S x v S x v−= − − (25)

is the squared Mahalanobis distance for cluster i . Since this objective function has four variables, the alternate optimization means minimization with respect to a variable while other three are fixed: After giving initial values for , ,V A S , we repeat

arg min ( , , , ),arg min ( , , , ),arg min ( , , , ),arg min ( , , , ),

U KL

V KL

A KL

S KL

U J U V A SV J U V A SA J U V A SS J U V A S

=

=

=

=

until convergence. The solutions are as follows [28]. Solutions for KLJ :

12

11 2

( , ; )exp| | ,

( , ; )exp

| |

i k i i

iki c

j k j i

jj

D x v S

SuD x v S

S

αν

αν=

⎛ ⎞−⎜ ⎟⎝ ⎠

=⎛ ⎞−⎜ ⎟⎜ ⎟⎝ ⎠

∑ (26)

1

1

,

n

ki kk

i n

kik

u xv

u

=

=

=∑

∑ (27)

1 ,

n

ikk

i

u

nα ==

∑ (28)

Page 5: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering

93

1

1

1 ( )( ) .n

i ki k i k ink

kik

S u x v x vu =

=

= − −∑∑

(29)

Note that the above solutions are similar to those ob-tained by the EM algorithm [5] for Gaussian mixture distributions [22, 29]. This model for fuzzy c -means clustering thus has a close relationship with the statisti-cal model of mixture distributions.

4. Kernel Functions in Fuzzy Clustering Many studies on kernel functions have been done [31].

The algorithms of kernel-based fuzzy c-means (e.g., [10, 26, 27]) have also been developed. We review the clus-tering algorithms and also discuss kernelized cluster va-lidity measures. We moreover study a class of new ker-nel functions.

A. Kernel-based algorithms Linear cluster boundaries between Voronoi regions are obtained by fuzzy c-means clustering. When we use the KL-information method, we have a curved boundary described by quadratic functions. In contrast, more gen-eral nonlinear boundaries can be obtained using kernel functions, as discussed in support vector machines [34]. A high-dimensional feature space H is assumed, while the original space pR is called the data space. H is an inner product space. Assume that the inner product is ,⟨⋅ ⋅⟩ . The norm of H for g H∈ is given by 2 ,Hg g g= ⟨ ⟩∥∥ .

A transformation : pR HΦ → is used whereby kx is mapped into ( )kxΦ . Explicit representation of

( )xΦ is unknown in general but the inner product ( ), ( )x y⟨Φ Φ ⟩ is assumed to be represented by a kernel

function: ( , ) ( ), ( )K x y x y= ⟨Φ Φ ⟩ . (30)

A well-known kernel function is the Gaussian kernel: 2( , ) exp , ( 0){ }K x y C x y c= − − >∥ ∥ . (31)

Note that ( , ) ( , )EK x y g x y= holds when C λ= . Objective functions BJ , EJ , and PJ are used but the dissimilarity is changed as follows:

2( ) ,ki k i HD x v= Φ −∥ ∥ (32) where iv H∈ . Note: There is another formulation using

2( ) ( )ki k i HD x v= Φ −Φ∥ ∥ instead of (32), which is omitted here (see, e.g., [35]). When we derive a kernel-based fuzzy c-means algo-rithm, we should consider two problems: one is the basic function and another is the updating scheme.

The basic function is changed as follows:

11

1( , ) ,( ( , ) ( , ) 2 ( , ))

Bm

g x yK x x K y y K x yε −

=+ + −

(33)

( , ) exp( ( ( , ) ( , ) 2 ( , ))),Eg x y K x x K y y K x yλ= − + − (34)

11

1( , ) ,1 ( ( , ) ( , ) 2 ( , ))

Pm

g x yK x x K y y K x yζ −

=+ + −

(35)

whereby solution kiu is given by (16), with function

Bg g= , Eg g= , or Pg g= changed as above. The cluster prototype is given by

1

1

( ) ( ),

( )

Nm

ki kk

i Nm

kik

u xv

u

=

=

Φ=∑

but function ( )kxΦ is generally unknown. Hence we cannot use FCM algorithm. Instead, we update dissimi-larity measure kiD :

1

1

2 1 1

1

2 ( )( )

1 ( ) ,( ( ) )

Nm

ki kk ji jkNm j

kik

N Nm

ji i jNm j

kik

D K u Ku

u u Ku

=

=

= =

=

= −

+

∑∑

∑∑∑

(36)

where ( , )jk j kK K x x= . Note that 1m = in (36) when

EJ is considered. We thus repeat (16) and (36) until convergence.

B. Kernelized Cluster Validity Measures Various cluster validity measures have been proposed [1,6] in order to determine the appropriate number of clusters. They are divided into two classes: one class uses the membership values alone. A typical example is the entropy

1 1( , ) log

c N

ki kii k

E U c u u= =

=∑∑

whereby the number c that maximizes ( , )E U c is selected. Another class takes geometrical characteristics into account. A typical method uses the fuzzy covariance matrix for cluster i :

1

1

( ) ( )( ).

( )

Nm

ki k i k ik

i Nm

kik

u x v x vF

u

=

=

− −=∑

∑ (37)

Page 6: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011

94

Gath and Geva [9] use the sum of the square root of the determinants of iF :

1

det .c

det ii

W F=

= ∑ (38)

We also consider the sum of the traces of iF :

1

.c

tr ii

W trF=

= ∑ (39)

Hashimoto et al. [12] showed the trace works as well as the determinant by having randomly generated many simulation examples and tested different validity meas-ures. When we use kernel-based clustering, we should also have kernel-based validity measures. Let us consider the kernelized versions of (38) and (39) for this purpose. The kernel-based fuzzy covariance matrix is the fol-lowing:

1

1

( ) ( ( ) )( ( ) ).

( )

Nm

ki k i k ik

i Nm

kik

u x v x vKF

u

=

=

Φ − Φ −=∑

∑ (40)

where iv is not explicitly given. Note that the determinant of the kernelized fuzzy co-

variance is inappropriate, since the next relation holds: det 0, .iKF as N→ →∞

The proof of this relation is simple because the mono-tone decreasing sequence 1 2, ,λ λ … of the eigenvalues of iKF will converge to zero as N →∞ (see, e.g., [31]). Hence we have

1log det log , .

N

i ii

KF as Nλ=

= → −∞ →∞∑

In contrast, the trace of iKF is useful. After some cal-culation, we have

2

1

1

1

1

1 ( ) ( )( )

1 ( ) ,( )

Nm

i ki k iNkm

kik

Nm

ki kiNkm

kik

trKF u x vu

u Du

=

=

=

=

= Φ −

=

∑∑

∑∑

∥ ∥

(41)

where kiD is given by (36). We hence define

1

.c

tr ii

KW trKF=

= ∑ (42)

Numerical experiments We now have a question: although a kernel-based

clustering works well for some typical clusters with non-linear boundaries (such as those in Fig.1), does it also

work well for ordinary ball-shaped clusters? To answer this question, we compared the above measures using randomly generated data with artificial clusters and evaluated the numbers of clusters. Condi-tions of the random data are shown below. The basic condition is shown by No.1 in Table 1. Then the diameter of each cluster was changed to No.2. Next, the total number of data points of each cluster was randomly changed to No.3. Finally, the diameter and total number of members was changed to No.4. The de-tails of these conditions are shown in Table 1. Note that the randomly generated clusters are ball-shaped, and we tested if the kernel-based measure has the ability to judge correct number of clusters as well as the non-kernelized measures.

Table 1. Conditions for random generation of clusters.

Conditions No.1 No.2 No.3 No.4

Total number of clusters 4

Total number of data points 400

Dimensions of data set 2, 3

Range of cluster centers 0.0 ~ 1.0

Number of data in each cluster 100 50~150

Diameter of each cluster 0.1 0.05~ 0.193 0.1 0.05 ~0.193

The process of evaluating the numbers of clusters is as follows: (1) Generate data sets with conditions No.1--4. (2) Perform clustering 100 times with random initial

values, and then use the resulting clusters having the minimum value of the objective function for the evaluation.

(3) Evaluate the above clusters by each validity meas-ure.

(4) Give label “correct” that has the number “4” of clusters, otherwise give label “wrong.”

(5) Repeat the process No.1--4 for 1000 times. (6) Calculate the percentage of label “correct” for each

validity measure. We observe that kernelized measure trKW is as effec-tive as the non-kernerized measures. Moreover, it has been shown that trKW can judge the correct number of clusters for the set of points like the one in Fig.1 (see e.g., [28]). C. Positive definite kernels derived from fuzzy c-means As the last topic in this paper, we consider kernel

Page 7: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering

95

functions again. We note that the Gaussian kernel is the same as basic function Eg . Note that other basic func-tions Bg and Pg are also bell-shaped with “longer tails.”

Table 2. The ratio of accurate numbers of clusters for each condition using BJ ( 2m = ) with dimension P=2, 3.

No.1 trW detW trKW

2p = 0.958 0.943 0.952 3p = 0.994 0.995 0.994

No.2 trW detW trKW

2p = 0.770 0.931 0.779 3p = 0.982 0.992 0.980

No.3 trW detW trKW

2p = 0.953 0.931 0.949 3p = 0.993 0.993 0.990

No.4 trW detW trKW

2p = 0.710 0.897 0.728 3p = 0.981 0.982 0.972

Here is a question: Are Bg and Pg also posi-tive-definite kernel functions? We also have a second question: Are they as useful in kernel-based clustering as the Gaussian kernel? The first answer is shown in the next proposition. Proposition 6: Functions

11

1( , ) ,( ( , ))

Bm

g x yD x yε −

=+

with 0ε > and

11

1( , )1 ( , )

Pm

g x yD x yζ −

=+

are positive-definite kernels. The proof is based on a theorem by Schönberg [32] that states a class of positive-definite kernels can be de-rived from complete monotone functions. We proved that Bg and Pg are derived from complete monotone functions [14]. The details are given in [14] and omitted here. Note that Bg is not positive-definite if 0ε = , i.e., the original objective function does not give a kernel function. The regularization parameter 0ε > is thus necessary. Accordingly, we can use these two functions to the kernel-based fuzzy c-means algorithms instead of the Gaussian kernel.

Illustrative examples Let us consider two sets of points shown in Figs.1 and 2. The former figure is typical in discussing the effect of kernel functions. The crisp and fuzzy c-means cannot divide the set of points into the outer circle and inner ball, since they produce linear cluster boundaries, In contrast, the Gaussian kernel is known to successfully divide the both circles. As expected, Bg and Pg also perfectly succeed in dividing the outer circle and inner ball [14].

Figure 1. First data set: a ball inside a circle.

In the second figure, the “two crescents” are shown, which is similar to those examples in semi-supervised learning [3]. It is more difficult to divide these two sets of points.

Figure 2. Second data set: two crescents.

Page 8: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011

96

We summarize the results of classifications in Table 3, where misclassifications are fewer for Bg and Pg than the Gaussian kernel. We thus have the second answer: Bg and Pg are useful in these examples with nonlinear cluster bounda-ries. Table 3. Summary of misclassifications by fuzzy c-means with the three kernel functions applied to the two crescents data. Calculations were repeated 50 times for each kernel with dif-ferent initial random values. Numbers in the parentheses (*) are from the entropy fuzzy c-means. The parameters are

2m = , 1.0=ε , 1.0η = , and 1.0λ = .

Percentage of misclassifica-

tions Bg Eg

(Gaussian) Pg

0 ~ 15 14 (7) 1 (0) 11 (10) 16 ~ 30 10 (12) 3 (5) 13 (13)

31 ~ 45 8 (13) 8 (8) 12 (15)

46~ 18 (18) 38 (37) 14 (12)

5. Conclusions

An overview of fuzzy c-means clustering with three

different objective functions has been given with the fo-cuses on fuzzy classifiers, a generalization including va-riables of cluster size and covariance, and kernel func-tions. The two discussions on kernel functions are ker-nelized validity measures and new kernels derived from basic functions of fuzzy c-means. The kernel functions

Bg and Pg are useful for examples given here. We expect that they are useful in support vector machines as well, but many more experiments using real numerical data are necessary. In spite of their importance, the topics herein are rela-tively unknown to the fuzzy community interested in clustering. They provide, however, many future research opportunities both in theory and applications. For exam-ple, application to semi-supervised clustering [3, 35] and a variety of new fuzzy clustering algorithms [33, 36] will be promising.

References [1] J. C. Bezdek, Pattern Recognition with Fuzzy Ob-

jective Function Algorithms, Plenum Press, New York, 1981.

[2] J. C. Bezdek, J. Keller, R. Krishnapuram, and N. R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer, Bos-ton, 1999.

[3] O. Chapelle, B. Schölkopf, and A. Zien, eds.,

Semi-Supervised Learning, MIT Press, Cambridge, Massachusetts, 2006.

[4] R. N. Davé and R. Krishnapuram, “Robust Clus-tering Methods: a Unified View,” IEEE Trans. on Fuzzy Systems, vol.5, pp.270-293, 1997.

[5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. R. Stat. Soc., vol. B39, pp. 1-38, 1977.

[6] D. Dumitrescu, B. Lazzerini, and L. C. Jain, Fuzzy Sets and Their Application to Clustering and Training, CRC Press, Boca Raton, Florida, 2000.

[7] J. C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters,” J. of Cybernetics, vol. 3, pp. 32-57, 1974.

[8] J. C. Dunn, “Well-Separated Clusters and Optimal Fuzzy Partitions,” J. of Cybernetics, vol. 4, pp. 95-104, 1974.

[9] I. Gath and A. B. Geva, “Unsupervised Optimal Fuzzy Clustering,” IEEE Trans. on Pattern Analy-sis and Machine Intelligence, vol. 11, no. 7, pp. 773-781, 1989.

[10] M. Girolami, “Mercer Kernel Based Clustering in Feature Space,” IEEE Trans. on Neural Networks, vol. 13, no. 3, pp. 780-784, 2002.

[11] E. E. Gustafson and W. C. Kessel, “Fuzzy Cluster-ing with a Fuzzy Covariance Matrix,” IEEE CDC, San Diego, California, pp. 761-766, 1979.

[12] W. Hashimoto, T. Nakamura, and S. Miyamoto, “Comparison and Evaluation of Different Cluster Validity Measures Including Their Kernelization,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 13, no. 3, pp. 204-209, 2009.

[13] F. Höppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis, Wiley, Chichester, 1999.

[14] J. S. Hwang and S. Miyamoto, “Kernel Functions Derived from Fuzzy Clustering and Their Applica-tion to Kernel Fuzzy c -Means,” Journal of Ad-vanced Computational Intelligence and Intelligent Informatics, vol. 15, pp. 90-94, 2011.

[15] H. Ichihashi, K. Honda, and N. Tani, “Gaussian Mixture PDF Approximation and Fuzzy c-Means Clustering with Entropy Regularization,” Proc. of Fourth Asian Fuzzy Systems Symposium, vol. 1, pp. 217-221, 2000.

[16] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wi-ley, New York, 1990.

[17] T. Kohonen, Self-Organizing Maps, 2nd Ed., Springer, Berlin, 1997.

[18] R. Krishnapuram and J. M. Keller, “A Possibilistic Approach to Clustering,” IEEE Trans. on Fuzzy

Page 9: ijfs11-2-r-3-Miyamoto(IJFS20100601-00430.pdf

S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering

97

Systems, vol. 1, pp. 98-110, 1993. [19] R. P. Li and M. Mukaidono, “A Maximum Entropy

Approach to Fuzzy Clustering,” Proc. of the 4th IEEE Intern. Conf. on Fuzzy Systems (FUZZ-IEEE/IFES'95), Yokohama, Japan, pp. 2227-2232, March 20-24, 1995.

[20] R. P. Li and M. Mukaidono, “Gaussian Clustering Method Based on Maximum-Fuzzy-Entropy Inter-pretation,” Fuzzy Sets and Systems, vol. 102, pp. 253-258, 1999.

[21] J. B. MacQueen, “Some Methods of Classification and Analysis of Multivariate Observations,” Proc. of 5th Berkeley Symposium on Math. Stat. and Prob., pp. 281-297, 1967.

[22] G. McLachlan and D. Peel, Finite Mixture Models, Wiley, New York, 2000.

[23] S. Miyamoto, Fuzzy Sets in Information Retrieval and Cluster Analysis, Kluwer, Dordrecht, 1990.

[24] S. Miyamoto and M. Mukaidono, “Fuzzy c -Means as a Regularization and Maximum En-tropy Approach,” Proc. of the 7th International Fuzzy Systems Association World Congress (IFSA'97), Prague, Czech, vol. II, pp. 86-92, June 25-30, 1997.

[25] S. Miyamoto, Introduction to Cluster Analysis, Morikita-Shuppan, Tokyo, 1999 (in Japanese).

[26] S. Miyamoto and Y. Nakayama, “Algorithms of Hard c -Means Clustering Using Kernel Functions in Support Vector Machines,” Journal of Advanced Computational Intelligence and Intelligent Infor-matics, vol. 7, no. 1, pp. 19-24, 2003.

[27] S. Miyamoto and D. Suizu, “Fuzzy c -Means Clustering Using Kernel Functions in Support Vector Machines,” Journal of Advanced Computa-tional Intelligence and Intelligent Informatics, vol. 7, no. 1, pp. 25-30, 2003.

[28] S. Miyamoto, H. Ichihashi, and K. Honda, Algo-rithms for Fuzzy Clustering, Springer, Berlin, 2008.

[29] R. A. Redner and H. F. Walker, “Mixture Densities, Maximum Likelihood and the EM Algorithm,” SIAM Review, vol. 26, no. 2, pp. 195-239, 1984.

[30] E. H. Ruspini, “A New Approach to Clustering,” Information and Control, vol. 15, pp. 22-32, 1969.

[31] B. Schölkopf and A. Smola. Learning with Kernels, MIT Press, Cambridge, Massachusetts, 2002.

[32] I. J. Schönberg, “Metric Spaces and Completely Monotone Functions,” Annals of Mathematics, vol. 39, no. 4, pp. 811-841, 1938.

[33] C.-C. Tsai, C.-C. Chen, C.-K. Chan, and Y.-Y. Li, “Behavior-Based Navigation Using Heuristic Fuzzy Kohonen Clustering Network for Mobile Service Robots,” International Journal of Fuzzy Systems, vol. 12, no. 1, pp. 25-32, 2010.

[34] V. N. Vapnik, Statistical Learning Theory, Wiley,

New York, 1998. [35] N. Wang, X. Li, and X. Luo, “Semi-supervised ker-

nel-based fuzzy c -means with pairwise con-straints,” WCCI 2008 Proceedings, Hong Kong, China, pp.1099-1103, June 1-6, 2008.

[36] F. Yu, J. Tang, and R. Cai, “Partially Horizontal Collaborative Fuzzy C-Means,” International Journal of Fuzzy Systems, vol. 9, no. 4, pp. 198-204, 2007.

[37] L. A. Zadeh, “Similarity Relations and Fuzzy Or-derings,” Information Sciences, vol. 3, pp. 177-200, 1971.

Dr. Miyamoto was born in Osaka, Japan, in 1950. He received the B.S.,M.S. and the Dr. Eng. degrees in Applied Mathe-matics and Physics Engineering from Kyoto University, Japan, in 1973, 1975, and 1978, respectively. He is now a Pro-fessor at the Department of Risk Engi-neering, the University of Tsukuba, Ja-pan. His current research interests in-

clude methodology for uncertainty modeling, data clustering algorithms, multisets, and methods for text mining. He is a member of the Society of Instrumentation and Control Engi-neers of Japan, Information Processing Society of Japan, the Japan Society of Fuzzy Theory and Systems, and IEEE. He is a fellow of International Fuzzy Systems Association.