hierarchical text categorization using one-class svmr95038/try/paper/etd... · 2008-03-01 ·...

i

國立成功大學

資訊工程學系

碩士論文

以單類支持向量機為基礎之階層式

文件分類

Hierarchical Text Categorization Using

One-Class SVM

研究生：塗宜昆

指導教授：蔣榮先博士

中華民國九十二年七月

i

以單類支持向量機為基礎之階層式文件分類

塗宜昆* 蔣榮先**

國立成功大學資訊工程研究所

中文摘要

由於資訊量的快速成長，自動文件分類對於處理及組織資料成為一種

重要的資訊分析技術。而由經驗得知當我們處理的資料類別數增加時，用

來衡量效能好壞的工具如精確率(precision)、召回率 (recall)都會相對的下

降，採用階層式的類別架構可以解決及處理具有大量資料的問題。

在這個研究中，我們採用單類支持向量機來達到文件聚類之目的，並

使用聚類的結果來建立一個階層式的架構，這個架構描述了類別間的關

係。我們採用兩類及多類支持向量機來作監督式的分類訓練。

由所設計的三個實驗，我們探討以單類支持向量機為基礎所建立的系

統的特性，並與其他研究方法作比較，實驗結果證明所提出的系統具有較

佳的效能。

*作者 **指導教授

ii

Hierarchical Text Categorization Using

One-Class SVM

Yi-Kun Tu* Jung-Hsien Chiang** Department of Computer Science & Information Engineering,

National Cheng Kung University

Abstract With the rapid growth of online information, text categorization has become one

of the key techniques for handling and organizing text data. Experience to date has

demonstrated that both precision and recall decrease as the number of categories

increase. Hierarchical categorization affords the ability to deal with very large

problems.

We utilize one-class SVM to perform support vector clustering, and then use the

clustering results to construct a hierarchical categories. Two-class and multi-class

SVMs are used to perform the supervised classification.

We explore one-class SVM model through three experiments. Performance

analysis is performed by comparing with other approaches, the experimental results

show that the proposed hierarchical categories works well.

*Author **Advisor

iii

誌謝

本論文的順利完成，首先要感謝指導老師蔣榮先教授，在這兩年的時間裡悉心

教導與鼓勵，並提供良好的學習環境。感謝上帝將女友淑華擺在我身旁成為我

最大的支援及鼓勵，沛毅是我最大的資料庫，有任何問題他總不吝嗇的給予指

導，宗賢學長是最佳支援手，總是在最需要時給我最好的建議。

當然，還有很多要感謝的人，我謹記在心，謝謝你們的支持與鼓勵。

2003 年仲夏

塗宜昆謹誌於

成大資工所 ISMP IIR LAB

iv

Contents

中文摘要 … … … … … … … … … ..… … … … … … … … … … … … … … … … … … … … … … … … … … .. i

ABSTRACT… … … … … … … … … … … … … .… … … … … … … … … … … … … … … … … … … … … .ii

FIGURE LISTING… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … . vi

TABLE LISTING… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. vii

CHAPTER 1 INTRODUCTION… … … … … … … … … … … … … … … … … … … … … … … … … . 1

1.1 RESEARCH MOTVATION… … .… … … … … … … … … … … … … … … … … .… … … … … … .. 2

1.2 THE APPROACH… … … … … … … … … … … … … … … … … … … … … … … … … … … … .… 3

1.3 THESIS ORGANIZATION… … … … … … … … … … … … … … … … … … … … … … … … … … . 4

CHAPTER 2 LITERATURE REVIEW AND RELATED WORKS..… … … … … … … … … … . 5

2.1 SUPPORT VECTOR CLUSTERING… … ..… .… … … … … … … … … … … … … … … … … … .. .5

2.1.1 One-Class SVM … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 5

2.1.2 The Formulation of Support Vector Clustering… … … … … … … … … … … … … … … … ... 9

2.2 SEQUENTIAL MINIMAL OPTIMIZATION… … … … … … … … … … … … … … … … … … ... 11

2.2.1 Optimize Two Lagrange Multipliers … … … … … … … … … … … … … … … … … … … … . 11

2.2.2 Updating After A Successful Optimization Step… … … … … … … … … … … … … … … .. 13

2.3 DIMENSION REDUCTION… … … … … … … … … … … … ..… … … … … … … … … … … … .. 13

2.4 FUZZY C-MEAN… … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 14

2.5 FEATURE SELECTION METHODS… … … … … … … … … … … … … … … … … … … … … … 16

2.5.1 Document Frequency Thresholding… … … … … … … … … … … … … … … … … … … … . 16

2.5.2 Information Gain … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 17

2.5.3 Mutual Information… … … … … … … … … … … … … … … … … … … … … … … … … … .. 17

2.5.4 2χ Statistic… … … … … … … … … … … … … … … … … … … … … … … … … … … … … . 18

2.5.5 Term Strength… … … … … … … … … … … … … … … … … … … … … … … … … … … … ... 18

2.6 MULTI-CLASS SVMs… … … … … … … … … … … … … … … … … … … … … … … … … … … .. 19

CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM… … … … … … … … … . 21

3.1 PROPOSED MODEL… … … … … … … … … … … … … … … … … … … … … … … … … … … … 21

3.2 DATA PREPROCESSING… … … … … … … … … … … … … … … … … … … … … … … … … … 22

3.2.1 Part-Of-Speech Tagger… … … … … … … … … … … … … … … … … … … … … … … … … 23

3.2.2 Stemming… … … … … … … … … … … … … … … … … … … … .… … … … … … … … … … 24

3.2.3 Stop-Word Filter… … … … … … … … … … … … … … … … … … … … … … … … … … … ... 25

3.2.4 Feature Selection… … … … … … … … … … … … … … … … … … … … … … … … … … … . 26

3.3 UNSUPERVISED LEARNING… … … … … … … … … … … … … … … … … … … … … … … … 28

3.3.1 Support Vector Clustering… … … … … … … … … … … … … … … … … … … … … … … … 28

v

3.3.2 The Choice Of Kernel Function… … … … … … … … … … … … … … … … … … … … … .. 30

3.3.3 Cluster-Finding With Depth First Searching Algorithm… … … … … … … … … … … … .. 30

3.3.4 Cluster Validation… … … … … … … … … … … … … … … … … … … … … … … … … … … . 32

3.3.5 One-Cluster And Time-Consuming Problem… … … … … … … … … … … … … … … … ... 33

3.4 SUPERVISED LEARNING… … … … … … … … … … … … … … … … … … … … … … … … … .. 34

3.4.1 Reuters Category Construction… … … … … … … … … … … … … … … … … … … … … … 34

3.4.2 The Mapping Strategy… … … … … … … … … … … … … … … … … … .… … … … … … … 36

3.4.3 Gateway Node Classifier… … … … … … … … … … … … … … … … … .… … … … … … … 37

3.4.4 Expert Node Classifier… … … … … … … … … … … … … … … … … … … … … … … … … 38

CHAPTER 4 EXPERIMENT DESIGN AND ANALYSIS … … … … … … … … … … … … … … … 39

4.1 THE CORPUS… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 39

4.1.1 Documents… … … … … … … … … … … … … … … … … … … … … … … … … … … … … ... 39

4.1.2 File Format… … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 40

4.1.3 Document Internal Tags… … … … … … … … … … … … … … … … … … … … … … … … .. 41

4.1.4 Categories… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … . 43

4.2 DATA REPRESENTATION ANALYSIS… … … … … … … … … … … … … … … … … … … … . 44

4.3 THE CHOICE OF KERNEL FUNCTION… … … … … … … … … … … … … … ...… … … … … 48

4.4 HIERARCHICAL TEXT CATEGORIZATION PERFORMANCE ANALYSIS… … … … … . 50

CHPATER 5 CONCLUSIONS AND FUTURE WORKS … … … … … … … … … … … … … … … . 56

5.1 CONCLUSIONS… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 56

5.2 FUTURE WORKS… … … … … … … … … … … … … … … … … … … … … … … … … … … … .… 57

Reference… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … .58

APPENDIX A Stop-word List… … … … … … … … … … … … … … … … … … … … … … … … … … .. 63

APPENDIX B Part-Of-Speech Tags… … … … … … … … … … … … … … … … … … … … … … … … .66

vi

Figure Listing Figure 1.1 Our Learning Processes … … … … … … … … … … … … … … … … … … … … … … … … … 3

Figure 2.1 Different value of q has different clustering results (a) q =1 (b) q =20(c)q =24

(d) q =48… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 9

Figure 2.2 q fixed and different value of ν results in different shape of cluster(a) 1=lν , l

is the number of data (b)ν =0.4… … … … … … … … … … … … … … … … … … … … … … … … … … 10

Figure 3.1 Our proposed text categorization model… … … … … … … … … … … … … … … … … … 22

Figure 3.2 Data Preprocessing Processes … … … … … … … … … … … … … … … … … … … … … … . 23

Figure 3.3 Words with tagging… … … … … … … … … … … … … … … … … … … … … … … … … … ..24

Figure 3.4 Representing text as a feature vector… … … … … … … … … … … … … … … … … … … .27

Figure 3.5 The SV clustering processes[Ben-Hur et. al., 2000]… … … … … … … … … … … … … ..29

Figure 3.6 Reuters basic hierarchy[D’Alessio et. al., 2000]… … … … … … … … … … … … … … … 34

Figure 3.7 Our Proposed Hierarchy… … … … … … … … … … … … … … … … … … … … … … … … .35

Figure 4.1 Sample news stories in the Reuters-21578 corpus… … … … … … … … … … … … … … 41

Figure 4.2 One-Class SVM with polynomial kernel where d=2… … … … … … … … … … … … … 48

Figure 4.2 One-Class SVM with Gaussian kernel where g=100, ν =0.1… … … … … … … … … ..48

Figure 4.4 Our Proposed Hierarchical Construction… … … … … … … … … … … … … … … … … . 52

vii

Table Listing Table 4.1 The list of categories, sorted in decreasing order of frequency… … … … … … … … … … .42

Table 4.2 Number of Training/Testing Items … … … … … … … … … … … … … … … .… … … … … … ..43

Table 4.3 1F measure in One-Class SVM with vector dimension 10… … … … … … … … … … … … 45

Table 4.4 1F measure in One-Class SVM with vector dimension 20… … … … … … … … … … … … 45

Table 4.5 Precision/Recall-breakeven point on the ten most frequent Reuters

categories on Basic SVMs … .… … … … … … … … … … … … … … … … … … … … … … … … .51

Table 4.6 Our proposed classifier comparison of k-NN and decision tree… … … … … … … … … … 53

1

CHAPTER 1

INTRODUCTION

With the rapid growth of online information, text categorization has become one of

the most popular key techniques for handling and organizing text data. It is used to

classify news stories [Hayes et. al., 1990], to find interesting information on the

WWW [Lang 1995], and to guide a users search through hypertext. Since building

text classifiers by hand is difficult and very time consuming, it is desirable to learn

classifiers from examples.

The most successful paradigm to organize a large mass of information is by

categorizing the different documents according to their topics. Recently, various

machine learning techniques have been attempted to automatic text categorization.

These approaches are usually based on the vector space model, in which documents

are represented by sparse vectors, with one component for each unique word extracted

from the document. Typically, the document vector is very high-dimensional, at least

in the thousands for large collections. This is a major stumbling block in applying

many machine learning methods, existing techniques rely heavily on dimension

reduction as a preprocessing step. But it is computational expensive. Until recently, a

new approach is found by the introduction of Support Vector Machines (SVMs), This

new algorithm outperforms other classifiers in text categorization, and is also used as

a clustering method.

Most of computational experience discussed in the literature deal with

hierarchies that are trees. Indeed, until recently, most problems discussed dealt with

categorization within a simple (non-hierarchical) set of categories [Frakes et. al.,

1992]. But also a few hierarchical classification methods have been proposed recently

2

[D’Alessio et. al., 1998 ; Wang et. al., 2001 ; Weigend et. al., 1999].

In this research, we try to utilize one-class SVM to perform text clustering. And

use the clustering results to build a hierarchical construction. This hierarchical

construction illustrates the relationship between Reuters categories.9

The Reuters-21578 corpus has been studied extensively. Yang [Yang 1997a]

compares 14 categorization algorithms applied this Reuters corpus as a flat

categorization problem on 135 categories. This same corpus has been more recently

studied by others treating the categories as a hierarchy [Koller et. al., 1997 ; Yang

1997a ; Ng et. al., 1997]. We construect our own hierarchy through Support Vector

(SV) clustering and compare the text categorization result with state-of-the-art

literatures.

1.1 RESEARCH MOTIVATION

In the field of document categorization, if we use only one- layer of classifiers, we

usually need to use many training samples. This kind of model is usually too

complicated and the training is not accurate enough. So we adopt the way ”divide and

conquer”, to partition a problem into many small and easy-solved sub-problems. With

this procedure, we can simplify our training processes and gain accurate training

model.

In the heuristic, we know that each document can be in multiple, exactly one, or

no category at all. By building the hierarchical construction between categories,

whenever a new document comes in, we can easily assign it into the category (or

categories) it belongs. Support Vector Machine outperforms other classifiers in text classification in

recent years, it is also used to perform text clustering. We want to perform support

vector clustering on text data set and construct a hierarchical construction between the

3

data set.

1.2 THE APPROACH

There are three stages in our proposed approach. It consists of data preprocessing,

unsupervised learning, and supervised learning.

We present documents as “bags of words”, only the presence or absence of the

words in a document is indicated and not their order of any higher-level linguistic

structure. We can thus think of documents as high-dimensional vectors, with one slot

for each word in the vocabulary and a value in each slot related to the number of

times that word appeared in the document. Note that a big fraction of the total number

of slots will have zero value for any given document, so the vectors are quite sparse.

The goal of the first stage is to build the training data for the second stage, a

detail description of the methods we use will be discussed in Section 3.2.

The second stage is unsupervised learning, we use one-class SVM to perform

support vector clustering. We will explain how we achieve text clustering by this new

learning algorithm in Section 3.3.

In the last stage, we use two-class SVMs and multi-class SVMs to train the tree

nodes and finally get much more accurate tree node classifiers. The following Figure

shows stage two and stage three:

Fig 1.1 Our Learning Process

4

1.3 THESIS ORGANIZATION

The content of this thesis is partitioned into five chapters and is organized as follows.

Chapter 1 introduces the motivation of our research.

Chapter 2 references works of related techniques such as support vector

clustering, one-class SVM, sequential minimal optimization, text categorization and

finally feature selection methods.

Chapter 3 describes our methodologies and the system.

Chapter 4 describes the corpus we use as the test bed for experiments, and our

three experiments. Finally we report our experimental results of SV clustering and

text categorization. System performance analysis is also discussed by comparing to

other classifiers.

Chapter 5 presents our conclusions based on the results of our experiments, and

suggests for future research.

5

CHAPTER 2

LITERATURE REVIEW AND RELATED WORKS

Section 2.1 and 2.2 illustrate unsupervised leaning methods we use. They are SV

clustering, and one-class SVM, also Sequential minimal optimization. Section 2.3

delivers methods of dimension reduction, and section 2.4 is about Fuzzy C-mean.

Section 2.5 reviews the methods of feature selection. Section 2.6 is about Multi-class

SVMs. These topics are major techniques in our proposed model.

2.1 SUPPORT VECTOR CLUSTERING

SV clustering was derived from the study of one-class support vector machine (SVM)

[Scholkopf et. al., 2000 ; Tax et. al., 1999]. The work of one-class SVM was to find

the density of a distribution probability. In solving the problem, one needs to solve a

quadratic optimization problem. The optimal solution tells us the position of support

vectors.

Asa Ben-Hur et. al. generalize support vectors as the boundary of the clusters.

[Ben-Hur et. al., 2000 ; Miguel 1997 ; Duda et. al., 2000]. They proposed an

algorithm for representing the support of a probability distribution by a finite data set

using the support vectors. We first review the concept of one-class SVM.

2.1.1 One-Class SVM

The term one-class classification originates from Moya [Moya et. al., 1993], but

also outlier detection [Ritter et. al., 1997], novelty detection [Bishop 1995] for

concept learning [Japkowicz et. al., 1995] are used. There are different names since

6

different applications can be applied for one-class classification. One-class

classification problem can be described as that one wants to develop an algorithm

which returns a function f that takes the value 1+ in a small region capturing

most of the data points, and 1− elsewhere.

Scholkopf et. al. [Scholkopf et. al., 2000] proposed two methods for solving

one-class classification problem:

a) To map the data into the feature space corresponding to the kernel function,

and to separate them from the origin with maximum margin. This is a

two-class separation problem. The only element of the negative examples is

the origin, and all the training data are the positive examples.

b) One searches for the sphere that can include all the training data such that the

sphere is with minimum volume.

We review only the solution to the second problem. Suppose the data points are

mapped from input space to the high-dimensional feature space, this is through a

non- linear transformation function Φ . One looks for the smallest sphere that

encloses the image of the data.

Scholkopf solved the problem by the following way: Consider the smallest

enclosing sphere with radius R , the optimization problem is as follows: 2min R

such that )( 22jRax j ∀≤−Φ (2.1)

where a is the center of the sphere and ⋅ is the Euclidean norm, R is the radius

of the sphere. Soft margin constraints are incorporated by adding slack variables jξ ,

so the original optimization problem is:

∑+i

iCR ξ2min

7

such that )( 22

jj Rax ξ+≤−Φ (2.2)

with 0≥jξ . By the introduction of Lagrange Multiplier, the constrained problem

was :

∑∑∑ +−−Φ−+−= jjjj

jjj CuaxRRL ξξβξ ))((222

(2.3)

where 0,0 ≥≥ jj uβ are Lagrange Multipliers, C is a constant. One calculates the

partial derivative of L with respect to aR , , and jξ . Set the partial derivatives to

zero, then one can get

∑ =j

j 1β (2.4)

∑ Φ=j

jj xa )(β (2.5)

jj uC −=β (2.6)

the Karush-Kuhn-Tucker (KKT) conditions should be hold, so it results in

0))((22 =−Φ−+ jjj axR βξ (2.7)

0=jjuξ (2.8)

Again, one can eliminate the variables and aR, and ju , and turns it into its Wolfe

dual form. Then the new problem is a function of the variable jβ only

)()()(,

2ji

jiji

jjj xxxW Φ⋅Φ−Φ= ∑∑ βββ (2.9)

with the constraints

8

Cj ≤≤ β0 (2.10)

∑ =j

j 1β (2.11)

the non- linear transformation function Φ is a feature map, i.e. a map into an inner product space F such that the inner product in the image of Φ can be computed by evaluating some simple kernel [Boser et. al., 1992 ; Cortes et. al., 1995 ; Scholkopf et. al., 1999]. Define

)()(),( yxyxK Φ⋅Φ≡ (2.12)

and the Gaussian kernel function,

)exp(),(2

yxqyxK −−= (2.13)

the Lagrangian W in (2.9) can be written as

),(),(,

jiji

jij

jji xxkxxKW ∑∑ −= βββ (2.14)

if one solves the above formula and get the optimal solution of jβ , then one can now

calculate the distance of each point x to the center of the sphere :

22 )()( axxR −Φ= (2.15)

By (2.5), one can rewrite it as

),(),(2),(,

2ji

jiji

jjji xxkxxKxxkR ∑∑ +−= βββ (2.16)

the radius of the sphere is:

9

}ctorsupport ve a is |)({ ii xxRR = (2.17)

So if one solves the dual problem W , one can find all the SVs. Then by calculating

the distance between the SVs and sphere center, the radius of the sphere is found.

Consider equation (2.7) and (2.8),

0))((22 =−Φ−+ jjj axR βξ

and 0=jjuξ

there are three types of the data , they are as follows:

a) If Cj << β0 , then the corresponding 0=jξ , from equation (2.7), these

data lie on the sphere surface. They are called Support Vectors (SVs).

b) If 0>jξ , the corresponding Cj =β , these data points lie outside the sphere.

They are called Bounded Support Vectors (BSVs).

c) All other data with 0=jβ lie inside the sphere.

So, from the discussion above. One can tell whether a data point is inside or outside or

on the sphere surface by the corresponding value of jβ .

2.1.2 The Formulation of Support Vector Clustering

In the training of one-class SVM, considering the use of Gaussian kernel. We map the

data into high-dimensional feature space and find a sphere with minimum volume to

enclose the image of the data. When the sphere be mapped back to the input space,

they form different shape of contours. These contours enclose the data points, they

can be seen as cluster boundary, below we explain how to perform SV clustering.

By the use of Gaussian kernel function, the parameters that one needs to control

are the kernel width q and penalty v . One can use Fig 2.1 and Fig 2.2 [ Ben-Hur et.

10

al., 2000 ; Hao P. Y. et. al., 2002 ] to explain what they work,

Fig 2.1 Different value of q has different clustering results (a) q =1 (b) q =20 (c) q =24 (d) q =48

In Fig 2.1, one sees that as the value of q increases, the contour boundaries are more

and more tightly, and there are more and more contours. This explains that the

parameter q controls the compactness of the enclosing sphere, it also controls the

number of the enclosing sphere. So one can tune the value of q from small to big or

from big to small in order to gain proper results.

The influence of the parameter v is as follows:

Fig 2.2 q fixed and different value of v results in different shape of cluster (a) data ofnumber theis ,1 lvl = (b) 4.0=v

11

In Fig 2.2 one can see that as v increases, there are more BSVs appear. It is because

v controls the percentage of outliers. So up to now one knows how to control the

tightness of the contour boundaries and one can also let some data points to become

outliers in order to form more clear contour boundaries.

The problem now is how one finds out these contours just as one sees in Fig 2.1

and Fig 2.2. [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001] proposed a method that

one can use the schema of connected components by defining an adjacent matrix.

Contours are now defined as the connected components. One can use any graph

searching algorithm to find out all the connected components.

2.2 SEQUENTIAL MINIMAL OPTIMIZATION

The new SVM learning algorithm is called Sequential Minimal Optimization (SMO).

This new learning algorithm was proposed by John C. Platt [Platt 1998]. The

algorithm solves the quadratic programming problem by breaking it into a series of

smallest possible sub-QP problems. These small sub-QP problems are solved

analytically, so one does not need to perform matrix computation. Thus the training

time the algorithm took is almost between linear and quadratic in the training set size

for various test problems.

Traditionally the SVM learning algorithm uses numeric quadratic programming

as an inner loop, so it takes much time scaling between linear and cubic in the data set

size, SMO goes in different way that it uses analytic QP step, one can partition it into

two steps just as what the following two sections describe.

2.2.1 Optimize Two Lagrange Multipliers

In solving the QP problem, every training data corresponds to a single Lagrange

12

Multiplier, one can judge whether a training data is a support vector by its

corresponding value of Lagrange Multiplier. The work now is to optimize two

Lagrange Multipliers at a time.

Platt proposed the following way [Platt 1998]: Consider optimizing over 1α

and 2α , with all other variables fixed. Then the original quadratic problem

∑∑ =≤≤i

iiij

jiji ltosubjectxxk 1,

10)(

21

min αν

αααα

(2.18)

can be reduced to

∑∑==

++2

1

2

1,, 2

1min

21 iii

jiijji CCk ααα

αα (2.19)

with ∑∑ ====

l

ji ijjil

j ijji KCandKC3,3

ααα

subject to ∑=

∆=≤≤2

121 ,

1,0

iil

αν

αα (2.20)

where ∑ =−=∆

l

i i31 α

since C does nothing with 1α and 2α , one can eliminate it, so one obtains the new

form as:

22122222122211

22 )(

21

)()(21

min CCKKK αααααα +−∆++−∆+−∆ (2.21)

with the derivative

21222122112 )2()( CCKKK +−+−∆+−∆− ααα (2.22)

let the derivative be zero, then

122211

2112112 2

)(KKK

CCKK−+

−+−∆=α (2.23)

13

since 2α is found, we can calculate 1α from (2.20).

If the new point ( 1α , 2α ) is outside of [0, lν/1 ], the constrained optimum is

found by projecting 2α from (2.23) into the region allowed by the constraints, and

then re-computing 1α .

The offset ρ is recomputed at every such step.

2.2.2 Updating After A Successful Optimization Step

Let *1α , *

2α be the values of the Lagrange parameter after the step in 2.2.1,

then the corresponding output is [Platt 1998]

iiii CKKO ++= *22

*11 αα (2.24)

combine with (2.23), one then has the update equation for 2α such that 1α is

disappeared,

122211

21*22 2KKK

OO−+

−+= αα (2.25)

2.3 DIMENSION REDUCTION

The problem of dimension reduction is introduced as a way to overcome the curse of

the dimensionality when dealing with vector data in high-dimensional spaces and as a

modeling tool for such data [Miguel 1997].

In most applications, dimension reduction is carried out as a preprocessing step,

the selection of the dimensions using principal component analysis (PCA) [Duda et.

al., 2000 ; Jolliffe 1986] through singular value decomposition (SVD) [Golub et. al.,

1996] is a popular approach for numerical attributes.

14

PCA is possibly the most widely used technique to perform dimension reduction,

consider a sample

niix 1}{ = in DR (2.26)

with mean

∑=

=n

iix

nx

1

1 (2.27)

and covariance matrix

}))({(T

xxxxE∑ −−= , (2.28)

with spectral decomposition

∑ Λ= TUU , (2.29)

the principal component transformation

)( xxUy T −= (2.30)

yields a reference system in which the sample has mean 0 and diagonal covariance

matrix Λ containing the eigenvalues of Σ , the variables are now uncorrelated. One

can discard the variables with small variance, i.e. project on the subspace spanned by

the first L principal components, and obtain a good approximation to the original

sample.

2.4 FUZZY C-MEAN

Clustering is one of the most fundamental issues in pattern recognition. It plays a key

15

role in searching for structures in data. Given a finite set of data, the problem is to

find several cluster centers that can properly characterize relevant classes of the set.

Fuzzy C-mean is based on fuzzy c-partitions, the algorithm is as follows [Georgej et.

al., 1995] :

Step1. Let t=0.Select an initial fuzzy pseudopartition )0(P .

Step2. Calculate the c cluster centers )()(1 ,... t

ct vv for )( tP and the chosen value of

),1(, ∞∈mm .

Step3. Update )1( +tP by the following procedure: For each xxk ∈ , if

02

>− tik vx (2.31)

for all cNi ∈ , then define

1

1

11

2)(

2)()1( ])([)( −

=

−+ ∑−

−=

c

j

m

tjk

tik

kt

ivx

vxxA , (2.32)

if 02

=− tik vx for some cNIi ⊆∈ ,then define )()1(

kt

i xA + for Ii ∈

by any nonnegative real numbers satisfying

∑∈

+ =Ii

kt

i xA 1)()1( (2.33)

and define 0)()1( =+k

ti xA for INi c −∈

Step4. Compare )( tP and )1( +tP . If ε≤− + )1()( tt PP , then stop, otherwise,

increase t by one and return to Step2

The most obvious disadvantage of FCM algorithm is that we need to guess the

16

number of cluster centers. In our implementation, we do know how many clusters we

need, so it won’t be big problem for us.

2.5 FEATURE SELECTION METHODS

In text categorization one is usually confronted with feature spaces containing 10000

dimensions and more, often exceeding the number of available training examples.

Many have noted the need for feature selection to make the use of conventional

learning methods possible, to improve generalization accuracy, and to avoid

“overfitting” [Joachims 1998].

The most popular approach to feature selection is to select a subset of the

available features using methods like Document Frequency Thresholding [Yang et. al.,

1997b], Information Gain, 2x statistic [Schutze et. al., 1995], Mutual Information,

and Term Strength. The most commonly used and often most effective [Yang et. al.,

1997b] method for selecting features is the information gain criterion. Below a short

description of these methods is given.

2.5.1 Document Frequency Thresholding

Document frequency is the number of document in which a word occurs. In

Document Frequency Thresholding one computes the document frequency for each

word in the training corpus and removes those words whose document frequency is

less than some predetermined threshold. The basic assumption is that rare words are

either non-informative for category prediction, or not influential in global

performance. In either case removal of rare words reduces the dimensionality of the

feature space. Improvement in categorization accuracy is also possible if rare words

happen to be noise words.

17

2.5.2 Information Gain

Information Gain measures the number of bits of information obtained for

category prediction by knowing the presence or absence of a word in a document.

Let Mccc ,..., , 21 denote the set of categories in the target space. The information

gain of word w is defined to be:

)|(log)|()()|(log)|()()(log)()(111

∑∑∑===

++−=M

kkk

M

kkk

M

kkk wcPwcPwPwcPwcPwPcPcPwIG

(2.34)

Where

)( kcP : The fraction of documents in the total collection that belong to class kc .

)(wP : The fraction of documents in which the word w occurs.

)|( wcP k : The fraction of documents from class kc that have at least one

occurrence of word w.

)|( wcP k : The fraction of documents from class kc that does not contain word w.

The information gain is computed for each word of the training set, and the

words whose information gain is less than some predetermined threshold are

removed.

2.5.3 Mutual Information

Mutual Information considers the two-way contingency table of a word w and a

category c. Then the mutual information between w and c is defined to be:

)()(

)(log),(

cpwpcwP

cwMI×∧

= (2.35)

and is estimated using

)()(log),(

BACANA

cwMI+×+

×≈ (2.36)

where A is the number of times w and c co-occur, B is the number of times the w

18

occurs without c, C is the number of times c occurs without w, and N is the total

number of documents.

2.5.4 2x Statistic

The 2x statistic measures the lack of independence between w and class c. it is

defined to be:

)()()()(

)(),(

22

DCBADBCACBADN

cwx+×+×+×+

−×= (2.37)

Where A is the number of times w and c co-occur, B is the number of times the w

occurs without c, C is the number of times c occurs without w, D is the number of

times neither w nor c occurs. N is still the total number of documents.

Two different measures can be computed based on the 2x statistic:

∑=

=M

kkkavg cwxcPwx

1

22 ),()()( (2.38)

or

),(max)( 2

1

2max k

M

kcwxwx

==

2.5.5 Term Strength

Term Strength estimates word importance based on how commonly a word is

likely to appear in “closely-related” documents. It uses a training set of documents to

derive document pairs whose similarity is above a threshold. Term Strength is

computed based on the estimated conditional probability that a word occurs in the

second half of a pair of related documents given that it occurs in the first half.

Let x and y be an arbitrary pair of distinct but related documents, and w be a

word, then the strength of the word is defined to be:

)|()( xwywPwTS ∈∈= (2.39)

19

2.6 MULTI-CLASS SVMS

There are many methods for SVMs to solve the multi-class classification problems.

One approach is to consider the problem as a two class classification problem. There

are two ways to solve multi-class SVMs in this approach [Chih-Wei, et. al., 2002 ;

Weston et. al., 1998], they are:

a) one-against-one classifiers.

b) one-against-the-rest classifiers.

In a), suppose there are k classes to be classified, this method constructs

)1(21

−kk SVM models. Each classifier must train on two classes, for training on the

i th class and the j th class, one needs to solve the following two-class classification

problem:

∑+t

Tijijt

ijTij

bwwCww

ijijij)()(

21

min,,

ξξ

(2.40)

ifbxw ijt

ijt

Tij ,1)()( ξ−≥+Φ iyt = (2.41)

ifbxw ijt

ijt

Tij ,1)()( ξ+−≤+Φ jyt = (2.42)

0≥ijtξ (2.43)

and the testing can be implemented in many ways, one of them is what so called

“Max Wins”.

This voting strategy says that if the result ijt

Tij bxw +Φ )()( says that the test data is

in the i th class, then the vote for class i is added by 1, otherwise the class j is

added 1. Then one can predict the test data with the largest vote. In general,

one-against-one method will take us much time to accomplish a training work,

especially when we have many classes to train. But in real implementation, if one

wants to gain better performance, one has no choice to use it.

20

In b), one-against-the-rest method. If one has k classes, then one needs to train

only k SVM models. This method spends much little time than one-against-one

method. The i th class is trained with positive labels while all the other classes are

trained with negative labels. So if one has l training data ),(),...,,( 11 ll yxyx , such

that liRx Ni ,...,1, =∈ and },...,1{ ky i ∈ is in class i , then the i th classifier

solves the following problem:

∑+t

Tiit

iTi

bwwCww

jii)()(

21

min,,

ξξ

(2.44)

ifbxw ij

ij

Ti ,1)()( ξ−≥+Φ iy j = (2.45)

ifbxw ij

ij

Ti ,1)()( ξ+−≤+Φ iy j ≠ (2.46)

ljij ,...,1,0 =≥ξ (2.47)

The testing is the same with two class SVMs. In general, the performance is

usually not good when comparing with one-against-one method.

21

CHAPTER 3

TEXT CATEGORIZATION USING ONE-CLASS SVM

3.1 PROPOSED MODEL

There are mainly three stages in our proposed model, they are:

a) Data preprocessing stage.

b) Unsupervised learning stage.

c) Supervised learning stage.

The first stage includes Part-Of-Speech (POS) tagging, word stemming, stop-word

deleting and feature selection. Through these processes, we can transform the original

raw data into the normalized data that can be used in the second stage.

The second stage provides the processes that about performing SV clustering.

The subjects include the choice of kernel function, and how we find clusters and the

strategy to judge the cluster validation. Also we discuss the problems we face and the

solution we use to solve them.

The last stage is about the training of internal and expert node classifiers, these

node classifiers come from the clustering result of the second stage. Firstly we

construct a mapping strategy between raw data and cluster centers, and then train all

the classifiers by one-against-one or one-against-the-rest training methods.

All the main components and procedures are illustrated in the Figure 3.1 as

follows:

22

Figure 3.1 Our proposed text categorization model

A detailed description of the behavior of each component is described as follows:

3.2 DATA PREPROCESSING

There are four main procedures in the data preprocessing stage :

23

Fig 3.2 Data Preprocessing Processes

3.2.1 Part-Of-Speech Tagger

In this procedure, a POS tagger [Brill 1994] is introduced to provide POS information.

News article will be previously tagged resulting in each word with its appropriate

part-of-speech (POS) tag. In general, the news articles are mostly composed by

natural language text to express human’s thought. In this thesis, we consider that

News Documents

Part-Of-Speech

Tagger

Stemmer

Stop-Word Filter

Feature Selection

Training Data

24

concepts, which express human’s thought, will mostly be decided by noun keywords.

Therefore, POS tagger module provides proper POS tags for the function of feature

selection. Furthermore, POS tags give important information for deciding contextual

relationship between words. In Figure 3.2, this tagger provides noun words to the next

stemmer module. In such way, this module employs natural language technology to

help analyze news articles. Consequently, it is considered a language model.

For natural language understanding, giving a sentence POS tags prepares the

further information to analyze the syntax of a sentence. The POS employed in this

thesis is based on rule-based POS tagger proposed by Eric Brill in 1992. Brill’s tagger

tries to learn lexical and contextual rules for tagging words. The precision of Brill’s

tagger was pronounced to be higher than 90% [Brill 1995]. There are totally 37 POS

tags as listing in APPENDIX B. As mentioned above, we select noun-only words.

Therefore, these noun tags are NN, NNS, NNP and NNPS. The following are

examples of words after POS tagging.

N.10/CD S.1/CD

"I/NN think/VB it/PRP is/VBZ highly/RB unlikely/JJ that/IN

American/NNP Express/NNP is/VBZ

Fig 3.3 Words with tagging

3.2.2 Stemming

Frequently, the user specifies a word in a query but only a variant of this word is

present in a relevant document. Plurals, gerund forms, and past tense suffixes are

examples of syntactical variations which prevent a perfect match between a query

word and a respective document word [Ricardo et. al., 1999]. This problem can be

partially overcome with the substitution of the words by their respective stems.

25

A stem is the portion of a word, which is left after the removal of its affixes. A

typical example of a stem is the word “calculate”, which is the stem for the variants

calculation, calculating, calculated, and calculations. Stems are thought to be useful

for improving retrieval performance because they reduce variants of the same root

word to a common concept. Furthermore, stemming has the secondary effect of

reducing the size of the indexing structure because the number of distinct index terms

is reduced [Ricardo et. al., 1999].

Because of most variants of a word are generated by the introduction of suffixes,

and on the basis of intuitive, simple, and can be implemented efficiently, there are

several well-known algorithms which been used suffixes removal. The most popular

one is that by Porter, so we use the Porter algorithm [Porter 1980] to do word

stemming.

3.2.3 Stop-Word Filter

Words, which are too frequent among the documents in the collection, are not good

discriminators. In fact, a word, which occurs in 80% of the documents in the

collection, is useless for purpose of retrieval. Such words are frequently referred to as

stop-words and are normally filtered out as potential index terms. Articles,

prepositions, and conjunctions are natural candidates for a list of stop-words.

Elimination of stop-words has an additional important benefit. It reduces the size

of the indexing structure considerably. In fact, it is typical to obtain compression in

the size of the indexing structure of 40% or more solely with the elimination of

stop-words [Ricardo et. al., 1999].

Since stop-words elimination also provides for compression of the indexing

structure, the list of stop-words might be extended to include words other than articles,

prepositions, and conjunctions. For example, some verbs, adverbs, and adjectives

26

could be treated as stop-words. In this thesis, a list of 306 stop-words has used. A

detailed list of the stop-words can be found in the appendix of this thesis.

The stop word filter procedure takes noun words as input and a few noun words

may be ineffective to what human wants to express in the document. They are only

auxiliary to complete the whole natural language text. Here we called them stop

words. In this reason, stop words must be filtered to prevent noise from the analysis.

After the stop words are filtered, the rest of non-stop noun words still can’t say

right away to be fully related to what human wants to express. According to human’s

writing habit, it is believed that too low or too high frequency of word’s occurrence

results from that the word itself is not important or representative.

3.2.4 Feature Selection

In many supervised learning problems, feature selection is important for a variety of

reasons: generalization performance, running time requirements, and interpretational

issues imposed by the problem itself.

One approach to feature selection is to select a subset of the available features.

This small feature subset will still retains the essential information of the original

attributes. There are some criteria [Meisel 1972]:

(1) low dimensionality

(2) retention of sufficient information

(3) enhancement of distance in pattern space as a measure of the similarity of

physical patterns, and

(4) consistency of feature throughout the sample.

Our test bed is Reuters Data set, a complete description is in Section 4.1. We choose

features for each category and use the features to represent a document, we use the

vector space model in information retrieval field. The feature selection method we

27

adopt is a frequency-based method, we use what so called TF-IDF,

logmax ,

,,

tdt

dtdt n

Ntf

tfw ×= (3.1)

where dttf , is the number of times the word t occurs in document d, tn is the

number of documents the word t occurs. N is the total number of documents.

From Section 3.2.1 to Section 3.2.4, we perform the preprocessing processes.

The original text document is now represented as a vector as the following Figure

shows.

Fig 3.4 Representing text as a feature vector.

These vectors are all m×1 dimensional, where m is the total number of features

we select for each category. We then utilize them as the training data in unsupervised

learning stage.

28

3.3 UNSUPERVISED LEARNING

In this stage, the aim is to construct the hierarchical news categories. In performing

SV clustering, we first need to choose the kernel function in order to map the training

data to high-dimensional feature space. Through the tuning of main parameters, we

generate different shape and different number of connected components. The work

now is to find out all the connected components by some searching algorithm [Ellis et.

al., 1995], the algorithm we use is adjacent matrix and Depth First Searching

algorithm. After we find all the connected components we want, we need to do cluster

validation checking. The strategy we use is discussed in Section 3.3.4.

Due to the experience in finding the connected components, we found that

finding the connected components is time-consuming. To solve this problem, we

decide to do sampling and dimension reduction for our raw data. The strategy for

sampling is Fuzzy C-Mean [Georgej et. al., 1995], and the strategy for dimension

reduction is Principal Component Analysis [Jolliffe 1986]. We will mention this later.

In the following we talk about the procedures of SV clustering and the approach we

use to solve the problems we face.

3.3.1 Support Vector Clustering

The use of SV clustering can help us to construct a Reuters hierarchy. The learning

algorithm we use is the one proposed by Scholkopf [Scholkopf et. al., 2000]. In

solving the QP problem, the optimization is performed by the use of Sequential

Minimal Optimization (SMO) [Platt 1998]. As we already mentioned that the training

time the algorithm took is almost between linear and quadratic in terms of the training

data set size. It is much faster than any other existing learning algorithms, and we can

easily modify the learning algorithm of SMO to fit our one-class SVM.

29

We now go to the procedures of performing SV clustering. As we mentioned in

Section 2.1.2, in prforming SV clustering we have to choose proper value of q and

ν . The choice of q decides the compactness of the enclosing sphere and also the

number of clusters. The choice of ν helps us to solve the problem of overlapping

clusters.

The SV clustering processes are as follows:

Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]

Unlabeled Data Set

dn RxxxX ∈= },...,,{ 21

Choose kernel function Increase q from 0,

ν fixed

Given q ,ν

Using adjacent-matrix

and DFS to find out all

Yes

Yes, Stop

No

Fixed q

change ν increasingly

Yes

No

No

Ifq exhausted

and all NO

Cluster Validity

Clusters exist (≥ 2)

30

We explain the above procedures as follows:

3.3.2 The Choice of Kernel Function

In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for

constructing a decision function can be interchanged. So instead of making a

non- linear transformation of the input vectors followed by dot-products with SVs in

feature space, one can first compare two vectors in input space and then makes a

non-linear transformation of the value of the result.

Commonly used kernel functions are as follows:

a) Gaussian RBF kernel : )exp(),(2

yxqyxK −−= (3.2)

b) Polynomial kernel : dyxyxK )1(),( +⋅= (3.3)

c) Sigmoid kernel : )tanh(),( θ−⋅= yxyxK (3.4)

we use only Gaussian kernel since other kernel function like polynomial kernel

function does not yield tight contour representations of a cluster [Tax et. al., 1999]

and we will show that Gaussian kernel is indeed the best choice for SV clustering in

Section 4.3.

3.3.3 Cluster-Finding with Depth First Searching Algorithm

We use graph theory to explain the clustering result. Every enclosing sphere is a

connected component, and data points in the same connected component are adjacent.

What we do now is to find out all the connected components.

Define an adjacent matrix ijA between pairs of points ix and jx ,

31

≤

=otherwise. 0

RR(y) , xand xconnectingsegment line on they allfor if 1 jiijA (3.5)

up to now we can know that whether two data points are adjacent, we need to find all

adjacent data points in the same connected component. The algorithm we adopt is the

Depth First Searching (DFS) algorithm. As we know every training data point even

BSV will belong to one connected component. We can find out which connected

component that the data point belongs to by DFS algorithm.

The connected component and DFS algorithm are as follows [黃曲江 1989 ;

Ellis 1995]:

procedure ConnectedComponents (adjacencyList: HeaderList; n: integer);

var

mark: array[VertexType] of integer;

{ Each vertex will be marked with the number of the component it is in.}

v: VertexType;

componentNumber: integer;

procedure DFS(v:VertexType);

{Does a depth-first search beginning at the vertex v}

var

w: VertexType;

ptr: NodePointer;

begin

mark[v] := componentNumber;

ptr := adjacencyList[v];

while ptr ≠ nil do

32

w := ptr .↑ vertex;

output(v,w);

if mark[w]=0 then DFS(w) end

ptr := ptr .↑ link

end {while}

end {DFS}

begin {ConnectedComponents}

{Initialize mark array.}

for v:=1 to n do mark[v] :=0 end;

{Find and number the connected components.}

componentNumber := 0;

for v := 1 to n do

if mark[v]=0 then

componentNumber := componentNumber +1;

output heading for a new component;

DFS(v)

end { if v was unmarked}

end {for}

end {ConnectedComponents}

3.3.4 Cluster Validation

But when to stop the clustering procedure ? It is natural to use the number of SVs as

an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].

At first we start with fixed v and some value of q , increase slowly the value of q .

We can find that the cluster boundaries are more and more tightly. More clusters are

formed and the percentage of SVs increases. If the value of percentage is too high, it

33

is time to stop the clustering process. In general, the percentage of SVs in the training

data set is about 10% .

If the connected components are not found in many q , we should increase the

value of v in order to break the overlapped boundaries. In doing so, many data

points that are in the overlapped boundaries will then be forced to become so called

Bounded SVs. They are not included into the connected components.

Up to now, through all the processes, we can construct the complete Reuters

hierarchical categories. We will show the advantage of this hierarchical categories

comparing to basic Reuters flat classification in Section 4.4.

Below we solve two problems, the first problem is that if we always have only

one connected component for our data set. The second problem is that finding

connected components is very time-consuming such that we cannot afford it.

3.3.5 One-Cluster And Time-Consuming Problem

We face two problems in our proposed model, they are

(1) In case that the clustering result always tells us there is only one connected

component for our training data set.

(2) The clustering process is time-consuming.

The strategy we use for the first problem is that we can perform dimension

reduction in order to see the influence of the dimension to the clustering result. We

use PCA for dimension reduction.

The second problem can be solved by using sampling. Suppose the training

data set is X ,

dn RxxxX ∈= },...,,{ 21 (3.6)

it takes time complexity of )( 2mnO to build the adjacency matrix, where n is the

34

number of training data and m is the partition number in every loop.

At first we find cluster centers for each category by FCM, and use all the cluster

centers to be our new training data. We also use SMO to solve our QP problem in

order to reduce training time.

3.4 SUPERVISED LEARNING

In this stage we train our gateway node classifiers and expert node classifiers. In our

hierarchy, every gateway node or expert node is a two-class SVMs or multi-class

SVMs classifier. We first look at the construction of Reuters category.

3.4.1 REUTERS CATEGORY CONSTRUCTION

D’Alessio organized Reuters 135 categories into a 2- level hierarchy as summarized in

Fig 3.6. Fig 3.6 includes counts by selected individual leaf categories and summarized

by upper level supercategories [D’Alessio et. al., 2000].

Fig 3.6 Reuters basic hierarchy [D’Alessio et. al., 2000]

This is the flat construction of Reuters categories, there is no structure defining

the relationship among them. We now discuss the motivation of using hierarchy.

35

There are two strong motivations for taking the hierarchy into account. First,

experience to date has demonstrated that both precision and recall decrease as the

number of categories increases [ Yang 1997a ; Apte et. al., 1994 ]. One of the reasons

for this is that as the scope of the corpus increases, terms become increasingly

polysemous. This is particularly evident for acronyms, which are often limited by the

number of 3- and 4- letter combinations, and which are reused from one domain to

another.

The second motivation for doing categorization within a hierarchical setting is it

affords the ability to deal with very large problems. As the number of categories

grows, the need for domain-specific vocabulary grows as well. Thus, we quickly

reach the point where the index no longer fits in memory and we are trading accuracy

against speed and software complexity. On the other hand, by treating the problem

hierarchically, we can decompose it into several problems each involving a smaller

number of categories and smaller domain-specific vocabularies and perhaps yield

savings of several orders of magnitude [D’Alessio et. al., 1998].

The hierarchical construction of Reuters categories we build through the use of

SV clustering is as follows:

36

Fig 3.7 Our Proposed Hierarchy

As we mentioned in Section 2.6, there are two ways to train a multi-class SVMs

classifier. They are one-against-one method and one-against-the-rest method. We use

both ways in our training of the gateway node classifiers and expert node classifiers.

Now we solve the problem that how we train the node classifiers in the hierarchical

categories.

3.4.2 The Mapping Strategy

In case that we used the sampling strategy to use cluster centers to represent the

original category data. We must build a mapping strategy between the raw data and

category centers. We propose two methods to solve this problem :

Method 1 : Direct-Map

suppose there are l training data,

37

lxx ,...,1 (3.7)

m centers,

mcc ,...,1 (3.8)

for every ix , li ,...,1= . We calculate the Euclidean distance

between the point ix and each center jc , mj ,...,1= . Define the

distance :

∑ −=t

jtitij cxd2

(3.9)

and then find the minimum for each i :

ijj

dmin (3.10)

then we build a mapping between the raw data and the centers.

Method 2 : Without-Map

since we already know what the hierarchy is like. We can directly

use the raw data to build the tree. In the training of gateway node

classifier and expert node classifiers, we solve multi-class

classification and two-class classification problem.

3.4.3 Gateway Node Classifier

We try two ways to train our gateway node classifiers. The first way is by SV

clustering combining with Direct-Map mapping strategy. The second way is to train a

two-class SVMs or multi-class classifier directly. In Direct-Map mapping strategy, the

result we gain is much more closer to our original idea, but for our convenience in

building all the tree node classifiers, we use both two methods in order to obtain better

results.

The gateway node classifier has the function that it controls the next go of the

38

test data, test data goes to the predefined node according to its output.

3.4.4 Expert Node Classifier

The expert node classifiers are two-class SVMs or multi-class SVMs. The training of

two-class SVMs is formal and we use two-class SVMs to implement multi-class

SVMs.

Expert node classifier tells us that the test data belongs to which category. In

Figure 3.7 we can see that the expert node classifier “corporate” has the duty to

distinguish 2 categories and so this node classifier is a two-class SVMs. But the

expert node “commodity” has the duty to distinguish 53 categories, this node

classifier is a multi-class SVMS.

The categories in every expert node classifier are different, so the training time is

also different. Taking the one-against-one training method will save us much time but

it may lead to poor testing result. So we would rather adopt the training method of

one-against-the-rest in order to gain better results.

39

CHAPTER 4

EXPERIMENT DESIGN AND ANALYSIS

4.1 The Corpus

In our experiments, we used the Reuters-21578, Distribution 1.0, which is

comprised of 21578 documents, representing what remains of the original

Reuters-22173 corpus after the elimination of 595 duplicates by Steve Lynch and

David Lewis in 1996 [Lewis 1996]. David Lewis made it publicly available for text

categorization research [Lewis 1992b]. Several researchers have used it for their

experiments, and it has the potential for becoming a standard test bed. In the

following sections, we describe the corpus on the level of documents and categories

and our hierarchical architecture of categories in data set.

4.1.1 Documents

The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files

(reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last

(reut2-021.sgm) contains 578 documents. The size of the corpus is 28,329,363 bytes,

yielding an average document size of 1,312 bytes per document. The docume nts are

“categorized” along five axes – EXCHANGES, ORGANIZATIONS, PEOPLE,

PLACES, and TOPICS. We consider only the categorization along the TOPICS axis.

Close to half of the documents (10,211) have no topic and we do not include these

documents in either training or testing sets. Unlike Lewis (acting for consistency with

earlier studies), the documents that we consider no-category are those that have no

40

categories listed between the topic tags in the Reuters-21578 corpus’ documents. This

leaves 11,367 documents with one or more topics. Most of these documents (9,494)

have only a single topic.

4.1.2 File Format

Each of the 22 files begins with a document type declaration line:

<!DOCTYPE lewis SYSTEM "lewis.dtd">

Each article starts with an "open tag" of the form

<REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??>

where the ?? are filled in an appropriate fashion. Each article ends with a "close tag"

of the form:

</REUTERS>

Each REUTERS tag contains explicit specifications of the values of five

attributes, TOPICS, LEWISSPLIT, CGISPLIT, OLDID, and NEWID. These attributes

are meant to identify documents and groups of documents. The values of the attributes

determine how the documents are divided into a training set and a testing set. In the

experiments described in this thesis, we use The Modified Lewis Split, which is the

one that is most used in the literature. This split is achieved by the following choice of

parameters:

Training Set (13,625 docs): LEWISSPLIT="TRAIN"; TOPICS="YES" or "NO"

Test Set ( 6,188 docs) : LEWISSPLIT="TEST"; TOPICS="YES" or "NO"

41

Unused ( 1,765 docs) : LEWISSPLIT="NOT-USED" or TOPICS="BYPASS"

The attributes CGISPLIT, OLDID, and NEWID were ignored in our experiments.

In our experiments, we used 9603 documents for training, and 3299

documents for testing.

4.1.3 Document Internal Tags

Just as the <REUTERS> and </REUTERS> tags serve to delimit documents

within a file, other tags are used to delimit elements within a document. There are

<DATE>,<MKNOTE>,<TOPICS>,<PLACES>,<PEOPLE>,<ORGS>,<EXCHA

NGES>,<COMPANIES>,<UNKNOWN>,<TEXT>. Of these only <TOPICS> and

<TEXT> were used in our experiments.

<TOPICS>, </TOPICS>

These tags encloses the list of TOPICS categories, if any, for the document. If

TOPICS categories are present, each will be delimited by the tags <D> and </D>.

<TEXT>, </TEXT>

All the textual material of each story is delimited between a pair of these tags.

Some control characters and other "junk" material may also be included. The white

space structure of the text has been preserved. The following tags optionally delimit

elements inside the TEXT elements: <AUTHOR>, </AUTHOR>, <DATELINE>,

</DATELINE>, <TITLE>, </TITLE>, <BODY>, </BODY>. of these, we only use

the last two, which enclose the main text of the documents.

42

A few sample documents are shown in Figure 4.1. The average length of the

stories is 90.6 words, but some are as short as a sentence and others as long as several

pages.

Figure 4.1 Sample news stories in the Reuters-21578 corpus.

43

4.1.4 Categories

There are 135 different categories defined for the Reuters data, 96 categories

occur in our training set. The number of categories assigned to each document ranges

from 1 to 14, but average is only 1.24 categories per document. The frequency of

occurrence varies greatly from category to category. For example, earn appears in

3987 documents, while 78 categories (i.e. more than 50%) have been assigned to

fewer than 10 documents. Actually, 15 categories have not been assigned to any

documents. Table 4.1 gives a listing of the categories sorted in decreasing order of

frequency (the number of documents with that category).

Table 4.1 the list of categories, sorted in decreasing order of frequency.

44

4.2 DOCUMENT REPRESENTATION ANALYSIS

We consider the impact of document representations in one-class SVM. In our

research of one-class SVM, we find an interesting phenomena. That is in one-class

SVM, the choice of data representation is somewhat different from two-class SVMs

or multi-class SVMs. We illustrate the phenomena as follows:

In the information retrieval field. We all agree that frequency representation of

the documents is best in the vector space model. But in [Manevitz et. al., 2001],

Manevitz found that binary representation achieves best result than other

representations. We argue that it do conflict to our common sense, so we decide to

perform one-class classification for the ten Reuters most frequently used categories.

And then we will compare the result to the former paper.

To be the same with the paper [Manevitz et. al., 2001], 25% of the data will be

our training data, and the rest will be used for predicting. The following table shows

the data set we use.

Table 4.2 Number of Training/Test Items

Category Name Num Train Num Test

Earn Acquisitions Money-fx Grain Crude Trade Interest Ship Wheat Corn

966 590 187 151 155 133 123 65 62 62

2902 1773 563 456 465 401 370 195 186 184

We compare with two different data representations. The first one is the binary

45

representation and the second one is tf- idf representation. The feature dimension is

fixed 10 and 20 by performing PCA on the original feature set, and Gaussian kernel

function is used.

According to the text categorization survey by Sebastiani [Sebastiani 2001], the

most commonly used performance measures in flat classification are the classic

information retrieval notions of Precision and Recall. Neither precision or recall an

effectively measure classification performance in isolation [Sebastiani 2001].

Therefore, the performance of the text categorization has often been measured by the

combination of the two measures. The most popular one is βF measure.

βF measure was proposed by Rijsbergen [Rijsbergen 1979]. It is a single score

computed from precision and recall values according to the user-defined importance

(i.e. β ) of precision and recall. Normally β =1 is used [Sebastiani 2001].

Here is the definition of precision, recall and 1F measure:

category toassigned iterms Total

identifiedcategory of iterms ofNumber Pr =ecision (4.1)

setin test memberscategory ofNumber

identifiedcategory of iterms ofNumber Re =call (4.2)

and callecisioncallecision

callecisionFRePrRePr2

)Re,(Pr1 +⋅

= (4.3)

The following two tables show the results. Table 4.3 is the result with binary and

Tf- idf representation where data is with 10 features. Table 4.4 is the result with binary

and Tf- idf representation where data is with 20 features. We see that binary

representation achieve better performance than Tf- idf representation both in 10

features and 20 features data.

46

Table 4.3 1F measure in One-class SVM with vector dimension 10

Our results Linear Sigmoid Polynomial Rbf

binary Tf-idf

earn 0.676 0.702 0.409 0.676 0.982 0.890

acq 0.483 0.481 0.185 0.482 0.991 0.897

money-fx 0.541 0.516 0.074 0.514 0.960 0.860

grain 0.585 0.533 0.084 0.585 0.970 0.883

crude 0.545 0.532 0.441 0.544 0.920 0.867

trade 0.445 0.476 0.363 0.597 0.965 0.880

interest 0.473 0.454 0.145 0.485 0.918 0.894

ship 0.563 0.518 0.025 0.539 0.917 0.847

wheat 0.474 0.450 0.619 0.474 0.956 0.825

corn 0.293 0.339 0.036 0.298 0.835 0.809

Table 4.4 1F measure in One-class SVM with vector dimension 20

Our results Linear Sigmoid Polynomial Rbf

binary Tf-idf

earn 0.652 0.686 0.678 0.321 0.973 0.891

acq 0.488 0.489 0.491 0.194 0.988 0.905

money-fx 0.487 0.494 0.503 0.084 0.956 0.901

grain 0.504 0.504 0.487 0.071 0.974 0.831

crude 0.496 0.496 0.485 0.111 0.912 0.905

trade 0.441 0.441 0.483 0.239 0.960 0.927

interest 0.440 0.425 0.425 0.092 0.915 0.907

ship 0.220 0.219 0.310 0.025 0.913 0.813

wheat 0.449 0.449 0.420 0.097 0.926 0.860

corn 0.376 0.376 0.352 0.029 0.820 0.730

We explain how it happens as follows. Consider nRX ⊆ , },...,,{ 21 nxxxX = .

For any two vectors, say ji xx , , },...,2,1{, nji ∈ , the distance between ix and jx in

47

the feature space is

)()( ji xx Φ−Φ = )()()()(2)()( jjjiii xxxxxx Φ⋅Φ+Φ⋅Φ−Φ⋅Φ

= )exp(222

ji xxq −−− (4.4)

so we can see that the distance between ji xx , in the feature space is proportional to

the distance between ji xx , in the input space.

Now consider the corresponding binary representation of ji xx , , say ji yy , ,

where

1=iky if 0>ikx , (4.5)

0=iky if 0=ikx , (4.6)

features ofnumber theis k

then

jiji xxyy −≤−<0 (4.7)

so we can now conclude that tf- idf representation will result in longer distance than

binary representation in the feature space. Now consider the original quadratic

program in one-class SVM:

∑+∈∈∈ i i

FcRRR lR

lξ

νξ

1min 2

,, (4.8)

subject to ][for 0,)( 22liRcx iii ∈≥+≤−Φ ξξ

the goal is to minimize the volume of the enclosing sphere, since the distance of any

two vectors in feature space by using Gaussian kernel is at most 2 , and we already

know that given any two vectors in input space, their tf- idf representation will result

in longer distance between these two vectors rather than their binary representation in

48

the feature space. So vectors with binary representation are much more compact than

tf- idf representation, and result in much more easier to perform one-class SVM

classification and so the classification result is better than tf- idf representation. This is

why the performance of binary representation is better than the tf- idf representation in

our experiments.

4.3 THE CHOICE OF KERNEL FUNCTION

In order to verify that only Gaussian kernel function works well for performing SV

clustering, we first compare different kernel functions for Iris data set [Blake et. al.,

1998]. The data set contains 150 instances each composed of four measurements of an

iris flower. There are three types of flowers, represented by 50 instances each. We use

PCA to do dimension reduction and choose the first and the second principal

components to be our training set.

For the convenience of visualization, we use LIBSVM 2.0 to perform the

experiments. This is an integrated tool for support vector classification and regression

which can handle one-class SVM using the Scholkopf etc algorithms. The LIBSVM

2.0 is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

We investigate the two most important and frequently used kernel functions , the

polynomial kernel and the Gaussian kernel:

(i) Polynomial function:

dyxyxK )1(),( +⋅= (4.9)

(ii) Radial basis function (Gaussian kernel function):

)exp(),(2

yxqyxK −−= (4.10)

49

the experimental results are as follows:

Fig 4.2 One-Class SVM with polynomial kernel where d=2.

Fig 4.3 One-Class SVM with Gaussian kernel where q =100, 1.0=ν .

we see that only Gaussian kernel can form a tighter enclosing sphere for data in the

original space. We now explain the result by looking at these two kernel functions.

The polynomial kernel function is

50

dyxyxK )1(),( +⋅= , (4.11)

for any two vectors ji xx , , mji ,...,1, = , m is the number of features. We then have

n

jn

iijnn

ji xxxx ⋅=⋅ )(cos)( θ , (4.12)

for data points that are not centered around the origin, the angle between them will be

small, and the value of )(cos ijn θ will tend to 1, so

n

jn

in

ji xxxx ⋅≈⋅ )( . We see that

this kernel function stretches the data in the feature space, and so it is hard to form

proper enclosing sphere. In Fig 4.2 we can see that polynomial kernel is not suitable

for SV clustering.

In Fig 4.3 we find that Gaussian kernel do the best job for SV clustering, the

enclosing sphere is tighter. We now conclude that Gaussian kernel is the best choice

for SV clustering and so we use it in our experiments.

4.4 HIEARCHICAL TEXT CATEGORIZATION PERFORMANCE

ANALYSIS

The hierarchical construction we build is a category tree. Whenever a document

comes in, at the root level in the category hierarchy, the document can be first

classified into positive or negative categories. The classification can be repeated on

the document in each of the subcategories until the document reaches some internal or

leaf category.

We compare our hierarchy with both non-hierarchical classifiers and hierarchical

classifiers in order to find the advantage of our classifier built by one-class SVM. Now we pay attention to the following two questions:

(1) Does the hierarchical classifier based on our SV clustering results improve

performance when compared to a non-hierarchical classifier?

51

(2) How does our hierarchical method compare with other text categorization

methods?

We talk about the performance analysis tools. As we mentioned in Section 4.2,

precision and recall and 1F measure are the best choice for measuring two class

classification problem. We must pay attention to that the above three measure are

actually designed for flat classification, since they have largely ignored the

parent-child and sibling relationships between categories in a hierarchy. We now

consider another measure that can fit our requirement.

Let iTP be the set of documents correctly classified into category iC ;

iFP be the set of documents wrongly classified;

iFN be the set of documents wrongly rejected;

iTN be the setoff documents correctly rejected;

The standard precision and recall are defined as follows:

ii

ii FPTP

TP+

=Pr (4.13)

ii

ii FNTP

TP+

=Re (4.14)

based on the standard precision and recall for each category, the overall precision and

recall for the whole category space can be obtained in two ways. Namely,

Micro-Average and Macro-Average. Micro-Average gives equal importance to each

document, while Macro-Average gives equal importance to each category [Yang

1997a]. In our experiment, we use only Micro-Average measure.

The Micro-Average is defined as :

∑∑

=

=

+= m

i ii

m

i iu

FPTP

TP

1

1

)(Pr (4.15)

52

∑∑

=

=

+= m

i ii

m

i iu

FNTP

TP

1

1

)(Re (4.16)

where m is the number of categories.

We also use another combination of precision and recall measures, that is

Break-Even Point (BEP). BEP was proposed by Lewis [Lewis 1992a], defines the

point at which precision and recall are equal. In our experiments, there are many BEP

for every category, we choose the smallest one to be our last result.

Table 4.5 Precision/Recall-breakeven point on the ten most frequent Reuters

categories on Basic SVMs.

SVM(poly) d=

1 2 3 4 5

SVM(rbf) γ =

0.6 0.8 1.0 1.2 Proposed

Approach

earn 98.2 98.4 98.5 98.4 98.3 98.5 98.5 98.4 98.3 99.0

acq 92.6 94.6 95.2 95.2 95.3 95.0 95.3 95.3 95.4 99.1

money-fx 66.9 72.5 75.4 74.9 76.2 74.0 75.4 76.3 75.9 97.5

grain 91.3 93.1 92.4 91.3 89.9 93.1 91.9 91.9 90.6 83.5

crude 86.0 87.3 88.6 88.9 87.8 88.9 89.0 88.9 88.2 88.7

trade 69.2 75.5 76.6 77.3 77.1 76.9 78.0 77.8 76.8 100

interest 69.8 63.3 67.9 73.1 76.2 74.4 75.0 76.2 76.1 94.7

ship 82.0 85.4 86.0 86.5 86.0 85.4 86.5 87.6 87.1 98.9

wheat 83.1 84.5 85.2 85.9 83.8 85.2 85.9 85.9 85.9 76.2

corn 86.0 86.5 85.3 85.7 83.9 85.1 85.7 85.7 84.5 63.0

microavg 84.2 85.1 85.9 96.2 85.9 86.4 86.5 86.3 86.2 95.4

Table 4.5 shows the performance comparison of our proposed hierarchical model

and the non-hierarchical SVMs [Joachims 1998] for the 10 most frequent categories.

Our proposed hierarchical construction do somewhat better than it. But poorly in the

categories like grain, crude, wheat, corn. We now try to explain why this four

category behave poor than other categories.

53

We can use the hierarchical construction we construct to explain, the hierarchical

categories is as follows:

Fig 4.4 Our Proposed Hierarchical Construction

grain category is in expert node “energy”, the energy node contains 9 subcategories.

We build a multi-class SVM classifier to distinguish these 9 subcategories. As we

mentioned in the literature review, we implement this multi-class SVM through 9

two-class SVMs, every two-class SVM distinguishes only one category with other 8

categories. In our experiments, we find that these 9 categories are somewhat

overlapped such that they share many same documents. So to distinguish these 9

categories is difficult.

How about these three categories: crude, wheat, and corn ? We find that these

three categories lie on the expert node “commodity”. The internal node commodity

has the duty to distinguish 53 sub-categories containing crude, wheat, and corn. We

54

first use one-against-the-rest training method and implement 52 classifiers. But we

then find that the classification results are poor. So we take the next choice, to use

one-against-one training method. We build classifiers for every category against other

52 categories, so totally we must build 532C classifiers. The same case, these 53

categories are very closer. So the classification results in the hierarchical construction

are poor.

Table 4.6 Our proposed classifier comparison of k-NN and decision tree

k-NN(k=30) Decision Tree Proposed Approach Class

Recall Precision 1F Recall Precision 1F Recall Precision 1F

earn 0.950 0.920 0.935 0.953 0.966 0.978 0.995 0.990 0.992

acq 1.000 0.910 0.953 0.961 0.953 0.957 0.991 0.993 0.991

money-fx 0.920 0.650 0.762 0.771 0.758 0.764 0.987 0.975 0.980

grain 0.960 0.700 0.810 0.953 0.916 0.934 0.891 0.887 0.888

crude 0.820 0.750 0.783 0.926 0.850 0.886 0.865 0.835 0.849

trade 0.890 0.660 0.758 0.812 0.704 0.754 1 1 1

interest 0.800 0.710 0.752 0.649 0.933 0.766 0.947 1 0.972

ship 0.850 0.770 0.808 0.769 0.861 0.812 0.930 0.762 0.837

wheat 0.690 0.730 0.709 0.972 0.831 0.894 0.989 1 0.994

corn 0.350 0.760 0.479 0.982 0.821 0.984 0.918 0.630 0.747

Average 0.823 0.756 0.788 0.879 0.879 0.879 0.951 0.907 0.925

Table 4.6 shows the performance of our proposed hierarchical model, against the

decision tree [Weiss et. al., 1999] and the k-NN classifiers [Aas et. al., 1999] also for

the most frequent categories. We all agree that k-NN classifier performs best among

the conventional methods, this replicates the findings of [Yang 1997b]. Our

hierarchical construction performs better than k-NN except for the grain category. The

same reason to the former experiment, when compare to decision tree classifier, we

perform poor in that four categories. But in general, the overall performance is best in

55

average.

56

CHAPTER 5

CONCLUSIONS AND FUTURE WORKS

5.1 Conclusions

Through the use of SV clustering combing with one-class SVM and SMO, the

hierarchical construction between Reuters categories is built automatically. The

hierarchical categories combines both supervised learning and unsupervised learning.

In order to overcome the time-consuming problem in the process of finding

connected components, we adopt the way of dimension reduction and sampling. And

also SMO learning algorithm is taken for increasing the learning speed. Thus in the

training process we use a mapping strategy to map the training data to its

corresponding cluster center.

Kernel method is important in our experiment, we show that only Gaussian

kernel works well in the process of SV clustering. There are many different properties

for SVM in one-class classification and two-class or multi-class classification, one of

them is the influence of data representation. We show that binary representation

results in best performance for other data representation in classification of one-class

SVM.

Finally we show that the text categorization performance can be raised higher

through our hierarchical construction. We compare the result with non-hierarchical

SVM classifiers and the experimental result tells us that our hierarchical classifiers do

perform better. Latter we compare with other non-SVM classifiers like k-NN,

decision tree, our proposed classifier also outperforms these two classifiers.

In summary, the contributions of this thesis are as follows:

1) To construct the hierarchical categories automatically.

57

2) Explore the characteristic of one-class SVM through experiments.

5.2 Future Works

We address two parts for future improvements of our system, the first part is the

sampling step.

In performing SV clustering, because the clustering time is too long so we adopt

the way that we do sampling and dimension reduction by FCM and PCA for every

category. By doing so the clustering time is decreased very much but at the same time

we must perform another strategy to compensate it. We must face the problem that it

may happen that the clustering result is not good when comparing to the case that we

use the original raw data to perform SV clustering.

We hope that by using techniques like SV mixture or other methods, we can use

the original data to perform SV clustering and use the clustering result to do the text

categorization.

The second part of our future work is the training of the expert node classifiers.

We see that category like commodity contains 53 sub-categories, this results in poor

performance in classification. Perhaps we can find another useful and powerful

method to train these kind of classifiers and promote its accuracy in classification.

58

References

Aas K. and Eikvil L., Text categorization: A survey. Report No 941, Norwegian Computing Center, ISBN 82-539-0425-8, June, 1999. Apte C., Damerau F., Weiss S. M., Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, pages 233-251, 1994. Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., A support vector clustering method. In International Conference on Pattern Recognition, 2000. Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., Support vector clustering. Journal of Machine Learning Research, volume 2 pages 125-137, 2001. Blake C. L., and Merz C. J., UCI repository of machine learning databases, 1998. Brill E., Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics. 1995. Bishop, C., Neural networks for pattern recognition. Oxford University Press, Walton Street, Oxford OX2 6DP, 1995. Boser B.E., Guyon I., and Vapnik V. N., A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of Computational Learning Theory, volume 5, pages 144-152, Pittsburg, ACM, 1992. Brill E., Rule based part of speech tagger. version 1.14, 1994. Chih-Wei Hsu, Chih-Jen Lin, A comparison of methods for multiclass support vector machines. IEEE transactions on neural networks. volume 13:2. March, 2002. Cortes C. and Vapnik V., Support vector networks. Machine Learning, volume 20:1, 25, 1995. D’Alessio S., Kershenbaum A., Murray K., Schiaffino R., Category levels in hierarchical text categorization. Proceedings of the Third Conference of Empirical Methods in Natural Language Processing EMNLP-3, 1998.

59

D’Alessio Stephen, Aaron Kershenbaum, Keitha Murray, and Robert Schiaffino., The effect of using hierarchical classifiers in text categorization. In Proceedings of 6th International Conference Recherche d’Information Assistee par Ordinateur(RIAO-00), pages 302-313, Paris, France, 2000. Duda R. P., Hart P. E., and Stork D. G., Pattern Classification, 2nd ed. Wiley, 2000. Ellis Horwitz, Sartaj Sahni, and Dinesh Mehta, Fundamentals of Data Structures in C++. New York :Computer Science Press, 1995. Frakes W. B. and Baeza-Yates R., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. Georgej Klir and Bo Yuan, Fuzzy Sets and Fuzzy Logic. Prentice Hall International Editions., 1995. Golub G. and Loan C. Van, Matrix Computations,3rd edition. Johns Hopkines, Baltimore, 1996. Hayes P. and Weinstein S., Constre/tis: a system for content-based indexing of a database of news stories. In Annual conference on Innovative Applications of AI, 1990. Japkowicz N., Myers C. and Gluck M., A novelty detection approach to classification. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 518-523, 1995. Joachims Thorsten., Text categorization with Support Vector Machines: learning with many relevant features. LS-8 report 23, 1998. Jolliffe I.T., Principal Component Analysis. Spring Verlag, 1986. Koller D. and sahami M., Hierarchically classifying documents using very few words. International Conference on Machine Learning, volume 14, Morgan-Kauffman, 1997. Lang, K., Newsweeder : Learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.

60

Lewis D. D., An evaluation of phrasal and clustered representations on a text categorization task. In Proc. Of the 15th Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval. pages 37-50, 1992a. Lewis. D. D., Representation and learning in information retrieval, Ph.D. thesis, Computer Science Dept, Univ. of Massachusetts at Amherst, February. Technical report pages 91-93, 1992b. Lewis D. D., Reuters-21578 collection, 1996. Manevitz Larry M., Malik Uousef, One-class SVMS for document classification. Journal of Machine Learning Research volume 2 pages 139-154, 2001. Meisel W. S., Computer-oriented approaches to pattern recognition. New York and London, 1972. Miguel Á . Carreira-Perpinán, A review of dimension reduction techniques. Technical report CS-96-09, 1997. Moya M., Koch M. and Hosterler L., One-class classifier networks for target recognition applications. In Proceedings world congress on neural networks, pages 797-801, Portland, OR. International Neural Network Society, INNS, 1993. Ng H.-T., Goh W.-B. and Low K.-L., Feature selection, perception learning and a usability case study. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 27-31, pages.67-73, 1997. Platt J. C., Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning. MIT Press, 1998. Hao P. Y. and Chiang J. H., Support vector clustering: a new geometrical grouping approach, Proceedings of the 9-th Bellman Continuum International Workshop on Uncertain Systems and Soft Computing, volume 2, pages. 312-317, Beijing, China, July, 2002.

61

Porter M. F., An algorithm for suffix stripping. program: automated library and information systems, volume 14(1), pages 130-137, 1980. Ricardo B. Y., Berthier R. N., Modern information retrieval. Addison-Wesley, ACM Press, New York, 1999. Rijsbergen C. J. V. , Information Retrieval. London: Butterworths, 2nd edition, 1979. Ritter G. and Gallegos M., Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters, volume 18 pages 525-539, 1997. Scholkopf B., Williamson R., Smola A., and Shawe-Taylor J., Single-class support vector machines. In J. Buhmann, W. Maass, H. Ritter, and N. Tishby, editors, Unsupervised Learning, Dagstuhl-Seminar-Report 235, pages 19-20, 1999. Scholkopf B., Platt J.C., Shawe-Tayer J., Smola A. J., and Williamson R. C., Estimating the support of a high dimensional distribution. In Proceedings of the Annual Conference on Neural Information Systems. MIT Press, 2000. Schutze H., Hull D., and Pedersen, J., A comparison of classifiers and document representations for the routing problem. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. Sebastiani F., Machine learning in automated text categorization: a survey. Technical report IEI-B4-31-1999, Istituto di Elaborazione dell’informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999, Revised version, 2001. Tax D. and Duin R., Support vector domain description. Pattern Recognition Letters volume 20 pages 11-13 , 1999. Wang K., Zhou S., and He Y., Hierarchical classification of real life documents. In Proceedings of the 1st SIAM Int. Conference on Data Mining, Chicago, 2001. Weigend A. S., Wiener E. D., and Pedersen J. O., Exploiting hierarchy in text categorization. Information Retrieval, volume 1(3) pages 193-216, 1999.

62

Weiss S. M., Apte C., Damerau F.J., Johnson D.E., Oles, F.J. Goetz, Hampp T., Maximizing text-mining performance, IEEE Intelligent Systems, volume 14(4), July-Aug, 1999. Weston J., Watkins C., Multi-class support vector machines. Technical Report CSD-TR-98-04 May 20, 1998. Yang Y. and Wilbur, J., Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, volume 47(5) pages 357-369, 1996. Yang Y., An evaluation of statistical approaches to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997a. Yang Y., and Pedersen J. O., A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning ICML97, pages 412-420, 1997b. 黃曲江, 計算機演算法設計與分析格致, 1989.

63

APPENDIX A Stop-word List

There are totally 306 stop words used in this thesis.

a

about

above

across

after

afterwards

again

against

albeit

all

almost

alone

along

already

also

although

always

among

amongst

an

and

another

any

anyhow

anyone

anything

anywhere

are

around

as

at

b

be

became

because

become

becomes

becoming

been

before

beforehand

behind

being

below

beside

besides

between

beyond

both

but

by

c

can

cannot

co

could

d

down

during

e

each

eg

eight

either

eleven

else

elsewhere

enough

etc

even

ever

every

everyone

everything

everywhere

except

f

few

five

for

four

former

formerly

from

further

g

h

had

has

have

he

hence

her

here

hereafter

hereby

herein

hereupon

hers

herself

64

him

himself

his

how

however

i

ie

if

in

inc

indeed

into

is

it

its

itself

j

k

l

last

latter

latterly

least

less

ltd

m

many

may

me

meanwhile

might

more

moreover

most

mostly

much

must

my

myself

n

namely

neither

never

nevertheless

next

nine

no

nobody

none

nor

not

nothing

now

nowhere

o

of

often

on

once

one

only

onto

or

other

others

otherwise

our

ours

ourselves

out

over

own

p

per

perhaps

q

r

rather

s

said

same

seem

seemed

seeming

seems

seven

several

she

should

since

six

so

some

somehow

someone

something

sometime

sometimes

somewhere

still

such

t

ten

than

that

the

their

them

themselves

then

thence

there

thereafter

thereby

therefore

therein

thereupon

these

they

this

those

though

three

through

throughout

65

thru

thus

to

today

together

too

toward

towards

two

twelve

u

under

until

up

upon

us

v

v

very

via

w

was

we

well

were

what

whatever

whatsoever

when

whence

whenever

whensoever

where

whereafter

whereas

whereat

whereby

wherefrom

wherein

whereinto

whereof

whereon

whereto

whereunto

whereupon

wherever

wherewith

whether

which

whichever

whichsoever

while

whilst

whither

who

whoever

whole

whom

whomever

whomsoever

whose

whosoever

why

will

with

within

without

would

x

y

year

years

yes

yesterday

yet

you

your

yours

yourself

yourselves

z

66

APPENDIX B PART-OF-SPEECH TAGS There are totally 37 part-of-speech tags.

Part-of-Speech Tag Meaning

1

CC Coordinating Conjunction

2 CD Cardinal number

3 DT Determiner

4 EX Existential there

5 FW Foreign word

6 IN Preposition subordinating conjunction

7 JJ Adjective

8 JJR Adjective, comparative

9 JJS Adjective, superlative

10 LS List item marker

11 MD Modal

12 NN Noun, singular or mass

13 NNS Noun, plural

14 NNP Proper noun, singular

15 NNPS Proper noun, plural

16 PDT Predeterminer

17 POS Possessive ending

18 PRP Personal pronoun

19 PRP$ Possessive pronoun

20 RB Adverb

21 RBR Adverb, comparative

22 RBS Adverb, superlative

23 RP Particle

67

24 SYM Symbol

25 TO To

26 UH Interjection

27 VB Verb, base form

28 VBD Verb past tense

29 VBG Verb gerund or present participle

30 VBN Verb, past participle

31 VBP Verb, non-3rd person singular present

32 VBZ Verb, 3rd person singular present

33 WDT Wh-determiner

34 WP Wh-pronoun

35 WP$ Possessive wh-pronoun

36 WRB Wh-adverb

37 . Period

68

自述

姓名：塗宜昆

籍貫：臺灣省雲林縣

生日：民國六十四年八月二日

學歷：

國立成功大學資訊工程研究所碩士班

國立成功大學數學系

省立斗六高級中學

縣立雲林國民中學

地址：雲林縣斗六市太平里城頂街 58號

電話：(05)5323425