hierarchical text categorization using one-class svmr95038/try/paper/etd... · 2008-03-01 ·...
TRANSCRIPT
i
國 立 成 功 大 學
資 訊 工 程 學 系
碩 士 論 文
以單類支持向量機為基礎之階層式
文件分類
Hierarchical Text Categorization Using
One-Class SVM
研 究 生 : 塗 宜 昆
指導教授 : 蔣 榮 先 博士
中 華 民 國 九 十 二 年 七 月
i
以單類支持向量機為基礎之階層式文件分類
塗宜昆* 蔣榮先**
國立成功大學資訊工程研究所
中文摘要
由於資訊量的快速成長,自動文件分類對於處理及組織資料成為一種
重要的資訊分析技術。而由經驗得知當我們處理的資料類別數增加時,用
來衡量效能好壞的工具如精確率(precision)、召回率 (recall)都會相對的下
降,採用階層式的類別架構可以解決及處理具有大量資料的問題。
在這個研究中,我們採用單類支持向量機來達到文件聚類之目的,並
使用聚類的結果來建立一個階層式的架構,這個架構描述了類別間的關
係。我們採用兩類及多類支持向量機來作監督式的分類訓練。
由所設計的三個實驗,我們探討以單類支持向量機為基礎所建立的系
統的特性,並與其他研究方法作比較,實驗結果證明所提出的系統具有較
佳的效能。
*作者 **指導教授
ii
Hierarchical Text Categorization Using
One-Class SVM
Yi-Kun Tu* Jung-Hsien Chiang** Department of Computer Science & Information Engineering,
National Cheng Kung University
Abstract With the rapid growth of online information, text categorization has become one
of the key techniques for handling and organizing text data. Experience to date has
demonstrated that both precision and recall decrease as the number of categories
increase. Hierarchical categorization affords the ability to deal with very large
problems.
We utilize one-class SVM to perform support vector clustering, and then use the
clustering results to construct a hierarchical categories. Two-class and multi-class
SVMs are used to perform the supervised classification.
We explore one-class SVM model through three experiments. Performance
analysis is performed by comparing with other approaches, the experimental results
show that the proposed hierarchical categories works well.
*Author **Advisor
iii
誌 謝
本論文的順利完成,首先要感謝指導老師 蔣榮先教授,在這兩年的時間裡悉心
教導與鼓勵,並提供良好的學習環境。感謝上帝將女友 淑華擺在我身旁成為我
最大的支援及鼓勵, 沛毅是我最大的資料庫,有任何問題他總不吝嗇的給予指
導, 宗賢學長是最佳支援手,總是在最需要時給我最好的建議。
當然,還有很多要感謝的人,我謹記在心,謝謝你們的支持與鼓勵。
2003 年仲夏
塗宜昆 謹誌於
成大資工所 ISMP IIR LAB
iv
Contents
中文摘要 … … … … … … … … … ..… … … … … … … … … … … … … … … … … … … … … … … … … … .. i
ABSTRACT… … … … … … … … … … … … … .… … … … … … … … … … … … … … … … … … … … … .ii
FIGURE LISTING… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … . vi
TABLE LISTING… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. vii
CHAPTER 1 INTRODUCTION… … … … … … … … … … … … … … … … … … … … … … … … … . 1
1.1 RESEARCH MOTVATION… … .… … … … … … … … … … … … … … … … … .… … … … … … .. 2
1.2 THE APPROACH… … … … … … … … … … … … … … … … … … … … … … … … … … … … .… 3
1.3 THESIS ORGANIZATION… … … … … … … … … … … … … … … … … … … … … … … … … … . 4
CHAPTER 2 LITERATURE REVIEW AND RELATED WORKS..… … … … … … … … … … . 5
2.1 SUPPORT VECTOR CLUSTERING… … ..… .… … … … … … … … … … … … … … … … … … .. .5
2.1.1 One-Class SVM … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 5
2.1.2 The Formulation of Support Vector Clustering… … … … … … … … … … … … … … … … ... 9
2.2 SEQUENTIAL MINIMAL OPTIMIZATION… … … … … … … … … … … … … … … … … … ... 11
2.2.1 Optimize Two Lagrange Multipliers … … … … … … … … … … … … … … … … … … … … . 11
2.2.2 Updating After A Successful Optimization Step… … … … … … … … … … … … … … … .. 13
2.3 DIMENSION REDUCTION… … … … … … … … … … … … ..… … … … … … … … … … … … .. 13
2.4 FUZZY C-MEAN… … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 14
2.5 FEATURE SELECTION METHODS… … … … … … … … … … … … … … … … … … … … … … 16
2.5.1 Document Frequency Thresholding… … … … … … … … … … … … … … … … … … … … . 16
2.5.2 Information Gain … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 17
2.5.3 Mutual Information… … … … … … … … … … … … … … … … … … … … … … … … … … .. 17
2.5.4 2χ Statistic… … … … … … … … … … … … … … … … … … … … … … … … … … … … … . 18
2.5.5 Term Strength… … … … … … … … … … … … … … … … … … … … … … … … … … … … ... 18
2.6 MULTI-CLASS SVMs… … … … … … … … … … … … … … … … … … … … … … … … … … … .. 19
CHAPTER 3 TEXT CATEGORIZATION USING ONE-CLASS SVM… … … … … … … … … . 21
3.1 PROPOSED MODEL… … … … … … … … … … … … … … … … … … … … … … … … … … … … 21
3.2 DATA PREPROCESSING… … … … … … … … … … … … … … … … … … … … … … … … … … 22
3.2.1 Part-Of-Speech Tagger… … … … … … … … … … … … … … … … … … … … … … … … … 23
3.2.2 Stemming… … … … … … … … … … … … … … … … … … … … .… … … … … … … … … … 24
3.2.3 Stop-Word Filter… … … … … … … … … … … … … … … … … … … … … … … … … … … ... 25
3.2.4 Feature Selection… … … … … … … … … … … … … … … … … … … … … … … … … … … . 26
3.3 UNSUPERVISED LEARNING… … … … … … … … … … … … … … … … … … … … … … … … 28
3.3.1 Support Vector Clustering… … … … … … … … … … … … … … … … … … … … … … … … 28
v
3.3.2 The Choice Of Kernel Function… … … … … … … … … … … … … … … … … … … … … .. 30
3.3.3 Cluster-Finding With Depth First Searching Algorithm… … … … … … … … … … … … .. 30
3.3.4 Cluster Validation… … … … … … … … … … … … … … … … … … … … … … … … … … … . 32
3.3.5 One-Cluster And Time-Consuming Problem… … … … … … … … … … … … … … … … ... 33
3.4 SUPERVISED LEARNING… … … … … … … … … … … … … … … … … … … … … … … … … .. 34
3.4.1 Reuters Category Construction… … … … … … … … … … … … … … … … … … … … … … 34
3.4.2 The Mapping Strategy… … … … … … … … … … … … … … … … … … .… … … … … … … 36
3.4.3 Gateway Node Classifier… … … … … … … … … … … … … … … … … .… … … … … … … 37
3.4.4 Expert Node Classifier… … … … … … … … … … … … … … … … … … … … … … … … … 38
CHAPTER 4 EXPERIMENT DESIGN AND ANALYSIS … … … … … … … … … … … … … … … 39
4.1 THE CORPUS… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 39
4.1.1 Documents… … … … … … … … … … … … … … … … … … … … … … … … … … … … … ... 39
4.1.2 File Format… … … … … … … … … … … … … … … … … … … … … … … … … … … … … .. 40
4.1.3 Document Internal Tags… … … … … … … … … … … … … … … … … … … … … … … … .. 41
4.1.4 Categories… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … . 43
4.2 DATA REPRESENTATION ANALYSIS… … … … … … … … … … … … … … … … … … … … . 44
4.3 THE CHOICE OF KERNEL FUNCTION… … … … … … … … … … … … … … ...… … … … … 48
4.4 HIERARCHICAL TEXT CATEGORIZATION PERFORMANCE ANALYSIS… … … … … . 50
CHPATER 5 CONCLUSIONS AND FUTURE WORKS … … … … … … … … … … … … … … … . 56
5.1 CONCLUSIONS… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 56
5.2 FUTURE WORKS… … … … … … … … … … … … … … … … … … … … … … … … … … … … .… 57
Reference… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … .58
APPENDIX A Stop-word List… … … … … … … … … … … … … … … … … … … … … … … … … … .. 63
APPENDIX B Part-Of-Speech Tags… … … … … … … … … … … … … … … … … … … … … … … … .66
vi
Figure Listing Figure 1.1 Our Learning Processes … … … … … … … … … … … … … … … … … … … … … … … … … 3
Figure 2.1 Different value of q has different clustering results (a) q =1 (b) q =20(c)q =24
(d) q =48… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 9
Figure 2.2 q fixed and different value of ν results in different shape of cluster(a) 1=lν , l
is the number of data (b)ν =0.4… … … … … … … … … … … … … … … … … … … … … … … … … … 10
Figure 3.1 Our proposed text categorization model… … … … … … … … … … … … … … … … … … 22
Figure 3.2 Data Preprocessing Processes … … … … … … … … … … … … … … … … … … … … … … . 23
Figure 3.3 Words with tagging… … … … … … … … … … … … … … … … … … … … … … … … … … ..24
Figure 3.4 Representing text as a feature vector… … … … … … … … … … … … … … … … … … … .27
Figure 3.5 The SV clustering processes[Ben-Hur et. al., 2000]… … … … … … … … … … … … … ..29
Figure 3.6 Reuters basic hierarchy[D’Alessio et. al., 2000]… … … … … … … … … … … … … … … 34
Figure 3.7 Our Proposed Hierarchy… … … … … … … … … … … … … … … … … … … … … … … … .35
Figure 4.1 Sample news stories in the Reuters-21578 corpus… … … … … … … … … … … … … … 41
Figure 4.2 One-Class SVM with polynomial kernel where d=2… … … … … … … … … … … … … 48
Figure 4.2 One-Class SVM with Gaussian kernel where g=100, ν =0.1… … … … … … … … … ..48
Figure 4.4 Our Proposed Hierarchical Construction… … … … … … … … … … … … … … … … … . 52
vii
Table Listing Table 4.1 The list of categories, sorted in decreasing order of frequency… … … … … … … … … … .42
Table 4.2 Number of Training/Testing Items … … … … … … … … … … … … … … … .… … … … … … ..43
Table 4.3 1F measure in One-Class SVM with vector dimension 10… … … … … … … … … … … … 45
Table 4.4 1F measure in One-Class SVM with vector dimension 20… … … … … … … … … … … … 45
Table 4.5 Precision/Recall-breakeven point on the ten most frequent Reuters
categories on Basic SVMs … .… … … … … … … … … … … … … … … … … … … … … … … … .51
Table 4.6 Our proposed classifier comparison of k-NN and decision tree… … … … … … … … … … 53
1
CHAPTER 1
INTRODUCTION
With the rapid growth of online information, text categorization has become one of
the most popular key techniques for handling and organizing text data. It is used to
classify news stories [Hayes et. al., 1990], to find interesting information on the
WWW [Lang 1995], and to guide a users search through hypertext. Since building
text classifiers by hand is difficult and very time consuming, it is desirable to learn
classifiers from examples.
The most successful paradigm to organize a large mass of information is by
categorizing the different documents according to their topics. Recently, various
machine learning techniques have been attempted to automatic text categorization.
These approaches are usually based on the vector space model, in which documents
are represented by sparse vectors, with one component for each unique word extracted
from the document. Typically, the document vector is very high-dimensional, at least
in the thousands for large collections. This is a major stumbling block in applying
many machine learning methods, existing techniques rely heavily on dimension
reduction as a preprocessing step. But it is computational expensive. Until recently, a
new approach is found by the introduction of Support Vector Machines (SVMs), This
new algorithm outperforms other classifiers in text categorization, and is also used as
a clustering method.
Most of computational experience discussed in the literature deal with
hierarchies that are trees. Indeed, until recently, most problems discussed dealt with
categorization within a simple (non-hierarchical) set of categories [Frakes et. al.,
1992]. But also a few hierarchical classification methods have been proposed recently
2
[D’Alessio et. al., 1998 ; Wang et. al., 2001 ; Weigend et. al., 1999].
In this research, we try to utilize one-class SVM to perform text clustering. And
use the clustering results to build a hierarchical construction. This hierarchical
construction illustrates the relationship between Reuters categories.9
The Reuters-21578 corpus has been studied extensively. Yang [Yang 1997a]
compares 14 categorization algorithms applied this Reuters corpus as a flat
categorization problem on 135 categories. This same corpus has been more recently
studied by others treating the categories as a hierarchy [Koller et. al., 1997 ; Yang
1997a ; Ng et. al., 1997]. We construect our own hierarchy through Support Vector
(SV) clustering and compare the text categorization result with state-of-the-art
literatures.
1.1 RESEARCH MOTIVATION
In the field of document categorization, if we use only one- layer of classifiers, we
usually need to use many training samples. This kind of model is usually too
complicated and the training is not accurate enough. So we adopt the way ”divide and
conquer”, to partition a problem into many small and easy-solved sub-problems. With
this procedure, we can simplify our training processes and gain accurate training
model.
In the heuristic, we know that each document can be in multiple, exactly one, or
no category at all. By building the hierarchical construction between categories,
whenever a new document comes in, we can easily assign it into the category (or
categories) it belongs. Support Vector Machine outperforms other classifiers in text classification in
recent years, it is also used to perform text clustering. We want to perform support
vector clustering on text data set and construct a hierarchical construction between the
3
data set.
1.2 THE APPROACH
There are three stages in our proposed approach. It consists of data preprocessing,
unsupervised learning, and supervised learning.
We present documents as “bags of words”, only the presence or absence of the
words in a document is indicated and not their order of any higher-level linguistic
structure. We can thus think of documents as high-dimensional vectors, with one slot
for each word in the vocabulary and a value in each slot related to the number of
times that word appeared in the document. Note that a big fraction of the total number
of slots will have zero value for any given document, so the vectors are quite sparse.
The goal of the first stage is to build the training data for the second stage, a
detail description of the methods we use will be discussed in Section 3.2.
The second stage is unsupervised learning, we use one-class SVM to perform
support vector clustering. We will explain how we achieve text clustering by this new
learning algorithm in Section 3.3.
In the last stage, we use two-class SVMs and multi-class SVMs to train the tree
nodes and finally get much more accurate tree node classifiers. The following Figure
shows stage two and stage three:
Fig 1.1 Our Learning Process
4
1.3 THESIS ORGANIZATION
The content of this thesis is partitioned into five chapters and is organized as follows.
Chapter 1 introduces the motivation of our research.
Chapter 2 references works of related techniques such as support vector
clustering, one-class SVM, sequential minimal optimization, text categorization and
finally feature selection methods.
Chapter 3 describes our methodologies and the system.
Chapter 4 describes the corpus we use as the test bed for experiments, and our
three experiments. Finally we report our experimental results of SV clustering and
text categorization. System performance analysis is also discussed by comparing to
other classifiers.
Chapter 5 presents our conclusions based on the results of our experiments, and
suggests for future research.
5
CHAPTER 2
LITERATURE REVIEW AND RELATED WORKS
Section 2.1 and 2.2 illustrate unsupervised leaning methods we use. They are SV
clustering, and one-class SVM, also Sequential minimal optimization. Section 2.3
delivers methods of dimension reduction, and section 2.4 is about Fuzzy C-mean.
Section 2.5 reviews the methods of feature selection. Section 2.6 is about Multi-class
SVMs. These topics are major techniques in our proposed model.
2.1 SUPPORT VECTOR CLUSTERING
SV clustering was derived from the study of one-class support vector machine (SVM)
[Scholkopf et. al., 2000 ; Tax et. al., 1999]. The work of one-class SVM was to find
the density of a distribution probability. In solving the problem, one needs to solve a
quadratic optimization problem. The optimal solution tells us the position of support
vectors.
Asa Ben-Hur et. al. generalize support vectors as the boundary of the clusters.
[Ben-Hur et. al., 2000 ; Miguel 1997 ; Duda et. al., 2000]. They proposed an
algorithm for representing the support of a probability distribution by a finite data set
using the support vectors. We first review the concept of one-class SVM.
2.1.1 One-Class SVM
The term one-class classification originates from Moya [Moya et. al., 1993], but
also outlier detection [Ritter et. al., 1997], novelty detection [Bishop 1995] for
concept learning [Japkowicz et. al., 1995] are used. There are different names since
6
different applications can be applied for one-class classification. One-class
classification problem can be described as that one wants to develop an algorithm
which returns a function f that takes the value 1+ in a small region capturing
most of the data points, and 1− elsewhere.
Scholkopf et. al. [Scholkopf et. al., 2000] proposed two methods for solving
one-class classification problem:
a) To map the data into the feature space corresponding to the kernel function,
and to separate them from the origin with maximum margin. This is a
two-class separation problem. The only element of the negative examples is
the origin, and all the training data are the positive examples.
b) One searches for the sphere that can include all the training data such that the
sphere is with minimum volume.
We review only the solution to the second problem. Suppose the data points are
mapped from input space to the high-dimensional feature space, this is through a
non- linear transformation function Φ . One looks for the smallest sphere that
encloses the image of the data.
Scholkopf solved the problem by the following way: Consider the smallest
enclosing sphere with radius R , the optimization problem is as follows: 2min R
such that )( 22jRax j ∀≤−Φ (2.1)
where a is the center of the sphere and ⋅ is the Euclidean norm, R is the radius
of the sphere. Soft margin constraints are incorporated by adding slack variables jξ ,
so the original optimization problem is:
∑+i
iCR ξ2min
7
such that )( 22
jj Rax ξ+≤−Φ (2.2)
with 0≥jξ . By the introduction of Lagrange Multiplier, the constrained problem
was :
∑∑∑ +−−Φ−+−= jjjj
jjj CuaxRRL ξξβξ ))((222
(2.3)
where 0,0 ≥≥ jj uβ are Lagrange Multipliers, C is a constant. One calculates the
partial derivative of L with respect to aR , , and jξ . Set the partial derivatives to
zero, then one can get
∑ =j
j 1β (2.4)
∑ Φ=j
jj xa )(β (2.5)
jj uC −=β (2.6)
the Karush-Kuhn-Tucker (KKT) conditions should be hold, so it results in
0))((22 =−Φ−+ jjj axR βξ (2.7)
0=jjuξ (2.8)
Again, one can eliminate the variables and aR, and ju , and turns it into its Wolfe
dual form. Then the new problem is a function of the variable jβ only
)()()(,
2ji
jiji
jjj xxxW Φ⋅Φ−Φ= ∑∑ βββ (2.9)
with the constraints
8
Cj ≤≤ β0 (2.10)
∑ =j
j 1β (2.11)
the non- linear transformation function Φ is a feature map, i.e. a map into an inner product space F such that the inner product in the image of Φ can be computed by evaluating some simple kernel [Boser et. al., 1992 ; Cortes et. al., 1995 ; Scholkopf et. al., 1999]. Define
)()(),( yxyxK Φ⋅Φ≡ (2.12)
and the Gaussian kernel function,
)exp(),(2
yxqyxK −−= (2.13)
the Lagrangian W in (2.9) can be written as
),(),(,
jiji
jij
jji xxkxxKW ∑∑ −= βββ (2.14)
if one solves the above formula and get the optimal solution of jβ , then one can now
calculate the distance of each point x to the center of the sphere :
22 )()( axxR −Φ= (2.15)
By (2.5), one can rewrite it as
),(),(2),(,
2ji
jiji
jjji xxkxxKxxkR ∑∑ +−= βββ (2.16)
the radius of the sphere is:
9
}ctorsupport ve a is |)({ ii xxRR = (2.17)
So if one solves the dual problem W , one can find all the SVs. Then by calculating
the distance between the SVs and sphere center, the radius of the sphere is found.
Consider equation (2.7) and (2.8),
0))((22 =−Φ−+ jjj axR βξ
and 0=jjuξ
there are three types of the data , they are as follows:
a) If Cj << β0 , then the corresponding 0=jξ , from equation (2.7), these
data lie on the sphere surface. They are called Support Vectors (SVs).
b) If 0>jξ , the corresponding Cj =β , these data points lie outside the sphere.
They are called Bounded Support Vectors (BSVs).
c) All other data with 0=jβ lie inside the sphere.
So, from the discussion above. One can tell whether a data point is inside or outside or
on the sphere surface by the corresponding value of jβ .
2.1.2 The Formulation of Support Vector Clustering
In the training of one-class SVM, considering the use of Gaussian kernel. We map the
data into high-dimensional feature space and find a sphere with minimum volume to
enclose the image of the data. When the sphere be mapped back to the input space,
they form different shape of contours. These contours enclose the data points, they
can be seen as cluster boundary, below we explain how to perform SV clustering.
By the use of Gaussian kernel function, the parameters that one needs to control
are the kernel width q and penalty v . One can use Fig 2.1 and Fig 2.2 [ Ben-Hur et.
10
al., 2000 ; Hao P. Y. et. al., 2002 ] to explain what they work,
Fig 2.1 Different value of q has different clustering results (a) q =1 (b) q =20 (c) q =24 (d) q =48
In Fig 2.1, one sees that as the value of q increases, the contour boundaries are more
and more tightly, and there are more and more contours. This explains that the
parameter q controls the compactness of the enclosing sphere, it also controls the
number of the enclosing sphere. So one can tune the value of q from small to big or
from big to small in order to gain proper results.
The influence of the parameter v is as follows:
Fig 2.2 q fixed and different value of v results in different shape of cluster (a) data ofnumber theis ,1 lvl = (b) 4.0=v
11
In Fig 2.2 one can see that as v increases, there are more BSVs appear. It is because
v controls the percentage of outliers. So up to now one knows how to control the
tightness of the contour boundaries and one can also let some data points to become
outliers in order to form more clear contour boundaries.
The problem now is how one finds out these contours just as one sees in Fig 2.1
and Fig 2.2. [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001] proposed a method that
one can use the schema of connected components by defining an adjacent matrix.
Contours are now defined as the connected components. One can use any graph
searching algorithm to find out all the connected components.
2.2 SEQUENTIAL MINIMAL OPTIMIZATION
The new SVM learning algorithm is called Sequential Minimal Optimization (SMO).
This new learning algorithm was proposed by John C. Platt [Platt 1998]. The
algorithm solves the quadratic programming problem by breaking it into a series of
smallest possible sub-QP problems. These small sub-QP problems are solved
analytically, so one does not need to perform matrix computation. Thus the training
time the algorithm took is almost between linear and quadratic in the training set size
for various test problems.
Traditionally the SVM learning algorithm uses numeric quadratic programming
as an inner loop, so it takes much time scaling between linear and cubic in the data set
size, SMO goes in different way that it uses analytic QP step, one can partition it into
two steps just as what the following two sections describe.
2.2.1 Optimize Two Lagrange Multipliers
In solving the QP problem, every training data corresponds to a single Lagrange
12
Multiplier, one can judge whether a training data is a support vector by its
corresponding value of Lagrange Multiplier. The work now is to optimize two
Lagrange Multipliers at a time.
Platt proposed the following way [Platt 1998]: Consider optimizing over 1α
and 2α , with all other variables fixed. Then the original quadratic problem
∑∑ =≤≤i
iiij
jiji ltosubjectxxk 1,
10)(
21
min αν
αααα
(2.18)
can be reduced to
∑∑==
++2
1
2
1,, 2
1min
21 iii
jiijji CCk ααα
αα (2.19)
with ∑∑ ====
l
ji ijjil
j ijji KCandKC3,3
ααα
subject to ∑=
∆=≤≤2
121 ,
1,0
iil
αν
αα (2.20)
where ∑ =−=∆
l
i i31 α
since C does nothing with 1α and 2α , one can eliminate it, so one obtains the new
form as:
22122222122211
22 )(
21
)()(21
min CCKKK αααααα +−∆++−∆+−∆ (2.21)
with the derivative
21222122112 )2()( CCKKK +−+−∆+−∆− ααα (2.22)
let the derivative be zero, then
122211
2112112 2
)(KKK
CCKK−+
−+−∆=α (2.23)
13
since 2α is found, we can calculate 1α from (2.20).
If the new point ( 1α , 2α ) is outside of [0, lν/1 ], the constrained optimum is
found by projecting 2α from (2.23) into the region allowed by the constraints, and
then re-computing 1α .
The offset ρ is recomputed at every such step.
2.2.2 Updating After A Successful Optimization Step
Let *1α , *
2α be the values of the Lagrange parameter after the step in 2.2.1,
then the corresponding output is [Platt 1998]
iiii CKKO ++= *22
*11 αα (2.24)
combine with (2.23), one then has the update equation for 2α such that 1α is
disappeared,
122211
21*22 2KKK
OO−+
−+= αα (2.25)
2.3 DIMENSION REDUCTION
The problem of dimension reduction is introduced as a way to overcome the curse of
the dimensionality when dealing with vector data in high-dimensional spaces and as a
modeling tool for such data [Miguel 1997].
In most applications, dimension reduction is carried out as a preprocessing step,
the selection of the dimensions using principal component analysis (PCA) [Duda et.
al., 2000 ; Jolliffe 1986] through singular value decomposition (SVD) [Golub et. al.,
1996] is a popular approach for numerical attributes.
14
PCA is possibly the most widely used technique to perform dimension reduction,
consider a sample
niix 1}{ = in DR (2.26)
with mean
∑=
=n
iix
nx
1
1 (2.27)
and covariance matrix
}))({(T
xxxxE∑ −−= , (2.28)
with spectral decomposition
∑ Λ= TUU , (2.29)
the principal component transformation
)( xxUy T −= (2.30)
yields a reference system in which the sample has mean 0 and diagonal covariance
matrix Λ containing the eigenvalues of Σ , the variables are now uncorrelated. One
can discard the variables with small variance, i.e. project on the subspace spanned by
the first L principal components, and obtain a good approximation to the original
sample.
2.4 FUZZY C-MEAN
Clustering is one of the most fundamental issues in pattern recognition. It plays a key
15
role in searching for structures in data. Given a finite set of data, the problem is to
find several cluster centers that can properly characterize relevant classes of the set.
Fuzzy C-mean is based on fuzzy c-partitions, the algorithm is as follows [Georgej et.
al., 1995] :
Step1. Let t=0.Select an initial fuzzy pseudopartition )0(P .
Step2. Calculate the c cluster centers )()(1 ,... t
ct vv for )( tP and the chosen value of
),1(, ∞∈mm .
Step3. Update )1( +tP by the following procedure: For each xxk ∈ , if
02
>− tik vx (2.31)
for all cNi ∈ , then define
1
1
11
2)(
2)()1( ])([)( −
=
−+ ∑−
−=
c
j
m
tjk
tik
kt
ivx
vxxA , (2.32)
if 02
=− tik vx for some cNIi ⊆∈ ,then define )()1(
kt
i xA + for Ii ∈
by any nonnegative real numbers satisfying
∑∈
+ =Ii
kt
i xA 1)()1( (2.33)
and define 0)()1( =+k
ti xA for INi c −∈
Step4. Compare )( tP and )1( +tP . If ε≤− + )1()( tt PP , then stop, otherwise,
increase t by one and return to Step2
The most obvious disadvantage of FCM algorithm is that we need to guess the
16
number of cluster centers. In our implementation, we do know how many clusters we
need, so it won’t be big problem for us.
2.5 FEATURE SELECTION METHODS
In text categorization one is usually confronted with feature spaces containing 10000
dimensions and more, often exceeding the number of available training examples.
Many have noted the need for feature selection to make the use of conventional
learning methods possible, to improve generalization accuracy, and to avoid
“overfitting” [Joachims 1998].
The most popular approach to feature selection is to select a subset of the
available features using methods like Document Frequency Thresholding [Yang et. al.,
1997b], Information Gain, 2x statistic [Schutze et. al., 1995], Mutual Information,
and Term Strength. The most commonly used and often most effective [Yang et. al.,
1997b] method for selecting features is the information gain criterion. Below a short
description of these methods is given.
2.5.1 Document Frequency Thresholding
Document frequency is the number of document in which a word occurs. In
Document Frequency Thresholding one computes the document frequency for each
word in the training corpus and removes those words whose document frequency is
less than some predetermined threshold. The basic assumption is that rare words are
either non-informative for category prediction, or not influential in global
performance. In either case removal of rare words reduces the dimensionality of the
feature space. Improvement in categorization accuracy is also possible if rare words
happen to be noise words.
17
2.5.2 Information Gain
Information Gain measures the number of bits of information obtained for
category prediction by knowing the presence or absence of a word in a document.
Let Mccc ,..., , 21 denote the set of categories in the target space. The information
gain of word w is defined to be:
)|(log)|()()|(log)|()()(log)()(111
∑∑∑===
++−=M
kkk
M
kkk
M
kkk wcPwcPwPwcPwcPwPcPcPwIG
(2.34)
Where
)( kcP : The fraction of documents in the total collection that belong to class kc .
)(wP : The fraction of documents in which the word w occurs.
)|( wcP k : The fraction of documents from class kc that have at least one
occurrence of word w.
)|( wcP k : The fraction of documents from class kc that does not contain word w.
The information gain is computed for each word of the training set, and the
words whose information gain is less than some predetermined threshold are
removed.
2.5.3 Mutual Information
Mutual Information considers the two-way contingency table of a word w and a
category c. Then the mutual information between w and c is defined to be:
)()(
)(log),(
cpwpcwP
cwMI×∧
= (2.35)
and is estimated using
)()(log),(
BACANA
cwMI+×+
×≈ (2.36)
where A is the number of times w and c co-occur, B is the number of times the w
18
occurs without c, C is the number of times c occurs without w, and N is the total
number of documents.
2.5.4 2x Statistic
The 2x statistic measures the lack of independence between w and class c. it is
defined to be:
)()()()(
)(),(
22
DCBADBCACBADN
cwx+×+×+×+
−×= (2.37)
Where A is the number of times w and c co-occur, B is the number of times the w
occurs without c, C is the number of times c occurs without w, D is the number of
times neither w nor c occurs. N is still the total number of documents.
Two different measures can be computed based on the 2x statistic:
∑=
=M
kkkavg cwxcPwx
1
22 ),()()( (2.38)
or
),(max)( 2
1
2max k
M
kcwxwx
==
2.5.5 Term Strength
Term Strength estimates word importance based on how commonly a word is
likely to appear in “closely-related” documents. It uses a training set of documents to
derive document pairs whose similarity is above a threshold. Term Strength is
computed based on the estimated conditional probability that a word occurs in the
second half of a pair of related documents given that it occurs in the first half.
Let x and y be an arbitrary pair of distinct but related documents, and w be a
word, then the strength of the word is defined to be:
)|()( xwywPwTS ∈∈= (2.39)
19
2.6 MULTI-CLASS SVMS
There are many methods for SVMs to solve the multi-class classification problems.
One approach is to consider the problem as a two class classification problem. There
are two ways to solve multi-class SVMs in this approach [Chih-Wei, et. al., 2002 ;
Weston et. al., 1998], they are:
a) one-against-one classifiers.
b) one-against-the-rest classifiers.
In a), suppose there are k classes to be classified, this method constructs
)1(21
−kk SVM models. Each classifier must train on two classes, for training on the
i th class and the j th class, one needs to solve the following two-class classification
problem:
∑+t
Tijijt
ijTij
bwwCww
ijijij)()(
21
min,,
ξξ
(2.40)
ifbxw ijt
ijt
Tij ,1)()( ξ−≥+Φ iyt = (2.41)
ifbxw ijt
ijt
Tij ,1)()( ξ+−≤+Φ jyt = (2.42)
0≥ijtξ (2.43)
and the testing can be implemented in many ways, one of them is what so called
“Max Wins”.
This voting strategy says that if the result ijt
Tij bxw +Φ )()( says that the test data is
in the i th class, then the vote for class i is added by 1, otherwise the class j is
added 1. Then one can predict the test data with the largest vote. In general,
one-against-one method will take us much time to accomplish a training work,
especially when we have many classes to train. But in real implementation, if one
wants to gain better performance, one has no choice to use it.
20
In b), one-against-the-rest method. If one has k classes, then one needs to train
only k SVM models. This method spends much little time than one-against-one
method. The i th class is trained with positive labels while all the other classes are
trained with negative labels. So if one has l training data ),(),...,,( 11 ll yxyx , such
that liRx Ni ,...,1, =∈ and },...,1{ ky i ∈ is in class i , then the i th classifier
solves the following problem:
∑+t
Tiit
iTi
bwwCww
jii)()(
21
min,,
ξξ
(2.44)
ifbxw ij
ij
Ti ,1)()( ξ−≥+Φ iy j = (2.45)
ifbxw ij
ij
Ti ,1)()( ξ+−≤+Φ iy j ≠ (2.46)
ljij ,...,1,0 =≥ξ (2.47)
The testing is the same with two class SVMs. In general, the performance is
usually not good when comparing with one-against-one method.
21
CHAPTER 3
TEXT CATEGORIZATION USING ONE-CLASS SVM
3.1 PROPOSED MODEL
There are mainly three stages in our proposed model, they are:
a) Data preprocessing stage.
b) Unsupervised learning stage.
c) Supervised learning stage.
The first stage includes Part-Of-Speech (POS) tagging, word stemming, stop-word
deleting and feature selection. Through these processes, we can transform the original
raw data into the normalized data that can be used in the second stage.
The second stage provides the processes that about performing SV clustering.
The subjects include the choice of kernel function, and how we find clusters and the
strategy to judge the cluster validation. Also we discuss the problems we face and the
solution we use to solve them.
The last stage is about the training of internal and expert node classifiers, these
node classifiers come from the clustering result of the second stage. Firstly we
construct a mapping strategy between raw data and cluster centers, and then train all
the classifiers by one-against-one or one-against-the-rest training methods.
All the main components and procedures are illustrated in the Figure 3.1 as
follows:
22
Figure 3.1 Our proposed text categorization model
A detailed description of the behavior of each component is described as follows:
3.2 DATA PREPROCESSING
There are four main procedures in the data preprocessing stage :
23
Fig 3.2 Data Preprocessing Processes
3.2.1 Part-Of-Speech Tagger
In this procedure, a POS tagger [Brill 1994] is introduced to provide POS information.
News article will be previously tagged resulting in each word with its appropriate
part-of-speech (POS) tag. In general, the news articles are mostly composed by
natural language text to express human’s thought. In this thesis, we consider that
News Documents
Part-Of-Speech
Tagger
Stemmer
Stop-Word Filter
Feature Selection
Training Data
24
concepts, which express human’s thought, will mostly be decided by noun keywords.
Therefore, POS tagger module provides proper POS tags for the function of feature
selection. Furthermore, POS tags give important information for deciding contextual
relationship between words. In Figure 3.2, this tagger provides noun words to the next
stemmer module. In such way, this module employs natural language technology to
help analyze news articles. Consequently, it is considered a language model.
For natural language understanding, giving a sentence POS tags prepares the
further information to analyze the syntax of a sentence. The POS employed in this
thesis is based on rule-based POS tagger proposed by Eric Brill in 1992. Brill’s tagger
tries to learn lexical and contextual rules for tagging words. The precision of Brill’s
tagger was pronounced to be higher than 90% [Brill 1995]. There are totally 37 POS
tags as listing in APPENDIX B. As mentioned above, we select noun-only words.
Therefore, these noun tags are NN, NNS, NNP and NNPS. The following are
examples of words after POS tagging.
N.10/CD S.1/CD
"I/NN think/VB it/PRP is/VBZ highly/RB unlikely/JJ that/IN
American/NNP Express/NNP is/VBZ
Fig 3.3 Words with tagging
3.2.2 Stemming
Frequently, the user specifies a word in a query but only a variant of this word is
present in a relevant document. Plurals, gerund forms, and past tense suffixes are
examples of syntactical variations which prevent a perfect match between a query
word and a respective document word [Ricardo et. al., 1999]. This problem can be
partially overcome with the substitution of the words by their respective stems.
25
A stem is the portion of a word, which is left after the removal of its affixes. A
typical example of a stem is the word “calculate”, which is the stem for the variants
calculation, calculating, calculated, and calculations. Stems are thought to be useful
for improving retrieval performance because they reduce variants of the same root
word to a common concept. Furthermore, stemming has the secondary effect of
reducing the size of the indexing structure because the number of distinct index terms
is reduced [Ricardo et. al., 1999].
Because of most variants of a word are generated by the introduction of suffixes,
and on the basis of intuitive, simple, and can be implemented efficiently, there are
several well-known algorithms which been used suffixes removal. The most popular
one is that by Porter, so we use the Porter algorithm [Porter 1980] to do word
stemming.
3.2.3 Stop-Word Filter
Words, which are too frequent among the documents in the collection, are not good
discriminators. In fact, a word, which occurs in 80% of the documents in the
collection, is useless for purpose of retrieval. Such words are frequently referred to as
stop-words and are normally filtered out as potential index terms. Articles,
prepositions, and conjunctions are natural candidates for a list of stop-words.
Elimination of stop-words has an additional important benefit. It reduces the size
of the indexing structure considerably. In fact, it is typical to obtain compression in
the size of the indexing structure of 40% or more solely with the elimination of
stop-words [Ricardo et. al., 1999].
Since stop-words elimination also provides for compression of the indexing
structure, the list of stop-words might be extended to include words other than articles,
prepositions, and conjunctions. For example, some verbs, adverbs, and adjectives
26
could be treated as stop-words. In this thesis, a list of 306 stop-words has used. A
detailed list of the stop-words can be found in the appendix of this thesis.
The stop word filter procedure takes noun words as input and a few noun words
may be ineffective to what human wants to express in the document. They are only
auxiliary to complete the whole natural language text. Here we called them stop
words. In this reason, stop words must be filtered to prevent noise from the analysis.
After the stop words are filtered, the rest of non-stop noun words still can’t say
right away to be fully related to what human wants to express. According to human’s
writing habit, it is believed that too low or too high frequency of word’s occurrence
results from that the word itself is not important or representative.
3.2.4 Feature Selection
In many supervised learning problems, feature selection is important for a variety of
reasons: generalization performance, running time requirements, and interpretational
issues imposed by the problem itself.
One approach to feature selection is to select a subset of the available features.
This small feature subset will still retains the essential information of the original
attributes. There are some criteria [Meisel 1972]:
(1) low dimensionality
(2) retention of sufficient information
(3) enhancement of distance in pattern space as a measure of the similarity of
physical patterns, and
(4) consistency of feature throughout the sample.
Our test bed is Reuters Data set, a complete description is in Section 4.1. We choose
features for each category and use the features to represent a document, we use the
vector space model in information retrieval field. The feature selection method we
27
adopt is a frequency-based method, we use what so called TF-IDF,
logmax ,
,,
tdt
dtdt n
Ntf
tfw ×= (3.1)
where dttf , is the number of times the word t occurs in document d, tn is the
number of documents the word t occurs. N is the total number of documents.
From Section 3.2.1 to Section 3.2.4, we perform the preprocessing processes.
The original text document is now represented as a vector as the following Figure
shows.
Fig 3.4 Representing text as a feature vector.
These vectors are all m×1 dimensional, where m is the total number of features
we select for each category. We then utilize them as the training data in unsupervised
learning stage.
28
3.3 UNSUPERVISED LEARNING
In this stage, the aim is to construct the hierarchical news categories. In performing
SV clustering, we first need to choose the kernel function in order to map the training
data to high-dimensional feature space. Through the tuning of main parameters, we
generate different shape and different number of connected components. The work
now is to find out all the connected components by some searching algorithm [Ellis et.
al., 1995], the algorithm we use is adjacent matrix and Depth First Searching
algorithm. After we find all the connected components we want, we need to do cluster
validation checking. The strategy we use is discussed in Section 3.3.4.
Due to the experience in finding the connected components, we found that
finding the connected components is time-consuming. To solve this problem, we
decide to do sampling and dimension reduction for our raw data. The strategy for
sampling is Fuzzy C-Mean [Georgej et. al., 1995], and the strategy for dimension
reduction is Principal Component Analysis [Jolliffe 1986]. We will mention this later.
In the following we talk about the procedures of SV clustering and the approach we
use to solve the problems we face.
3.3.1 Support Vector Clustering
The use of SV clustering can help us to construct a Reuters hierarchy. The learning
algorithm we use is the one proposed by Scholkopf [Scholkopf et. al., 2000]. In
solving the QP problem, the optimization is performed by the use of Sequential
Minimal Optimization (SMO) [Platt 1998]. As we already mentioned that the training
time the algorithm took is almost between linear and quadratic in terms of the training
data set size. It is much faster than any other existing learning algorithms, and we can
easily modify the learning algorithm of SMO to fit our one-class SVM.
29
We now go to the procedures of performing SV clustering. As we mentioned in
Section 2.1.2, in prforming SV clustering we have to choose proper value of q and
ν . The choice of q decides the compactness of the enclosing sphere and also the
number of clusters. The choice of ν helps us to solve the problem of overlapping
clusters.
The SV clustering processes are as follows:
Fig 3.5 The SV clustering processes [Ben-Hur et. al., 2000]
Unlabeled Data Set
dn RxxxX ∈= },...,,{ 21
Choose kernel function Increase q from 0,
ν fixed
Given q ,ν
Using adjacent-matrix
and DFS to find out all
Yes
Yes, Stop
No
Fixed q
change ν increasingly
Yes
No
No
Ifq exhausted
and all NO
Cluster Validity
Clusters exist (≥ 2)
30
We explain the above procedures as follows:
3.3.2 The Choice of Kernel Function
In 1992 [Boser et. al., 1992] Vapnik shows that the order of operations for
constructing a decision function can be interchanged. So instead of making a
non- linear transformation of the input vectors followed by dot-products with SVs in
feature space, one can first compare two vectors in input space and then makes a
non-linear transformation of the value of the result.
Commonly used kernel functions are as follows:
a) Gaussian RBF kernel : )exp(),(2
yxqyxK −−= (3.2)
b) Polynomial kernel : dyxyxK )1(),( +⋅= (3.3)
c) Sigmoid kernel : )tanh(),( θ−⋅= yxyxK (3.4)
we use only Gaussian kernel since other kernel function like polynomial kernel
function does not yield tight contour representations of a cluster [Tax et. al., 1999]
and we will show that Gaussian kernel is indeed the best choice for SV clustering in
Section 4.3.
3.3.3 Cluster-Finding with Depth First Searching Algorithm
We use graph theory to explain the clustering result. Every enclosing sphere is a
connected component, and data points in the same connected component are adjacent.
What we do now is to find out all the connected components.
Define an adjacent matrix ijA between pairs of points ix and jx ,
31
≤
=otherwise. 0
RR(y) , xand xconnectingsegment line on they allfor if 1 jiijA (3.5)
up to now we can know that whether two data points are adjacent, we need to find all
adjacent data points in the same connected component. The algorithm we adopt is the
Depth First Searching (DFS) algorithm. As we know every training data point even
BSV will belong to one connected component. We can find out which connected
component that the data point belongs to by DFS algorithm.
The connected component and DFS algorithm are as follows [黃曲江 1989 ;
Ellis 1995]:
procedure ConnectedComponents (adjacencyList: HeaderList; n: integer);
var
mark: array[VertexType] of integer;
{ Each vertex will be marked with the number of the component it is in.}
v: VertexType;
componentNumber: integer;
procedure DFS(v:VertexType);
{Does a depth-first search beginning at the vertex v}
var
w: VertexType;
ptr: NodePointer;
begin
mark[v] := componentNumber;
ptr := adjacencyList[v];
while ptr ≠ nil do
32
w := ptr .↑ vertex;
output(v,w);
if mark[w]=0 then DFS(w) end
ptr := ptr .↑ link
end {while}
end {DFS}
begin {ConnectedComponents}
{Initialize mark array.}
for v:=1 to n do mark[v] :=0 end;
{Find and number the connected components.}
componentNumber := 0;
for v := 1 to n do
if mark[v]=0 then
componentNumber := componentNumber +1;
output heading for a new component;
DFS(v)
end { if v was unmarked}
end {for}
end {ConnectedComponents}
3.3.4 Cluster Validation
But when to stop the clustering procedure ? It is natural to use the number of SVs as
an indication of a meaningful solution [Ben-Hur et. al., 2000 ; Ben-Hur et. al., 2001].
At first we start with fixed v and some value of q , increase slowly the value of q .
We can find that the cluster boundaries are more and more tightly. More clusters are
formed and the percentage of SVs increases. If the value of percentage is too high, it
33
is time to stop the clustering process. In general, the percentage of SVs in the training
data set is about 10% .
If the connected components are not found in many q , we should increase the
value of v in order to break the overlapped boundaries. In doing so, many data
points that are in the overlapped boundaries will then be forced to become so called
Bounded SVs. They are not included into the connected components.
Up to now, through all the processes, we can construct the complete Reuters
hierarchical categories. We will show the advantage of this hierarchical categories
comparing to basic Reuters flat classification in Section 4.4.
Below we solve two problems, the first problem is that if we always have only
one connected component for our data set. The second problem is that finding
connected components is very time-consuming such that we cannot afford it.
3.3.5 One-Cluster And Time-Consuming Problem
We face two problems in our proposed model, they are
(1) In case that the clustering result always tells us there is only one connected
component for our training data set.
(2) The clustering process is time-consuming.
The strategy we use for the first problem is that we can perform dimension
reduction in order to see the influence of the dimension to the clustering result. We
use PCA for dimension reduction.
The second problem can be solved by using sampling. Suppose the training
data set is X ,
dn RxxxX ∈= },...,,{ 21 (3.6)
it takes time complexity of )( 2mnO to build the adjacency matrix, where n is the
34
number of training data and m is the partition number in every loop.
At first we find cluster centers for each category by FCM, and use all the cluster
centers to be our new training data. We also use SMO to solve our QP problem in
order to reduce training time.
3.4 SUPERVISED LEARNING
In this stage we train our gateway node classifiers and expert node classifiers. In our
hierarchy, every gateway node or expert node is a two-class SVMs or multi-class
SVMs classifier. We first look at the construction of Reuters category.
3.4.1 REUTERS CATEGORY CONSTRUCTION
D’Alessio organized Reuters 135 categories into a 2- level hierarchy as summarized in
Fig 3.6. Fig 3.6 includes counts by selected individual leaf categories and summarized
by upper level supercategories [D’Alessio et. al., 2000].
Fig 3.6 Reuters basic hierarchy [D’Alessio et. al., 2000]
This is the flat construction of Reuters categories, there is no structure defining
the relationship among them. We now discuss the motivation of using hierarchy.
35
There are two strong motivations for taking the hierarchy into account. First,
experience to date has demonstrated that both precision and recall decrease as the
number of categories increases [ Yang 1997a ; Apte et. al., 1994 ]. One of the reasons
for this is that as the scope of the corpus increases, terms become increasingly
polysemous. This is particularly evident for acronyms, which are often limited by the
number of 3- and 4- letter combinations, and which are reused from one domain to
another.
The second motivation for doing categorization within a hierarchical setting is it
affords the ability to deal with very large problems. As the number of categories
grows, the need for domain-specific vocabulary grows as well. Thus, we quickly
reach the point where the index no longer fits in memory and we are trading accuracy
against speed and software complexity. On the other hand, by treating the problem
hierarchically, we can decompose it into several problems each involving a smaller
number of categories and smaller domain-specific vocabularies and perhaps yield
savings of several orders of magnitude [D’Alessio et. al., 1998].
The hierarchical construction of Reuters categories we build through the use of
SV clustering is as follows:
36
Fig 3.7 Our Proposed Hierarchy
As we mentioned in Section 2.6, there are two ways to train a multi-class SVMs
classifier. They are one-against-one method and one-against-the-rest method. We use
both ways in our training of the gateway node classifiers and expert node classifiers.
Now we solve the problem that how we train the node classifiers in the hierarchical
categories.
3.4.2 The Mapping Strategy
In case that we used the sampling strategy to use cluster centers to represent the
original category data. We must build a mapping strategy between the raw data and
category centers. We propose two methods to solve this problem :
Method 1 : Direct-Map
suppose there are l training data,
37
lxx ,...,1 (3.7)
m centers,
mcc ,...,1 (3.8)
for every ix , li ,...,1= . We calculate the Euclidean distance
between the point ix and each center jc , mj ,...,1= . Define the
distance :
∑ −=t
jtitij cxd2
(3.9)
and then find the minimum for each i :
ijj
dmin (3.10)
then we build a mapping between the raw data and the centers.
Method 2 : Without-Map
since we already know what the hierarchy is like. We can directly
use the raw data to build the tree. In the training of gateway node
classifier and expert node classifiers, we solve multi-class
classification and two-class classification problem.
3.4.3 Gateway Node Classifier
We try two ways to train our gateway node classifiers. The first way is by SV
clustering combining with Direct-Map mapping strategy. The second way is to train a
two-class SVMs or multi-class classifier directly. In Direct-Map mapping strategy, the
result we gain is much more closer to our original idea, but for our convenience in
building all the tree node classifiers, we use both two methods in order to obtain better
results.
The gateway node classifier has the function that it controls the next go of the
38
test data, test data goes to the predefined node according to its output.
3.4.4 Expert Node Classifier
The expert node classifiers are two-class SVMs or multi-class SVMs. The training of
two-class SVMs is formal and we use two-class SVMs to implement multi-class
SVMs.
Expert node classifier tells us that the test data belongs to which category. In
Figure 3.7 we can see that the expert node classifier “corporate” has the duty to
distinguish 2 categories and so this node classifier is a two-class SVMs. But the
expert node “commodity” has the duty to distinguish 53 categories, this node
classifier is a multi-class SVMS.
The categories in every expert node classifier are different, so the training time is
also different. Taking the one-against-one training method will save us much time but
it may lead to poor testing result. So we would rather adopt the training method of
one-against-the-rest in order to gain better results.
39
CHAPTER 4
EXPERIMENT DESIGN AND ANALYSIS
4.1 The Corpus
In our experiments, we used the Reuters-21578, Distribution 1.0, which is
comprised of 21578 documents, representing what remains of the original
Reuters-22173 corpus after the elimination of 595 duplicates by Steve Lynch and
David Lewis in 1996 [Lewis 1996]. David Lewis made it publicly available for text
categorization research [Lewis 1992b]. Several researchers have used it for their
experiments, and it has the potential for becoming a standard test bed. In the
following sections, we describe the corpus on the level of documents and categories
and our hierarchical architecture of categories in data set.
4.1.1 Documents
The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files
(reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last
(reut2-021.sgm) contains 578 documents. The size of the corpus is 28,329,363 bytes,
yielding an average document size of 1,312 bytes per document. The docume nts are
“categorized” along five axes – EXCHANGES, ORGANIZATIONS, PEOPLE,
PLACES, and TOPICS. We consider only the categorization along the TOPICS axis.
Close to half of the documents (10,211) have no topic and we do not include these
documents in either training or testing sets. Unlike Lewis (acting for consistency with
earlier studies), the documents that we consider no-category are those that have no
40
categories listed between the topic tags in the Reuters-21578 corpus’ documents. This
leaves 11,367 documents with one or more topics. Most of these documents (9,494)
have only a single topic.
4.1.2 File Format
Each of the 22 files begins with a document type declaration line:
<!DOCTYPE lewis SYSTEM "lewis.dtd">
Each article starts with an "open tag" of the form
<REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??>
where the ?? are filled in an appropriate fashion. Each article ends with a "close tag"
of the form:
</REUTERS>
Each REUTERS tag contains explicit specifications of the values of five
attributes, TOPICS, LEWISSPLIT, CGISPLIT, OLDID, and NEWID. These attributes
are meant to identify documents and groups of documents. The values of the attributes
determine how the documents are divided into a training set and a testing set. In the
experiments described in this thesis, we use The Modified Lewis Split, which is the
one that is most used in the literature. This split is achieved by the following choice of
parameters:
Training Set (13,625 docs): LEWISSPLIT="TRAIN"; TOPICS="YES" or "NO"
Test Set ( 6,188 docs) : LEWISSPLIT="TEST"; TOPICS="YES" or "NO"
41
Unused ( 1,765 docs) : LEWISSPLIT="NOT-USED" or TOPICS="BYPASS"
The attributes CGISPLIT, OLDID, and NEWID were ignored in our experiments.
In our experiments, we used 9603 documents for training, and 3299
documents for testing.
4.1.3 Document Internal Tags
Just as the <REUTERS> and </REUTERS> tags serve to delimit documents
within a file, other tags are used to delimit elements within a document. There are
<DATE>,<MKNOTE>,<TOPICS>,<PLACES>,<PEOPLE>,<ORGS>,<EXCHA
NGES>,<COMPANIES>,<UNKNOWN>,<TEXT>. Of these only <TOPICS> and
<TEXT> were used in our experiments.
<TOPICS>, </TOPICS>
These tags encloses the list of TOPICS categories, if any, for the document. If
TOPICS categories are present, each will be delimited by the tags <D> and </D>.
<TEXT>, </TEXT>
All the textual material of each story is delimited between a pair of these tags.
Some control characters and other "junk" material may also be included. The white
space structure of the text has been preserved. The following tags optionally delimit
elements inside the TEXT elements: <AUTHOR>, </AUTHOR>, <DATELINE>,
</DATELINE>, <TITLE>, </TITLE>, <BODY>, </BODY>. of these, we only use
the last two, which enclose the main text of the documents.
42
A few sample documents are shown in Figure 4.1. The average length of the
stories is 90.6 words, but some are as short as a sentence and others as long as several
pages.
Figure 4.1 Sample news stories in the Reuters-21578 corpus.
43
4.1.4 Categories
There are 135 different categories defined for the Reuters data, 96 categories
occur in our training set. The number of categories assigned to each document ranges
from 1 to 14, but average is only 1.24 categories per document. The frequency of
occurrence varies greatly from category to category. For example, earn appears in
3987 documents, while 78 categories (i.e. more than 50%) have been assigned to
fewer than 10 documents. Actually, 15 categories have not been assigned to any
documents. Table 4.1 gives a listing of the categories sorted in decreasing order of
frequency (the number of documents with that category).
Table 4.1 the list of categories, sorted in decreasing order of frequency.
44
4.2 DOCUMENT REPRESENTATION ANALYSIS
We consider the impact of document representations in one-class SVM. In our
research of one-class SVM, we find an interesting phenomena. That is in one-class
SVM, the choice of data representation is somewhat different from two-class SVMs
or multi-class SVMs. We illustrate the phenomena as follows:
In the information retrieval field. We all agree that frequency representation of
the documents is best in the vector space model. But in [Manevitz et. al., 2001],
Manevitz found that binary representation achieves best result than other
representations. We argue that it do conflict to our common sense, so we decide to
perform one-class classification for the ten Reuters most frequently used categories.
And then we will compare the result to the former paper.
To be the same with the paper [Manevitz et. al., 2001], 25% of the data will be
our training data, and the rest will be used for predicting. The following table shows
the data set we use.
Table 4.2 Number of Training/Test Items
Category Name Num Train Num Test
Earn Acquisitions Money-fx Grain Crude Trade Interest Ship Wheat Corn
966 590 187 151 155 133 123 65 62 62
2902 1773 563 456 465 401 370 195 186 184
We compare with two different data representations. The first one is the binary
45
representation and the second one is tf- idf representation. The feature dimension is
fixed 10 and 20 by performing PCA on the original feature set, and Gaussian kernel
function is used.
According to the text categorization survey by Sebastiani [Sebastiani 2001], the
most commonly used performance measures in flat classification are the classic
information retrieval notions of Precision and Recall. Neither precision or recall an
effectively measure classification performance in isolation [Sebastiani 2001].
Therefore, the performance of the text categorization has often been measured by the
combination of the two measures. The most popular one is βF measure.
βF measure was proposed by Rijsbergen [Rijsbergen 1979]. It is a single score
computed from precision and recall values according to the user-defined importance
(i.e. β ) of precision and recall. Normally β =1 is used [Sebastiani 2001].
Here is the definition of precision, recall and 1F measure:
category toassigned iterms Total
identifiedcategory of iterms ofNumber Pr =ecision (4.1)
setin test memberscategory ofNumber
identifiedcategory of iterms ofNumber Re =call (4.2)
and callecisioncallecision
callecisionFRePrRePr2
)Re,(Pr1 +⋅
= (4.3)
The following two tables show the results. Table 4.3 is the result with binary and
Tf- idf representation where data is with 10 features. Table 4.4 is the result with binary
and Tf- idf representation where data is with 20 features. We see that binary
representation achieve better performance than Tf- idf representation both in 10
features and 20 features data.
46
Table 4.3 1F measure in One-class SVM with vector dimension 10
Our results Linear Sigmoid Polynomial Rbf
binary Tf-idf
earn 0.676 0.702 0.409 0.676 0.982 0.890
acq 0.483 0.481 0.185 0.482 0.991 0.897
money-fx 0.541 0.516 0.074 0.514 0.960 0.860
grain 0.585 0.533 0.084 0.585 0.970 0.883
crude 0.545 0.532 0.441 0.544 0.920 0.867
trade 0.445 0.476 0.363 0.597 0.965 0.880
interest 0.473 0.454 0.145 0.485 0.918 0.894
ship 0.563 0.518 0.025 0.539 0.917 0.847
wheat 0.474 0.450 0.619 0.474 0.956 0.825
corn 0.293 0.339 0.036 0.298 0.835 0.809
Table 4.4 1F measure in One-class SVM with vector dimension 20
Our results Linear Sigmoid Polynomial Rbf
binary Tf-idf
earn 0.652 0.686 0.678 0.321 0.973 0.891
acq 0.488 0.489 0.491 0.194 0.988 0.905
money-fx 0.487 0.494 0.503 0.084 0.956 0.901
grain 0.504 0.504 0.487 0.071 0.974 0.831
crude 0.496 0.496 0.485 0.111 0.912 0.905
trade 0.441 0.441 0.483 0.239 0.960 0.927
interest 0.440 0.425 0.425 0.092 0.915 0.907
ship 0.220 0.219 0.310 0.025 0.913 0.813
wheat 0.449 0.449 0.420 0.097 0.926 0.860
corn 0.376 0.376 0.352 0.029 0.820 0.730
We explain how it happens as follows. Consider nRX ⊆ , },...,,{ 21 nxxxX = .
For any two vectors, say ji xx , , },...,2,1{, nji ∈ , the distance between ix and jx in
47
the feature space is
)()( ji xx Φ−Φ = )()()()(2)()( jjjiii xxxxxx Φ⋅Φ+Φ⋅Φ−Φ⋅Φ
= )exp(222
ji xxq −−− (4.4)
so we can see that the distance between ji xx , in the feature space is proportional to
the distance between ji xx , in the input space.
Now consider the corresponding binary representation of ji xx , , say ji yy , ,
where
1=iky if 0>ikx , (4.5)
0=iky if 0=ikx , (4.6)
features ofnumber theis k
then
jiji xxyy −≤−<0 (4.7)
so we can now conclude that tf- idf representation will result in longer distance than
binary representation in the feature space. Now consider the original quadratic
program in one-class SVM:
∑+∈∈∈ i i
FcRRR lR
lξ
νξ
1min 2
,, (4.8)
subject to ][for 0,)( 22liRcx iii ∈≥+≤−Φ ξξ
the goal is to minimize the volume of the enclosing sphere, since the distance of any
two vectors in feature space by using Gaussian kernel is at most 2 , and we already
know that given any two vectors in input space, their tf- idf representation will result
in longer distance between these two vectors rather than their binary representation in
48
the feature space. So vectors with binary representation are much more compact than
tf- idf representation, and result in much more easier to perform one-class SVM
classification and so the classification result is better than tf- idf representation. This is
why the performance of binary representation is better than the tf- idf representation in
our experiments.
4.3 THE CHOICE OF KERNEL FUNCTION
In order to verify that only Gaussian kernel function works well for performing SV
clustering, we first compare different kernel functions for Iris data set [Blake et. al.,
1998]. The data set contains 150 instances each composed of four measurements of an
iris flower. There are three types of flowers, represented by 50 instances each. We use
PCA to do dimension reduction and choose the first and the second principal
components to be our training set.
For the convenience of visualization, we use LIBSVM 2.0 to perform the
experiments. This is an integrated tool for support vector classification and regression
which can handle one-class SVM using the Scholkopf etc algorithms. The LIBSVM
2.0 is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
We investigate the two most important and frequently used kernel functions , the
polynomial kernel and the Gaussian kernel:
(i) Polynomial function:
dyxyxK )1(),( +⋅= (4.9)
(ii) Radial basis function (Gaussian kernel function):
)exp(),(2
yxqyxK −−= (4.10)
49
the experimental results are as follows:
Fig 4.2 One-Class SVM with polynomial kernel where d=2.
Fig 4.3 One-Class SVM with Gaussian kernel where q =100, 1.0=ν .
we see that only Gaussian kernel can form a tighter enclosing sphere for data in the
original space. We now explain the result by looking at these two kernel functions.
The polynomial kernel function is
50
dyxyxK )1(),( +⋅= , (4.11)
for any two vectors ji xx , , mji ,...,1, = , m is the number of features. We then have
n
jn
iijnn
ji xxxx ⋅=⋅ )(cos)( θ , (4.12)
for data points that are not centered around the origin, the angle between them will be
small, and the value of )(cos ijn θ will tend to 1, so
n
jn
in
ji xxxx ⋅≈⋅ )( . We see that
this kernel function stretches the data in the feature space, and so it is hard to form
proper enclosing sphere. In Fig 4.2 we can see that polynomial kernel is not suitable
for SV clustering.
In Fig 4.3 we find that Gaussian kernel do the best job for SV clustering, the
enclosing sphere is tighter. We now conclude that Gaussian kernel is the best choice
for SV clustering and so we use it in our experiments.
4.4 HIEARCHICAL TEXT CATEGORIZATION PERFORMANCE
ANALYSIS
The hierarchical construction we build is a category tree. Whenever a document
comes in, at the root level in the category hierarchy, the document can be first
classified into positive or negative categories. The classification can be repeated on
the document in each of the subcategories until the document reaches some internal or
leaf category.
We compare our hierarchy with both non-hierarchical classifiers and hierarchical
classifiers in order to find the advantage of our classifier built by one-class SVM. Now we pay attention to the following two questions:
(1) Does the hierarchical classifier based on our SV clustering results improve
performance when compared to a non-hierarchical classifier?
51
(2) How does our hierarchical method compare with other text categorization
methods?
We talk about the performance analysis tools. As we mentioned in Section 4.2,
precision and recall and 1F measure are the best choice for measuring two class
classification problem. We must pay attention to that the above three measure are
actually designed for flat classification, since they have largely ignored the
parent-child and sibling relationships between categories in a hierarchy. We now
consider another measure that can fit our requirement.
Let iTP be the set of documents correctly classified into category iC ;
iFP be the set of documents wrongly classified;
iFN be the set of documents wrongly rejected;
iTN be the setoff documents correctly rejected;
The standard precision and recall are defined as follows:
ii
ii FPTP
TP+
=Pr (4.13)
ii
ii FNTP
TP+
=Re (4.14)
based on the standard precision and recall for each category, the overall precision and
recall for the whole category space can be obtained in two ways. Namely,
Micro-Average and Macro-Average. Micro-Average gives equal importance to each
document, while Macro-Average gives equal importance to each category [Yang
1997a]. In our experiment, we use only Micro-Average measure.
The Micro-Average is defined as :
∑∑
=
=
+= m
i ii
m
i iu
FPTP
TP
1
1
)(Pr (4.15)
52
∑∑
=
=
+= m
i ii
m
i iu
FNTP
TP
1
1
)(Re (4.16)
where m is the number of categories.
We also use another combination of precision and recall measures, that is
Break-Even Point (BEP). BEP was proposed by Lewis [Lewis 1992a], defines the
point at which precision and recall are equal. In our experiments, there are many BEP
for every category, we choose the smallest one to be our last result.
Table 4.5 Precision/Recall-breakeven point on the ten most frequent Reuters
categories on Basic SVMs.
SVM(poly) d=
1 2 3 4 5
SVM(rbf) γ =
0.6 0.8 1.0 1.2 Proposed
Approach
earn 98.2 98.4 98.5 98.4 98.3 98.5 98.5 98.4 98.3 99.0
acq 92.6 94.6 95.2 95.2 95.3 95.0 95.3 95.3 95.4 99.1
money-fx 66.9 72.5 75.4 74.9 76.2 74.0 75.4 76.3 75.9 97.5
grain 91.3 93.1 92.4 91.3 89.9 93.1 91.9 91.9 90.6 83.5
crude 86.0 87.3 88.6 88.9 87.8 88.9 89.0 88.9 88.2 88.7
trade 69.2 75.5 76.6 77.3 77.1 76.9 78.0 77.8 76.8 100
interest 69.8 63.3 67.9 73.1 76.2 74.4 75.0 76.2 76.1 94.7
ship 82.0 85.4 86.0 86.5 86.0 85.4 86.5 87.6 87.1 98.9
wheat 83.1 84.5 85.2 85.9 83.8 85.2 85.9 85.9 85.9 76.2
corn 86.0 86.5 85.3 85.7 83.9 85.1 85.7 85.7 84.5 63.0
microavg 84.2 85.1 85.9 96.2 85.9 86.4 86.5 86.3 86.2 95.4
Table 4.5 shows the performance comparison of our proposed hierarchical model
and the non-hierarchical SVMs [Joachims 1998] for the 10 most frequent categories.
Our proposed hierarchical construction do somewhat better than it. But poorly in the
categories like grain, crude, wheat, corn. We now try to explain why this four
category behave poor than other categories.
53
We can use the hierarchical construction we construct to explain, the hierarchical
categories is as follows:
Fig 4.4 Our Proposed Hierarchical Construction
grain category is in expert node “energy”, the energy node contains 9 subcategories.
We build a multi-class SVM classifier to distinguish these 9 subcategories. As we
mentioned in the literature review, we implement this multi-class SVM through 9
two-class SVMs, every two-class SVM distinguishes only one category with other 8
categories. In our experiments, we find that these 9 categories are somewhat
overlapped such that they share many same documents. So to distinguish these 9
categories is difficult.
How about these three categories: crude, wheat, and corn ? We find that these
three categories lie on the expert node “commodity”. The internal node commodity
has the duty to distinguish 53 sub-categories containing crude, wheat, and corn. We
54
first use one-against-the-rest training method and implement 52 classifiers. But we
then find that the classification results are poor. So we take the next choice, to use
one-against-one training method. We build classifiers for every category against other
52 categories, so totally we must build 532C classifiers. The same case, these 53
categories are very closer. So the classification results in the hierarchical construction
are poor.
Table 4.6 Our proposed classifier comparison of k-NN and decision tree
k-NN(k=30) Decision Tree Proposed Approach Class
Recall Precision 1F Recall Precision 1F Recall Precision 1F
earn 0.950 0.920 0.935 0.953 0.966 0.978 0.995 0.990 0.992
acq 1.000 0.910 0.953 0.961 0.953 0.957 0.991 0.993 0.991
money-fx 0.920 0.650 0.762 0.771 0.758 0.764 0.987 0.975 0.980
grain 0.960 0.700 0.810 0.953 0.916 0.934 0.891 0.887 0.888
crude 0.820 0.750 0.783 0.926 0.850 0.886 0.865 0.835 0.849
trade 0.890 0.660 0.758 0.812 0.704 0.754 1 1 1
interest 0.800 0.710 0.752 0.649 0.933 0.766 0.947 1 0.972
ship 0.850 0.770 0.808 0.769 0.861 0.812 0.930 0.762 0.837
wheat 0.690 0.730 0.709 0.972 0.831 0.894 0.989 1 0.994
corn 0.350 0.760 0.479 0.982 0.821 0.984 0.918 0.630 0.747
Average 0.823 0.756 0.788 0.879 0.879 0.879 0.951 0.907 0.925
Table 4.6 shows the performance of our proposed hierarchical model, against the
decision tree [Weiss et. al., 1999] and the k-NN classifiers [Aas et. al., 1999] also for
the most frequent categories. We all agree that k-NN classifier performs best among
the conventional methods, this replicates the findings of [Yang 1997b]. Our
hierarchical construction performs better than k-NN except for the grain category. The
same reason to the former experiment, when compare to decision tree classifier, we
perform poor in that four categories. But in general, the overall performance is best in
56
CHAPTER 5
CONCLUSIONS AND FUTURE WORKS
5.1 Conclusions
Through the use of SV clustering combing with one-class SVM and SMO, the
hierarchical construction between Reuters categories is built automatically. The
hierarchical categories combines both supervised learning and unsupervised learning.
In order to overcome the time-consuming problem in the process of finding
connected components, we adopt the way of dimension reduction and sampling. And
also SMO learning algorithm is taken for increasing the learning speed. Thus in the
training process we use a mapping strategy to map the training data to its
corresponding cluster center.
Kernel method is important in our experiment, we show that only Gaussian
kernel works well in the process of SV clustering. There are many different properties
for SVM in one-class classification and two-class or multi-class classification, one of
them is the influence of data representation. We show that binary representation
results in best performance for other data representation in classification of one-class
SVM.
Finally we show that the text categorization performance can be raised higher
through our hierarchical construction. We compare the result with non-hierarchical
SVM classifiers and the experimental result tells us that our hierarchical classifiers do
perform better. Latter we compare with other non-SVM classifiers like k-NN,
decision tree, our proposed classifier also outperforms these two classifiers.
In summary, the contributions of this thesis are as follows:
1) To construct the hierarchical categories automatically.
57
2) Explore the characteristic of one-class SVM through experiments.
5.2 Future Works
We address two parts for future improvements of our system, the first part is the
sampling step.
In performing SV clustering, because the clustering time is too long so we adopt
the way that we do sampling and dimension reduction by FCM and PCA for every
category. By doing so the clustering time is decreased very much but at the same time
we must perform another strategy to compensate it. We must face the problem that it
may happen that the clustering result is not good when comparing to the case that we
use the original raw data to perform SV clustering.
We hope that by using techniques like SV mixture or other methods, we can use
the original data to perform SV clustering and use the clustering result to do the text
categorization.
The second part of our future work is the training of the expert node classifiers.
We see that category like commodity contains 53 sub-categories, this results in poor
performance in classification. Perhaps we can find another useful and powerful
method to train these kind of classifiers and promote its accuracy in classification.
58
References
Aas K. and Eikvil L., Text categorization: A survey. Report No 941, Norwegian Computing Center, ISBN 82-539-0425-8, June, 1999. Apte C., Damerau F., Weiss S. M., Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, pages 233-251, 1994. Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., A support vector clustering method. In International Conference on Pattern Recognition, 2000. Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., Support vector clustering. Journal of Machine Learning Research, volume 2 pages 125-137, 2001. Blake C. L., and Merz C. J., UCI repository of machine learning databases, 1998. Brill E., Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics. 1995. Bishop, C., Neural networks for pattern recognition. Oxford University Press, Walton Street, Oxford OX2 6DP, 1995. Boser B.E., Guyon I., and Vapnik V. N., A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of Computational Learning Theory, volume 5, pages 144-152, Pittsburg, ACM, 1992. Brill E., Rule based part of speech tagger. version 1.14, 1994. Chih-Wei Hsu, Chih-Jen Lin, A comparison of methods for multiclass support vector machines. IEEE transactions on neural networks. volume 13:2. March, 2002. Cortes C. and Vapnik V., Support vector networks. Machine Learning, volume 20:1, 25, 1995. D’Alessio S., Kershenbaum A., Murray K., Schiaffino R., Category levels in hierarchical text categorization. Proceedings of the Third Conference of Empirical Methods in Natural Language Processing EMNLP-3, 1998.
59
D’Alessio Stephen, Aaron Kershenbaum, Keitha Murray, and Robert Schiaffino., The effect of using hierarchical classifiers in text categorization. In Proceedings of 6th International Conference Recherche d’Information Assistee par Ordinateur(RIAO-00), pages 302-313, Paris, France, 2000. Duda R. P., Hart P. E., and Stork D. G., Pattern Classification, 2nd ed. Wiley, 2000. Ellis Horwitz, Sartaj Sahni, and Dinesh Mehta, Fundamentals of Data Structures in C++. New York :Computer Science Press, 1995. Frakes W. B. and Baeza-Yates R., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. Georgej Klir and Bo Yuan, Fuzzy Sets and Fuzzy Logic. Prentice Hall International Editions., 1995. Golub G. and Loan C. Van, Matrix Computations,3rd edition. Johns Hopkines, Baltimore, 1996. Hayes P. and Weinstein S., Constre/tis: a system for content-based indexing of a database of news stories. In Annual conference on Innovative Applications of AI, 1990. Japkowicz N., Myers C. and Gluck M., A novelty detection approach to classification. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 518-523, 1995. Joachims Thorsten., Text categorization with Support Vector Machines: learning with many relevant features. LS-8 report 23, 1998. Jolliffe I.T., Principal Component Analysis. Spring Verlag, 1986. Koller D. and sahami M., Hierarchically classifying documents using very few words. International Conference on Machine Learning, volume 14, Morgan-Kauffman, 1997. Lang, K., Newsweeder : Learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.
60
Lewis D. D., An evaluation of phrasal and clustered representations on a text categorization task. In Proc. Of the 15th Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval. pages 37-50, 1992a. Lewis. D. D., Representation and learning in information retrieval, Ph.D. thesis, Computer Science Dept, Univ. of Massachusetts at Amherst, February. Technical report pages 91-93, 1992b. Lewis D. D., Reuters-21578 collection, 1996. Manevitz Larry M., Malik Uousef, One-class SVMS for document classification. Journal of Machine Learning Research volume 2 pages 139-154, 2001. Meisel W. S., Computer-oriented approaches to pattern recognition. New York and London, 1972. Miguel Á . Carreira-Perpinán, A review of dimension reduction techniques. Technical report CS-96-09, 1997. Moya M., Koch M. and Hosterler L., One-class classifier networks for target recognition applications. In Proceedings world congress on neural networks, pages 797-801, Portland, OR. International Neural Network Society, INNS, 1993. Ng H.-T., Goh W.-B. and Low K.-L., Feature selection, perception learning and a usability case study. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 27-31, pages.67-73, 1997. Platt J. C., Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning. MIT Press, 1998. Hao P. Y. and Chiang J. H., Support vector clustering: a new geometrical grouping approach, Proceedings of the 9-th Bellman Continuum International Workshop on Uncertain Systems and Soft Computing, volume 2, pages. 312-317, Beijing, China, July, 2002.
61
Porter M. F., An algorithm for suffix stripping. program: automated library and information systems, volume 14(1), pages 130-137, 1980. Ricardo B. Y., Berthier R. N., Modern information retrieval. Addison-Wesley, ACM Press, New York, 1999. Rijsbergen C. J. V. , Information Retrieval. London: Butterworths, 2nd edition, 1979. Ritter G. and Gallegos M., Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters, volume 18 pages 525-539, 1997. Scholkopf B., Williamson R., Smola A., and Shawe-Taylor J., Single-class support vector machines. In J. Buhmann, W. Maass, H. Ritter, and N. Tishby, editors, Unsupervised Learning, Dagstuhl-Seminar-Report 235, pages 19-20, 1999. Scholkopf B., Platt J.C., Shawe-Tayer J., Smola A. J., and Williamson R. C., Estimating the support of a high dimensional distribution. In Proceedings of the Annual Conference on Neural Information Systems. MIT Press, 2000. Schutze H., Hull D., and Pedersen, J., A comparison of classifiers and document representations for the routing problem. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. Sebastiani F., Machine learning in automated text categorization: a survey. Technical report IEI-B4-31-1999, Istituto di Elaborazione dell’informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999, Revised version, 2001. Tax D. and Duin R., Support vector domain description. Pattern Recognition Letters volume 20 pages 11-13 , 1999. Wang K., Zhou S., and He Y., Hierarchical classification of real life documents. In Proceedings of the 1st SIAM Int. Conference on Data Mining, Chicago, 2001. Weigend A. S., Wiener E. D., and Pedersen J. O., Exploiting hierarchy in text categorization. Information Retrieval, volume 1(3) pages 193-216, 1999.
62
Weiss S. M., Apte C., Damerau F.J., Johnson D.E., Oles, F.J. Goetz, Hampp T., Maximizing text-mining performance, IEEE Intelligent Systems, volume 14(4), July-Aug, 1999. Weston J., Watkins C., Multi-class support vector machines. Technical Report CSD-TR-98-04 May 20, 1998. Yang Y. and Wilbur, J., Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, volume 47(5) pages 357-369, 1996. Yang Y., An evaluation of statistical approaches to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997a. Yang Y., and Pedersen J. O., A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning ICML97, pages 412-420, 1997b. 黃曲江, 計算機演算法設計與分析 格致, 1989.
63
APPENDIX A Stop-word List
There are totally 306 stop words used in this thesis.
a
about
above
across
after
afterwards
again
against
albeit
all
almost
alone
along
already
also
although
always
among
amongst
an
and
another
any
anyhow
anyone
anything
anywhere
are
around
as
at
b
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
beyond
both
but
by
c
can
cannot
co
could
d
down
during
e
each
eg
eight
either
eleven
else
elsewhere
enough
etc
even
ever
every
everyone
everything
everywhere
except
f
few
five
for
four
former
formerly
from
further
g
h
had
has
have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
64
him
himself
his
how
however
i
ie
if
in
inc
indeed
into
is
it
its
itself
j
k
l
last
latter
latterly
least
less
ltd
m
many
may
me
meanwhile
might
more
moreover
most
mostly
much
must
my
myself
n
namely
neither
never
nevertheless
next
nine
no
nobody
none
nor
not
nothing
now
nowhere
o
of
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
ourselves
out
over
own
p
per
perhaps
q
r
rather
s
said
same
seem
seemed
seeming
seems
seven
several
she
should
since
six
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such
t
ten
than
that
the
their
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
this
those
though
three
through
throughout
65
thru
thus
to
today
together
too
toward
towards
two
twelve
u
under
until
up
upon
us
v
v
very
via
w
was
we
well
were
what
whatever
whatsoever
when
whence
whenever
whensoever
where
whereafter
whereas
whereat
whereby
wherefrom
wherein
whereinto
whereof
whereon
whereto
whereunto
whereupon
wherever
wherewith
whether
which
whichever
whichsoever
while
whilst
whither
who
whoever
whole
whom
whomever
whomsoever
whose
whosoever
why
will
with
within
without
would
x
y
year
years
yes
yesterday
yet
you
your
yours
yourself
yourselves
z
66
APPENDIX B PART-OF-SPEECH TAGS There are totally 37 part-of-speech tags.
Part-of-Speech Tag Meaning
1
CC Coordinating Conjunction
2 CD Cardinal number
3 DT Determiner
4 EX Existential there
5 FW Foreign word
6 IN Preposition subordinating conjunction
7 JJ Adjective
8 JJR Adjective, comparative
9 JJS Adjective, superlative
10 LS List item marker
11 MD Modal
12 NN Noun, singular or mass
13 NNS Noun, plural
14 NNP Proper noun, singular
15 NNPS Proper noun, plural
16 PDT Predeterminer
17 POS Possessive ending
18 PRP Personal pronoun
19 PRP$ Possessive pronoun
20 RB Adverb
21 RBR Adverb, comparative
22 RBS Adverb, superlative
23 RP Particle
67
24 SYM Symbol
25 TO To
26 UH Interjection
27 VB Verb, base form
28 VBD Verb past tense
29 VBG Verb gerund or present participle
30 VBN Verb, past participle
31 VBP Verb, non-3rd person singular present
32 VBZ Verb, 3rd person singular present
33 WDT Wh-determiner
34 WP Wh-pronoun
35 WP$ Possessive wh-pronoun
36 WRB Wh-adverb
37 . Period