chapter Ⅳ. categorization 2007 년 2 월 15 일 인공지능연구실 김민호 text : the text...

Chapter . CategorizationⅣ

2007 년 2 월 15 일인공지능연구실 김민호

Text : THE TEXT MINING HANDBOOK

Page. 64 ~ 81

Outline

Ⅳ.5 Machine Learning Approach to TC Ⅳ5.1 Probabilistic Classifiers Ⅳ5.2 Bayesian Logistic Regression

Ⅳ.6 Using Unlabeled Data to Improve Classification Ⅳ.7 Evaluation of Text Classifiers

Ⅳ.5 MACHINE LEARNING APPRACHT TO TC

In the ML approach, the classifier is built automatically by learning the properties of categories from a set of preclassified training documents

Four main issue decide on to develop an application based on text categorization provide a training set for each of the categories decide on the features the represent each of the instances decide on the algorithm to be used for the categorization

Ⅳ5.1 Probabilistic Classifiers(1/3)

Bayes Theorem 변수 값을 x, 분류하고자 하는 class 를 C, 전체 모집단에서 변수 x

에 대한 확률분포를 P (x), 임의의 sample 이 class C 에 속할 사전확률 P (C), class C 에서 변수 값 x 가 얻어질 조건부 확률 P (x|C) 이 주어지면

P(A|B) 는 사건 B 가 일어났다는 조건하에 사건 A 가 일어날 조건부 확률

categorization status value CSV(d, c) Probabilistic Classifiers view CSV as the probability P(c | d) probability P(c | d) that the document d belongs to the category c


Naïve Bayes (NB) classifiers simple probabilistic classification 의미에 따른 확률모델로서의 더 정확한 묘사는 independent feature model 실제로는 생길 수 없는 strong independences assumption 을 포함 parameter estimation 은 Maximum Likelihood 사용

Likelihood 어떤 가설 (hypothesis) H 에 대한 우도 ( 尤度 , likelihood) 란 , 어떤

시행의 결과 (Evidence) E 가 주어졌다 할 때 , 주어진 가설 H 가 참이라면 , 그러한 결과 E 가 나올 정도는 얼마나 되겠느냐 하는 것

결과 E 가 나온 경우 , 그러한 결과가 나올 수 있는 여러 가능한 가설들을 평가할 수 있는 측도

두 개의 class A 와 B 의 likelihood ratio

- R 이 1 보다 크면 sample x 가 속하는 class 로 class A 를 선택하고 1 보다 작으면 B 를 선택


H - HIV virus 를 가진 사건 H - HIV virus 를 가지지 않은 사건으로 Pos - test 에서 양성으로 나온 사건 Neg – test 에서 음성으로 나온 사건

Bayes’s theorem

Likelihood ratio

Ⅳ5.2 Bayesian Logistic Regression(1/3)

Assuming the categorization is binary

φ is the logistic link function

c = ±1 is the category membership value is the document representation in the feature space is the model parameters vector

to use prior distribution for the parameter vector β Gaussian prior Laplace prior


Gaussian prior

the maximum a posteriori (MAP)

- 사후분포의 최빈값을 능력모수의 추정값으로 사용하는 방법- 최대우도 추정과 비슷하나 사전분포를 고려한다는 점이 다름

MAP estimate of the parameters will rarely be exactly zero; thus, the model not be space

Laplace prior


log-posterior distribution of β

posterior mode

Ⅳ5.3 Decision Tree Classifiers(1/3)

A decision tree (DT) classifier is a tree the internal nodes are labeled by the features the edges leaving a node are labeled by tests on the feature’s weight the leaves are labeled by categories

A DT categorizes a document start at the root of tree move successively downward via the branches conditions are satisfied by the document until a leaf node is reached

Binary DT Most of the DT classifiers use a binary document representation the tree that corresponds CONSTRUE rule may look like Figure .1Ⅳ


Figure .1. A Decision Tree classifier.Ⅳ


Pruning the trees generated in such a way are prone to overfit the training collection removing the too specific branches

ID3 어떤 개념에 대한 예와 반례로써 Training Set 이 주어졌을 때 이로부터

개념을 구별할 수 있는 Decision Tree 의 분류규칙을 형성시키는 것 ID3 의 단계

step1 - Training Set 로 부터 부분집합 W 를 무작위로 선택 step2 - W 에서 Decision Tree 를 생성 step3 - 나머지 사례에서 생성된 규칙을 적용하여 예외적인 사례 발견 step4 - 예외적인 사례를 원래의 W 에 포함시키고 step2 부터 다시 시작

Ⅳ.5.4 Decision Rule classifiers

Disjunctive Normal Form (DNF) the outermost operators of the formula are all ORs, and there is only one level

of nesting allowed, which may only contain literals or conjunctions of literals For example, all of following formulas are in DNF

- A B ∨- (A B) C ∧ ∨- (A ¬B ¬C) (¬D E F) ∧ ∧ ∨ ∧ ∧

Decision Rule (DT) classifiers look very much like the DNF rules of the CONSTRUE system DT classifiers are built from the training collection using inductive rule learnin

g d are the features of the document and c its category

Ⅳ.5.5 Regression Methods

linear least-square fit (LLSF) category assignment function is viewed as a describes some linear transformation from the feature space to the space of all

possible category assignments computers the matrix by minimizing the error on the training collection accordi

ng to the formula

matrix M can be computed by performing singular value decomposition on the

training data matrix element represents the degree of association between the ith feature

and the jth category

Ⅳ.5.6 The Rocchio Methods

computing document’s distance to the prototypical examples of the categories

prototypical example for the category c is a vector(w1, w2, …) in the feature space computed by

POS (c) – the sets of all training documents that belong to the category c NEG (c) – the sets of all training documents that do not belong to the category c wdi is the weight of ith feature in the document d

very easy to implement, and cheap computationally performance is usually mediocre

Ⅳ.5.7 Neural Networks

Neural network can be built to perform text categorization input nodes receive the feature values output nodes produce the categorization status value link weights represent dependence relations

forward propagation activation of the nodes is propagated forward through the network the final values on output nodes determine the categorization decisions

back propagation if a misclassification occurs, the error is propagated back through the network modifying the link weights in order to minimize the error

Perceptron simplest kind of a neural network only two layers – the input and the output nodes

Ⅳ.5.8 Example-Based Classifiers

computing the similarity between the document to be classified and the training documents

lazy learners defer the decision on how to training data until each new query instance is enco

untered

kNN (k-nearest neighbor) the most prominent example of an example-based classifier checks whether the k training documents most similar to d belong to c one of the best-performing text classifiers to this day

Ⅳ.5.9 Support Vector Machines

very fast and effective for TC problems classifying hyperplane is chosen during

training as the unique hyperplane separates the known positive instances

from the known negative instance withthe maximal margin

distance from the hyperplane to the nearest point from the positive and negative sets

theoretically justified approach to the overfitting problem allows it to perform well irrespective of the dimensionality of the feature needs no parameter adjustment

there is theoretically motivated default choice of parameters

Ⅳ.5.10 Classifier Committees : Bagging and Boosting

Bagging individual classifiers are trained in parallel on the same training collection to build a single committee classifier, one must choose the method of combing

their results majority vote

- at least (k+1)/2 classifiers decide weighted linear combination

- CSV is given by a weighted sum of the CSV of the k classifiers

- the weights can be estimated on a validation dataset


Boosting the classifiers are trained sequentially greater weight given to the documents that were misclassified by the previous

classifiers weak learner


AdaBoost algorithm

Ⅳ.7 EVALUATION OF TEXT CLASSIFIERS

the performance of classifiers can be evaluated only experimentally the training set, as the name suggests, is used for training the classifier the test set is the one on which the performance measures are calculated n-fold cross-validation

the whole document collection is divided into n equal parts the training-and-testing process is run n times, each time using a different part

of the collection as the test set the results for n folds are averaged

Ⅳ.7.1 Performance Measures

recall for a category is defined as the percentage of correctly classified documents among all documents belonging to that category

precision is the percentage of correctly classified documents among all documents that were assigned to the category by the classifier

breakeven point the value of recall and precision at the point on the recall-versus-precision curve where

they are equal

F1

2/(1/recall + 1/precision)

i i ii

i i

i i ii

i i

Ⅳ.7.2 Benchmark Collections

the most known publicly available collection is the Reuters set of new stories accounts for most of the experimental work in TC so far this does not mean that the result produced by different researchers are directly

comparable

the results of two experiments to be directly comparable the following condition must be met The experiments must bi performed on exactly the same collection using the

same split between training and test sets The same performance measure must be chosen

Ⅳ.7.3 Comparison among Classifiers

the top performers are SVM, AdaBoost, kNN, and Regression methods Rocchio and Naïve Bayes have the worst performance

both are often used as baseline classifiers NB is very useful as a member of classifier committees

chapter Ⅳ. categorization 2007 년 2 월 15 일 인공지능연구실 김민호 text : the text...

Documents