집단지성프로그래밍 - 6장 문서 필터링

Post on 04-Aug-2015

83 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

TRANSCRIPT

문서 필터링집단지성 프로그래밍 Ch.6

허윤

Document Filtering

Filtering == Classification Problem

Data Mining Problem

Estimation Classification Predication

Clustering Description

Affinity Grouping

Document?A set of feature -> text document, image, etc.

p( document ) = ?

Spam Filtering

Binary Classification Problem

‘Spam’ or ‘Ham’

Techniques

Naïve Bayesian Classifier

Support Vector Machine

Decision Tree

Rule vs. Modelpros and cons

Spam Filtering in Practice

Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, IJAIEM

Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de

Naïve Bayesian Classifier

Bayes Theorem

Naïve?

Bayesian Theorem with string independence assumption

Classifier ignore evidence term

Posterior1 > posterio2Posterior1 < posterio2

Example

1. 상자 A 가 선택될 확률 P( A ) =  7 / 10 

2. 상자 A 에서 흰공 뽑힐 확률 P( 흰공 | A )=  2 / 10

3. 주머니에서는 A, 상자 A 에서 흰공 뽑힐 확률

4. 흰공의 확률

❶ ❷

Example ❶ ❷

어디선가 흰공이 나왔는데… P( A | 흰공 )A 에서 나왔을 확률 ?

B 에서 나왔을 확률 ? P( B | 흰공 )

P( A | 흰공 ) = ?

Bayes Rule

❶ Conditional Prob. A given B ❷ Conditional Prob. B given A

❸ Bayes Rule

Document Representation Extracting words from document

Implementation: Preparation

Implementation: Preparation

Representation of Classifier

{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}

# getwords

How to access dict

Implementation: Preparation

Training

Implementation: Preparation

Result

Implementation: Preparation

Recall

Bayesian Theorem

p( category | doc ) = p( doc )

p( doc | category ) * p( category)

Implementation : Classifier

P( feature | category ) as prior

Assumed Probability to resolve data sparseness

Implementation : Classifier

Results

Implementation : Classifier

P( document | category ) as likelihood

Implementation : Classifier

P( document | category ) * p( category )

Implementation : Classifier

Classifying

Implementation : Classifier

Result

Implementation : Classifier

Recall: Naïve Bayesian Classifier

Fisher’s Method

Fisher’s Method

First, p( document| category ) = p( feature_1| category ) * p( feature_2| category ) … * p( feature_N| category )

p( category | document ) ??

p( category | feature ) = # of documents having feature in category

# of documents having feature

Q&A

Thank You

top related