an experimental comparison of naive bayesian and keyword based

18
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Author: Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos Resourse: sigir2000

Upload: eraser60913

Post on 24-Jan-2015

192 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: An experimental comparison of naive bayesian and keyword based

An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filteringwith Personal E-mail Messages

Author:

Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos

Resourse: sigir2000

Page 2: An experimental comparison of naive bayesian and keyword based

Outline Introduction Feature selection The Naive Bayesian classifier Result

Page 3: An experimental comparison of naive bayesian and keyword based

Introduction

垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的

反垃圾郵件機制做比較 . Sahami et al. trained a Naïve Bayesian

classifier on manually categorized legitimate and spare messages

Page 4: An experimental comparison of naive bayesian and keyword based

The Naive Bayesian classifier

x = (xl , x2 , x 3 .... , xn ) , where xl ,….., xn are the values of attributes X 1 .... , X n .

Each attribute shows whether or not a particular word (eg. "adult") is present in the message.

Use additional attributes corresponding to phrases(e.g. "be over 21") .

Non-textual properties (e.g. whether or not the message contains attachments).

Page 5: An experimental comparison of naive bayesian and keyword based

mutual information Use mutual information ( MI ) to select

possible attributes. MI(X;C):

Then select the attributes with the highest mutual

information values.

Page 6: An experimental comparison of naive bayesian and keyword based

The Naive Bayesian classifier

Page 7: An experimental comparison of naive bayesian and keyword based
Page 8: An experimental comparison of naive bayesian and keyword based
Page 9: An experimental comparison of naive bayesian and keyword based

S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types.

we assume that L->S is times more costly than S -> L

Classify a message as spare if the following classification criterion is met:

Page 10: An experimental comparison of naive bayesian and keyword based
Page 11: An experimental comparison of naive bayesian and keyword based

= 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter.

= 9 (t=0.9) , 若郵件被 blocked 時 , 回傳給sender 道歉訊息以及猜謎 .

= 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.

Page 12: An experimental comparison of naive bayesian and keyword based

Result

Page 13: An experimental comparison of naive bayesian and keyword based

1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages.

First experiment word-attributes were used. Candidate attributes were added (e.g. corresponding to

the phrases "be over 21", "only $"). Third experiment, (e.g. whether or not the message

containsattachments, or a high proportion of non alphanumericcharacters).

Page 14: An experimental comparison of naive bayesian and keyword based

Experiments with the PU1 corpus 481 spam messages. 618 legitimate

messages. Naive Bayesian classifier, ten-fold cross validation to reduce random

variation. That Results were then averaged over the ten runs. varied the number of retained attributes from

50 to 700 by a step of 50 lemmatizer and stop-list

Page 15: An experimental comparison of naive bayesian and keyword based
Page 16: An experimental comparison of naive bayesian and keyword based
Page 17: An experimental comparison of naive bayesian and keyword based
Page 18: An experimental comparison of naive bayesian and keyword based