an experimental comparison of naive bayesian and keyword based

An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filteringwith Personal E-mail Messages

Author:

Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos

Resourse: sigir2000

Outline Introduction Feature selection The Naive Bayesian classifier Result

Introduction

垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的

反垃圾郵件機制做比較 . Sahami et al. trained a Naïve Bayesian

classifier on manually categorized legitimate and spare messages

The Naive Bayesian classifier

x = (xl , x2 , x 3 .... , xn ) , where xl ,….., xn are the values of attributes X 1 .... , X n .

Each attribute shows whether or not a particular word (eg. "adult") is present in the message.

Use additional attributes corresponding to phrases(e.g. "be over 21") .

Non-textual properties (e.g. whether or not the message contains attachments).

mutual information Use mutual information ( MI ) to select

possible attributes. MI(X;C):

Then select the attributes with the highest mutual

information values.

The Naive Bayesian classifier

S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types.

we assume that L->S is times more costly than S -> L

Classify a message as spare if the following classification criterion is met:

= 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter.

= 9 (t=0.9) , 若郵件被 blocked 時 , 回傳給sender 道歉訊息以及猜謎 .

= 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.

Result

1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages.

First experiment word-attributes were used. Candidate attributes were added (e.g. corresponding to

the phrases "be over 21", "only $"). Third experiment, (e.g. whether or not the message

containsattachments, or a high proportion of non alphanumericcharacters).

Experiments with the PU1 corpus 481 spam messages. 618 legitimate

messages. Naive Bayesian classifier, ten-fold cross validation to reduce random

variation. That Results were then averaged over the ten runs. varied the number of retained attributes from

50 to 700 by a step of 50 lemmatizer and stop-list

an experimental comparison of naive bayesian and keyword based

Technology