data mining & machine learning final project
Post on 24-Feb-2016
58 Views
Preview:
DESCRIPTION
TRANSCRIPT
DATA MINING & MACHINE LEARNING FINAL PROJECT
Group 2R95922027 李庭閣R95922034 孔垂玖R95922081 許守傑R95942129 鄭力維
Outline Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
Experiment setting
Selected online corpus: enron
Removing html tags Factoring important headers
Six folders from enron1 to enron6. Contain totally 13496 spam mails &
15045 ham mails
Outline Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
Feature Extration
1. Transmitted Time of the Mail2. Number of the Receiver3. Existence of Attachment4. Existence of images in mail5. Existence of Cited URLs in mail6. Symbols in Mail Title7. Mail-body
Transmitted Time of the Mail& Number of the Receiver
Spam: Non-uniform Distribution
Spam:Only Single Receiver
Probability of being Spam for Transmitted Time & Receiver Size
]|[]|[]|[]|[hamhPspamhP
hamhPhdatehamP
]|[]|[
]|[]|#[hamrPspamrP
hamrProfreceiverhamP
Attachment, Images, and URL
Attachment Image URLSpam 0.0307% 0.6816% 30.779%Ham 7.3712% 0% 7.0521%
8.01.78.30
8.30) URLsciting Mail|Spam(
999.0)images containing Mail|Spam(
004.03712.70307.0
0307.0)attachment with Mail|Spam(
P
P
P
Symbols in Mail Titles
Marks Probability of being Spam Mail
Feature Showing Rate
~ ^ | * % [] ! ? = 0.911 28% in spam\ / ; & 0.182 16% in ham
Title Absentness Spam senders add titles now.
Arabic Numeral : Almost equal probability (Date, ID)
Non-alphanumeric Character & Punctuation Marks:Appear more often in Spam
Appear more often in ham
Mail-body Build the internal structure of words Use a good NLP tool called Treetagger
to help us do word stemming Given the stemmed words appeared
in each mail, we build a sparse format vector to represent the “semantic” of a mail
Outline Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
Naïve BayesGiven a bag of words (x1, x2, x3,…,xn), Naïve Bayes is powerful for document classification. ( , )
log ( | ) log log ( , ) log ( )( )j i
j i j i ii
c x CP x C c x C c C
c C
Vector Space ModelCreate a word-document (mail) matrix by SRILM.
For every mail (column) pair, a similarity value can be calculated.
d1 d2 ........ dj .......... dNw1 w2
wi
wM
wij
d1 d2 ........ dj .......... dNw1 w2
wi
wM
w1 w2
wi
wM
wij
ijij
j
cw
n
( , )|| || * || ||
Ti j
i ji j
d dsimilarity d d
d d
KNN (Vector Space Model)
As K = 1, the KNN classification model show the best accuracy.
Maximum Entropy Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution.
The elements in word-document matrix are modified to the binary value {0, 1}.
SVMBinary : Select binary value {0,1} to represent that this word appears or notNormalized : Count the occurrence of each word and divide them by their maximum occurrence counts.
Outline Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
Single-layered-perceptron Hybrid Model
Inputlayer
OutputLayer
Naïve Bayes
knn
Maximum entropy
Inputlayer
OutputLayer
Naïve Bayes
knn
Maximum entropy
The accuracy of NN-based Hybrid Model is always the highest.
Mail(Bag of words)
Naïve Bayes
K-nearest neighbor
Maximum entropy
Decisionmaker
committee
Mail(Bag of words)
Naïve Bayes
K-nearest neighbor
Maximum entropy
Naïve Bayes
K-nearest neighbor
Maximum entropy
Decisionmaker
committee
Committee-based Hybrid-model The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority.
1. Knn + naïve Bayes + Maximum Entropy2. naïve Bayes + Maximum Entropy + SVM
Outline Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference
Conclusion 7 features are shown mail type
discrimination. Transmitted Time & Receiver Size Attachment, Image, and URL Non-alphanumeric Character & Punctuation
Marks 5 populous Machine Learning are proved
suitable for spam filter Naïve Bayes, KNN, SVM
2 Model combination ways are tested. Committee-based & Single Neural Network
Reference [1]. M. Sahami, S. Dumais, D. Heckerman, and
E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail," in Proc. AAAI 1998, Jul. 1998.
[2] A plan for spam: http://www.paulgraham.com/spam.html [3]Enron Corpus: http://www.aueb.gr/users/ion/ [4]Treetagger:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
[5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
[6]SRILM: http://www.speech.sri.com/projects/srilm/ [7]SVM: http://svmlight.joachims.org/
top related