zhen wang demo3

8
Empower Public Health through Social Media Zhen Wang, Ph.D. Insight Health Data Science

Upload: zhen-wang

Post on 11-Apr-2017

47 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Empower Public Health through Social Media

Empower Public Health through Social MediaZhen Wang, Ph.D.Insight Health Data Science

http://54.191.168.240

TextCleaning, TokenizingConvert to Feature Vectors

I like food!Food is good!I had some good food.

i, like, foodfood, is, goodi, had, some, good, food

e.g., TF-IDF

Im really good with numbers!

ilikefoodisgoodhadsome111000000111001010111

Downweight, NormalizeMachine LearningNumbersNatural Language Processing

Text ClassificationNormalized Retweet CountsNumber of TweetsDistribution of Tweets Sample Imbalance Classification (0/1: Not / Retweeted)Logistic Regression

Threshold: 0.005

Misclassification Error: 22%

0011Train Testdownsampling

0.810.740.260.19Normalized Confusion MatrixCodes: github.com/zweinstein/SpreadHealth_dev

Zhen (Jen) Wang

Beta TesterSince 2015

Editor since 2015

Traditional MedicineScience FictionPublic Speaking

Online EducationPh.D. in Physical Chemistry

Thank you!

See the App in Action:

Text Preprocessing PipelineText Cleaning:Convert to lower caseReplace URL, #, and @Remove special characters other than emoticonsRemove stopwords

Tokenizing:Splitting each documents into individual elements Bag-of-Words or N-gramsStemming Porter Stemmer was usedSnowball or Lancaster stemmer faster but more aggressiveLemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf):Term Frequency--tf(t,d): the number of times a term t occurs in a document dUsed to downweight frequently occurring words in the feature vectors tf(t,d)Document Frequency--df(d,f): the number of documents d that contain a term t.The implementation in Scikit-learn

Lastly, L2 normalization

Train Dataset: 10000 tweets on diabetes (4782 retweeted);Test Set Accuracy (Random Chance 0.49 on positive class): KNN: 60%Naive Bayes: 67%Logistic regression: 75% (chosen and tested on imbalanced test data)Potential Improvements: Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)Other Features:Polarity & SentimentLength Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression)Automatic Update to SQLite Database and to the ClassifierPrediction Algorithms