popularity of online news article

22
Online News Popularity Dataset PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

Upload: sumit-saini

Post on 20-Jan-2017

212 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Popularity of Online News Article

Online News Popularity Dataset

PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

Page 2: Popularity of Online News Article

01

Introduction

Page 3: Popularity of Online News Article

Introduction

• Created to analyze the number of shares depending on the attributes and predict if an article will be popular on the internet or not.

• 39,644 observations• 61 attributes• Mashable website: collected over a 2 year period from Jan 2013 -

Jan 2015 • No missing values, but some topics were unclassified • Target: number of shares

Page 4: Popularity of Online News Article

02

Data Set Introduction

Page 5: Popularity of Online News Article

Data Set Introduction

Data accuracy

Data Set

Website

843,330 shares

12 videos128 videos

792 shares

0 videos12 videos

Page 6: Popularity of Online News Article

Attributes

Page 7: Popularity of Online News Article

LDA

The Latent Dirichlet Allocation algorithm was applied to all Mashable texts (known before publication) in order to first identify the five top relevant topics and then measure the closeness of each articles to such topics.• They were named LDA-00…...LDA-04 (undefined topics)• LDAs add up to one per observation• Maximum LDA impurity → overall low shares

• Mean: 1,660 vs 3,395• Median: 1,100 vs 1,400

Page 8: Popularity of Online News Article

03

Data Modification And Models

Page 9: Popularity of Online News Article

Data ModificationRecoding

Data channel Date of publication

0 Viral

1 Lifestyle

2 Entertainment

3 Business

4 Social Media

5 Technology

6 World

1 Monday

2 Tuesday

3 Wednesday

4 Thursday

5 Friday

6 Saturday

7 Sunday

Page 10: Popularity of Online News Article

Conference Paper• Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400

shares. • Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular• Avoided dealing with a class imbalance problem• Made it into a binary problem

Popular or UnpopularAUC = 0.73

Page 11: Popularity of Online News Article

Model 1

• 1500 trees• All attributes

Page 12: Popularity of Online News Article

Models - Chosen Attributes

Subjective Opinion Random Forest Importance Highly Correlated (w/ shares)• n_tokens_title• n_tokens_content• average_token_length• summary_channel_value• summary_weekday• LDA_00• LDA_01• LDA_02• LDA_03• LDA_04• global_subjectivity• global_sentiment_polarity• global_rate_positive_words• global_rate_negative_word

s• title_subjectivity• title_sentiment_polarity

• LDA _03• LDA_02• kw_max_avg• kw_avg_avg• summary_channel_value• self_reference_min_shares• self_reference_avg_shares

Page 13: Popularity of Online News Article

Models - Chosen Attributes

Random Forest Importance

R2: -1.376

Highly Correlated (w/ shares)

R2: 0.01434R2: 0.0148

Subjective Opinion

Page 14: Popularity of Online News Article

04

Data Insights

Page 15: Popularity of Online News Article

Data Insights

Publication Day:Most articles published - Tuesday, Wednesday, and Thursday.Least articles published - Weekends.

Channel:Most popular topic is Viral,

followed by Tech and Business.Least popular topic is Social Media.

No. of keywords: Generally between 5 to 10.

Page 16: Popularity of Online News Article

Challenges

Page 17: Popularity of Online News Article

Challenges

• Understanding the variableswhat is LDA topic #sentimentpolaritykeywords

• Finding relation among attributes and which attributes are important for modelling.

• Numbers in dataset vs. numbers on Mashablesharesvideosimages

• Can’t do boosting because we don’t have a binary outcome

Page 18: Popularity of Online News Article

Recommendations

Page 19: Popularity of Online News Article

Recommendations

For MashablePublish during the week rather than weekendPublish about world, technology, and business and avoid social media articlesPublish articles closer to the topic (minimize impurity)

For ResearchersAlways identify your attributes Ethically and accurately collecting dataTo get more accurate results, get data about the number of likes and

comments,number of tweets or hashtags, number of URL mentions and to understand thesource of shares

Page 20: Popularity of Online News Article

Conclusion

Page 21: Popularity of Online News Article

Conclusion

● R2 is very small regardless of the model● Using all attributes is the best combination● Removing attributes, changing number of trees, and

changing classifier does not improve R2 value

Page 22: Popularity of Online News Article

THANK YOU!

PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel