introduction to data mining

33
Big Data: Text Mining, Web Mining, Social Network Analysis Taha Mokfi Department of Statistics University of Central Florida (Case study in twitter) https://www.linkedin.com/in/tahamokfi

Upload: taha-mokfi

Post on 21-Jan-2017

428 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to data mining

Big Data: Text Mining, Web Mining, Social Network Analysis

Taha MokfiDepartment of Statistics

University of Central Florida

(Case study in twitter)

https://www.linkedin.com/in/tahamokfi

Page 2: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

2

B.S industrial engineering M.S Statistical Computing – Data Mining (UCF) 4 years experience in analytics area Full proficiency in data mining concept Publish few papers in journals and conferences Professional with Rapidminer, IBM Modeler, R,

and Weka Enough experience with KNIME and SAS Familiar with Python and Java

Taha Mokfi

Page 3: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

3

In 2014 every minute of the day: Email users send: 204,000,000 Google received over 4,000,000 search query Facebook users share 2,460,000 contents Twitter users tweet 277,000 times Apple user download 48,000 aps

By 2017, more than 30% of enterprise access to broadly based big data will be via intermediary data broker services, serving context to business decisions.

Movie 1 Movie 2

Who much data do we have?

Page 4: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

4

Web Browsers

Search Engines

Smart Pones, Apps

New Source of data Web Entertainment

Social Networks

Banks, Insurance, Hospitals and …

Page 5: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

5

Different source of data

Page 6: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

6

Data mining: interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large data bases.

Big Data: Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate

Data science: C.F. Jeff Wu – Statistician - in 1997 used it in his talk “Statistics = Data Science?” Definition: processes and systems to extract

knowledge or insights from data

Data mining, Big data, Data science?

Page 7: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

7

Big data and Data mining

Page 8: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

8

1977 - John W. Tukey - published Exploratory Data

Analysis

1974  - Peter Naur - Survey of Computer Methods in Sweden and the United State

1977 - The International Association for Statistical Computing (IASC) - mission is to link traditional statistical methodology, modern computer technology

1989 - Gregory Piatetsky-Shapiro - the first Knowledge Discovery in Databases (KDD) workshop

History of Big Data

Page 9: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

9

History of Big Data

Page 10: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

10

First data mining paper From Data Mining to Knowledge

Discovery in Databases – 1996 (7179 citation)

Usama Fayyad – PhD in computer and mathematics - University of Michigan

Gregory Piatetsky-Shapiro – PhD in computer and mathematics - New York University

Padhraic Smyth - Professor in the Department of Computer Science with a joint appointment in Statistics - University of California Irvine

History of Big Data

Page 11: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

11

Big data job trend WANTED Analytics, a CEB Company The company maintains a database of more than one billion

job listings and is collecting hiring trend data from more than 150 countries

Page 12: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

12

What should data scientist know?

Page 13: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

13

Math (e.g. linear algebra, calculus) Statistics (e.g. probability, hypothesis testing and summary

statistics) Machine learning tools and techniques (e.g. k-nearest neighbors,

random forests, ensemble methods, etc.) Data mining Data cleaning Data visualization and reporting techniques R and/or SAS languages Unstructured data techniques More computer skills:

SQL databases and database querying languages Python (most common), C/C++ Java, Perl Big data platforms like Hadoop, Hive & Pig Cloud tools like Amazon S3

What should data scientist know?

Page 14: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

14

SAS / SAS EM RapidMiner KNIME R Weka IBM modeler Excel, Oracle, SQL,… Python Matlab

Tools, Tools, Tools…

Page 15: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

15

Tools, Tools, Tools…

Page 16: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

16

Tools, Tools, Tools…

Page 17: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

17

What is the difference?

Page 18: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

18

Apache Hadoop: open-source software for distributed processing of very large data sets on computer clusters. Work based on MapReduce.

Spark: uses MapReduce paradigm provides performance up to 100 times faster for certain applications.

Big Data Specific Tools

Page 19: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

19

MapReduce Concept

Page 20: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

20

Definition: Refers to the process of deriving high-quality information from text According to Merrill Lynch around 80-90% of all potentially usable

business information may originate in unstructured form Typical text mining applications:

text classification text clustering concept/entity extraction sentiment analysis Anomaly detection document summarization ,…

Natural language processing (NLP): is concerned with the interactions between computers and human

(natural) languages

Text Mining

Page 21: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

21

Sentiment Analysis

The weather is very good today. So bad, It is

raining.

The weather is ver

ygood

today So bad It rainin

g Mood

1 1 1 1 1 1 0 0 0 0 Good0 0 1 0 0 0 1 1 1 1 Bad

Page 22: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

22

Transform cases Tokenizing Filter Most frequent and less frequent words Filter stop words Stemming Generate n-Grams Concept extraction Other preparations(Length, Replace,

Remove ,etc.)

Typical Text Mining process

Page 23: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

23

Natural Language Processing (NLP)

A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun

Noun Phrase Complex Verb Noun PhraseNoun Phrase

Prep PhraseVerb Phrase

Verb Phrase

Sentence

Dog(d1).Boy(b1).Playground(p1).Chasing(d1,b1,p1).

Semantic analysis

Lexicalanalysis

(part-of-speechtagging)

Syntactic analysis(Parsing)

A person saying this maybe reminding another person to

get the dog back… Pragmatic analysis

(speech act)

Scared(x) if Chasing(_,x,_).+

Scared(b1)Inference

Page 24: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

24

Definition: is the application of data mining and analytics techniques to discover patterns from the World Wide Web

Application: Web usage mining Search engine optimization Web site classification and clustering

Web scraping: is a computer software technique of extracting information from websites

Web Mining

Page 25: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

25

Social network analysis (SNA): is the process of investigating social

structures through the use of network and graph theories or analyzing shared text and media.

Extracting data from SN: Using APIs (very restricted)

Candy crush Installs: 500,000,000

Scraping Web

Social Network Analysis

Page 26: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

26

Has been viewed 1400 times, shared and cited by Kdnuggets

Using Twitter API R software November 15, 2015 (two days after incidence) 200,000 English tweets Hashtags including: #ISIS, #Syria, #SaudiArabia, #Iraq,

and #Muslims 30000 tweets collected with the ParisAttacks hashtag for

visualization Jeffrey Breen's methods for sentiment analysis

Tries to score each tweet using a scoring function based on the pool of negative and positive words

Case Study (Twitter Mining - "Paris Attacks“)

Page 27: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

27

Here the higher score means the higher negativity in tweets including each hashtag

Case Study (Twitter Mining - "Paris Attacks“)

Page 28: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

28

Hashtag Word Clouds

Case Study (Twitter Mining - "Paris Attacks“)

Page 29: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

29

Hashtags Graph graph analysis and clustering technique

different colors = different cluster font size = frequency

Case Study (Twitter Mining - "Paris Attacks“)

Page 30: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

30

Text Mining in RExecuting R codes on real data

USA election

Page 31: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

31

Web Mining in RExecuting R codes on real data

TripAdviser

Page 32: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

32

Han, J., Kamber, M. and Pei, J., 2011. Data mining: concepts and techniques. Elsevier.

Forbes, http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#2715e4857a0b159bff6f69fd

Business insider, http://www.businessinsider.com.au/infographic-heres-how-much-data-is-created-on-the-web-every-minute-2015-8

Peter Cabena, Pablo Hadjinian, Rolf Stadler, JaapVerhees, and Alessandro Zanasi, Discovering Data Mining: From Concept to Implementation, Prentice Hall, Upper Saddle River, NJ, 1998

http://www.mastersindatascience.org/careers/data-scientist/ KDnuggets, http://

www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html R4stats, http://r4stats.com/articles/popularity/ Vormetric inc, www.vormetric.com Forbes, http://www.forbes.com/sites/gartnergroup/2015/02/12/gartner-

predicts-three-big-data-trends-for-business-intelligence/#4a65492d66a2

References

Page 33: Introduction to data mining

Taha Mokfi https://www.linkedin.com/in/tahamokfi

33

End of Presentation

“In God we trust. All others must bring data.” – W. Edwards Deming