introduction to data mining
TRANSCRIPT
![Page 1: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/1.jpg)
Big Data: Text Mining, Web Mining, Social Network Analysis
Taha MokfiDepartment of Statistics
University of Central Florida
(Case study in twitter)
https://www.linkedin.com/in/tahamokfi
![Page 2: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/2.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
2
B.S industrial engineering M.S Statistical Computing – Data Mining (UCF) 4 years experience in analytics area Full proficiency in data mining concept Publish few papers in journals and conferences Professional with Rapidminer, IBM Modeler, R,
and Weka Enough experience with KNIME and SAS Familiar with Python and Java
Taha Mokfi
![Page 3: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/3.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
3
In 2014 every minute of the day: Email users send: 204,000,000 Google received over 4,000,000 search query Facebook users share 2,460,000 contents Twitter users tweet 277,000 times Apple user download 48,000 aps
By 2017, more than 30% of enterprise access to broadly based big data will be via intermediary data broker services, serving context to business decisions.
Movie 1 Movie 2
Who much data do we have?
![Page 4: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/4.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
4
Web Browsers
Search Engines
Smart Pones, Apps
New Source of data Web Entertainment
Social Networks
Banks, Insurance, Hospitals and …
![Page 5: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/5.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
5
Different source of data
![Page 6: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/6.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
6
Data mining: interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large data bases.
Big Data: Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate
Data science: C.F. Jeff Wu – Statistician - in 1997 used it in his talk “Statistics = Data Science?” Definition: processes and systems to extract
knowledge or insights from data
Data mining, Big data, Data science?
![Page 7: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/7.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
7
Big data and Data mining
![Page 8: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/8.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
8
1977 - John W. Tukey - published Exploratory Data
Analysis
1974 - Peter Naur - Survey of Computer Methods in Sweden and the United State
1977 - The International Association for Statistical Computing (IASC) - mission is to link traditional statistical methodology, modern computer technology
1989 - Gregory Piatetsky-Shapiro - the first Knowledge Discovery in Databases (KDD) workshop
History of Big Data
![Page 9: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/9.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
9
History of Big Data
![Page 10: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/10.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
10
First data mining paper From Data Mining to Knowledge
Discovery in Databases – 1996 (7179 citation)
Usama Fayyad – PhD in computer and mathematics - University of Michigan
Gregory Piatetsky-Shapiro – PhD in computer and mathematics - New York University
Padhraic Smyth - Professor in the Department of Computer Science with a joint appointment in Statistics - University of California Irvine
History of Big Data
![Page 11: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/11.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
11
Big data job trend WANTED Analytics, a CEB Company The company maintains a database of more than one billion
job listings and is collecting hiring trend data from more than 150 countries
![Page 12: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/12.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
12
What should data scientist know?
![Page 13: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/13.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
13
Math (e.g. linear algebra, calculus) Statistics (e.g. probability, hypothesis testing and summary
statistics) Machine learning tools and techniques (e.g. k-nearest neighbors,
random forests, ensemble methods, etc.) Data mining Data cleaning Data visualization and reporting techniques R and/or SAS languages Unstructured data techniques More computer skills:
SQL databases and database querying languages Python (most common), C/C++ Java, Perl Big data platforms like Hadoop, Hive & Pig Cloud tools like Amazon S3
What should data scientist know?
![Page 14: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/14.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
14
SAS / SAS EM RapidMiner KNIME R Weka IBM modeler Excel, Oracle, SQL,… Python Matlab
Tools, Tools, Tools…
![Page 15: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/15.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
15
Tools, Tools, Tools…
![Page 16: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/16.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
16
Tools, Tools, Tools…
![Page 17: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/17.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
17
What is the difference?
![Page 18: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/18.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
18
Apache Hadoop: open-source software for distributed processing of very large data sets on computer clusters. Work based on MapReduce.
Spark: uses MapReduce paradigm provides performance up to 100 times faster for certain applications.
Big Data Specific Tools
![Page 19: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/19.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
19
MapReduce Concept
![Page 20: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/20.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
20
Definition: Refers to the process of deriving high-quality information from text According to Merrill Lynch around 80-90% of all potentially usable
business information may originate in unstructured form Typical text mining applications:
text classification text clustering concept/entity extraction sentiment analysis Anomaly detection document summarization ,…
Natural language processing (NLP): is concerned with the interactions between computers and human
(natural) languages
Text Mining
![Page 21: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/21.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
21
Sentiment Analysis
The weather is very good today. So bad, It is
raining.
The weather is ver
ygood
today So bad It rainin
g Mood
1 1 1 1 1 1 0 0 0 0 Good0 0 1 0 0 0 1 1 1 1 Bad
![Page 22: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/22.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
22
Transform cases Tokenizing Filter Most frequent and less frequent words Filter stop words Stemming Generate n-Grams Concept extraction Other preparations(Length, Replace,
Remove ,etc.)
Typical Text Mining process
![Page 23: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/23.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
23
Natural Language Processing (NLP)
A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun PhraseNoun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Dog(d1).Boy(b1).Playground(p1).Chasing(d1,b1,p1).
Semantic analysis
Lexicalanalysis
(part-of-speechtagging)
Syntactic analysis(Parsing)
A person saying this maybe reminding another person to
get the dog back… Pragmatic analysis
(speech act)
Scared(x) if Chasing(_,x,_).+
Scared(b1)Inference
![Page 24: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/24.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
24
Definition: is the application of data mining and analytics techniques to discover patterns from the World Wide Web
Application: Web usage mining Search engine optimization Web site classification and clustering
Web scraping: is a computer software technique of extracting information from websites
Web Mining
![Page 25: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/25.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
25
Social network analysis (SNA): is the process of investigating social
structures through the use of network and graph theories or analyzing shared text and media.
Extracting data from SN: Using APIs (very restricted)
Candy crush Installs: 500,000,000
Scraping Web
Social Network Analysis
![Page 26: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/26.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
26
Has been viewed 1400 times, shared and cited by Kdnuggets
Using Twitter API R software November 15, 2015 (two days after incidence) 200,000 English tweets Hashtags including: #ISIS, #Syria, #SaudiArabia, #Iraq,
and #Muslims 30000 tweets collected with the ParisAttacks hashtag for
visualization Jeffrey Breen's methods for sentiment analysis
Tries to score each tweet using a scoring function based on the pool of negative and positive words
Case Study (Twitter Mining - "Paris Attacks“)
![Page 27: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/27.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
27
Here the higher score means the higher negativity in tweets including each hashtag
Case Study (Twitter Mining - "Paris Attacks“)
![Page 28: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/28.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
28
Hashtag Word Clouds
Case Study (Twitter Mining - "Paris Attacks“)
![Page 29: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/29.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
29
Hashtags Graph graph analysis and clustering technique
different colors = different cluster font size = frequency
Case Study (Twitter Mining - "Paris Attacks“)
![Page 30: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/30.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
30
Text Mining in RExecuting R codes on real data
USA election
![Page 31: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/31.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
31
Web Mining in RExecuting R codes on real data
TripAdviser
![Page 32: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/32.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
32
Han, J., Kamber, M. and Pei, J., 2011. Data mining: concepts and techniques. Elsevier.
Forbes, http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#2715e4857a0b159bff6f69fd
Business insider, http://www.businessinsider.com.au/infographic-heres-how-much-data-is-created-on-the-web-every-minute-2015-8
Peter Cabena, Pablo Hadjinian, Rolf Stadler, JaapVerhees, and Alessandro Zanasi, Discovering Data Mining: From Concept to Implementation, Prentice Hall, Upper Saddle River, NJ, 1998
http://www.mastersindatascience.org/careers/data-scientist/ KDnuggets, http://
www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html R4stats, http://r4stats.com/articles/popularity/ Vormetric inc, www.vormetric.com Forbes, http://www.forbes.com/sites/gartnergroup/2015/02/12/gartner-
predicts-three-big-data-trends-for-business-intelligence/#4a65492d66a2
References
![Page 33: Introduction to data mining](https://reader035.vdocuments.pub/reader035/viewer/2022062223/588303a61a28abe70d8b5fcd/html5/thumbnails/33.jpg)
Taha Mokfi https://www.linkedin.com/in/tahamokfi
33
End of Presentation
“In God we trust. All others must bring data.” – W. Edwards Deming