290: data mining, informationextraction, and (business)analytics in knowledge services ram akella...

26
290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center Lecture 1 January 19, 2011

Upload: tracy-anderson

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

290: Data Mining, InformationExtraction, and

(Business)Analytics in Knowledge Services

Ram AkellaUniversity of California

Berkeley & Silicon Valley Center

Lecture 1January 19, 2011

Page 2: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Class Outline

Knowledge Services, Data Mining, and Business Analytics Internet marketing and online ads, financial services,

health services, service centers

Data Mining and Statistics Focus of course Prediction and Classification Data and pre-processing: TSK Ch 2 Review of class Look ahead for next class

Page 3: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Who?

Who Should Take This Course? Graduate Students Engineers and Managers who wish to

Gain depth and/or perspective Move into this area

(Potential) Entrepreneurs who wish to Brainstorm new ideas Create process for startup

Page 4: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

What?

What will you learn in this course? Techniques, software, and perspectives in:

Statistics, Data Mining, and Business Analytics Online marketing, computational advertising,

healthcare services, financial services, service/call centers and text mining

Page 5: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Knowledge Services Examples Online Marketing (Ranking Ads)

Ads

User

TargetPage

TargetingEngine

...

...

Ad Creatives

Landing Pages

...

...

Page 6: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Knowledge Services Examples Opinion Mining (Blog Trend)

Page 7: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Knowledge Services Examples Social Networks

Page 8: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Knowledge Services and Data Mining What are Knowledge Services? What is Data Mining? Business Analytics? What is the connection between all three?

Page 9: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Services What is a service?

http://en.wikipedia.org/wiki/Service_(economics)

A service is the non-material equivalent of a good. A service provision is an economic activity that does not result in ownership

Service professions http://www.bls.gov/oco/oco1006.htm

Management and Business Professionals http://www.bls.gov/oco/oco1001.htm

Page 10: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Knowledge Services

Marketing Internet and other marketing campaigns Online (computational) advertising

Financial Services How should you invest? What are stock and industry trends? Fraud detection

Page 11: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Knowledge Services (Continued) Health Services

Body fat profile and weight prediction Cancer identification Social networks for diabetes knowledge sharing Facebook!

Service Centers Call center management Network prognostics and diagnostics

Anomaly detection

Page 12: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Data Mining and Business Analytics Data Mining and Business Analytics

Techniques to model and solve Knowledge Services problems => This course

Decision Theory is an aspect of business analytics Techniques to solve business management

decision making Later courses E.g. How many experts and technicians of each

type in a service center

Page 13: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Data Mining and Text Mining

Knowledge Services

Data Mining Decision analytics Business Analytics

Data MiningText Mining plus Image/Video Mining

Page 14: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Statistics and Data Mining - 1 How are statistics and data mining

related? Or are they not?

Page 15: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Data Mining: Definitions Data mining is the nontrivial process of identifying, novel,

potentially useful, and ultimately understandable patterns in data. - Fayyad.

Data mining is the process of extracting previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. - Zekulin.

Data Mining is a set of methods used in the knowledge discovery process to distinguish previously unknown relationships and patterns within data. - Ferruzza.

Data mining is the process of discovering advantageous patterns in data. - John

Page 16: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Statistics Hypothesis testing Experimental design Response surface modeling ANOVA, MANOVA, etc. Linear regression Discriminant analysis Logistic regression GLM Canonical correlation Principal components Factor analysis

Page 17: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Data Mining Decision tree induction (C4.5, CART, CHAID) Rule induction (AQ, CN2, Recon, etc.) Nearest neighbors (case based reasoning) Clustering methods (data segmentation) Association rules (market basket analysis) Feature extraction Visualization In addition, some include: Neural networks Bayesian belief networks (graphical models) Genetic algorithms Self-organizing maps

Page 18: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Course Focus

In this course, we transition from one to the other Good statistical basis enables more powerful

data mining techniques! Every class

Motivated by practical examples in Knowledge Services

Solid grounding in techniques, software, data, for statistics and data mining (machine learning)

Page 19: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Statistics to Data Mining Transition DM packages implement well known

procedures from machine learning, pattern recognition, neural networks and data visualization.

Statistics concentrate on probabilistic inference in information science while DM also finds patterns in the data.

Dimensionality reduction with statistical assumptions can be applied in DM (PCA).

Assessing data quality.

Page 20: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Class Administration Office hours 1-2?, 5-6 pm?, Wed, by appt. Ignore rst for now Assignments and Projects: Postponement by 1 day – lose 10 points; 2 days, 20 points;

then, 0 credit Project grading will be identical; lose 10 points for one day delay, 20 for 2 days, and 0

credit subsequently Quizzes and midterms: No postponement unless serious health or extraordinary work

situation; see TA and then instructor Review website every day; you are responsible for monitoring and responding to

changes Readings will be posted ahead of time; lecture PPTs just a bit before or after class

You are expected to read and be prepared! Homework will be posted by Wednesday (latest Friday) for you to work over weekend

and consult TA on Monday Labs: has computers; similarly Labs, with course software

Page 21: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Prediction and Classification Classification

Classification is the task of assigning objects to one of several predefined categories.

Prediction A prediction is a statement or claim that a

particular event or value will occur in the future in more certain terms than a forecast.

In DM, typically these tasks are performed based on a set of attributes which describe the object to classify or the variable to predict.

Page 22: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Data and Pre-processing

Lecture 1b

Page 23: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Review and Summary of Lecture 1 Introduction to the course:

Problems in Knowledge Services and Analytics. Data Mining definitions and differences between DM and Statistics.

Data types and issues: Types of attributes in data: Nominal, Ordinal,

Interval and Ratio. Types of data sets: Record, Graph, Ordered. Data quality issues: Noise and outliers, missing

values, duplicate data.

Page 24: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Review and Summary of Lecture 1 Data preprocessing:

Aggregation: data reduction, change of scale, combination of features.

Sampling: random, with/without replacement, stratified.

Dimensionality reduction: PCA, SVD. Feature subset selection: brute-force,

embedded, filter, wrapper. Feature creation: extraction (domain specific),

mapping to new space, combination of features.

Page 25: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Look-Ahead for Lecture 2 (and Boot camp 1)

Covariance Matrix. Notions of Linear Algebra Singular Value Decomposition Principal Component Analysis in detail Please take a look at chapter 4 of the

textbook.

Page 26: 290: Data Mining, InformationExtraction, and (Business)Analytics in Knowledge Services Ram Akella University of California Berkeley & Silicon Valley Center

Guest Speaker Jeff Kreuelen

Senior Manager, IBM Almaden Text Mining, Service Centers Service Analytics, Data Mining, Text Mining,

CRM, Marketing, Call Centers, Financial Services