introduction data 資料探勘方法 關聯性法則 (association rules) 分類法 ...
DESCRIPTION
大綱. Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering) Applications 結論. What Is Data Mining?. Data mining: knowledge discovery from data) - PowerPoint PPT PresentationTRANSCRIPT
大數據 (Big Data) Introduction Data 資料探勘方法
1. 關聯性法則 (Association Rules)
2. 分類法 (Classification)
3. 叢集法 (Clustering) Applications 結論
What Is Data Mining?
Data mining: knowledge discovery from data) Extraction of interesting (non-trivial, previously
unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names Knowledge discovery in databases (KDD),
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
MachineLearning
A.I.
AlgorithmOther
Disciplines
Visualization
Why Not Traditional Data Analysis?
Tremendous amount of data
High complexity of data
Why Data Mining?
“Necessity is the mother of invention”—Data mining
—Automated analysis of massive data sets.
Business Objectives
[PPS 03]
[BL 97]
[MMPH 03][BM 98]
[LPSHG 01]
[SM 00][FCJ 01][MPT 99]
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Steps of Data Mining
Steps of a KDD Process Learning the application domain
relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
Data Mining Functionalities Outlier analysis
Outlier: Data object that does not comply with the general behavior of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD
memory Periodicity analysis Similarity-based analysis
Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Classification: Definition Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training SetDecision Tree
Weather Data: Play or not Play?
OutlookTemperature
Humidity
Windy
Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Note:Outlook is theForecast,no relation to Microsoftemail program
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Example Tree for “Play?”Outlook
HumidityWindy
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Data Mining for Retail Industry
Retail industry: huge amounts of data on sales, customer shopping history, etc.
Applications of retail data mining Identify customer buying behaviors Discover customer shopping patterns and trends Improve the quality of customer service Achieve better customer retention and satisfaction Enhance goods consumption ratios Design more effective goods transportation and distribution
policies
Data Mining in Retail Industry: Examples
Design and construction of data warehouses based on the benefits of data mining
Multidimensional analysis of sales, customers, products, time, and region
Analysis of the effectiveness of sales campaigns Customer retention: Analysis of customer loyalty
Use customer loyalty card information to register sequences of purchases of particular customers
Use sequential pattern mining to investigate changes in customer consumption or loyalty
Suggest adjustments on the pricing and variety of goods Purchase recommendation and cross-reference of items
Financial Data Mining
Classification and clustering of customers for targeted marketing multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group
Detection of money laundering and other financial crimes integration of from multiple DBs (e.g., bank transactions,
federal/state crime history DBs) Tools: data visualization, linkage analysis, classification,
clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)
Financial Data Mining
Classification and clustering of customers for targeted marketing multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group
Detection of money laundering and other financial crimes integration of from multiple DBs (e.g., bank transactions,
federal/state crime history DBs) Tools: data visualization, linkage analysis, classification,
clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)
Data Mining for Telecomm. Industry (1)
A rapidly expanding and highly competitive industry and a great demand for data mining
Understand the business involved Identify telecommunication patterns Catch fraudulent activities Make better use of resources Improve the quality of service
Multidimensional analysis of telecommunication data Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
Data Mining for Telecomm. Industry (2)
Fraudulent pattern analysis and the identification of unusual patterns Identify potentially fraudulent users and their atypical usage
patterns Detect attempts to gain fraudulent entry to customer accounts Discover unusual patterns which may need special attention
Multidimensional association and sequential pattern analysis Find usage patterns for a set of communication services by
customer group, by month, etc. Promote the sales of specific services Improve the availability of particular services in a region
Use of visualization tools in telecommunication data analysis
Biomedical and DNA Data Analysis
DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotides arranged in a particular order
Humans have around 30,000 genes Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes Semantic integration of heterogeneous, distributed genome
databases Current: highly distributed, uncontrolled generation and use
of a wide variety of DNA data Data cleaning and data integration methods developed in
data mining will help
DNA Analysis: Examples
Similarity search and comparison among DNA sequences Compare the frequently occurring patterns of each class (e.g.,
diseased and healthy) Identify gene sequence patterns that play roles in various diseases
Association analysis: identification of co-occurring gene sequences Most diseases are not triggered by a single gene but by a combination
of genes acting together Association analysis may help determine the kinds of genes that are
likely to co-occur together in target samples Path analysis: linking genes to different disease development stages
Different genes may become active at different stages of the disease Develop pharmaceutical interventions that target the different stages
separately Visualization tools and genetic data analysis
Other Applications Sports
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
結論 Data mining is a young discipline with
wide and diverse applications There is still a nontrivial gap between
general principles of data mining and domain-specific, effective data mining tools for particular applications