mining knowledge in data explosion age

58
1 Mining Knowledge in Data Explosion Age (在資料爆炸時代中挖掘知識) 廖宜恩 中興大學資訊科學與工程系

Upload: tommy96

Post on 22-Nov-2014

1.191 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

  • 1. Mining Knowledge in Data Explosion Age 1
  • 2. Outline Some News Reports Why Data Mining What is Data Mining Knowledge Discovery Process Data Mining Functionalities Data Mining Process Data Mining Tools Trends in Data Mining Some Research Results on Data Mining Conclusions 2
  • 3. Some News Reports Time's Person of the Year for 2006 12 IT skills that employers can't say no to F.B.I. Data Mining Reached Beyond Initial Targets MIT names its top 10 emerging technologies for 2008 Effect of US Recession on Data Mining Demand (July 2008) 3
  • 4. Why Data Mining Data Explosion Problem Data in the world doubles every 20 months! NASAs Earth Orbiting System: forty-six megabytes of data per second 4,000,000,000,000 bytes a day4 TeraByte/day 20200GB Hard Disk FBI fingerprints image library: 200,000,000,000,000 bytes200 TB In-line image analysis for particle detection: 1 megabyte in one second 4
  • 5. Why Data Mining? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) 5
  • 6. Why Data Mining? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data 6
  • 7. Mining Large Data Sets - Motivation There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all We are drowning in data, but starving for knowledge! 4,000,000 3,500,000 3,000,000 The Data Gap 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 1,000,000 500,000 Number of 0 analysts 1995 1996 1997 1998 1999 7
  • 8. What is Data Mining? Data Mining (Knowledge Discovery in Databases, KDD) : Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules 8
  • 9. Knowledge Discovery Process Data mining: the core Pattern Evaluation of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration 9 Databases
  • 10. Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ Enormity of data AI Pattern Recognition Curse of high Data Mining dimensionality Database Heterogeneous, systems distributed nature of data 10
  • 11. Data Mining Functionalities 1. Concept description: Characterization and discrimination 2. Classification 3. Association rule mining 4. Clustering 5. Sequence analysis 6. Anomaly detection 11
  • 12. Concept description: Characterization and discrimination Concept description: Characterization: provides a concise summarization of the given collection of data Example: Describe general characteristics of graduate students in the NCHU database Discrimination: provides descriptions comparing two or more collections of data Example: Compare graduate and undergraduate students of NCHU using discriminant rule 12
  • 13. Classification Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 13
  • 14. Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes 10 Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? Decision 12 Yes Medium 80K ? Tree 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 14
  • 15. Example of a Decision Tree Splitting Attributes Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No Refund Yes No 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No 7 Yes Divorced 220K No TaxInc NO 8 No Single 85K Yes < 80K > 80K 9 No Married 75K No NO YES 10 No Single 90K Yes 10 Training Data Model: Decision Tree 15
  • 16. Apply Model to Test Data Test Data Start from the root of tree. Refund Marital Taxable Status Income Cheat No Married 80K ? Refund 10 Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES 16
  • 17. Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 17
  • 18. Association rule mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper} {Beer}, 1 Bread, Milk {Milk, Bread} {Eggs,Coke}, 2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk}, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 18
  • 19. Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be {Beer, } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Beer in the antecedent => Can be used to see which products would be affected if the store discontinues selling beer. Beer in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Beer to promote sale of Potato chips! 19
  • 20. Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. 20
  • 21. Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intercluster distances are minimized are maximized 21
  • 22. Clustering: Applications Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. 22
  • 23. Clustering of Microarray Data 23
  • 24. Sequence analysis Sequence Sequence Element Event Database (Transaction) (Item) Customer Purchase history of a A set of items bought by Books, diary given customer a customer at time t products, CDs, etc Web Data Browsing activity of a A collection of files Home page, index particular Web visitor viewed by a Web visitor page, contact info, after a single mouse etc click Event data History of events Events triggered by a Types of alarms generated by a given sensor at time t generated by sensors sensor Genome DNA sequence of a An element of the DNA Bases A,T,G,C sequences particular species sequence Element Event (Transaction) E1 E1 E3 (Item) E2 E2 E2 E3 E4 Sequence 24
  • 25. 25 Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
  • 26. How does the human genome stack up? Organism Genome Size (Bases) Estimated Genes Human (Homo sapiens) 3 billion 25,000 Laboratory mouse (M. musculus) 2.6 billion 30,000 Mustard weed (A. thaliana) 100 million 25,000 Roundworm (C. elegans) 97 million 19,000 Fruit fly (D. melanogaster) 137 million 13,000 Yeast (S. cerevisiae) 12.1 million 6,000 Bacterium (E. coli) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9 26
  • 27. Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG 27
  • 28. Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 28
  • 29. Social Network Analysis (Link Mining) Six Degrees of Separation Link: relationship among data objects Link-Based Object Ranking (LBR): Exploit the link structure of a graph to order or prioritize the set of objects within the graph Web information analysis such as PageRank and Hits are typical LBR approaches 29
  • 30. Complex Network A complex network is a network (graph) that has certain non-trivial topological features that do not occur in simple networks. Such non-trivial features include: a heavy-tail in the degree distribution; a high clustering coefficient; assortativity (a correlation between two nodes) or disassortativity among vertices; and evidence of a hierarchical structure. 30
  • 31. Web Mining Web Usage Mining Web Structure Mining Web Content Mining Google has a precious asset: Database of Intensions 31
  • 32. Graph Mining Find frequent subgraph in a given graph database Graphs are ubiquitous Web databases, XML databases Cheminformatics (chemical compound) Bioinformactics (protein structure, pathway) Workflow analysis Social network analysis 32
  • 33. Example (Chemistry-informatics) Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2) 33
  • 34. Data Mining Process Define the problem Build data mining database Explore data Prepare data for modeling Build model Evaluate model Deploy model 34
  • 35. Examples of data mining in science & engineering Data mining in Biomedical Engineering Robotic Arm Control Using Data Mining Techniques 35
  • 36. Data Mining Process: 1. Define the problem Control a robotic arm by means of EMG signals from biceps and triceps muscles. Electromyography (EMG,) is a medical technique for evaluating and recording physiologic properties of muscles at rest and while contracting. Muscle Biceps Triceps Contraction Supination H H Pronation L L Flexion H L Extension Supination Pronation Flexion Extension L H 36
  • 37. Data Mining Process: 2. Build a data mining database The dataset includes 80 records. There are two input variables; biceps signal and triceps signal. One output variable, with four possible values; supination, pronation, flexion and extension. 37
  • 38. Data Mining Process: 3. Explore data Scatter Plot Triceps Record# Flexion Extension Supination Pronation 38
  • 39. Data Mining Process: 3. Explore data (cont.) Scatter Plot Biceps Record# Flexion Extension Supination Pronation 39
  • 40. Data Mining Process: 4. Prepare data for modeling Build a dataset with the ARFF format: @relation EMG @attribute Triceps real @attribute Biceps real @attribute Move {Flexion,Extension,Pronation,Supination} @data 13,31,Flexion 14,30,Flexion 10,31,Flexion 13,29,Flexion 40
  • 41. Data Mining Process: 5. Build Model Classification OneR Decision Tree Nave Bayesian K-Nearest Neighbors Neural Networks Linear Discriminant Analysis Support Vector Machines 41
  • 42. Data Mining Process: 5. Decision Tree 1. Find the attribute that best classifies the training data. 2. Use this attribute as the root of the decision tree. 3. Repeat the process for each subtree. Triceps 37 Triceps Biceps 14 17 42 Flexion Pronation Extension Supination
  • 43. Data Mining Process: 6. Evaluate Models Simple validation : training set and test set n-fold cross-validation Leave-one-out 10 -fold cross-validation OneR 76% Decision Tree 90% Nave Bayesian 98% 1-Nearest Neighbors 100% Neural Networks 100% 43
  • 44. Data Mining Process: 7. Deploy Model The neural network model was successfully implemented inside the robotic arm. 44
  • 45. Data Mining Tools Commercial tools: SAS Enterprise Miner , IBM Intelligent Miner, SPSS Clementine Open source tools: WEKA: http://www.cs.waikato.ac.nz/ml/weka RapidMiner: http://rapid-i.com/index.php?lang=en Poll: Data mining/analytic tools you used in 2006 Good portals for data mining: KDnuggets 45
  • 46. Trends in Data Mining Application exploration development of application-specific data mining system Invisible data mining (mining as built-in function) Scalable data mining methods Constraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns Integration of data mining with database systems, data warehouse systems, and Web database systems 46
  • 47. Trends in Data Mining Web mining Social network analysis Recommender systems: US$1 Million prize for 10% improvement on Cinematch movie recommender system Netflix If You Liked This, Youre Sure to Love That (New York Times, Nov. 21, 2008) 47
  • 48. Trends in Data Mining Spam filters: Cost of Spam: How much does spam cost you? Google will calculate http://www.google.com/a/help/intl/en/security/r oi_calculator.html Privacy protection and information security in data mining Bioinformatics 48
  • 49. Some Research Results on DM Localization system for WLAN Rogue Access Point Detection System Based on Packet Analysis Library Recommender System Based on Personal Ontology Model 49
  • 50. Localization system for WLAN Enhancing the Accuracy of WLAN-based Location Determination Systems Using Predicted Orientation Information (Information Sciences, Vol. 178, No. 4, Feb. 15, 2008, pp. 10491068.) We proposed Accumulated Orientation Strength (AOS) algorithm based on Bayesian classifier to predict the orientation of a mobile user for improving the accuracy of localization system. 50
  • 51. Rogue Access Point Detection System A paper entitled "Detecting Rogue Access Points Using Client-side Bottleneck Bandwidth Analysis" has been accepted for publication in Computers & Security. 51
  • 52. Rogue Access Point Detection System Big challenge in managing APs in university campus: NCHU is a class B network with more than 50 departmental networks 52
  • 53. Rogue Access Point Detection System: Intruders from the Air 53
  • 54. Rogue Access Point Detection System Proposed a novel approach for detecting rogue access points by estimating client-side bottleneck bandwidth based on ACK packet pair technique. The system is implemented and tested in the Computer and Information Network Center at NCHU. Experimental results show that the accuracy is higher than 90%. 54
  • 55. Library Recommender System Based on Personal Ontology Model (PORE) A paper entitled "PORE: A Personal Ontology Recommender System for Digital Library" has been accepted for publication in The Electronic Library. Proposed personal ontology model for recommending books to library patrons based on keywords extracted from the books borrowed by the user 55
  • 56. Library Recommender System Based on Personal Ontology Model (PORE) Collaborative filtering techniques are also incorporated into the PORE system PORE system is in service at NCHU Library 56
  • 57. Conclusions We are drowning in data, but starving for knowledge! Data mining is the key to knowledge discovery. Applications of data mining techniques can be found in almost every research area of computer science and engineering. Even in a recession, data mining services are still in strong demand. 57
  • 58. References 1. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. 2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Ed., Morgan Kaufmann, 2005. 3. Jones, Neil and Pevzner, Pavel, An Introduction to Bioinformatics Algorithms, MIT Press, 2004. 4. http://www.chem-eng.utoronto.ca/~datamining/ 5. Duncan Watts6Six Degrees 2004 6. Mark BuchananNexus2003 7. http://www.kdnuggets.com/ 58