bÀi giẢng nhẬp mÔn khai phÁ dỮ liỆu chƯƠng 1. giỚi thiỆu chung vỀ khai phÁ dỮ...

Download BÀI GIẢNG NHẬP MÔN KHAI PHÁ DỮ LIỆU CHƯƠNG 1. GIỚI THIỆU CHUNG VỀ KHAI PHÁ DỮ LIỆU

If you can't read please download the document

Upload: spiro

Post on 09-Jan-2016

60 views

Category:

Documents


2 download

DESCRIPTION

BÀI GIẢNG NHẬP MÔN KHAI PHÁ DỮ LIỆU CHƯƠNG 1. GIỚI THIỆU CHUNG VỀ KHAI PHÁ DỮ LIỆU. PGS. TS. HÀ QUANG THỤY HÀ NỘI 09-2013 TRƯỜNG ĐẠI HỌC CÔNG NGHỆ ĐẠI HỌC QUỐC GIA HÀ NỘI. Nội dung. Nhu cầu của khai phá dữ liệu (KPDL) Khái niệm KPDL và phát hiện tri thức trong CSDL - PowerPoint PPT Presentation

TRANSCRIPT

  • BI GING NHP MN KHAI PH D LIU

    CHNG 1. GII THIU CHUNG V KHAI PH D LIUPGS. TS. H QUANG THYH NI 09-2015TRNG I HC CNG NGHI HC QUC GIA H NI*

  • Ni dungNhu cu ca khai ph d liu (KPDL)Khi nim KPDL v pht hin tri thc trong CSDLKPDL v x l CSDL truyn thngKiu d liu trong KPDLKiu mu c khai phCng ngh KPDL in hnhMt s ng dng in hnhCc vn chnh trong KPDL*

  • 1. Nhu cu v khai ph d liuS bng n d liuL do cng nghL do x hiTh hinNgnh kinh t nh hng d liuKinh t tri thcPht hin tri thc t d liu*

  • Bng n d liu: Lut Moore*Xut xGordon E. Moore (1965). Cramming more components onto integrated circuits, Electronics, 38 (8), April 19, 1965. Mt quan st v d boPhng ngn 2xS lng bn dn tch hp trong mt chp s tng gp i sau khong hai nmChi ph sn xut mch bn dn vi cng tnh nng gim mt na sau hai nmPhin bn 18 thng: rt ngn chu k thi gian

  • Dn dt ngnh cng ngh bn dnM hnh c bn cho ngnh cng nghip mch bn dnnh lut Moore vn to kh nng c bn cho s pht trin ca chng ti, v n vn cn hiu lc tt ti Intel nh lut Moore khng ch l mch bn dn. N cng l cch s dng sng to mch bn dn. Paul S. Otellini, Ch tch v Gim c iu hnh Tp on Intelton b chu trnh thit k, pht trin, sn xut, phn phi v bn hng c coi l c tnh bn vng khi tun theo nh lut Moore Nu nh bi nh lut Moore, th trng khng th hp th ht cc sn phm mi, v k s b mt vic lm. Nu b tt sau nh lut Moore, khng c g mua, v gnh nng ln i vai ca chui cc nh phn phi sn phm. Daniel Grupp, Gim c PT cng ngh tin tin, Acorn Technologies, Inc. (http://acorntech.com/)Thc y cng ngh x l, lu gi v truyn dn d liuCng ngh bn dn l nn tng ca cng nghip in t.nh lut Moore vi cng nghip phn cng my tnh: b x l Intel trong 40 nm qua (trang tip theo).Bng n v nng lc x l tnh ton v lu tr d liu.Tc ng ti s pht trin cng ngh c s d liu (t chc v qun l d liu) v cng ngh mng (truyn dn d liu)*Lut Moore & cng nghip in t

  • Another decade is probably straightforward...There is certainly no end to creativity. Gordon Moore, Intel Chairman Emeritus of the Board Speaking of extending Moores Law at the International Solid-State Circuits Conference (ISSCC), February 2003.

    *Lut Moore: B x l IntelMoores Law: Transistor densities on a single chip double about every two years.(Source: Intel Web site Moores Law: Made Real by Intel Innovation, www.intel.com/technology/mooreslaw/?iid=search, accessed January 9, 2008.)

  • *Gi tr, cch c cc bi v c in hnhH thng c v bi n v o

  • Nng lc s haThit b s ha a dngMi lnh vc Qun l, Thng mi, Khoa hcMt v d in hnh: SDSSSloan Digital Sky Surveyhttp://www.sdss.org/ to bn 3-chiu c cha hn 930.000 thin h v hn 120.000 quasarKnh vin vng u tinLm vic t 2000Vi tun u tin: thu thp d liu thin vn hc = ton b trong qu kh. Sau 10 nm: 140 TBKnh vin vng k tipLarge Synoptic Survey TelescopeBt u hot ng 2016. Sau 5 ngy s c 140 TB*Thit b thu thp lu tr d liu

  • Tin ha cng ngh CSDL [HK0106]: H CSDL m rng, KDL & KPDL, H CSDL da trn Web*Tin ha Cng ngh CSDL: nm 2006

  • Tin ha cng ngh CSDL [HKP11]: H CSDL m rng v Phn tch d liu m rng (c KPDL)*Tin ha Cng ngh CSDL: nm 2011

  • Cng ngh CSDL: Mt s CSDL lnTp 10 CSDL ln nhthttp://top-10-list.org/2010/02/16/top-10-largest-databases-list/ (04/9/13)Library of Congress: 125 triu mc; Central Intelligence Agency (CIA): 100 h s: thng k dn s, bn hng thng; Amazon: 250 nghn sch, 55 triu ngi dng, 40TB; YouTube: hng trm triu clip c xem hng ngy; ChoicePoint: 75 ln Tri t Mt trng; Sprint: 70.000 bn ghi vin thng; Google: 90 triu tm kim/ngy; AT&T: 310TB; World Data Centre for ClimateTrung tm tnh ton khoa hc nghin cu nng lng quc gia MNational Energy Research Scientific Computing Center: NERSCthng 3/2010: khong 460 TB http://www.nersc.gov/news/annual_reports/annrep0809/annrep0809.pdfYouTubeSau hai nm: hng trm triu video dung lng CSDL YouTube tng gp i sau mi chu k 5 thng*

  • Bng n d liu: Cng ngh mngTng lng giao vn IP trn mngNgun: Sch trng CISCO 20102010: 20.396 PB/thng, 2009-2014: tng trung bnh hng nm 34%Web13 t ri trang web c nh ch s (ngy 23/01/2011). t nht c 4,2 t trang Web c nh ch s (04/09/2013)Ngun: http://www.worldwidewebsize.com/ *

  • Bng n d liu: Tc nhn to miM rng tc nhn to d liuPhn to mi d liu ca ngi dng ngy cng tngH thng trc tuyn ngi dng, Mng x hiMng x hi Facebook cha ti 40 t nh2010: 900 EB do ngi dng to (trong 1260 EB tng th). Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010*

  • Bng n d liu: Gi thnh v th hinNgun: IDC Digital Universe Study, sponsored by EMC, May 2010Gi to d liu ngy cng r hnChiu hng gi to mi d liu gim dn0,5 xu M/1 GB vo nm 2009 gim ti 0,02 xu M /1 GB vo nm 2020Dung lng tng th tng dc tng cng caot 35 ZB vo nm 2020

    *

  • Bng n d liu vi tng trng nhn lc CNTTDung lng thng tin tng 67 ln, i tng d liu tng 67 lnLc lng nhn lc CNTT tng 1,4 ln Ngun: IDC Digital Universe Study, sponsored by EMC, May 2010.*Nhu cu nm bt d liu

  • Jim Gray, chuyn gia ca Microsoft, gii thng Turing 1998Chng ta ang ngp trong d liu khoa hc, d liu y t, d liu nhn khu hc, d liu ti chnh, v cc d liu tip th. Con ngi khng c thi gian xem xt d liu nh vy. S ch ca con ngi tr thnh ngun ti nguyn qu gi. V vy, chng ta phi tm cch t ng phn tch d liu, t ng phn loi n, t ng tm tt n, t ng pht hin v m t cc xu hng trong n, v t ng ch dn cc d thng.y l mt trong nhng lnh vc nng ng v th v nht ca cng ng nghin cu c s d liu. Cc nh nghin cu trong lnh vc bao gm thng k, trc quan ha, tr tu nhn to, v hc my ang ng gp cho lnh vc ny. B rng ca lnh vc lm cho n tr nn kh khn nm bt nhng tin b phi thng trong vi thp k gn y [HK0106].Kenneth Cukier, Thng tin t khan him ti d dt. iu mang li li ch mi to ln to nn kh nng lm c nhiu vic m trc y khng th thc hin c: nhn ra cc xu hng kinh doanh, ngn nga bnh tt, chng ti phm c qun l tt, d liu nh vy c th c s dng m kha cc ngun mi c gi tr kinh t, cung cp nhng hiu bit mi vo khoa hc v to ra li ch t qun l. http://www.economist.com/node/15557443?story_id=15557443*Nhu cu thu nhn tri thc t d liu

  • Kinh t tri thcTri thc l ti nguyn c bnS dng tri thc l ng lc ch cht cho tng trng kinh tHnh v: Nm 2003, ng gp ca tri thc cho tng GDP/u ngi ca Hn Quc gp i so vi ng gp ca lao ng v vn. TFP: Total Factor Productivity (The World Bank. Korea as a Knowledge Economy, 2006)

    *Kinh t tri thc

  • Kinh t dch vX hi loi ngi chuyn dch t kinh t hng ha sang kinh t dch v. Lao ng dch v vt lao ng nng nghip (2006).Mi nn kinh t l kinh t dch v.n v trao i trong kinh t v x hi l dch vDch v: d liu & thng tin tri thc gi tr miKhoa hc: d liu & thng tin tri thcK ngh: tri thc dch vQun l: tc ng ti ton b quy trnh thi hnh dch vJim Spohrer (2006). A Next Frontier in Education, Employment, Innovation, and Economic Growth, IBM Corporation, 2006*Kinh t dch v: T d liu ti gi tr

  • Ngnh cng nghip qun l v phn tch d liuChng ta nhp trong d liu m i kht tri thcng gi hn 100 t US$ vo nm 2010Tng 10% hng nm, gn gp i kinh doanh phn mm ni chungvi nm gn y cc tp on ln chi khong 15 t US$ mua cng ty phn tch d liuTng hp ca Kenneth CukierNhn lc khoa hc d liuCIO v chuyn gia phn tch d liu c vai tr ngy cng caoNgi phn tch d liu: ngi lp trnh + nh thng k + ngh nhn d liu. M c chun quy nh chc nngTham kho bi trao i Tn mn v c hi trong ngnh Thng k (v KHMT) ca Nguyn Xun Long ngy 03/7/2009. http://www.procul.org/blog/2009/07/03/t%e1%ba%a3n-m%e1%ba%a1n-v%e1%bb%81-c%c6%a1-h%e1%bb%99i-trong-nganh-th%e1%bb%91ng-ke-va-khmt/ *Ngnh kinh t nh hng d liu

  • **2. Khi nim KDD v KPDLKnowledge discovery from databases Trch chn cc mu hoc tri thc hp dn (khng tm thng, n, cha bit v hu dng tiim nng) t mt tp hp ln d liuKDD v KPDL: tn gi ln ln? theo hai tc gi|Khai ph d liuData Mining l mt bc trong qu trnh KDD

  • **Qu trnh KDD [FPS96][FPS96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth (1996). From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining 1996: 1-34nh gi v

  • **Cc bc trong qu trnh KDDHc t min ng dngTri thc sn c lin quan v mc tiu ca ng dngKhi to mt tp d liu ch: chn la d liuChun b d liu v tin x l: (huy ng ti 60% cng sc!)Thu gn v chuyn i d liuTm cc c trng hu dng, rt gn chiu/bin, tm cc i din bt bin.Chn la chc nng (hm) KPDLTm tt, phn lp, hi quy, kt hp, phn cm.Chn (cc) thut ton KPDLBc KPDL: tm mu hp dnnh gi mu v trnh din tri thcTrc quan ha, chuyn dng, loi b cc mu d tha, v.v.S dng tri thc pht hin c

  • **Cc khi nim lin quan Cc tn thay thchit lc tri thc (knowledge extraction), pht hin thng tin (information discovery), thu hoch thng tin (information harvesting),khai qut/no vt d liu (data archaeology/ dredging),Phn tch/x l mu/d liu (data/pattern analysis/processing)Thng minh doanh nghip (business intelligence -BI)Phn bit: Phi chng mi th l DM?X l truy vn suy din. H chuyn gia hoc chng trnh hc my/thng k nh

  • **M hnh qu trnh KDD lp [CCG98]Mt m hnh ci tin qu trnh KDDnh hng kinh doanh: Xc nh 1-3 cu hi hoc mc ch h tr ch KDDKt qu thi hnh c: xc nh tp kt qu thi hnh c da trn cc m hnh c nh giLp kiu vng i pht trin phn mm[CCG98] Kenneth Collier, Bernard Carey, Ellen Grusy, Curt Marjaniemi, Donald Sautter (1998). A Perspective on Data Mining, Technical Reporrt, Northern Arizona University.

  • **

    M hnh CRISP-DM 2000Quy trnh chun tham chiu cng nghip KPDLCc pha trong m hnh quy trnh CRISP-DM (Cross-Industry Standard Process for Data Mining). Hiu kinh doanh: hiu bi ton v nh gi Thi hnh ch sau khi tham chiu kt qu vi hiu kinh doanhCRISP-DM 2.0 SIG WORKSHOP, LONDON, 18/01/2007Ngun: http://www.crisp-dm.org/Process/index.htm (13/02/2011)

  • **Chu trnh pht trin tri thc thng qua khai ph d liuWang, H. and S. Wang (2008). A knowledge management approach to data mining process for business intelligence, Industrial Management & Data Systems, 2008. 108(5): 622-634. [Oha09]M hnh tch hp DM-BI [WW08]

  • **Khoa hoc d liuData science is an emerging field in industry, and as yet, it is not welldefined as an academic subject.Van der AalstLm th no s dng ton b thng tin ci thin quy trnh v my mc, nng cao hiu qu chng, v ngn chn trc trc ? "Lm th no chng ta c th s dng thng tin tc ng ti cc hnh vi khng mong mun? C cch no cho mi ngi phn hi v li sng ca h? "

  • **Khoa hoc d liu

  • **D liu v MuD liu (tp d liu)tp F gm hu hn cc trng hp (s kin). KDD:phi gm rt nhiu trng hpMuTrong KDD: ngn ng L biu din cc tp con cc s kin (d liu) thuc vo tp s kin F, Mu: biu thc E trong ngn ng L tp con FE tng ng cc s kin trong F. E c gi l mu nu n n gin hn so vi vic lit k cc s kin thuc FE. Chng hn, biu thc "THUNHP < $t" (m hnh cha mt bin THUNHP)

  • **Tnh c gi trMu c pht hin: phi c gi tr i vi cc d liu mi theo chn thc no y.Tnh "c gi tr" : mt o tnh c gi tr (chn thc) l mt hm C nh x mt biu thc thuc ngn ng biu din mu L ti mt khng gian o c (b phn hoc ton b) MC.Chng hn, ng bin xc nh mu "THUNHP < $t dch sang phi (bin THUNHP nhn gi tr ln hn) th chn thc gim xung do bao gi thm cc tnh hung vay tt li b a vo vng khng cho vay n.Nu a*THUNHP + b*N < 0 mu c gi tr hn.

  • **Tnh mi v hu dng tim nngTnh mi: Mu phi l mi trong mt min xem xt no , t nht l h thng ang c xem xt.Tnh mi c th o c : s thay i trong d liu: so snh gi tr hin ti vi gi tr qu kh hoc gi tr k vnghoc tri thc: tri thc mi quan h nh th no vi cc tri thc c.Tng qut, iu ny c th c o bng mt hm N(E,F) hoc l o v tnh mi hoc l o k vng.Hu dng tim nng: Mu cn c kh nng ch dn ti cc tc ng hu dng v c o bi mt hm tin ch. Hm U nh x cc biu thc trong L ti mt khng gian o c th t (b phn hoc ton b) MU: u = U (E,F).V d, trong tp d liu vay n, hm ny c th l s tng hy vng theo s tng li ca nh bng (tnh theo n v tin t) kt hp vi quy tc quyt nh c trnh by trong Hnh 1.3.

  • **Tnh hiu c, tnh hp dn v tri thcTnh hiu c: Mu phi hiu cKDD: mu m con ngi hiu chng d dng hn cc d liu nn.Kh o c mt cch chnh xc: "c th hiu c d hiu.Tn ti mt s o d hiu:Sp xp t c php (tc l c ca mu theo bit) ti ng ngha (tc l d dng con ngi nhn thc c theo mt tc ng no ).Gi nh rng tnh hiu c l o c bng mt hm S nh x biu thc E trong L ti mt khng gian o c c th t (b phn /ton b) MS: s = S(E,F).Tnh hp dn: o tng th v mu l s kt hp ca cc tiu ch gi tr, mi, hu ch v d hiu. Hoc dng mt hm hp dn: i = I (E, F, C, N, U, S) nh x biu thc trong L vo mt khng gian o c Mi. Hoc xc nh hp dn trc tip: th t ca cc mu c pht hin.Tri thc: Mt mu E L c gi l tri thc nu nh i vi mt lp ngi s dng no , ch ra c mt ngng i Mi m hp dn I(E,F,C,N,U,S) > i.

  • **Kin trc in hnh h thng KPDL

  • **3. Khai ph d liu v qun tr CSDLCu hi thuc h qun tr CSDL (DBMS)Hy hin th s tin ng Smith trong ngy 5 thng Ging ? ghi nhn ring l do x l giao dch trc tuyn (on-line transaction processing OLTP). C bao nhiu nh u t nc ngoi mua c phiu X trong thng trc ? ghi nhn thng k do h thng h tr quyt nh thng k (stastical decision suppport system - DSS) Hin th mi c phiu trong CSDL vi mnh gi tng ? ghi nhn d liu a chiu do x l phn tch trc tuyn (on-line analytic processing - OLAP).Cn c mt gi thit y v tri thc min phc tp!

  • **Khi nim KPDL: cu hi DMSCu hi thuc h thng khai ph d liu (DMS) Cc c phiu tng gi c c trng g ? T gi US$ - DMark c c trng g ? Hy vng g v c phiu X trong tun tip theo ? Trong thng tip theo, s c bao nhiu on vin cng on khng tr c n ca h ? Nhng ngi mua sn phm Y c c trng g ?

    Gi thit tri thc y khng cn c tnh ct li, cn b sung tri thc cho h thng Ci tin (nng cp) min tri thc !

  • **H thng CSDL v H thng KPDL

  • **KPDL v Thng minh kinh doanhChiu tng bn cht H tr quyt nh kinhdoanhNgi dng cuiChuyn gia phn tch kinh doanh Chuyn gia phn tch d liuQun tr CSDL(DBA)To quyt nhTrnh din DLVisualization TechniquesKPDLInformation DiscoveryKhai thc DL (Data Exploration)OLAP, MDAPhn tch thng k, Truy vn v Tr liKho DL(Data Warehouses) / KDL chuyn (Data Marts)Ngun d liuBi vit, Files, Nh cung cp thng tin, H thng CSDL, OLTP

  • **4. KPDL: cc kiu d liuCSDL quan hKho d liuCSDL giao dchCSDL m rng v kho cha thng tinCSDL quan h-i tngD liu khng gian v thi gianD liu chui thi gianD liu dngD liu a phng tinD liu khng ng nht v tha kCSDL Text & WWW

  • **Kiu d liu c phn tch/khai phhttp://www.kdnuggets.com/polls/2010/data-types-analyzed.html

  • **http://www.kdnuggets.com/polls/2010/data-miner-salary.htmlhttp://www.kdnuggets.com/polls/2009/largest-database-data-mined.htm Kch thc d liu v lng KPDLhttp://www.kdnuggets.com/polls/2010/data-types-analyzed.html

  • **5. KPDL: Kiu mu c khai phChc nng chungKPDL m t: tm tt, phn cm, lut kt hpKPDL d on: phn lp, hi quyCc bi ton in hnhM t khi nimQuan h kt hpPhn lpPhn cmHi quyM hnh ph thucPht hin bin i v lchPhn tch nh hng mu, cc bi ton khc

  • **Phn cp phng php KPDLL. Rokach and O. Maimon (2015). Data Mining with Decision Trees: Theory and Applications. World Scientific Publishing.

  • **KPDL: S phn loi (Chc nng)M t khi nim: c trng v phn bitTm cc c trng v tnh cht ca khi nimTng qut ha, tm tt, pht hin c trng rng buc, tng phn, chng hn, cc vng kh so snh vi tBi ton m t in hnh: Tm tt (tm m t c ng)K vng, phng saiTm tt vn bnQuan h kt hpQuan h kt hp gia cc bin d liu: Tng quan v nhn qu)Diaper Beer [0.5%, 75%]Lut kt hp: XYV d, trong khai ph d liu WebPht hin quan h ng nghaQuan h ni dung trang web vi mi quan tm ngi dng

  • **Cc bi ton KPDL: Chc nng KPDLPhn lp v D boXy dng cc m hnh (chc nng) m t v phn bit khi nim cho cc lp hoc khi nim d on trong tng laiChng hn, phn lp quc gia da theo kh hu, hoc phn lp t da theo tiu tn xngTrnh din: cy quyt nh, lut phn lp, mng nronD on gi tr s cha bit hoc mt

  • **KPDL: S phn loi (Chc nng)Phn lpxy dng/m t m hnh/ hm d bo m t/pht hin lp/khi nim cho d bo tip hc mt hm nh x d liu vo mt trong mt s lp bitPhn cmnhm d liu thnh cc "cm" (lp mi) pht hin c mu phn b d liu min ng dng.Tnh tng t

  • **KPDL: S phn loi chc nng (2)Phn tch cmNhn lp cha bit: Nhm d liu thnh cc lp mi: phn cm cc nh tm mu phn bCc i tng t ni b cm & cc tiu tng t gia cc cmPhn tch bt thngBt thng: i tng d liu khng tun theo hnh vi chung ca ton b d liu. V d, s dng k vng mu v phng sai muNhiu hoc ngoi l? Khng phi! Hu dng pht hin gian ln, phn tch cc s kin himPht hin bin i v lchHu nh s thay i c ngha di dng o bit trc/gi tr chun, cung cp tri thc v s bin i v lchPht hin bin i v lch tin x l

  • **KPDL: S phn loi (Chc nng)Hi quyhc mt hm nh x d liu nhm xc nh gi tr thc ca mt bin theo mt s bin khcin hnh trong phn tch thng k v d bod on gi tr ca mt/mt s bin ph thuc vo gi tr ca mt tp bin c lp.M hnh ph thucxy dng m hnh ph thuc: tm mt m hnh m t s ph thuc c ngha gia cc binmc cu trc: dng thbin l ph thuc b phn vo cc bin khcmc nh lng: tnh ph thuc khi s dng vic o tnh theo gi tr s

  • **KPDL: S phn loi (Chc nng)Phn tch xu hng v tin haXu hng v lch: phn tch hi quyKhai ph mu tun t, phn tch chu kPhn tch da trn tng tPhn tch nh hng mu khc hoc phn tch thng k

  • **KPDL: S phn loi (2)Phn loi theo khung nhnKiu d liu c KPKiu tri thc cn pht hinKiu k thut c dngKiu min ng dng

  • **Khung nhn a chiu ca KPDLD liu c khai phQuan h, KDL, giao dch, dng, hng i tng/quan h, tch cc, khng gian, chui thi gian, vn bn, a phng tin, khng ng nahats, k tha, WWWTri thc c khai phc trng, phn bit, kt hp, phn lp, phn cm, xu hng/ lch, phn tch bt thng,Cc chc nng phc/tch hp v KPDL cc mc phc hpK thut c dngnh hng CSDL, KDL (OLAP), hc my, thng k, trc quan ha, .ng dng ph hpBn l, vin thng, ngn hng, phn tch gian ln, KPDL sinh hc, phn tch th trng chng khon, KP vn bn, KP Web,

  • **Mi mu khai ph c u hp dn?KPDL c th sinh ra ti hng nghn mu: Khng phi tt c u hp dnTip cn gi : KPDL hng ngi dng, da trn cu hi, hng ch o hp dnMu l hp dn nu d hiu, c gi tr theo d liu mi/kim tra vi chc chn, hu dng tim nng, mi l hoc xc nhn cc gi thit m ngi dng tm kim xc thc. o hp dn khch quan v ch quanKhch quan: da trn thng k v cu trc ca mu, chng hn, d h tr, tin cy, Ch quan: da trn s tin tng ca ngi dng i vi d liu, chng hn, s khng ch n, tnh mi m, tc ng c...

  • **Tm c tt c v ch cc mu hp dn?Tm c mi mu hp dn: V tnh y H thng KHDL c kh nng tm mi mu hp dn?Tm kim my m (heuristic) tm kim y Kt hp phan lp phn cmTm ch cc mu hp dn: V tnh ti uH thng KPDL c kh nng tm ra ng cc mu hp dn?Tip cnu tin tm tng th tt c cc mu sau lc b cc mu khng hp dn.Sinh ra ch cc mu hp dnti u ha cu hi khai ph

  • *Khai ph d liu: Chng 1*6. KPDL: Cc cng ngh chnhHi t ca nhiu ngnh phc [HK0106]

  • *Kho d liu v khai ph d liu: Chng 1*KPDL: Cc cng ngh chnhHi t ca nhiu ngnh phc [HKP11]

  • **Thng k ton hc vi KPDLNhiu im chung gia KPDL vi thng k:c bit nh phn tch d liu thm d (EDA: Exploratory Data Analysis) cng nh d bo [Fied97, HD03].H thng KDD thng gn kt vi cc th tc thng k c bit i vi m hnh d liu v nm bt nhiu trong mt khung cnh pht hin tri thc tng th.Cc phng php KPDL da theo thng k nhn c s quan tm c bit.

  • **Thng k ton hc vi KPDLPhn bit gia bi ton thng k v bi ton khai ph d liuBi ton kim nh gi thit thng k: cho trc mt gi thit + tp d liu quan st c. Cn kim tra xem tp d liu quan st c c ph hp vi gi thit thng k hay khng/ gi thit thng k c ng trn ton b d liu quan st c hay khng.Bi ton hc khai ph d liu: m hnh cha c trc. M hnh kt qu phi ph hp vi tp ton b d liu -> cn m bo cc tham s m hnh khng ph thuc vo cch chn tp d liu hc. Bi ton hc KPDL i hi tp d liu hc/tp d liu kim tra cn "i din" cho ton b d liu trong min ng dng v cn c lp nhau. Mt s trng hp: hai tp d liu ny (hoc tp d liu kim tra) c cng b di dng chun. V thut ng: KPDL: bin ra/bin mc tiu, thut ton khai ph d liu, thuc tnh/c trng, bn ghi... XLDLTK: bin ph thuc, th tc thng k, bin gii thch, quan st... Tham kho thm t Nguyn Xun Long

  • **Hc my vi KPDLHc myMachine LearningCch my tnh c th hc (nng cao nng lc) da trn d liu. Cc chng trnh my tnh t ng hc c cc mu phc tp v ra quyt nh thng minh da trn d liu, v d, hc c ch vit tay trn th thng qua mt tp v d.Hc my l lnh vc nghin cu pht trin nhanhMt s ni dung hc my vi khai ph d liuNhiu ni dung c trnh by ti mc trcHc gim st (supervised learning) l ng ngha vi phn lp (classification)Hc khng gim st (unsupervised learning) l ng ngha vi phn cm (clustering),Hc bn gim st (semi-supervised learning) s dng c v d c nhn v v d khng c nhnHc tch cc (Active learning) c th gi l hc tng tc (interactive learning) c tng tc vi ngi dng.

  • **Tm kim thng tin vi KPDLTm kim thng tinInformation Retrieval. Truy hi thng tinTm kim ti liu hoc tm kim thng tin trong ti liu theo mt truy vn. Ti liu: vn bn, a phng tin, webHai gi thit: (i) D liu tm kim l khng cu trc; (ii) Truy vn di dng t kha/cm t kha m khng phi cu trc phc tpTm kim thng tin vi KPDLKt hp m hnh tm kim vi k thut KPDL tm thy cc ch chnh trong tp ti liu, tng ti liu b sung thuc tnh d liu quan trngKPDL vn bn, web, phng tin x hi lin quan mt thit vi tm kim thng tin.

  • **Phn tch d liu v h tr quyt nhPhn tch v qun l th trngTip th nh hng, qun l quan h khch hng (CRM), phn tch thi quen mua hng, bn hng cho, phn on th trngPhn tch v qun l ri roD bo, duy tr khch hng, ci thin bo lnh, kim sot cht lng, phn tch cnh tranhPht hin gian ln v pht hin mu bt thng (ngoi lai)ng dng khcKhai ph Text (nhm mi, email, ti liu) v khai ph WebKhai ph d liu dngPhn tch DNA v d liu sinh hc7. ng dng c bn ca KPDL

  • **Phn tch v qun l th trngNgun d liu c t u ?Giao dch th tn dng, th thnh vin, phiu gim gi, cc phn nn ca khch hng, cc nghin cu phong cch sng (cng cng) b sungTip th nh hngTm cm cc m hnh khch hng cng c trng: s quan tm, mc thu nhp, thi quen chi tiu...Xc nh cc mu mua hng theo thi gianPhn tch th trng choQuan h kt hp/ng quan h gia bn hng v s bo da theo quan h kt hpH s khch hngKiu ca khch hng mua sn phm g (phn cm v phn lp)Phn tch yu cu khch hngnh danh cc sn phm tt nht ti khch hng (khc nhau)D bo cc nhn t s thu ht khch hng miCung cp thng tin tm ttBo co tm tt a chiuThng tin tm tt thng k (xu hng trung tm d liu v bin i)

  • **Phn tch doanh nghip & Qun l ri roLn k hoch ti chnh v nh gi ti snPhn tch v d bo dng tin mtPhn tch yu cu ngu nhin nh gi ti snPhn tch lt ct ngang v chui thi gian (t s ti chnh, phn tch xu hng)Ln k hoch ti nguynTm tt v so snh cc ngun lc v chi tiuCnh tranhTheo di i th cnh tranh v nh hng th trngNhm khch hng thnh cc lp v nh gi da theo lp khchKhi to chin lc gi trong th trng cnh tranh cao

  • **Phn tch kinh doanh: Khai ph quy trnhWMP Van der Aalst (2011). Process Mining: Discovery, Conformance and Enhancement of Business Processes, Springer.

  • **Pht hin gian ln v khai ph mu himTip cn: Phn cm & xy dng m hnh gian ln, phn tch bt thngng dng: Chm sc sc khe, bn l, dch v th tn dng, vin thng.Bo him t ng: vng xung tRa tin: giao dch tin t ng ngBo him y tBnh ngh nghip, nhm bc s, v nhm ch dnXt nghim khng cn thit hoc tng quanVin thng: cuc gi gian lnM hnh cuc gi: ch cuc gi, di, thi im trong ngy hoc tun. Phn tch mu lch mt dng chun d kinCng nghip bn lCc nh phn tch c lng rng 38% gim bn l l do nhn vin khng trung thcChng khng b

  • **ng dng khcKhai ph web v khai ph phng tin x hiTr gip IBM p dng cc thut ton KPDL bin bn truy nhp Web i vi cc trang lin quan ti th trng khm ph u i khch hng v cc trang hnh vi, phn tch tnh hiu qu ca tip th Web, ci thi cch t chc Website Th thaoIBM Advanced Scout phn tch thng k mn NBA (chn bng, h tr v li) a ti li th cnh trang cho New York Knicks v Miami HeatThin vn hcJPL v Palomar Observatory khm ph 22 chun tinh (quasar) vi s tr gip ca KPDL

  • **

  • **

    8. Vn chnh trong KPDLNgun ch dn v KPDLData mining and KDD (SIGKDD: CDROM)Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.Journal: Data Mining and Knowledge Discovery, KDD ExplorationsDatabase systems (SIGMOD: CD ROM)Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAAJournals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.AI & Machine LearningConferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.Journals: Machine Learning, Artificial Intelligence, etc.StatisticsConferences: Joint Stat. Meeting, etc.Journals: Annals of statistics, etc.VisualizationConference proceedings: CHI, ACM-SIGGraph, etc.Journals: IEEE Trans. visualization and computer graphics, etc.Mt s tham kho khchttp://www.kdnuggets.com/Danh sch ti liu tham khoFuture Directions in Computer Science

  • **http://www.kdnuggets.com/2015/09/free-data-science-books.html

  • *A regional breakdown in the US/Canada shows that :Data Science Managers earn average salary around $177K (11% higher than $165K in 2014). Data Scientists earn on average $122K (9% lower than $135K in 2014, probably because more people entered the market).Data Analysts earn on average $86K (11% higher than $76K in 2014).http://www.kdnuggets.com/2015/03/salary-analytics-data-science-poll-well-compensated.html

  • **S lc cng ng KPDL1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)1991-1994 Workshops on Knowledge Discovery in DatabasesAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98)Journal of Data Mining and Knowledge Discovery (1997)1998 ACM SIGKDD, SIGKDD1999-2001 conferences, and SIGKDD ExplorationsMore conferences on data miningPAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.

  • **KPDL: tp 20 t kha hng uhttp://www.researcherid.com/

  • **Cc ch lin quan KPDL l thi s !

  • **Trang web KDD; KPDL & bin i kh huNguyn nhn gy bin i kh hu:Gn 50% c gi KDnuggets tin rng thay i kh hu hin nay phn ln l do hot ng ca con ngi, mt s ng k s ngi nghi ng.Kh hu rt phc tp v cc nh khoa hc khng phi l tuyn b rng hot ng ca con ngi l nguyn nhn duy nht ca thay i kh hu.ng thun vi Hi ng lin chnh ph v Bin i kh hu: hot ng ca con ngi l mt trong nhng nguyn nhn chnh.Khai ph nhn nh: Opinion Mining / Sentiment Mining

  • **Vn chnh trong KPDLPhng php lun khai phKhai ph cc kiu tri thc khc nhau t d liu hn tp nh sinh hc, dng, webHiu nng: Hiu sut, tnh hiu qu, v tnh m rngnh gi mu: bi ton v tnh hp dnKt hp tri thc min: ontologyX l d liu nhiu v d liu khng y Tnh song song, phn tn v phng php KP gia tngKt hp cc tri thc c khm ph vi tri thc hin c: tng hp tri thcTng tc ngi dngNgn ng hi KPDL v khai ph ngu hngBiu din v trc quan kt qu KPDLKhai thc tng tc tri thc cc cp tru tngp dng v ch s x hiKPDL c t min ng dng v KPDL v hnhBo m b mt d liu, ton vn v tnh ring t

  • **Mt s yu cu ban uS b v mt s yu cu d n KPDL thnh cngCn c k vng v mt li ch ng k v kt qu KPDLHoc trc tip nhn c tri cy treo thp (low-hanging fruit) d thu lm (nh M hnh m rng khch hng qua tip th v bn hng)Hoc gin tip to ra n by cao khi tc ng vo qu trnh sng cn c nh hng sng ngm mnh (Gim cc n khon kh i t 10% cn 9,8% c s tin ln).Cn c mt i d n thi hnh cc k nng theo yu cu: chn d liu, tch hp d liu, phn tch m hnh ha, lp v trnh din bo co. Kt hp tt gi ngi phn tch v ngi kinh doanhNm bt v duy tr cc dng thng tin tch ly (chng hn, m hnh kt qu t mt lot chin dch tip th)Qu trnh hc qua nhiu chu k, cn chy ua vi thc tin (m hnh m rng khch hng ban u cha phi ti u).Mt tng hp v cc bi hc KPDL thnh cng, tht bi [NEM09] Robert Nisbet, John Elder, and Gary Miner (2009). Handbook of Statistical Analysis and Data Mining, Elsevier, 2009.

    ****