chuong 2 - khai pha du lieu

26
 TRƢỜNG ĐẠI HC HÀNG HI VIT NA M KHOA CÔNG NGHTHÔNG TIN BÀI GING MÔN HC KHAI PHÁ DLIU Ging viên: Nguyn Vương Thnh Bmôn: Hthng thông tin Hi Phòng, 2011 CHƯƠNG 2: KHAI PHÁ LUT KT HP

Upload: huyen-tran

Post on 22-Jul-2015

46 views

Category:

Documents


0 download

TRANSCRIPT

TRNG I HC HNG HI VIT NAMKHOA CNG NGH THNG TINBI GING MN HCKHAI PH D LIUGing vin: Nguyn Vng ThnhB mn:H thng thng tinHi Phng, 2011CHNG 2: KHAI PH LUT KT HP 2Ti liu tham kho1. Jiawei HanandMichelineKamber, Data Mining Concepts and Techniques,Elsevier Inc, 2006.2. Robert Nisbet, John Elder, Gary Miner, Handbook of Statistical Analysis andData Mining Applications, Elsevier Inc, 2009.3. Elmasri, Navathe, Somayajulu, Gupta, Fundamentals of Database Systems(the 4thEdition), Pearson Education Inc, 2004.CHNG 2: KHAI PH LUT KT HP 2.1. MT S KHI NIM C BN2.2. TM TP PH BIN VI GII THUT APRIORI2.3. SINH LUT KT HP T CC TP PH BIN 342.1. MT S KHI NIM C BN2.1.1. Khi nim mc (item) v tp mc (item set) Cho mt tp gm n i tng I = {I1, I2, I3,, In}, mi phn t Ii Ic gi l mt mc (item). Mt tp con bt k X I c gi lmt tp mc (item set). Cho mt tp D = {T1, T2,, Tm}, mi phn t Tj D c gi l mtgiaodch(transaction) vlmt tpconnocaI (Tj I).Ngi ta gi D l c s d liu giao dch (transaction database).S giao dch c trong D k hiu l |D|.V d: I = {A, B, C, D, E, F},X = {A, D, E} l mt tp mc. Mt c s d liu giao dch D gmcc tp con Tjkhc nhau ca I:T1{A, B, C, D}T2{A, C, E}T3{A, E}T4{A, E, F}T5{A, B, C, E, F} 52.1.2. h tr (support) ng vi mt tp mc h tr ng vi tp mc X l xc sut xut hin ca X trongc s d liu giao dch DHoc h tr ng vi tp mc X l t l cc giao dch c cha Xtrn tng s cc giao dch c trong c s d liu giao dch D( )sup( )| |CXXD=Trong : C(X) l s ln xut hin ca X hay s giao dch c cha XT1{A, B, C, D}T2{A, C, E}T3{A, E}T4{A, E, F}T5{A, B, C, E, F} V d: X = {A, E} th C(X) = 4 v sup(X) = 4/5 = 80%Cc tp mc c h tr ln hn mt gi tr ngng minsup no cho trc c gi l cc tp ph bin (frequent item set).62.1.3. Lut kt hp (Association Rule) Cho hai tp mc X, Y I, X Y = . Lut kt hp k hiu l X Ych ra mi rng buc ca tp mc Y theo tp mc X, ngha l khi Xxut hin trong c s d liu giao dch th s ko theo s xut hinca Y vi mt mt t l no y. Lut kt hp c c trng bi: h tr ca lut: l t l (hay xc sut) xut hin c X v Y trong cngmt giao dch.( )sup( ) sup( )| |CX YX Y X YD

== tin cy ca lut: l t l cc giao dch c cha c X v Y so vi ccgiao dch c cha X.( ) sup( )conf ( )( ) sup( )CX Y X YX YCX X = =Trong : C(X Y): S giao dch c cha c X v Y.C(X): S giao dch c cha X. Lut mnh: Cc lut c h tr ln hn mt gi tr ngng minsupv tin cy ln hn mt gi tr ngng minconf cho trc c gil cc lut mnh hay lut c gi tr (strong association rules).C th:7Nu ng thi sup(XY) minsup v conf(XY) minconf thXY c gi l lut mnh (strong association rule).82.1.4. Bi ton khai ph lut kt hpInput:C s d liu giao dch D.Cc gi tr ngng minsup, minconf.Output: Tt c cc lut mnh. gii quyt bi ton khai ph lut kt hp bao gi cng thng tri quahai pha:Pha 1: Sinh tt c cc tp ph bin c th c. pha ny ta s dng giithut Apriori.Pha 2: ng vi mi tp ph bin K tm c pha 1, tch K thnh hai tpX, Y khng giao nhau (K = X Y v X Y = ). Tnh tin cy ca lut kthp X Y, nu tin cy trn ngng minsup th n l lut mnh. Ch l nu tp K c k phn t th s tp con thc s ca K s l 2k 2, tc lt K ta s sinh c ti a l 2k- 2 lut.Lu : Trong gii thut Apriori, xc nh mt tp l ph bin ngi ta khng sdng khi nim h tr m s dng khi nim s ln xut hin (support count).Nu s ln xut hin ca tp mc trong c s d liu giao dch ln hn mt gitr ngng no y th n l tp ph bin. Gi tr ngng ny c xc nh l:mincount =minsup*| | D ( (92.2. TM TP PH BIN VI GII THUT APRIORI2.2.1. Nguyn l AprioriNu mt tp mc l tp ph bin th mi tp con khc rng bt kca n cng l tp ph binChng minh:Xt X X. K hiu p l ngng h tr minsup. Mt tp mc xut hin baonhiu ln th cc tp con cha trong n cng xut hin t nht by nhiu ln, nnta c:C(X) C(X) (1).X l tp ph bin nn:( )sup( ) ( ) | | (2)| |CXX p CX p DD= > >T (1) v (2) suy ra:( ')( ') | | sup( ')| |CXCX p D X pD> = >Tc l X cng l tp ph bin (pcm).102.2.2. Gii thut AprioriMc ch: Tm ra tt c cc tp ph bin c th c. Da trn nguyn l Apriori. Hot ng da trn Quy hoch ng:T cc tp Fi = { ci | ci l tp ph bin, |ci| = i} gm mi tp mc phbin c di i (1 i k), i tm tp Fk+1 gm mi tp mc ph binc di k+1. Cc mc I1, I2,, In trong tp I c coi l sp xp theomt th t c nh.11F1 = { cc tp ph bin c di 1};for(k=1; Fk != ; k++){Ck+1 = Apriori_gen(Fk);for each t D {Ct = { c | c Ck+1 v c t};for each c Ctc.count++;}Fk+1 = {c Ck+1 | c.count mincount};}return F = Input: - C s d liu giao dch D = {t1, t2,, tm}.- Ngng h tr ti thiu minsup.Output: - Tp hp tt c cc tp ph bin.mincount =minsup*| | ; D ( (kkF12Th tc con Apriori_gen Th tc con Apriori_gen c nhim v sinh ra (generation) cc tp mcc di k+1 t cc tp mc c di k trong tp Fk. c thi hnh qua hai bc: ni (join) cc tp mc c chung cc tin t(prefix) v sau p dng nguyn l Apriori loi b bt nhng tpkhng tha mn.C th: Bc ni: Sinh cc tp mc c l ng vin ca tp ph bin c dik+1 bng cch kt hp hai tp ph bin li v lj Fk c di k v trngnhau k-1 mc u tin: c = li + lj = {i1, i2,, ik-1, ik, ik}.Vi li ={i1, i2,, ik-1, ik}, lj = {i1, i2,, ik-1, ik}, v i1 i2 ik-1 ik ik. Bc ta: Gi li tt c cc ng vin c tha tha mn nguyn l Aprioritc l mi tp con c di k ca n u l tp ph bin (sk c v|sk| = k th sk Fk).13function Apriori_gen(Fk: tp cc tp ph bin di k): Tp ng vin c dik+1{Ck+1 = ;for each li Fkfor each lj Fkif (li[1]=lj[1]) and (li[2]=lj[2]) and (li[k-1]=lj[k-1]) and (li[k]function Rules_Generation(F: Tp cc tp ph bin): Tp cc lut kt hp mnh{R = ;F=F \ F1; //Cc tp ph bin di 1 khng dng sinh lutfor each X F for each S F if conf(S(X\S)) minconf thenR = R { S(X\S)};return R;}16BI TP P DNGBi tp s 1:Cho I = {A, B, C, D, E, F} v c s d liu giao dch D:T1 {A, B, C, F}T2 {A, B, E, F}T3 {A, C}T4 {D, E}T5 {B, F}Chn ngng minsup = 25% v minconf = 75%. Hy xc nh cc lut kthp mnh.17mincount =minsup * |D| 25%*5 1.25 2 = = =((( (((F1S ln xut hin{A} 3{B} 3{C} 2{E} 2{F} 3Tp mcS lnxut hin{A} 3{B} 3{C} 2{D} 1{E} 2{F} 3F2S lnxut hin{A, B} 2{A, C} 2{A, F} 2{B, F} 3Tp mc{A, B}{A, C}{A, E}{A, F}{B, C}{B, E}{B, F}{C, E}{C, F}{E, F}C2S ln xut hin{A, B} 2{A, C} 2{A, E} 1{A, F} 2{B, C} 1{B, E} 1{B, F} 3{C, E} 0{C, F} 1{E, F} 1Tp mc{A, B, C}{A, B, F}{A, C, F}C3S lnxut hin{A, B, F} 2F3S ln xut hin{A, B, F} 2Sinh cc tpmc c di 3 t tpph bin F2Sinh cc tpph bin c di 1Sinh cc tpc di 2 bng cchni cc tpc di 1Loi cctp mckhngtha mnnguyn lAprioriF3 ch c mt phn t nn khng th tip tc kt ni sinh F4. Thut ton ktthc. Ta c tp cc tp ph bin l:F ={{A}, {B}, {C}, {E}, {F}, {A, B}, {A, C}, {A, F}, {B, F}, {A, B, F}}18({ , }) 2conf ({ } { }) 66.7%({}) 3C A BA BC A = = =({ , }) 2conf ({ } { }) 66.7%({ }) 3C A BB AC B = = =({ , }) 2conf ({ } { }) 66.7%({ }) 3C A CA CC A = = =({ , }) 2conf ({ } { }) 100%({ }) 2C A CC AC C = = ={A, B} c th sinh cc lut: {A}{B}, {B}{A}{A, C} c th sinh cc lut: {A}{C}, {C}{A}({ , }) 2conf ({ } { }) 66.7%({ }) 3C A FA FC A = = =({ , }) 2conf ({ } { }) 66.7%({ }) 3C A FF AC F = = ={A, F} c th sinh cc lut: {A}{F}, {F}{A}19({ , }) 3conf ({} { }) 100%({}) 3C BFB FC B = = =({ , }) 3conf ({ } {}) 100%({ }) 3C B FF BC F = = =({, , }) 2conf ({} { , }) 66.7%({}) 3C A B FA B FC A = = =({, , }) 2conf ({ , } { }) 100%({ , }) 2C A B FA B FC A B = = =({, , }) 2conf ({} { , }) 66.7%({}) 3C A B FB A FC B = = =({, , }) 2conf ({ , } {}) 66.7%({ , }) 3C A B FB F AC B F = = ={B, F} c th sinh cc lut:{B}{F}, {F}{B}{A, B, F} c th sinh cc lut: {A}{B, F}, {A, B}{F}, {B}{A, F},{B, F}{A}, {F}{A, B}, {A, F}{B}20({ , , }) 2conf ({ } { , }) 66.7%({ }) 3C A B FF A BC F = = =({ , , }) 2conf ({ , } { }) 100%({ , }) 2C A B FA F BC A F = = =Nh vy cc lut kt hp mnh thu c gm:{C}{A}, {B}{F}, {F}{B}, {A, B}{F}, {A, F}{B}21Bi tp s 2: Cho I = {A, B, C, D, E, F} v c s d liu giao dch D:T1 {D, E}T2 {A, B, D, E}T3 {A, B, D}T4 {C, D, E}T5 {F}T6 {B, C, D}Chn ngng minsup = 20% v minconf = 70%. Hy xc nh cc lut kthp mnh.22F1S lnxut hin{A} 2{B} 3{C} 2{D} 5{E} 3Tp mcS lnxut hin{A} 2{B} 3{C} 2{D} 5{E} 3{F} 1Tp mc{A, B}{A, C}{A, D}{A, E}{B, C}{B, D}{B, E}{C, D}{C, E}{D, E}C2S ln xut hin{A, B} 2{A, C} 0{A, D} 2{A, E} 1{B, C} 1{B, D} 3{B, E} 1{C, D} 2{C, E} 1{D, E} 3F2S ln xut hin{A, B} 2{A, D} 2{B, D} 3{C, D} 2{D, E} 3Tp mc{A, B, D}C3S ln xut hin{A, B, D} 2F3S ln xut hin{A, B, D} 2mincount =minsup * |D| 20%*6 1.2 2 = = =((( (((Tp F3 ch c mt phn t nn khng th tip tc kt ni sinh ng vincho tp F4. Thut ton kt thc. Tp cc tp ph bin thu c:F = {{A}, {B}, {C}, {D}, {E}, {A, B}, {A, D}, {B, D}, {C, D}, {D, E}, {A, B, D}}23({ , }) 2conf ({ } { }) 100%({}) 2C A BA BC A = = =({ , }) 2conf ({ } { }) 66.7%({ }) 3C A BB AC B = = =({ , }) 2conf ({ } { }) 100%({}) 2C A DA DC A = = =({ , }) 2conf ({ } { }) 40%({ }) 5C A BD AC D = = ={A, B} sinh lut: {A}{B}, {B}{A}{A, D} sinh lut: {A}{D}, {D}{A}({ , }) 3conf ({ } { }) 100%({ }) 3C B DB DC B = = =({ , }) 3conf ({ } { }) 60%({ }) 5C B DD BC D = = ={B, D} sinh lut: {B}{D}, {D}{B}24({ , }) 2conf({ } { }) 100%({ }) 2C C DC DC C = = =({ , }) 2conf({ } { }) 40%({ }) 5C C DD CC D = = =({ , }) 3conf({ } { }) 60%({ }) 5C D ED EC D = = =({ , }) 3conf({ } { }) 100%({ }) 3C D EE DC E = = =({ , , }) 2conf({} { , }) 100%({}) 2C A B DA B DC A = = =({ , , }) 2conf({ , } { }) 100%({ , }) 2C A B DA B DC A B = = ={C, D} sinh lut: {C}{D}, {D}{C}{D, E} sinh lut: {D}{E}, {E}{D}{A, B, D} sinh lut: {A}{B, D}, {A, B}{D}, {B}{A, D}, {B, D}{A}, {D}{A, B}, {A, D}B25({ , , }) 2conf ({ } { , }) 40%({ }) 5C A B DD A BC D = = =({ , , }) 2conf ({ , } { }) 100%({ , }) 2C A B DA D BC A D = = =Cc lut kt hp mnh thu c gm: 1. {A}{B}2. {A}{D}3. {B}{D}4. {C}{D}5. {E}{D}6. {A}{B, D}7. {A,B}{D}8. {A, D}B({ , , }) 2conf ({ } { , }) 66.7%({ }) 3C A B DB A DC B = = =({ , , }) 2conf ({ , } { }) 66.7%({ , }) 3C A B DB D AC B D = = =Q & A26