cao dinh tri-ch0901058

Upload: vinhxuann

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Cao Dinh Tri-CH0901058

    1/71

    NGHIN CU H THNG

    QuNTR C S D LiUMCHO iNTON M MY

    GVHD: PGS.TSKH. NGUYN PHI KH

    HVTH : Cao nhTr

  • 7/28/2019 Cao Dinh Tri-CH0901058

    2/71

    NGHIN CU H THNG QuNTRC S D LiUM CHO iN TON

    M MY

    CHNG 1: TNG QUAN V IN TON M MY

    CHNG 2: NGHIN CU H THNG QuNTR CS D LIU M CHO IN TON MMY

    CHNG 3: XU HNG PHT TRINCA IN TON M MY

  • 7/28/2019 Cao Dinh Tri-CH0901058

    3/71

    CHNG 1: TNG QUAN V INTON M MY

    1.1 nhngha

    1.2 Kin trc cain ton m my

    1.3 Li ch cain ton m my

    1.4 M hnh trin khai cain ton m my

    1.5 u, khuytimca m hnh

  • 7/28/2019 Cao Dinh Tri-CH0901058

    4/71

    1.1 nhngha

    c rtnhiunhnghavin ton m my ca ra

    A. in ton m my (cloud computing) l mt m hnh inton c kh nng co gin (scalable) linh ng v cc ti nguynthng c o ha c cung cp nh mt dch v trn mngInternet.

    B. Theo Foster (2008): Mt m hnh in ton phn tn c tnh cogin ln m hng theo co gin vmt kinh t, l nicha cc scmnh tnh ton, kho lutr, cc nntng (platform) v cc dchvc trc quan, o ha v co gin linh ng, s c phn phi

    theo nhu cu cho cc khch hng bn ngoi thng qua Internet.

    C. Cn theo Synmatec nhngha: in ton m my l 1 mngktnica cc ti nguyn my tnh sn c ccp pht ng theo ccho ha v c khnng co dn , toiukin cho ngi dng cthsdngdchv theo nhu cu dng nutrtinn.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    5/71

    Da trn cc nhnnhca cc nh phn tch cngnh chuyn giath mtnhnghatng quan vin ton m my ca ra.Vyin ton m my l:* 1 m hnh in ton mi.* Cc ti nguyn vhtng (phncng, thitblutr,phnmmh thng) v cc ng dng c cung cp theo m hnh X-as-aServices da theo m hnh trtin theo mcsdng.

    * c tnh quan trngca Cloud l o ha v co gin linh ng tytheo nhu cu.* Cc dchvca Cloud c th sdng thng qua cc giao dinweb hay qua cc API cnhnghatrc.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    6/71

    Hnh 1: M hnh in ton m my

  • 7/28/2019 Cao Dinh Tri-CH0901058

    7/71

    1.2 Kin trc cain ton m my

    -Hunhmikin trc in ton m my hin nay u tun theom hnh 3 lp:* Dchvcshtng( Infrastructure as a Services-IaaS)* Dchvcsnntng( Platform as a Services-PaaS)* Dchvcsphnmm ( Software as a Services-SaaS)

    Hnh 2: M hnh kin trc in ton m my

  • 7/28/2019 Cao Dinh Tri-CH0901058

    8/71

    1.2.1 Dchvcshtng

    Dchv IaaS cung cpdchvcbn bao gmnnglc tnh ton,khng gian lu tr, kt ni mng ti khch hng. Khch hng (cnhn hoc t chc) c th sdng ti nguyn h tng ny png nhu cu tnh ton hoc ci t ng dng ring cho ngi sdng. Vidchv ny khch hng lm chhiu hnh, lutr v

    cc ngdng do khch hng ci t. Khch hng in hnh cadchv IaaS c th l miitngcntimt my tnh v t ci tngdngca mnh. V din hnh vdchv ny l dchv EC2 ca Amazon. Khch

    hng c th ng k s dng mt my tnh o trn dch v caAmazon v lachnmthiu hnh (v d, Windows hoc Linux)v t ci tngdngca mnh.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    9/71

    1.2.2 Dchvcsnntng

    y l tngnmgia SaaS v IaaS. N hngn cc ngisdng l cc nh pht trindchv. H c thvit cc ngdngca mnh theo cc tiu chunca 1 platform c th mkhng cn quan tm nphncngnmdi.Ngi dng cth up cc on code chng trnh ln cc platform v sau

    c thqun l mc pht trin 1 cch tng khi nhu cusdngphnmmtng ln. Lp PaaS hotngda trn ccgiao dincchun ha calp IaaS truy cpn cc tinguyn sn c co ha ng thicng cung cp cc

    giao dinchun v cc nn tng pht trin cho cc ngdngtrn lp SaaS.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    10/71

    1.2.3 Dchvcsphnmm

    SaaS l tng trn cng trong m hnh 3 lp ca in ton m

    my. N gm ccphnmm, cc ngdng thucquynshu,phnphi v qun l t xabi 1 hay nhiu cc nh cung cpdchv v thng p dng theo m hnh Tr tin theo nhu cu sdng. y chnh l lptrc quan, r rng nhtvingi dng v

    y l nhngchng trnh vphnmmthc m h truy cp vsdng.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    11/71

    1.3 Li ch cain ton m my

    - Sdng cc ti nguyn tnh ton ng (Dynamic computingresources)- Gim chi ph cho doanh nghip- Gimphctp trong ccuca doanh nghip- Tngkhnngsdng ti nguyn tnh ton

  • 7/28/2019 Cao Dinh Tri-CH0901058

    12/71

    1.4. M hnh trin khai cain ton m my

    Hnh 3: Cc kiu m hnh in ton m my

  • 7/28/2019 Cao Dinh Tri-CH0901058

    13/71

    1.4.1. m my cng cng

    Hnh 4: M hnh m my cng cng

  • 7/28/2019 Cao Dinh Tri-CH0901058

    14/71

    L cc dchvm my cmt bn th ba cung cp. Chng tnti ngoi tngla cng ty v chng clutr , qun lbi nhcung cpdchv.

    Cc m my cng cng cung cp cho ngi dng vi cc phn tcng ngh thng tin tt nht: phn mm, c s h tng ng dnghoccshtngvt l.

    uimcaloi hnh ny l chi ph thp do ti nguyn c chia s

    vinhiungi dng v d dng mrng khi nhu cutng ln.Nhcim l cc d liuc lu tr , qun l bi nh cungcpngthiphi public qua internet do i km theo l nhngri ro vmt an ninh.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    15/71

    1.4.2. m my ring

    Hnh 5: M hnh m my ring

  • 7/28/2019 Cao Dinh Tri-CH0901058

    16/71

    L cc dch v m my c cung cp bn trong doanh nghip.Nhng m my ny tn ti bn trong tng la cng ty v chngc doanh nghip qun l.

    * C th mang li 1 sli ch nhtnh cho doanh nghipnh:- Doanh nghip ton quynqun l, cu hnh m my do c thty chncu hnh thch hp- Dliuclutr ngay ti doanh nghip do nng cao tnhbomt v an ton .*Nhcim:- Chi ph ut ban uln.- C th khng tndnght ti nguyn cahthng- Kh c khnngmrng khi quy m doanh nghiptng ln

  • 7/28/2019 Cao Dinh Tri-CH0901058

    17/71

    1.4.3. Cc m my lai (Hybrid cloud )

    Hnh 6: M hnh m my lai

  • 7/28/2019 Cao Dinh Tri-CH0901058

    18/71

    L mtskthpca cc m my cng cng v m my ring.Nhng m my ny thng do doanh nghip to ra v cc trchnhimqun l sc phn chia gia doanh nghip v nh cung cp

    m my cng cng. m my lai sdng cc dchv c trong ckhng gian cng cng v khng gian ring.Vi m my lai, doanh nghip c th tn dng cc ngun tinguyn sn c ca nh cung cp trong khi vn gi c cc ng

    dng hay dliunhtnh bn trong firewall.Nhngi km theo nth phctpcngtng ln do vic phnphingdng qua nhiumi trng, ng thiphi gim st c h tng bn trong v bnngoi bao gmcvnbomt v chnh sch. Hnh thc ny c ll khng ph hpvingdngihisphctpvngb v c

    sdlius.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    19/71

    1.4.4. Cc m my cngng

    Hnh 7: M hnh m my cngng

  • 7/28/2019 Cao Dinh Tri-CH0901058

    20/71

    L cc m my c chia sbimts tchc hay 1 nhm ccngi dng c chung mcch v nhu cu. N c thcqun l

    bi nhm ngihocmt bn th ba.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    21/71

    1.5. uim, khuytimca m hnh

    1.5.1 uim

    Nhnguim v thmnhdiy gpphn gip "in tonm my" trthnh m hnh in ton c p dngrng ri trnton th gii.a. Tc x ly nhanh, cung cp cho ngi dng nhng dch vnhanh chng va gia thnh rda trn nn tng cs h tng tptrung (m my).

    b. Chi ph u t ban u v c s h tng, my mc va ngunnhn lc ca ngi s dng in ton m my c gim nmc thp nht.

    c. Khng cn phu thuc vo thit b va v tra ly, chophp ngidng truy cp vasdng h thng thng qua trnh duyt web btkyu va trnbt kythit bno m hsdng (chng hn l PChoc lin thoi di ng).

  • 7/28/2019 Cao Dinh Tri-CH0901058

    22/71

    d. Chia sti nguyn va chi ph trn mt abn rng ln, mang licc li ch cho ngi dng.e. Vi tin cy cao

    f. Kha nng m rng c, gip ci thin cht lng cc dch vc cung cp trn m my.g. Kha nngbo mt c ci thin do s tp trung v d liu.h. D dng bo tr, ci t

  • 7/28/2019 Cao Dinh Tri-CH0901058

    23/71

    1.5.2.Nhcim

    Tuy nhin, m hnh in ton ny vn cn mcphimtsnhcim sau:a. Tnh ring t:

    b. Tnh sn dngc. Mt d liud. Tnh di ng ca d liu va quyn s hu

    e. Kha nngbo mt

  • 7/28/2019 Cao Dinh Tri-CH0901058

    24/71

    CHNG 2: NGHIN CU H THNGQUN L C S D LIU

    M CHO IN TON M MY2.1 Giithiu

    2.2 Kin trc hthng

    2.3 Tiu ha hiusut

    2.4 Thnghim

  • 7/28/2019 Cao Dinh Tri-CH0901058

    25/71

    CHNG 2: NGHIN CU H THNGQUN L C S D LIU

    M CHO IN TON M MY

    2.1. Giithiu

    Sxuthincain ton m my vphnmmlutrnh l mtdchv l to ra mtthtrngmiqun l dliu. ngthi, dokch thc ngy cng tngcadliugii php csdliutruynthng song song c th gy tnkm. c ththchinkiu phntch ny trong mt cch hiuquv chi ph, nhiu cng ty pht

    trinh thng lu trd liu phn b v x l thnh cc h thngln cc my ch, bao gm h thng file Google [1], BigTable [2],MapReduce [3], Hadoop [4], Dch v lu tr ca Amazon Simple(S3) [5], SimpleDB [6], SDS My csdliuca Microsoft [7].

  • 7/28/2019 Cao Dinh Tri-CH0901058

    26/71

    C nhiucsdliu NoSQL csdngqun l mtslngln d liu, bao gm c MongoDB [8], Apache CouchDB [9],Cassandra [10] v Dynamo [11].

    Nhiucsd liuc thitkchy trn mtcm gm hngtrmn hng ngn nt, v c khnngphcvdliut hng trmterabytes n hng petabytes.Mt khc, cng cqun l dliuthng giao tipvicsdliu

    bng cch s dng ODBC hoc JDBC, do phn mm c s dliumun lm vic cc snphm ny phichpnhn cc truy vnSQL. Do ,mt cng nghmiktni DBMS tng thch vikhnng gin nca quy m m my.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    27/71

    Trong bi nghin cu ny, chng ti xutmtgii php , cgil SQLMR, kt hp lp trnh SQL vi kh nng co gin caMapReduce. ngi dng SQLMR c th ghi cc chng trnh qun

    l dliuvi ngn ng truy vnhocchy cc chng trnh hin cm khng cnsai. SQLMR cung cpmt trnh bin dchdchmtchng trnh SQL thnh mtchng trnh MapReduce, v thchin n trong mththngMapReduce. tchiunng caotrong vicx l dliu, chng ti cnga ra mts cc kthuttiu.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    28/71

    Cc ng gp chnh ca bi nghin cu bao gm:* Mt trnh bin dch SQL thnh MapReduce , gi l SQLMR. Hinnay, SQLMRhtrmttphp cc truy vn SQL rng,htr

    cc cc ngdngqun l dliuvi quy m ln , chnghnnhxl phn tch trctuyn (OLAP), khai thc dliu,.* Mt k thut xy dng tp tin d liu cho php nhanh chngchuyni cc tp tin SQL thnh cc tp tin HDFS ( hthngtp tin

    phn b Hadoop) c thcchpnhnnh l cc tp tin u vobichng trnh MapReduce. K thut ny lm gimngk thigian chuynigia SQL v MapReduce.

    * Kthut phn vng v lpchmchiuqunhv nhanh d

    liu truy vn trong HDFS v gima I / O cho cc truy vnphmvi.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    29/71

    * Mtktqu truy vnbnhm c th trnh cvic ti x lcc truy vndtha. Cc kthuttiu ha cho hthng HadoopMapReduce lm gimthi gian x l truy vn.

    Chng ti tin hnh cc thnghimmrngnh gi hiuquca SQLMR. So snh vi Hive v HadoopDB, hai HadoopMapReduce ni tingda trn cc h thngqun l csd liucho thyhiusut v khnngmrngca SQLMR.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    30/71

    2.2 Kin trc hthng

    Hnh 8: Kin trc hthng SQLMR

  • 7/28/2019 Cao Dinh Tri-CH0901058

    31/71

    Mc tiu cahthng SQLMR l thitkmt frameworkkthplithlp trnh SQL v khnngmrng v khnngkhcphclica MapReduce. H thngchpnhn cc truy vn SQL u vo v

    dch chng thnh mtchui cc cng vic MapReduce.Khi cc cng vic MapReduce c hon thnh, hthngtrv ccktqu truy vn cho ngisdngdi hnh thc SQL.

    Kin trc h thng ca SQLMR.c bn thnh phn chnh trong

    SQLMR: Trnh bin dch SQL-to-MapReduce. Trnh qun l ktqutruy vn, Trnh qun l phn vng v lpchmc trn csdliuv Tiu ha hthng Hadoop. Trnh qun l phn vng v lpchmc trn csd liu / lugi cc thng tin cabng cng vic,

    cc tp tin ch mc v cc tp tin siu d liu.ba thnhphn khc tng tc trnh qun l phn vng v lpchmctrn csd liu yu cu cc thng tin cnthit khi x l mttruy vn.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    32/71

    2.2.1 Trnh dch SQL-to-MapReduce

    Trnh bin dch nhp cu Truy vn SQL l u vo v dch chngthnh mt chui cc cng vic Trong v d sau, Bng d liu

    Student_hw c 3 ct (id, hw,score). smslng sinh vin hw 1ims cao hn 80 v c s.hw=1:SELECT COUNT(s.id)FROM student_hw as s

    WHERE s.score > 80 AND s.hw=1

  • 7/28/2019 Cao Dinh Tri-CH0901058

    33/71

    Cc truy vn SQL c bin dch thnh hai giai on. u tin,bncmu tin tbngd liu student_ hw v to ra mtbn ghixut c haiphn. thnhphnu tin cgi l "kha" (s.id) no

    m c im s cao hn 80. thnh phn th hai l "gi tr" 1, Giaion sau bnccp kha-gi tr (key-value) v thm ttc

    bis "1" . Ktqu truy vn trvbi SQLMR l mt gi tridin cho s lng. Lu rngas mappers v reducers thc thi

    mt truy vncthchinbihthng Hadoop. Trong tng lai,chng ta c th cung cpchcnng cho ngi dng quytnhslng mappers v reducers cho vicx l mt truy vn.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    34/71

    Cc php ton truy vnhtrtrong hthng SQLMR

    Type Supported Functions

    Basic Operations

    SELECTWHERE

    ATTRIBUTES (Single, Multiple, *)

    Computing Operations

    SUM

    DISTINCTJOIN

    COUNT

    JOIN MULTI-TABLE

    SUB-QUERY

    Condition Operations

    GROUP BY

    BETWEEN-AND

    MULTI-CONDITION

    ORDER BY (DESC, ASC)DATA OPERATION

  • 7/28/2019 Cao Dinh Tri-CH0901058

    35/71

    2.2.2 Trnh qun l ktqu truy vn

    Hthngktqu truy vnhthnglutrktqu cho mi truy vn.Khi mt truy vn mi nhp vo h thng SQLMR, trnh bin dchu tin sa truy vn ny vo Trnh qun l ktqu truy vn sosnh cu truy vn ny vi cc cu truy vn trong nht k. Nu phhp , ktquctrv cho ngisdng m khng cnvic tix l cu truy vn. ngcli trnh bin dchs phn tch cu truy vn

    v sinh m MapReduce tiu Cc ktqulutrs l khng hpl khi mtngisdngcpnhthoc xa dliutcsdliu.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    36/71

    2.2.3 Trnh qun l phn vng v chmc trn csdliu

    Thnhphnhthng ny qun l tp tin dliu v lpchmc. Khidliumic thm vo hthng, DPIM phn vng dliumiv to ra cc chmc cho cc dliumi. Vivic phn vng v lpchmc thng minh, hthng SQLMR c thnhv nhanh khidliu truy vncngnh xc nh chnh xc khidliucnctruy cp trongphm vi truy vngima I / O. Cc kthut phn

    vng v lpchmc th khc vi cng vic c lin quan , n ktxuttonb cc tp tin dliu thnh hthng MapReduce.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    37/71

    2.2.4 Tiu ha Hadoop

    Hthng Hadoop l mtphnmm framework cho vicx l phnb cc tpdliuln trn cc cm my tnh Trnh bin dchs phtsinh cc cng vic c ti u MapReduce v thc hin cc cngvic trn h thng Hadoop. Chngto ra cc tiu ha, chnghnnhtiu ha truyn thng Cross-rack, cithinhiusutcahthng Hadoop.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    38/71

    2.3 Tiu ha hiusut

    2.3.1 Phn vng v tinx l dliu

    Hnh 9: SQL ktnihthng HDFS: Qu trnhnpdliu trong HadoopDB

  • 7/28/2019 Cao Dinh Tri-CH0901058

    39/71

    Trongphn ny, chng ti m tPhng php SQLMRchuynd liu t RDBMS truyn thng thnh h thng MapReduceHadoop Chng ti cng trnh by quan im nghin cu ca h

    thng HadoopDB , scsdngnhmtcs so snh trongcc thnghimca chng ti.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    40/71

    Hnh 9 l s HadoopDB. khiu,ngisdngcnxutttc cc dliut PostgreSQL vo mttp tin vnbn (CSV) v tcc tp tin ny trong hthng HDFS cho victinx l phn vng .

    Mcchcabm phn vng l y cu truy vn logic vo csdliu (v dnh php kt). cng vic ny c phn lm hai giaion . u tin, tp tin d liucncnp vo HDFS. Sau ,mt HadoopDB tu chnh-thc hin cng vic Hadoop, c tnGlobalHasher, phn vng lidliu thnh mts phn vng (v d

    nhslng cc nt trong mt cluster). Bctip theo l tivttc cc d liu phn vng t HDFS vo accb v nhpdliu phn chia vo my ch PostgreSQL ccb trn mi nt. mtkhc trong h thng SQLMRttc cc tp tin csd liuc

    lu tr trong HDFS trc tip m khng cnphi n tin x lchng. Thitk ny c thgimthi gian tinx l mt cch ngk.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    41/71

    Hnh 10: SQL ktnihthng HDFS: Qutrnh npdliu trong SQLMR

  • 7/28/2019 Cao Dinh Tri-CH0901058

    42/71

    tiptcgim thi gian ti v tng tcx l dliu, chngti pht trinmtstiu ha trong SQLMR , nh cho trong hnh10. khiu, SQLMR phn tch cc lcbng c c kch

    thcdliucamtbn ghi. Tip theo, SQLMRcttc cc dliut my chcsdliu v phn vng dliuda theo lc phn tch v kch thckhi HDFS . Cui cng, bngd liuc phn vng vbmclutr trong HDFS . Cc chi titca

    phn vng scc m t trongphntip theo.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    43/71

    Hadoop thc hin mt h thng c s d liu lai bng cch phttrin mt kt ni c s d liu ly d liu t c s d liuPostgreSQL truyn thng kt ni c s d liu trong HadoopDB

    tngtnhmtcsdliu khch hng bnh thngcthchin trong JDBC (Java Database Connectivity). Trong sut thnghim , chng ti thyrng HadoopDB mtrtnhiuthi gian lyd liu t PostgreSQL v gy ra rtnngn I / O ti. khc

    phcvn, chng ti cng pht trinkthut xy dngtp tin d

    liu cho php chuyni nhanh cc tp tin csdliu SQL sangnh dng c th c cng nhn nh l cc tp tin u vo bithng MapReduce. K thut xy dng tp tin thc hin ccInputFormat giao dincahthng Hadoop MapReduce. Giao din

    l cgibi hm Mappercdliucnthitt HDFS v cthcthitkcbtkd liu no trong btknhdngno.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    44/71

    Trong SQLMR , chng ti thchinmtlp java InputFormat quenthuc ,gn nhn nhktniDliu trong hnh 11, m c thccc tp tin MySQL trctip m khng cnphixutcsdliu

    didng cc tp tin vnbn. Kthut ny lm gimngkthigian chuynigia SQL v hthng MapReduce.

    Hnh 11: Ktnidliuhtrcdliuththngcsdliu

    2 3 2 Ch d li

  • 7/28/2019 Cao Dinh Tri-CH0901058

    45/71

    2.3.2 Chmcdliu

    Ch mc l mt cu trc d liu to iu kin thun li v cithin hiu sut truy xutd liu v tm kim DBMS truyn thngivim my DBMS , n trnn quan trngbi v dliuclu tr l rtnhiu v chng ta cnphi xc nhsm cc d liuchng ta quan tm . Trong SQLMR , chng ti trin khai hai kthutnhchmccithintc tm kim. SQLMRchnmt

    k thutnhchs ph hp ty vo cimcacsd liu.Tip theo chng ti giithiu cc kthut hai chs.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    46/71

    2.3.2.1 Chmc phn vng

    Hnh 12: Minh ha phn vng dliu tronghthng SQLMR

    T h h i l b d li

  • 7/28/2019 Cao Dinh Tri-CH0901058

    47/71

    Trongphng php ny, cc tp tin lutrbngcsdliucchia thnh cc tp tin c chiu di cnh . Kch thc c thcquytnhbi kch thccakhi trong HDFS , nhvymttp tin

    sccha trong mtkhi. mi tp tin cha cc bn ghi vimtlot cc kha . Phm vi cc kha cquytnhbilcbngv kch thccamt tp tin. V d, trong hnh 12, c 4 ct trongmtbng. Sau khi phn tch lcbng,mtbn ghi l 2KB vmttp tin l 64MB. Sau , SQLMRs phn vng mttp tinbng

    d liu thnh nhiubng phn vng tp tin d liu. Mt tp tin scha hng lot cc hng dliu trong phm vi cc kha t 1 n32.768.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    48/71

    Cc tp tin tip theo scha cc hng dliuviphm vi cc khat 32.769 n 65.537, ..... thao tc chn, xa v tm kim cho mtkha l bao gmthi gian lin tc O (1), v cc tp tin cha ccbn

    ghi tngng trong mttp tin c thc tnhbng kha cabnghi mc tiu . Cch tipcn ny thch hp cho dliu c khng giankha dy c , bi v chng ti phn b trc khng gian cho mt

    bn ghi trong mttp tin. ni chung, chng ti sdng cy chmcB +.

  • 7/28/2019 Cao Dinh Tri-CH0901058

    49/71

    2.3.2.2 Chmc cy B+

    Hnh 13: Cu trc cy B+

  • 7/28/2019 Cao Dinh Tri-CH0901058

    50/71

    Chmc cy B+ , thngcsdngnhiu trong vicnhchmccsdliu.Nhiusnphmcsdliu m ngunmnhOracle [20] v MySQL,

    ngdng cy B+ nhchmcdliu . l mt cch tipcnchung vp dngivi cc ngdng khc nhau. Trong SQLMR , cu trccy B+ cbo trbihthng HDFS , v chng ti sai m-un DFSClient trong Hadoop nhvy chng ta c th truy vn kchthckhibimt kha thng qua cy. Vic tm kim, xa, v chnl ttcthi gian khu hao logarit O (logN) trong N l slngcc node trong cy. xy dng cy chmc , chng ta cn truyvn cc nt chnh cho khi thng tin. Nt chnh trvttc cc v

    tr camtkhi, bao gmbn chnh vbn sao . cy B + nt lutrttc cc khi thng tin sao chp duy tr sn sng cakhidliu thng tin.

    Q h h SQ h kh

  • 7/28/2019 Cao Dinh Tri-CH0901058

    51/71

    Qu trnh nh sau. SQLMRnhnmt truy vncamt kha tmkim v tm cc khitng thch thng qua cy. Cc nt nibcacy chlutr thng tin kha tm kim v nt l ca cy lutrd

    liucakhi thng tin, bao gmc ID khi v v tr . Hnh 13 minhhacu trc ca cy B+ . thun li trn v chia nhkhi, cckha trong mtkhiphicspxp. Chng spxp cc khatrong mtkhi khi cc khic sp nhphoc chia tch lmgim thm chi ph spxp,bi v slng cc kha trong mtkhic rngbucbimthngsC ,

    C=(kch thcmtkhi) / kch thcmtbn ghiV vy,phctpspxp l l O (1).

    2 3 3 Ti h H d

  • 7/28/2019 Cao Dinh Tri-CH0901058

    52/71

    2.3.3 Tiu ha Hadoop

    Cu truy vnngi dng c bin dchnh cng vic MapReducev chy trn hthng Hadoop, do ,hiusutca Hadoop l quan

    trngivisthc thi ca SQLMR . Chng ti trin khai cng victrc kia da trn stiu ha trong hthng Hadoop cithinhiusutca Hadoop . Chng ti a ra ligiithiungngnivikthuttiu ha ny sau y.

    Trong khi thc hin Hadoop [21], chng ti xut mt ReducerPlacement Problem (RPP) c nh ngha nh sau. Cung cp sracks cc mapper trn mi rack v slng cc reducerlp trnhlm th no chng ti xc nhslng cc reducerchy trn mi

    rack ? Chng ti a ra m hnh giao thng ngin v suy ra hmmc tiu i din cho s lng giao thng trn mt rack.Trong m hnh ny, cc giao thng camt rack l mt hm caslng cc reducer (ri) chy trn n v chng ti xy dng RPP nh l

    mtthtctiu ha.

    Gi s chng ta c N racks M Mappers v R reducers s lng cc

  • 7/28/2019 Cao Dinh Tri-CH0901058

    53/71

    Gis chng ta c N racks, M Mappers v R reducers , slng ccmapppers trn mi rack fm1; m2; :::; mNg bit cc giao tc carack i l fi (ri), chng ti mun tm slng cc reducers trn mi

    rack fr1; r2; :::; rNg nh sau:

    M h h i l h h l h h

  • 7/28/2019 Cao Dinh Tri-CH0901058

    54/71

    Mt trong nhngthut ton tiu l thut ton tham lam.n thay thmt reducer thnh mt racktimtthiim. tng chnh l, lunlun thay th reducer trn rackvi cc giao tc nh nht hin ti .

    Thut ton 1 minh ha m gica thut ton tham lam Lu rngchng ti sdng cy chmcmtmngdliutrng thi lutrcc slng cc reducers trn mi rack . V d,nu chng ta cbnracks k v mi reducers lp trnh, thut ton tham lam strv

    bdliu = [1, 2, 3, 4], iu c ngha l thay thmt reducertrn rack 1, reducer 2 trong rack 2 reducer ba trong rack 3 v reducer4 trong rack 4.

    Thut ton tham lam cho RPP

  • 7/28/2019 Cao Dinh Tri-CH0901058

    55/71

    Thut ton tham lam cho RPPYu cu: S lng cc Mapper trn mi rack :fm1; m2; :::; mNgm bo: Mt b d liu reducer : fr1; r2; :::; rNg

    N : s lng cc racksM : tng s cc mapperR: tng s reducersB d liu [N] f0; 0; :::; 0g

    for i = 1 to R do minimal 1for j = 1 to N do

    traffic = (M 2mj) (state tuple[j] + 1) + mjRif traffic < minimal then

    candidate = jend ifend forstate tuple[candidate] + +end for

    return state tuple

    2 4 Th hi

  • 7/28/2019 Cao Dinh Tri-CH0901058

    56/71

    2.4 Thnghim2.4.1 Ci tthnghim

    Trong thnghim, chng ti sdngchuncsdliu Sysbenchv so snh SQLMRvi cc hthngcsdliu khc, bao gmMySQL trong Ceph, trong tp tin dliuclutrhthngtp tin phnb Ceph , cm MySQL trong , v hai Hthngda trnMapReduce: Hive v . HadoopDB SysBench l mt modul nntng

    v cng cchunalungnh gi cc tham shiu hnh rtquan trng cho mththngchymtngdngcsdliu quavic ti chuyn su . Chng ti s dng m-un OLTP caSysbench chun ha thc thi csd liuhin thchiusut.OLTP c thto ra slngrtnhiudliutuntclpchmcbi id ct. N cng c thto ra cc cu truy vn giao tc .

    Th hi h i h kh d li kh

  • 7/28/2019 Cao Dinh Tri-CH0901058

    57/71

    Thnghimgm haiphnkhnngmrngdliu v khnngmrnghthngTrcy l trnh by khnngmrng w.r.t. lmgia tng kch thcdliu trong khi slng cc nt l cnh10

    v ci k tip cho thy kh nng m rng w.r.t. lm gia tng kchthchthngvi kch thcdliucnh l 10Gb cho mi ntv tngcng c 64 nt. Mi nt cha 2 nhn CPU c tc l 2.27Ghz , 4GB bnh, 200GB khng gian a v ttc chng cktni mt switch Gigabit Ethernet. thi gian kt qu ca mi thnghimcobnglnh "time" v miimdliu l trung bnh10 Runs.

    2 4 2 C kt th hi

  • 7/28/2019 Cao Dinh Tri-CH0901058

    58/71

    2.4.2 Cc ktquthnghim2.4.2.1 Thay i kch thcdliu

    Tp cc th nghim ny so snh kh nng m rng w.r.t. lm gia

    tng kch thcdliu. Slng cc nt ccnhmc 10, vkch thcd liu thay i t 512MB n 1TB. Hnh 14 cho thythi gian thchinca MySQL, cm MySQL Hive, HadoopDB vSQLMR, trong vicchn la thao tc vi kch thcd liu khc

    nhau. Hnh 15 Hnh 16 cho th minh hacavic so snh hiusut .Cc truy vnnh sau.SELECT sum(id) FROM tableWHERE id >= max(id)/2 and id

  • 7/28/2019 Cao Dinh Tri-CH0901058

    59/71

    MySQL MySQL(C) Hive HadoopDB SQLMR

    512MB 7.56 3.32 34.34 38.38 32.14

    1GB 11.69 6.02 33.27 41.35 31.64

    2GB 20.32 12.11 34.24 43.36 33.374gb 42.58 23.96 34.27 38.28 33.41

    8GB 84.66 47.15 40.06 40.48 37.29

    16GB 165.25 94.72 49.53 70.17 41.89

    32GB 334.14 188.98 58.11 136.03 48.23

    64GB 659.07 378.50 91.50 281.37 70.39

    128GB 1.305.01 157.43 706.50 122.19

    256GB 2.578.99 296.79 1.955.53 209.22

    512GB 5.180.53 586.70 4.070.14 387.56

    772GB 7.058.54 866.25 6.820.45 552.35

    1TB 1.145.80 9.570.77 717.15

    Hnh 14: So snh thi gian thc hin gia cc h thngc s d liu khc nhau trong vic truy vn cu SELECTvi kch thc cu truy vn khc nhau .

  • 7/28/2019 Cao Dinh Tri-CH0901058

    60/71

    Hnh 15: So thi gian thc hin trong cu truy vn

    SELECTgia cc h thng khc nhau vi kchthc d liu nh h thng c s d liu c kchthc nh.

    i vi kch thc d liu nh (Hnh 14) chng ti thy rng

  • 7/28/2019 Cao Dinh Tri-CH0901058

    61/71

    i vi kch thc d liu nh (Hnh 14), chng ti thy rngMySQL v cm MySQL tthn cc hthngda trn MapReduce-khi kch thcdliunhhn 4GB. Khi kch thcdliutng ln

    di 8GB, c hai chng c thc hin tt hnbi h thngMapreducebased MapReduebased L do l MySQL khng x lsong song cc truy vnn. cm MySQL sdngkthutcsdliubnhtrong m ghi ttc cc dliu vobnhtrc khibtu thao tc csdliu v nhvybhnchbi kch thcca

    bnhvt l. Trong mi trngthnghim , cm MySQL sgpsc do vt ra khibnhkhi kch thcdliut 64GB.

    Trong hnh 15 MySQL treo khi kch thc d liu t 772GB L

  • 7/28/2019 Cao Dinh Tri-CH0901058

    62/71

    Trong hnh 15, MySQL treo khi kch thcd liut 772GB. Ldo l bng d liu MySQL ch c th cha 232 bn ghi d liu.772GB l kch thcdliutia c thcto rabi Sysbench

    chun v hn ch nh vy. i vi h thng da trn MapReduce,chng ti thy rng thi gian thc thi HadoopDB gia tngiu ny dn ti vic gia tng kch thc d liu . L do lHadoopDB phi gnh chukhi lng cng vic I / cao gy ra bivictinx l dliunhiu giai onc m t trong mc IV-A.

    hthng SQLMR lun nhanh hn so vihthng HadoopDB v sviccithinhiusutbng cch cc tiu ha khc nhau m ttiMc IV. SQLMR th nhanh hn gp 2.82 ln so vi h thngHadoopDB vi kch thcdliu 32GB v nhanh hn 13.35 ln so

    vih thng HadoopDB c kch thcd liu 1TB. Hnna , hthng SQLMR th nhanh hn 1.41 ln so vih thng Hive, trungbnh.

    2 4 2 2 Thay i kch thc h thng

  • 7/28/2019 Cao Dinh Tri-CH0901058

    63/71

    2.4.2.2 Thay i kch thchthng

    Hnh 16.So snh thi gian thc hin gia cc hthng c s d liu khc nhau trong cu truy vnJOIN vi kch thc d liu ln .

  • 7/28/2019 Cao Dinh Tri-CH0901058

    64/71

    Hnh 17.So snh thi gian thc hin gia cc h thng cs d liu khc nhau trong cu lnh truy vn SELECT vi

    kch thcdliucnh, thay i trnmi nt vmts nt.

    Tp cc th nghim ny so snh kh nng m rng h thng c s

  • 7/28/2019 Cao Dinh Tri-CH0901058

    65/71

    Tp cc thnghim ny so snh khnngmrnghthngcsd liu khc nhau wrt lm gia tng kch thch thng. s lngcc nt vt l thay it 1 n 16. Mi nt vt l chabn my o(tc l, o cc nt). Kch thcd liu cho mi nt ocnhmc 10GB. Hnh 17 so snh ktquca cc truy vn(phm vi cckha ) SELECT, v Hnh 16 cho thy kt qu ca vic trong cutruy vn JOIN .

    Nhthhin trong hnh 17, hthng HadoopDB trnh by khnng

    m rng h thng khng n nh trong khi h thng SQLMR vHive.

    Hnh 16 trnh by kh nng m rng n nh hn h thng

  • 7/28/2019 Cao Dinh Tri-CH0901058

    66/71

    Hnh 16, trnh by kh nng m rng n nh hn h thngHadoopDB . Tt c cchthngthchinxunht khi slng cc nt o(ngha l my

    o) l bn. iu ny l do 4 my o tr ng trn cng node vt lmy nm trn ca nt vt l bo ha ngunsdng trn nt . Khislngca cc my o gia tngt 4 n 16, khilng cng vicc chia sgianhiu cc node vt l, lm gia tng tnh song songv c ktqu trong vicgimthi gian thchin. Khi slng myo gia tngn 32, mngs trnn nt c chai v gy ra lm giatngthi gian thc thi .

    T hnh 17 chng ti c th thy rng t l vic ci thin hiu sut

  • 7/28/2019 Cao Dinh Tri-CH0901058

    67/71

    T hnh 17, chng ti c th thy rng t l vic ci thin hiu sutkhong t 4,16 (1 node) n 4,95 (64 nt). t l ci thin hiu sutca h thng SQLMRngc vi h thng Hive phm vi t 1,67 n

    2,05.Hnh 17cng cho thyhthng HadoopDB trnh by khnngmrng h thng khng n nh trong khi h thng SQLMR v Hivetng t c hai trnh by khnng m rng nnhhnh thng

    HadoopDB . l do chnh cho khnngmrnghthng HadoopDBl, Hadoop sdng my ch PostgreSQL trn mi nt ccbnh llutrcsdliu m khng cnhtrcahthnglutr phn

    b Hadoop sdng hm Map thu thpttcdliu truy vnv gi tonb cc dliu cho reducer tnh ton ktqucui cngl snphmgim cho tnh ton caktqucui cng. T hnh 13,chng ta c th thy rng t l ci thin hiu sut ca h thngSQLMR ngc vi h thng HadoopDB phm vi t 6,03 (2 nt)10,57 (64 nt). tlcithinhiusutcahthng SQLMR ngc

    vihthng Hivephm vi t 1,65 n 2,24.

    CHNG 3 XU HNG PHT TRIN

  • 7/28/2019 Cao Dinh Tri-CH0901058

    68/71

    CHNG 3 XU HNG PHT TRINCA IN TON M MY

    - ThutngCloudcomputing ra itgianm 2007, cho nnay khng ngng pht trin mnh m, v c thc hinbinhiu cng ty ln trn thgii: IBM, SUN, AMAZON, GOOGLE,

    MICROSOFT, YAHOO, SALESFORCE..

  • 7/28/2019 Cao Dinh Tri-CH0901058

    69/71

    Hnh 18: Mts cng ty trn thgiithchin m hnhin ton m my

    - c tnh trong nmnmti, tctngtrng 23.4% thtrng ton cut 74.9 Usd

    Hnh 19 M hnh d on pht trin ca in ton m my trong

  • 7/28/2019 Cao Dinh Tri-CH0901058

    70/71

    Hnh 19 M hnh don pht trincain ton m my trongnhngnmtiTheo http://it.marketintelgroup.com

    http://it.marketintelgroup.com/http://it.marketintelgroup.com/http://it.marketintelgroup.com/http://it.marketintelgroup.com/http://it.marketintelgroup.com/http://it.marketintelgroup.com/http://it.marketintelgroup.com/
  • 7/28/2019 Cao Dinh Tri-CH0901058

    71/71