0071610391_chap01-libre.pdf
TRANSCRIPT
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
1Introduction to Data
Warehousing
Informationassetsareimmenselyvaluabletoanyenterprise,andbecauseofthis,theseassetsmustbeproperlystoredandreadilyaccessiblewhentheyareneeded.However,theavailabilityoftoomuchdatamakestheextractionofthemost
importantinformationdifficult,ifnotimpossible.ViewresultsfromanyGooglesearch,andyou’llseethatthedata=informationequationisnotalwayscorrect—thatis,toomuchdataissimplytoomuch.
Datawarehousingisaphenomenonthatgrewfromthehugeamountofelectronicdatastoredinrecentyearsandfromtheurgentneedtousethatdatatoaccomplishgoalsthatgobeyondtheroutinetaskslinkedtodailyprocessing.Inatypicalscenario,alargecorporationhasmanybranches,andseniormanagersneedtoquantifyandevaluatehoweachbranchcontributestotheglobalbusinessperformance.Thecorporatedatabasestoresdetaileddataonthetasksperformedbybranches.Tomeetthemanagers’needs,tailor-madequeriescanbeissuedtoretrievetherequireddata.Inorderforthisprocesstowork,databaseadministratorsmustfirstformulatethedesiredquery(typicallyanaggregateSQLquery)aftercloselystudyingdatabasecatalogs.Thenthequeryisprocessed.Thiscantakeafewhoursbecauseofthehugeamountofdata,thequerycomplexity,andtheconcurrenteffectsofotherregularworkloadqueriesondata.Finally,areportisgeneratedandpassedtoseniormanagersintheformofaspreadsheet.
Manyyearsago,databasedesignersrealizedthatsuchanapproachishardlyfeasible,becauseitisverydemandingintermsoftimeandresources,anditdoesnotalwaysachievethedesiredresults.Moreover,amixofanalyticalquerieswithtransactionalroutinequeriesinevitablyslowsdownthesystem,andthisdoesnotmeettheneedsofusersofeithertypeofquery.Today’sadvanceddatawarehousingprocessesseparateonlineanalyticalprocessing(OLAP)fromonlinetransactionalprocessing(OLTP)bycreatinganewinformationrepositorythatintegratesbasicdatafromvarioussources,properlyarrangesdataformats,andthenmakesdataavailableforanalysisandevaluationaimedatplanninganddecision-makingprocesses(Lechtenbörger,2001).
1
CHAPTER
ch01.indd1 4/21/093:23:27PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
2 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 2 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Let’sreviewsomefieldsofapplicationforwhichdatawarehousetechnologiesaresuccessfullyused:
• Trade Salesandclaimsanalyses,shipmentandinventorycontrol,customercareandpublicrelations
• Craftsmanship Productioncostcontrol,supplierandordersupport
• Financialservices Riskanalysisandcreditcards,frauddetection
• Transportindustry Vehiclemanagement
• Telecommunicationservices Callflowanalysisandcustomerprofileanalysis
• Healthcareservice Patientadmissionanddischargeanalysisandbookkeepinginaccountsdepartments
Thefieldofapplicationofdatawarehousesystemsisnotonlyrestrictedtoenterprises,butitalsorangesfromepidemiologytodemography,fromnaturalsciencetoeducation.ApropertythatiscommontoallfieldsistheneedforstorageandquerytoolstoretrieveinformationsummarieseasilyandquicklyfromthehugeamountofdatastoredindatabasesormadeavailablebytheInternet.Thiskindofinformationallowsustostudybusinessphenomena,learnaboutmeaningfulcorrelations,andgainusefulknowledgetosupportdecision-makingprocesses.
1.1 Decision Support SystemsUntilthemid-1980s,enterprisedatabasesstoredonlyoperationaldata—datacreatedbybusinessoperationsinvolvedindailymanagementprocesses,suchaspurchasemanagement,salesmanagement,andinvoicing.However,everyenterprisemusthavequick,comprehensiveaccesstotheinformationrequiredbydecision-makingprocesses.ThisstrategicinformationisextractedmainlyfromthehugeamountofoperationaldatastoredinenterprisedatabasesbymeansofaprogressiveselectionandaggregationprocessshowninFigure1-1.
FIGURE 1-1 Information value as a function of quantity
ch01.indd2 4/21/093:23:28PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 3
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 3
Anexponentialincreaseinoperationaldatahasmadecomputerstheonlytoolssuitableforprovidingdatafordecision-makingperformedbybusinessmanagers.Thisfacthasdramaticallyaffectedtheroleofenterprisedatabasesandfosteredtheintroductionofdecisionsupportsystems.Theconceptofdecisionsupportsystemsmainlyevolvedfromtworesearchfields:theoreticalstudiesondecision-makingprocessesfororganizationsandtechnicalresearchoninteractiveITsystems.However,thedecisionsupportsystemconceptisbasedonseveraldisciplines,suchasdatabases,artificialintelligence,man-machineinteraction,andsimulation.Decisionsupportsystemsbecamearesearchfieldinthemid-’70sandbecamemorepopularinthe’80s.
Inpractice,aDSSisanITsystemthathelpsmanagersmakedecisionsorchooseamongdifferentalternatives.Thesystemprovidesvalueestimatesforeachalternative,allowingthemanagertocriticallyreviewtheresults.Table1-1showsapossibleclassificationofDSSsonthebasisoftheirfunctions(Power,2002).
Fromthearchitecturalviewpoint,aDSStypicallyincludesamodel-basedmanagementsystemconnectedtoaknowledgeengineand,ofcourse,aninteractivegraphicaluserinterface(SpragueandCarlson,1982).Datawarehousesystemshavebeenmanagingthedataback-endsofDSSssincethe1990s.Theymustretrieveusefulinformationfromahugeamountofdatastoredonheterogeneousplatforms.Inthisway,decision-makerscanformulatetheirqueriesandconductcomplexanalysesonrelevantinformationwithoutslowingdownoperationalsystems.
TABLE 1-1 Classification of Decision Support Systems
System Description
Passive DSS Supports decision-making processes, but it does not offer explicit suggestions on decisions or solutions.
Active DSS Offers suggestions and solutions.
Collaborative DSS Operates interactively and allows decision-makers to modify, integrate, or refine suggestions given by the system. Suggestions are sent back to the system for validation.
Model-driven DSS Enhances management of statistical, financial, optimization, and simulation models.
Communication-driven DSS Supports a group of people working on a common task.
Data-driven DSS Enhances the access and management of time series of corporate and external data.
Document-driven DSS Manages and processes nonstructured data in many formats.
Knowledge-driven DSS Provides problem-solving features in the form of facts, rules, and procedures.
Decision Support SystemAdecisionsupportsystem(DSS)isasetofexpandable,interactiveITtechniquesandtoolsdesignedforprocessingandanalyzingdataandforsupportingmanagersindecisionmaking.Todothis,thesystemmatchesindividualresourcesofmanagerswithcomputerresourcestoimprovethequalityofthedecisionsmade.
ch01.indd3 4/21/093:23:28PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
4 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 4 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
1.2 Data WarehousingDatawarehousesystemsareprobablythesystemstowhichacademiccommunitiesandindustrialbodieshavebeenpayingthegreatestattentionamongalltheDSSs.Datawarehousingcanbeinformallydefinedasfollows:
Thedefinitionofdatawarehousingpresentedhereisintentionallygeneric;itgivesyouanideaoftheprocessbutdoesnotincludespecificfeaturesoftheprocess.Tounderstandtheroleandtheusefulpropertiesofdatawarehousingcompletely,youmustfirstunderstandtheneedsthatbroughtitintobeing.In1996,R.Kimballefficientlysummedupafewclaimsfrequentlysubmittedbyendusersofclassicinformationsystems:
• “Wehaveheapsofdata,butwecannotaccessit!”Thisshowsthefrustrationofthosewhoareresponsibleforthefutureoftheirenterprisesbuthavenotechnicaltoolstohelpthemextracttherequiredinformationinaproperformat.
• “Howcanpeopleplayingthesameroleachievesubstantiallydifferentresults?”Inmidsizetolargeenterprises,manydatabasesareusuallyavailable,eachdevotedtoaspecificbusinessarea.Theyareoftenstoredondifferentlogicalandphysicalmediathatarenotconceptuallyintegrated.Forthisreason,theresultsachievedineverybusinessareaarelikelytobeinconsistent.
• “Wewanttoselect,group,andmanipulatedataineverypossibleway!”Decision-makingprocessescannotalwaysbeplannedbeforethedecisionsaremade.Endusersneedatoolthatisuser-friendlyandflexibleenoughtoconductadhocanalyses.Theywanttochoosewhichnewcorrelationstheyneedtosearchforinrealtimeastheyanalyzetheinformationretrieved.
• “Showmejustwhatmatters!”Examiningdataatthemaximumlevelofdetailisnotonlyuselessfordecision-makingprocesses,butisalsoself-defeating,becauseitdoesnotallowuserstofocustheirattentiononmeaningfulinformation.
• “Everyoneknowsthatsomedataiswrong!”Thisisanothersorepoint.Anappreciablepercentageoftransactionaldataisnotcorrect—oritisunavailable.Itisclearthatyoucannotachievegoodresultsifyoubaseyouranalysesonincorrectorincompletedata.
Wecanusethepreviouslistofproblemsanddifficultiestoextractalistofkeywordsthatbecomedistinguishingmarksandessentialrequirementsforadatawarehouseprocess,asetoftasksthatallowustoturnoperationaldataintodecision-makingsupportinformation:
• accessibilitytousersnotveryfamiliarwithITanddatastructures;
• integrationofdataonthebasisofastandardenterprisemodel;
• queryflexibilitytomaximizetheadvantagesobtainedfromtheexistinginformation;
Data WarehousingDatawarehousingisacollectionofmethods,techniques,andtoolsusedtosupportknowledgeworkers—seniormanagers,directors,managers,andanalysts—toconductdataanalysesthathelpwithperformingdecision-makingprocessesandimprovinginformationresources.
ch01.indd4 4/21/093:23:28PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 5
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 5
• informationconcisenessallowingfortarget-orientedandeffectiveanalyses;
• multidimensionalrepresentationgivingusersanintuitiveandmanageableviewofinformation;
• correctnessandcompletenessofintegrateddata.
Datawarehousesareplacedrightinthemiddleofthisprocessandactasrepositoriesfordata.Theymakesurethattherequirementssetcanbefulfilled.
Datawarehousesaresubject-orientedbecausetheyhingeonenterprise-specificconcepts,suchascustomers,products,sales,andorders.Onthecontrary,operationaldatabaseshingeonmanydifferententerprise-specificapplications.
Weputemphasisonintegrationandconsistencybecausedatawarehousestakeadvantageofmultipledatasources,suchasdataextractedfromproductionandthenstoredtoenterprisedatabases,orevendatafromathirdparty’sinformationsystems.Adatawarehouseshouldprovideaunifiedviewofallthedata.Generallyspeaking,wecanstatethatcreatingadatawarehousesystemdoesnotrequirethatnewinformationbeadded;rather,existinginformationneedsrearranging.Thisimplicitlymeansthataninformationsystemshouldbepreviouslyavailable.
Operationaldatausuallycoversashortperiodoftime,becausemosttransactionsinvolvethelatestdata.Adatawarehouseshouldenableanalysesthatinsteadcoverafewyears.Forthisreason,datawarehousesareregularlyupdatedfromoperationaldataandkeepongrowing.Ifdatawerevisuallyrepresented,itmightprogresslikeso:Aphotographofoperationaldatawouldbemadeatregularintervals.Thesequenceofphotographswouldbestoredtoadatawarehouse,andresultswouldbeshowninamoviethatrevealsthestatusofanenterprisefromitsfoundationuntilpresent.
Fundamentally,dataisneverdeletedfromdatawarehousesandupdatesarenormallycarriedoutwhendatawarehousesareoffline.Thismeansthatdatawarehousescanbeessentiallyviewedasread-onlydatabases.Thissatisfiestheusers’needforashortanalysisqueryresponsetimeandhasotherimportanteffects.First,itaffectsdatawarehouse–specificdatabasemanagementsystem(DBMS)technologies,becausethereisnoneedforadvancedtransactionmanagementtechniquesrequiredbyoperationalapplications.Second,datawarehousesoperateinread-onlymode,sodatawarehouse–specificlogicaldesignsolutionsarecompletelydifferentfromthoseusedforoperationaldatabases.Forinstance,themostobviousfeatureofdatawarehouserelationalimplementationsisthattablenormalizationcanbegivenuptopartiallydenormalizetablesandimproveperformance.
Otherdifferencesbetweenoperationaldatabasesanddatawarehousesareconnectedwithquerytypes.Operationalqueriesexecutetransactionsthatgenerallyread/writea
Data WarehouseAdatawarehouseisacollectionofdatathatsupportsdecision-makingprocesses.Itprovidesthefollowingfeatures(Inmon,2005):
• Itissubject-oriented.
• Itisintegratedandconsistent.
• Itshowsitsevolutionovertimeanditisnotvolatile.
ch01.indd5 4/21/093:23:29PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
6 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 6 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
smallnumberoftuplesfrom/tomanytablesconnectedbysimplerelations.Forexample,thisappliesifyousearchforthedataofacustomerinordertoinsertanewcustomerorder.ThiskindofqueryisanOLTPquery.Onthecontrary,thetypeofqueryrequiredindatawarehousesisOLAP.Itfeaturesdynamic,multidimensionalanalysesthatneedtoscanahugeamountofrecordstoprocessasetofnumericdatasumminguptheperformanceofanenterprise.ItisimportanttonotethatOLTPsystemshaveanessentialworkloadcore“frozen”inapplicationprograms,andadhocdataqueriesareoccasionallyrunfordatamaintenance.Conversely,datawarehouseinteractivityisanessentialpropertyforanalysissessions,sotheactualworkloadconstantlychangesastimegoesby.
ThedistinctivefeaturesofOLAPqueriessuggestadoptionofamultidimensionalrepresentationfordatawarehousedata.Basically,dataisviewedaspointsinspace,whosedimensionscorrespondtomanypossibleanalysisdimensions.Eachpointrepresentsaneventthatoccursinanenterpriseandisdescribedbyasetofmeasuresrelevanttodecision-makingprocesses.Section1.5givesadetaileddescriptionofthemultidimensionalmodelyouabsolutelyneedtobefamiliarwithtounderstandhowtomodelconceptualandlogicallevelsofadatawarehouseandhowtoquerydatawarehouses.
Table1-2summarizesthemaindifferencesbetweenoperationaldatabasesanddatawarehouses.
NOTE Forfurtherdetailsonthedifferentissuesrelatedtothedatawarehouseprocess,refertoChaudhuriandDayal,1997;Inmon,2005;Jarkeetal.,2000;Kelly,1997;Kimball,1996;Mattison,2006;andWrembelandKoncilia,2007.
TABLE 1-2 Differences Between Operational Databases and Data Warehouses (Kelly, 1997)
Feature Operational Databases Data Warehouses
Users Thousands Hundreds
Workload Preset transactions Specific analysis queries
Access To hundreds of records, write and read mode
To millions of records, mainly read-only mode
Goal Depends on applications Decision-making support
Data Detailed, both numeric and alphanumeric
Summed up, mainly numeric
Data integration Application-based Subject-based
Quality In terms of integrity In terms of consistency
Time coverage Current data only Current and historical data
Updates Continuous Periodical
Model Normalized Denormalized, multidimensional
Optimization For OLTP access to a database part
For OLAP access to most of the database
ch01.indd6 4/21/093:23:29PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 7
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 7
1.3 Data Warehouse ArchitecturesThefollowingarchitecturepropertiesareessentialforadatawarehousesystem(Kelly,1997):
• Separation Analyticalandtransactionalprocessingshouldbekeptapartasmuchaspossible.
• Scalability Hardwareandsoftwarearchitecturesshouldbeeasytoupgradeasthedatavolume,whichhastobemanagedandprocessed,andthenumberofusers’requirements,whichhavetobemet,progressivelyincrease.
• Extensibility Thearchitectureshouldbeabletohostnewapplicationsandtechnologieswithoutredesigningthewholesystem.
• Security Monitoringaccessesisessentialbecauseofthestrategicdatastoredindatawarehouses.
• Administerability Datawarehousemanagementshouldnotbeoverlydifficult.
Twodifferentclassificationsarecommonlyadoptedfordatawarehousearchitectures.Thefirstclassification,describedinsections1.3.1,1.3.2,and1.3.3,isastructure-orientedonethatdependsonthenumberoflayersusedbythearchitecture.Thesecondclassification,describedinsection1.3.4,dependsonhowthedifferentlayersareemployedtocreateenterprise-orientedordepartment-orientedviewsofdatawarehouses.
1.3.1 Single-Layer ArchitectureAsingle-layerarchitectureisnotfrequentlyusedinpractice.Itsgoalistominimizetheamountofdatastored;toreachthisgoal,itremovesdataredundancies.Figure1-2showstheonlylayerphysicallyavailable:thesourcelayer.Inthiscase,datawarehousesarevirtual.
FIGURE 1-2 Single-layer architecture for a data warehouse system
ch01.indd7 4/21/093:23:29PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
8 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 8 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Thismeansthatadatawarehouseisimplementedasamultidimensionalviewofoperationaldatacreatedbyspecificmiddleware,oranintermediateprocessinglayer(Devlin,1997).
Theweaknessofthisarchitectureliesinitsfailuretomeettherequirementforseparationbetweenanalyticalandtransactionalprocessing.Analysisqueriesaresubmittedtooperationaldataafterthemiddlewareinterpretsthem.Itthisway,thequeriesaffectregulartransactionalworkloads.Inaddition,althoughthisarchitecturecanmeettherequirementforintegrationandcorrectnessofdata,itcannotlogmoredatathansourcesdo.Forthesereasons,avirtualapproachtodatawarehousescanbesuccessfulonlyifanalysisneedsareparticularlyrestrictedandthedatavolumetoanalyzeishuge.
1.3.2 Two-Layer ArchitectureTherequirementforseparationplaysafundamentalroleindefiningthetypicalarchitectureforadatawarehousesystem,asshowninFigure1-3.Althoughitistypicallycalledatwo-layerarchitecturetohighlightaseparationbetweenphysicallyavailablesourcesanddatawarehouses,itactuallyconsistsoffoursubsequentdataflowstages(Lechtenbörger,2001):
1. Sourcelayer Adatawarehousesystemusesheterogeneoussourcesofdata.Thatdataisoriginallystoredtocorporaterelationaldatabasesorlegacy1databases,oritmaycomefrominformationsystemsoutsidethecorporatewalls.
2. Datastaging Thedatastoredtosourcesshouldbeextracted,cleansedtoremoveinconsistenciesandfillgaps,andintegratedtomergeheterogeneoussourcesintoonecommonschema.Theso-calledExtraction,Transformation,andLoadingtools(ETL)canmergeheterogeneousschemata,extract,transform,cleanse,validate,filter,andloadsourcedataintoadatawarehouse(Jarkeetal.,2000).Technologicallyspeaking,thisstagedealswithproblemsthataretypicalfordistributedinformationsystems,suchasinconsistentdatamanagementandincompatibledatastructures(Zhugeetal.,1996).Section1.4dealswithafewpointsthatarerelevanttodatastaging.
3. Datawarehouselayer Informationisstoredtoonelogicallycentralizedsinglerepository:adatawarehouse.Thedatawarehousecanbedirectlyaccessed,butitcanalsobeusedasasourceforcreatingdatamarts,whichpartiallyreplicatedatawarehousecontentsandaredesignedforspecificenterprisedepartments.Meta-datarepositories(section1.6)storeinformationonsources,accessprocedures,datastaging,users,datamartschemata,andsoon.
4. Analysis Inthislayer,integrateddataisefficientlyandflexiblyaccessedtoissuereports,dynamicallyanalyzeinformation,andsimulatehypotheticalbusinessscenarios.Technologicallyspeaking,itshouldfeatureaggregatedatanavigators,complexqueryoptimizers,anduser-friendlyGUIs.Section1.7dealswithdifferenttypesofdecision-makingsupportanalyses.
Thearchitecturaldifferencebetweendatawarehousesanddatamartsneedstobestudiedcloser.ThecomponentmarkedasadatawarehouseinFigure1-3isalsooftencalledtheprimarydatawarehouseorcorporatedatawarehouse.Itactsasacentralizedstoragesystemfor
1Thetermlegacysystemdenotescorporateapplications,typicallyrunningonmainframesorminicomputers,that are currently used for operational tasks but do not meet modern architectural principles and currentstandards.Forthisreason,accessinglegacysystemsandintegratingthemwithmorerecentapplicationsisacomplextask.Allapplicationsthatuseanonrelationaldatabaseareexamplesoflegacysystems.
ch01.indd8 4/21/093:23:30PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 9
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 9
allthedatabeingsummedup.Datamartscanbeviewedassmall,localdatawarehousesreplicating(andsummingupasmuchaspossible)thepartofaprimarydatawarehouserequiredforaspecificapplicationdomain.
Thedatamartspopulatedfromaprimarydatawarehouseareoftencalleddependent.Althoughdatamartsarenotstrictlynecessary,theyareveryusefulfordatawarehousesystemsinmidsizetolargeenterprisesbecause
• theyareusedasbuildingblockswhileincrementallydevelopingdatawarehouses;
• theymarkouttheinformationrequiredbyaspecificgroupofuserstosolvequeries;
• theycandeliverbetterperformancebecausetheyaresmallerthanprimarydatawarehouses.
Data MartsAdatamartisasubsetoranaggregationofthedatastoredtoaprimarydatawarehouse.Itincludesasetofinformationpiecesrelevanttoaspecificbusinessarea,corporatedepartment,orcategoryofusers.
FIGURE 1-3 Two-layer architecture for a data warehouse system
ch01.indd9 4/21/093:23:30PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
10 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 10 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Sometimes,mainlyfororganizationandpolicypurposes,youshoulduseadifferentarchitectureinwhichsourcesareusedtodirectlypopulatedatamarts.Thesedatamartsarecalledindependent(seesection1.3.4).Ifthereisnoprimarydatawarehouse,thisstreamlinesthedesignprocess,butitleadstotheriskofinconsistenciesbetweendatamarts.Toavoidtheseproblems,youcancreateaprimarydatawarehouseandstillhaveindependentdatamarts.Incomparisonwiththestandardtwo-layerarchitectureofFigure1-3,therolesofdatamartsanddatawarehousesareactuallyinverted.Inthiscase,thedatawarehouseispopulatedfromitsdatamarts,anditcanbedirectlyqueriedtomakeaccesspatternsaseasyaspossible.
Thefollowinglistsumsupallthebenefitsofatwo-layerarchitecture,inwhichadatawarehouseseparatessourcesfromanalysisapplications(Jarkeetal.,2000;Lechtenbörger,2001):
• Indatawarehousesystems,goodqualityinformationisalwaysavailable,evenwhenaccesstosourcesisdeniedtemporarilyfortechnicalororganizationalreasons.
• Datawarehouseanalysisqueriesdonotaffectthemanagementoftransactions,thereliabilityofwhichisvitalforenterprisestoworkproperlyatanoperationallevel.
• Datawarehousesarelogicallystructuredaccordingtothemultidimensionalmodel,whileoperationalsourcesaregenerallybasedonrelationalorsemi-structuredmodels.
• AmismatchintermsoftimeandgranularityoccursbetweenOLTPsystems,whichmanagecurrentdataatamaximumlevelofdetail,andOLAPsystems,whichmanagehistoricalandsummarizeddata.
• Datawarehousescanusespecificdesignsolutionsaimedatperformanceoptimizationofanalysisandreportapplications.
NOTE Afewauthorsusethesameterminologytodefinedifferentconcepts.Inparticular,thoseauthorsconsideradatawarehouseasarepositoryofintegratedandconsistent,yetoperational,data,whiletheyuseamultidimensionalrepresentationofdataonlyindatamarts.Accordingtoourterminology,this“operationalview”ofdatawarehousesessentiallycorrespondstothereconcileddatalayerinthree-layerarchitectures.
1.3.3 Three-Layer ArchitectureInthisarchitecture,thethirdlayeristhereconcileddatalayeroroperationaldatastore.Thislayermaterializesoperationaldataobtainedafterintegratingandcleansingsourcedata.Asaresult,thosedataareintegrated,consistent,correct,current,anddetailed.Figure1-4showsadatawarehousethatisnotpopulatedfromitssourcesdirectly,butfromreconcileddata.
Themainadvantageofthereconcileddatalayeristhatitcreatesacommonreferencedatamodelforawholeenterprise.Atthesametime,itsharplyseparatestheproblemsofsourcedataextractionandintegrationfromthoseofdatawarehousepopulation.Remarkably,insomecases,thereconciledlayerisalsodirectlyusedtobetteraccomplishsomeoperationaltasks,suchasproducingdailyreportsthatcannotbesatisfactorilypreparedusingthecorporateapplications,orgeneratingdataflowstofeedexternalprocessesperiodicallysoastobenefitfromcleaningandintegration.However,reconcileddataleadstomoreredundancyofoperationalsourcedata.Notethatwemayassumethateventwo-layerarchitecturescanhaveareconciledlayerthatisnotspecificallymaterialized,butonlyvirtual,becauseitisdefinedasaconsistentintegratedviewofoperationalsourcedata.
ch01.indd10 4/21/093:23:31PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 11
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 11
Finally,let’sconsiderasupplementaryarchitecturalapproach,whichprovidesacomprehensivepicture.Thisapproachcanbedescribedasahybridsolutionbetweenthesingle-layerarchitectureandthetwo/three-layerarchitecture.Thisapproachassumesthatalthoughadatawarehouseisavailable,itisunabletosolveallthequeriesformulated.Thismeansthatusersmaybeinterestedindirectlyaccessingsourcedatafromaggregatedata(drill-through).Toreachthisgoal,somequerieshavetoberewrittenonthebasisofsourcedata(orreconcileddataifitisavailable).ThistypeofarchitectureisimplementedinaprototypebyCuiandWidom,2000,anditneedstobeabletogodynamicallybacktothesourcedatarequiredforqueriestobesolved(lineage).
FIGURE 1-4 Three-layer architecture for a data warehouse system
ch01.indd11 4/21/093:23:31PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
12 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 12 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
NOTE Gupta,1997a;HullandZhou,1996;andYangetal.,1997discusstheimplicationsofthisapproachfromtheviewpointofperformanceoptimization,andinparticularviewmaterialization.
1.3.4 An Additional Architecture ClassificationThescientificliteratureoftendistinguishesfivetypesofarchitecturefordatawarehousesystems,inwhichthesamebasiclayersmentionedintheprecedingparagraphsarecombinedindifferentways(Rizzi,2008).
Inindependentdatamartsarchitecture,differentdatamartsareseparatelydesignedandbuiltinanonintegratedfashion(Figure1-5).Thisarchitecturecanbeinitiallyadoptedintheabsenceofastrongsponsorshiptowardanenterprise-widewarehousingproject,orwhentheorganizationaldivisionsthatmakeupthecompanyarelooselycoupled.However,ittendstobesoonreplacedbyotherarchitecturesthatbetterachievedataintegrationandcross-reporting.
Thebusarchitecture,recommendedbyRalphKimball,isapparentlysimilartotheprecedingarchitecture,withoneimportantdifference.Abasicsetofconformeddimensions(thatis,analysisdimensionsthatpreservethesamemeaningthroughoutallthefactstheybelongto),derivedbyacarefulanalysisofthemainenterpriseprocesses,isadoptedandsharedasacommondesignguideline.Thisensureslogicalintegrationofdatamartsandanenterprise-wideviewofinformation.
Inthehub-and-spokearchitecture,oneofthemostusedinmediumtolargecontexts,thereismuchattentiontoscalabilityandextensibility,andtoachievinganenterprise-wideview
FIGURE 1-5 Independent data marts architecture
ch01.indd12 4/21/093:23:32PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 13
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 13
ofinformation.Atomic,normalizeddataisstoredinareconciledlayerthatfeedsasetofdatamartscontainingsummarizeddatainmultidimensionalform(Figure1-6).Usersmainlyaccessthedatamarts,buttheymayoccasionallyquerythereconciledlayer.
Thecentralizedarchitecture,recommendedbyBillInmon,canbeseenasaparticularimplementationofthehub-and-spokearchitecture,wherethereconciledlayerandthedatamartsarecollapsedintoasinglephysicalrepository.
Thefederatedarchitectureissometimesadoptedindynamiccontextswherepreexistingdatawarehouses/datamartsaretobenoninvasivelyintegratedtoprovideasingle,cross-organizationdecisionsupportenvironment(forinstance,inthecaseofmergersandacquisitions).Eachdatawarehouse/datamartiseithervirtuallyorphysicallyintegratedwiththeothers,leaningonavarietyofadvancedtechniquessuchasdistributedquerying,ontologies,andmeta-datainteroperability(Figure1-7).
FIGURE 1-6 Hub-and-spoke architecture
ch01.indd13 4/21/093:23:32PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
14 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 14 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Thefollowinglistincludesthefactorsthatareparticularlyinfluentialwhenitcomestochoosingoneofthesearchitectures:
• Theamountofinterdependentinformationexchangedbetweenorganizationalunitsinanenterpriseandtheorganizationalroleplayedbythedatawarehouseprojectsponsormayleadtotheimplementationofenterprise-widearchitectures,suchasbusarchitectures,ordepartment-specificarchitectures,suchasindependentdatamarts.
• Anurgentneedforadatawarehouseproject,restrictionsoneconomicandhumanresources,aswellaspoorITstaffskillsmaysuggestthatatypeof“quick”architecture,suchasindependentdatamarts,shouldbeimplemented.
• Theminorroleplayedbyadatawarehouseprojectinenterprisestrategiescanmakeyoupreferanarchitecturetypebasedonindependentdatamartsoverahub-and-spokearchitecturetype.
• Thefrequentneedforintegratingpreexistingdatawarehouses,possiblydeployedonheterogeneousplatforms,andthepressingdemandforuniformlyaccessingtheirdatacanrequireafederatedarchitecturetype.
FIGURE 1-7 Federated architecture
ch01.indd14 4/21/093:23:32PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 15
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 15
1.4 Data Staging and ETLNowlet’scloselystudysomebasicfeaturesofthedifferentarchitecturelayers.Wewillstartwiththedatastaginglayer.
ThedatastaginglayerhoststheETLprocessesthatextract,integrate,andcleandatafromoperationalsourcestofeedthedatawarehouselayer.Inathree-layerarchitecture,ETLprocessesactuallyfeedthereconcileddatalayer—asingle,detailed,comprehensive,top-qualitydatasource—thatinitsturnfeedsthedatawarehouse.Forthisreason,theETLprocessoperationsasawholeareoftendefinedasreconciliation.Thesearealsothemostcomplexandtechnicallychallengingamongallthedatawarehouseprocessphases.
ETLtakesplaceoncewhenadatawarehouseispopulatedforthefirsttime,thenitoccurseverytimethedatawarehouseisregularlyupdated.Figure1-8showsthatETLconsistsoffourseparatephases:extraction(orcapture),cleansing(orcleaningorscrubbing),transformation,andloading.Inthefollowingsections,weofferbriefdescriptionsofthesephases.
NOTE RefertoJarkeetal.,2000;Hofferetal.,2005;KimballandCaserta,2004;andEnglish,1999formoredetailsonETL.
Thescientificliteratureshowsthattheboundariesbetweencleansingandtransformingareoftenblurredfromtheterminologicalviewpoint.Forthisreason,aspecificoperationisnotalwaysclearlyassignedtooneofthesephases.Thisisobviouslyaformalproblem,butnotasubstantialone.WewilladopttheapproachusedbyHofferandothers(2005)tomakeourexplanationsasclearaspossible.Theirapproachstatesthatcleansingisessentiallyaimedatrectifyingdatavalues,andtransformationmorespecificallymanagesdataformats.
Chapter10discussesallthedetailsofthedata-stagingdesignphase.Chapter3dealswithanearlydatawarehousedesignphase:integration.Thisphaseisnecessaryifthereareheterogeneoussourcestodefineaschemaforthereconcileddatalayer,andtospecificallytransformoperationaldatainthedata-stagingphase.
1.4.1 ExtractionRelevantdataisobtainedfromsourcesintheextractionphase.Youcanusestaticextractionwhenadatawarehouseneedspopulatingforthefirsttime.Conceptuallyspeaking,thislookslikeasnapshotofoperationaldata.Incrementalextraction,usedtoupdatedatawarehousesregularly,seizesthechangesappliedtosourcedatasincethelatestextraction.IncrementalextractionisoftenbasedonthelogmaintainedbytheoperationalDBMS.Ifatimestampisassociatedwithoperationaldatatorecordexactlywhenthedataischangedoradded,itcanbeusedtostreamlinetheextractionprocess.Extractioncanalsobesource-drivenifyoucanrewriteoperationalapplicationstoasynchronouslynotifyofthechangesbeingapplied,orifyouroperationaldatabasecanimplementtriggersassociatedwithchangetransactionsforrelevantdata.
Thedatatobeextractedismainlyselectedonthebasisofitsquality(English,1999).Inparticular,thisdependsonhowcomprehensiveandaccuratetheconstraintsimplementedinsourcesare,howsuitablethedataformatsare,andhowcleartheschemataare.
ch01.indd15 4/21/093:23:33PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
16 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 16 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
1.4.2 CleansingThecleansingphaseiscrucialinadatawarehousesystembecauseitissupposedtoimprovedataquality—normallyquitepoorinsources(Galhardasetal.,2001).Thefollowinglistincludesthemostfrequentmistakesandinconsistenciesthatmakedata“dirty”:
• Duplicatedata Forexample,apatientisrecordedmanytimesinahospitalpatientmanagementsystem
FIGURE 1-8 Extraction, transformation, and loading
ch01.indd16 4/21/093:23:33PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 17
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 17
• Inconsistentvaluesthatarelogicallyassociated SuchasaddressesandZIPcodes
• Missingdata Suchasacustomer’sjob
• Unexpecteduseoffields Forexample,asocialSecurityNumberfieldcouldbeusedimproperlytostoreofficephonenumbers
• Impossibleorwrongvalues Suchas2/30/2009
• Inconsistentvaluesforasingleentitybecausedifferentpracticeswereused Forexample,tospecifyacountry,youcanuseaninternationalcountryabbreviation(I)orafullcountryname(Italy);similarproblemsarisewithaddresses(HamletRd.andHamletRoad)
• Inconsistentvaluesforoneindividualentitybecauseoftypingmistakes SuchasHametRoadinsteadofHamletRoad
Inparticular,notethatthelasttwotypesofmistakesareveryfrequentwhenyouaremanagingmultiplesourcesandareenteringdatamanually.
ThemaindatacleansingfeaturesfoundinETLtoolsarerectificationandhomogenization.Theyusespecificdictionariestorectifytypingmistakesandtorecognizesynonyms,aswellasrule-basedcleansingtoenforcedomain-specificrulesanddefineappropriateassociationsbetweenvalues.Seesection10.2formoredetailsonthesepoints.
1.4.3 TransformationTransformationisthecoreofthereconciliationphase.Itconvertsdatafromitsoperationalsourceformatintoaspecificdatawarehouseformat.Ifyouimplementathree-layerarchitecture,thisphaseoutputsyourreconcileddatalayer.Independentlyofthepresenceofareconcileddatalayer,establishingamappingbetweenthesourcedatalayerandthedatawarehouselayerisgenerallymadedifficultbythepresenceofmanydifferent,heterogeneoussources.Ifthisisthecase,acomplexintegrationphaseisrequiredwhendesigningyourdatawarehouse.SeeChapter3formoredetails.
Thefollowingpointsmustberectifiedinthisphase:
• Loosetextsmayhidevaluableinformation.Forexample,BigDealLtDdoesnotexplicitlyshowthatthisisaLimitedPartnershipcompany.
• Differentformatscanbeusedforindividualdata.Forexample,adatecanbesavedasastringorasthreeintegers.
Followingarethemaintransformationprocessesaimedatpopulatingthereconcileddatalayer:
• Conversionandnormalizationthatoperateonbothstorageformatsandunitsofmeasuretomakedatauniform
• Matchingthatassociatesequivalentfieldsindifferentsources
• Selectionthatreducesthenumberofsourcefieldsandrecords
Whenpopulatingadatawarehouse,normalizationisreplacedbydenormalizationbecausedatawarehousedataaretypicallydenormalized,andyouneedaggregationtosumupdataproperly.
ch01.indd17 4/21/093:23:33PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
18 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 18 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
CleansingandtransformationprocessesareoftencloselyconnectedinETLtools.Figure1-9showsanexampleofcleansingandtransformationofcustomerdata:afield-basedstructureisextractedfromaloosetext,thenafewvaluesarestandardizedsoastoremoveabbreviations,andeventuallythosevaluesthatarelogicallyassociatedcanberectified.
1.4.4 LoadingLoadingintoadatawarehouseisthelaststeptotake.Loadingcanbecarriedoutintwoways:
• Refresh Datawarehousedataiscompletelyrewritten.Thismeansthatolderdataisreplaced.Refreshisnormallyusedincombinationwithstaticextractiontoinitiallypopulateadatawarehouse.
• Update Onlythosechangesappliedtosourcedataareaddedtothedatawarehouse.Updateistypicallycarriedoutwithoutdeletingormodifyingpreexistingdata.Thistechniqueisusedincombinationwithincrementalextractiontoupdatedatawarehousesregularly.
1.5 Multidimensional ModelThedatawarehouselayerisavitallyimportantpartofthisbook.Here,weintroduceadatawarehousekeyword:multidimensional.Youneedtobecomefamiliarwiththeconceptsandterminologyusedheretounderstandtheinformationpresentedthroughoutthisbook,particularlyinformationregardingconceptualandlogicalmodelinganddesigning.
FIGURE 1-9 Example of cleansing and transforming customer data
ch01.indd18 4/21/093:23:34PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 19
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 19
Overthelastfewyears,multidimensionaldatabaseshavegeneratedmuchresearchandmarketinterestbecausetheyarefundamentalformanydecision-makingsupportapplications,suchasdatawarehousesystems.ThereasonwhythemultidimensionalmodelisusedasaparadigmofdatawarehousedatarepresentationisfundamentallyconnectedtoitseaseofuseandintuitivenessevenforITnewbies.Themultidimensionalmodel’ssuccessisalsolinkedtothewidespreaduseofproductivitytools,suchasspreadsheets,thatadoptthemultidimensionalmodelasavisualizationparadigm.
Perhapsthebeststartingpointtoapproachthemultidimensionalmodeleffectivelyisadefinitionofthetypesofqueriesforwhichthismodelisbestsuited.Section1.7offersmoredetailsontypicaldecision-makingqueriessuchasthoselistedhere(Jarkeetal.,2000):
• “Whatisthetotalamountofreceiptsrecordedlastyearperstateandperproductcategory?”
• “WhatistherelationshipbetweenthetrendofPCmanufacturers’sharesandquartergainsoverthelastfiveyears?”
• “Whichordersmaximizereceipts?”
• “Whichoneoftwonewtreatmentswillresultinadecreaseintheaverageperiodofadmission?”
• “Whatistherelationshipbetweenprofitgainedbytheshipmentsconsistingoflessthan10itemsandtheprofitgainedbytheshipmentsofmorethan10items?”
Itisclearthatusingtraditionallanguages,suchasSQL,toexpressthesetypesofqueriescanbeaverydifficulttaskforinexperiencedusers.Itisalsoclearthatrunningthesetypesofqueriesagainstoperationaldatabaseswouldresultinanunacceptablylongresponsetime.
Themultidimensionalmodelbeginswiththeobservationthatthefactorsaffectingdecision-makingprocessesareenterprise-specificfacts,suchassales,shipments,hospitaladmissions,surgeries,andsoon.Instancesofafactcorrespondtoeventsthatoccurred.Forexample,everysinglesaleorshipmentcarriedoutisanevent.Eachfactisdescribedbythevaluesofasetofrelevantmeasuresthatprovideaquantitativedescriptionofevents.Forexample,salesreceipts,amountsshipped,hospitaladmissioncosts,andsurgerytimearemeasures.
Obviously,ahugenumberofeventsoccurintypicalenterprises—toomanytoanalyzeonebyone.Imagineplacingthemallintoann-dimensionalspacetohelpusquicklyselectandsortthemout.Then-dimensionalspaceaxesarecalledanalysisdimensions,andtheydefinedifferentperspectivestosingleoutevents.Forexample,thesalesinastorechaincanberepresentedinathree-dimensionalspacewhosedimensionsareproducts,stores,anddates.Asfarasshipmentsareconcerned,products,shipmentdates,orders,destinations,andterms&conditionscanbeusedasdimensions.Hospitaladmissionscanbedefinedbythedepartment-date-patientcombination,andyouwouldneedtoaddthetypeofoperationtoclassifysurgeryoperations.
Theconceptofdimensiongavelifetothebroadlyusedmetaphorofcubestorepresentmultidimensionaldata.Accordingtothismetaphor,eventsareassociatedwithcubecellsandcubeedgesstandforanalysisdimensions.Ifmorethanthreedimensionsexist,thecubeiscalledahypercube.Eachcubecellisgivenavalueforeachmeasure.Figure1-10showsanintuitiverepresentationofacubeinwhichthefactisasaleinastorechain.Itsanalysisdimensionsarestore,productanddate.Aneventstandsforaspecificitemsoldinaspecificstoreonaspecificdate,anditisdescribedbytwomeasures:thequantitysoldandthereceipts.Thisfigurehighlightsthatthecubeissparse—thismeansthatmanyeventsdidnotactuallytakeplace.Ofcourse,youcannotselleveryitemeverydayineverystore.
ch01.indd19 4/21/093:23:34PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
20 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 20 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Ifyouwanttousetherelationalmodeltorepresentthiscube,youcouldusethefollowingrelationalschema:
SALES(store,product,date,quantity,receipts)
Here,theunderlinedattributesmakeuptheprimarykeyandeventsareassociatedwithtuples,suchas<'EverMore','Shiny','04/05/08',10,25>.Theconstraintexpressedbythisprimarykeyspecifiesthattwoeventscannotbeassociatedwithanindividualstore,product,anddatevaluecombination,andthateveryvaluecombinationfunctionallydeterminesauniquevalueforquantityandauniquevalueforreceipts.Thismeansthatthefollowingfunctionaldependency2holds:
store,product,date→quantity,receipts
Toavoidanymisunderstandingofthetermevent,youshouldrealizethatthegroupofdimensionsselectedforafactrepresentationsinglesoutauniqueeventinthemultidimensionalmodel,butthegroupdoesnotnecessarilysingleoutauniqueeventintheapplicationdomain.Tomakethisstatementclearer,consideronceagainthesalesexample.Intheapplicationdomain,onesinglesaleseventissupposedtobeacustomer’spurchaseofasetofproductsfromastoreonaspecificdate.Inpractice,thiscorrespondstoasalesreceipt.Fromtheviewpointofthemultidimensionalmodel,ifthesalesfacthastheproduct,store,anddatedimensions,aneventwillbethedailytotalamountofanitemsoldinastore.Itisclearthatthedifferencebetweenbothinterpretationsdependsonsales
FIGURE 1-10 The three-dimensional cube modeling sales in a store chain: 10 packs of Shiny were sold on 4/5/2008 in the EverMore store, totaling $25.
2The definition of functional dependency belongs to relational theory. Given relation schema R and twoattributesetsX={a1...,an}andY={b1...,bm},XissaidtofunctionallydetermineY(X→Y)ifandonlyif,foreverylegalinstancerofRandforeachpairoftuplest1,t2inr,t1[X]= t2[X]impliest1[Y]= t2[Y].Heret[X/Y]denotesthevaluestakenintfromtheattributesinX/Y.Byextension,wesaythatafunctionaldependencyholdsbetweentwoattributesetsXandYwheneachvaluesetofXalwayscorrespondstoasinglevaluesetofY.Tosimplifythenotation,whenwedenotetheattributesineachset,wedropthebraces.
ch01.indd20 4/21/093:23:35PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 21
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 21
receiptsthatgenerallyincludevariousitems,andonindividualitemsthataregenerallysoldmanytimeseverydayinastore.Inthefollowingsections,weusethetermseventandfacttomakereferencetothegranularitytakenbyeventsandfactsinthemultidimensionalmodel.
Normally,eachdimensionisassociatedwithahierarchyofaggregationlevels,oftencalledroll-uphierarchy.Roll-uphierarchiesgroupaggregationlevelvaluesindifferentways.Hierarchiesconsistoflevelscalleddimensionalattributes.Figure1-11showsasimpleexampleofhierarchiesbuiltontheproductandstoredimensions:productsareclassifiedintotypes,andarethenfurtherclassifiedintocategories.Storesarelocatedincitiesbelongingtostates.Ontopofeachhierarchyisafakelevelthatincludesallthedimension-relatedvalues.Fromtheviewpointofrelationaltheory,youcanuseasetoffunctionaldependenciesbetweendimensionalattributestoexpressahierarchy:
product→type→categorystore→city→state
Insummary,amultidimensionalcubehingesonafactrelevanttodecision-making.Itshowsasetofeventsforwhichnumericmeasuresprovideaquantitativedescription.Eachcubeaxisshowsapossibleanalysisdimension.Eachdimensioncanbeanalyzedatdifferentdetaillevelsspecifiedbyhierarchicallystructuredattributes.
Thescientificliteratureshowsmanyformalexpressionsofthemultidimensionalmodel,whichcanbemoreorlesscomplexandcomprehensive.We’llbrieflymentionalternativetermsusedforthemultidimensionalmodelinthescientificliteratureandincommercialtools.
FIGURE 1-11 Aggregation hierarchies built on the product and store dimensions
ch01.indd21 4/21/093:23:35PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
22 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 22 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Thefactandcubetermsareofteninterchangeablyused.Essentially,everyoneagreesontheuseofthetermdimensionstospecifythecoordinatesthatclassifyandidentifyfactoccurrences.However,entirehierarchiesaresometimescalleddimensions.Forexample,thetermtimedimensioncanbeusedfortheentirehierarchybuiltonthedateattribute.Measuresaresometimescalledvariables,metrics,properties,attributes,orindicators.Insomemodels,dimensionalattributesofhierarchiesarecalledlevelsorparameters.
NOTE ThemainformalexpressionsofthemultidimensionalmodelintheliteraturewereproposedbyAgrawaletal.,1995;GyssensandLakshmanan,1997;DattaandThomas,1997;Vassiliadis,1998;andCabibboandTorlone,1998.
Theinformationinamultidimensionalcubeisverydifficultforuserstomanagebecauseofitsquantity,evenifitisaconciseversionoftheinformationstoredtooperationaldatabases.If,forexample,astorechainincludes50storesselling1000items,andaspecificdatawarehousecoversthree-year-longtransactions(approximately1000days),thenumberofpotentialeventstotals50×1000×1000=5×107.Assumingthateachstorecansellonly10percentofalltheavailableitemsperday,thenumberofeventstotals5×106.Thisisstilltoomuchdatatobeanalyzedbyuserswithoutrelyingonautomatictools.
Youhaveessentiallytwowaystoreducethequantityofdataandobtainusefulinformation:restrictionandaggregation.Thecubemetaphoroffersaneasy-to-useandintuitivewaytounderstandbothofthesemethods,aswewilldiscussinthefollowingparagraphs.
1.5.1 RestrictionRestrictingdatameansseparatingpartofthedatafromacubetomarkoutananalysisfield.Inrelationalalgebraterminology,thisiscalledmakingselectionsand/orprojections.
Thesimplesttypeofselectionisdataslicing,showninFigure1-12.Whenyouslicedata,youdecreasecubedimensionalitybysettingoneormoredimensionstoaspecificvalue.Forexample,ifyousetoneofthesalescubedimensionstoavalue,suchasstore='EverMore',thisresultsinthesetofeventsassociatedwiththeitemssoldintheEverMorestore.Accordingtothecubemetaphor,thisissimplyaplaneofcells—thatis,adataslicethatcanbeeasilydisplayedinspreadsheets.Inthestorechainexamplegivenearlier,approximately105eventsstillappearinyourresult.Ifyousettwodimensionstoavalue,suchasstore='EverMore'anddate='4/5/2008',thiswillresultinallthedifferentitemssoldintheEverMorestoreonApril5(approximately100events).Graphicallyspeaking,thisinformationisstoredattheintersectionoftwoperpendicularplanesresultinginaline.Ifyousetallthedimensionstoaparticularvalue,youwilldefinejustoneeventthatcorrespondstoapointinthethree-dimensionalspaceofsales.
Dicingisageneralizationofslicing.Itposessomeconstraintsondimensionalattributestoscaledownthesizeofacube.Forexample,youcanselectonlythedailysalesofthefooditemsinApril2008inFlorida(Figure1-12).Inthisway,iffivestoresarelocatedinFloridaand50foodproductsaresold,thenumberofeventstoexaminechangesto5×50×30=7500.
Finally,aprojectioncanbereferredtoasachoicetokeepjustonesubgroupofmeasuresforeveryeventandrejectothermeasures.
ch01.indd22 4/21/093:23:35PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 23
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 23
1.5.2 AggregationAggregationplaysafundamentalroleinmultidimensionaldatabases.Assume,forexample,thatyouwanttoanalyzetheitemssoldmonthlyforathreeyearperiod.Accordingtothecubemetaphor,thismeansthatyouneedtosortallthecellsrelatedtothedaysofeachmonthbyproductandstore,andthenmergethemintoonesinglemacrocell.Intheaggregatecubeobtainedinthisway,thetotalnumberofevents(thatis,thenumberofmacrocells)is50×1000×36.Thisisbecausethegranularityofthetimedimensionsdoesnotdependondaysanylonger,butnowdependsonmonths,and36isthenumberofmonthsinthreeyears.Everyaggregateeventwillthensumupthedataavailableintheeventsitaggregates.Inthisexample,thetotalamountofitemssoldpermonthandthetotalreceiptsarecalculatedbysummingeverysinglevalueoftheirmeasures(Figure1-13).Ifyoufurtheraggregatealongtime,youcanachievejustthreeeventsforeverystore-productcombination:oneforeveryyear.Whenyoucompletelyaggregatealongthetimedimension,eachstore-productcombinationcorrespondstoonesingleevent,whichshowsthetotalamountofitemssoldinastoreoverthreeyearsandthetotalamountofreceipts.
FIGURE 1-12 Slicing and dicing a three-dimensional cube
ch01.indd23 4/21/093:23:36PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
24 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 24 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
FIGURE 1-13 Time hierarchy aggregation of the quantity of items sold per product in three stores. A dash shows that an event did not occur because no item was sold.
EverMore EvenMore SmartMart
1/1/2007 – – –
1/2/2007 10 15 5
1/3/2007 20 – 5
.......... .......... .......... ..........
1/1/2008 – – –
1/2/2008 15 10 20
1/3/2008 20 20 25
.......... .......... .......... ..........
1/1/2009 – – –
1/2/2009 20 8 25
1/3/2009 20 12 20
.......... .......... .......... ..........
EverMore EvenMore SmartMart
January 2007 200 180 150
February 2007 180 150 120
March 2007 220 180 160
.......... .......... .......... ..........
January 2008 350 220 200
February 2008 300 200 250
March 2008 310 180 300
.......... .......... .......... ..........
January 2009 380 200 220
February 2009 310 200 250
March 2009 300 160 280
.......... .......... .......... ..........
EverMore EvenMore SmartMart
2,400 2,000 1,600
3,200 2,300 3,000
2007
2008
2009 3,400 2,200 3,200
EverMore EvenMore SmartMart
Total 9,000 6,500 7,800
ch01.indd24 4/21/093:23:37PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 25
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 25
Youcanaggregatealongvariousdimensionsatthesametime.Forexample,Figure1-14showsthatyoucangroupsalesbymonth,producttype,andstorecity,andbymonthandproducttype.Moreover,selectionsandaggregationscanbecombinedtocarryoutananalysisprocesstargetedexactlytousers’needs.
1.6 Meta-dataThetermmeta-datacanbeappliedtothedatausedtodefineotherdata.Inthescopeofdatawarehousing,meta-dataplaysanessentialrolebecauseitspecifiessource,values,usage,andfeaturesofdatawarehousedataanddefineshowdatacanbechangedandprocessedateveryarchitecturelayer.Figures1-3and1-4showthatthemeta-datarepositoryiscloselyconnectedtothedatawarehouse.Applicationsuseitintensivelytocarryoutdata-stagingandanalysistasks.
AccordingtoKelly’sapproach,youcanclassifymeta-dataintotwopartiallyoverlappingcategories.Thisclassificationisbasedonthewayssystemadministratorsandendusersexploitmeta-data.Systemadministratorsareinterestedininternalmeta-databecauseitdefinesdatasources,transformationprocesses,populationpolicies,logicalandphysicalschemata,constraints,anduserprofiles.Externalmeta-dataisrelevanttoendusers.Forexample,itisaboutdefinitions,qualitystandards,unitsofmeasure,relevantaggregations.
FIGURE 1-14 Two cube aggregation levels. Every macro-event measure value is a sum of its component event values.
ch01.indd25 4/21/093:23:37PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
26 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 26 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Meta-dataisstoredinameta-datarepositorywhichalltheotherarchitecturecomponentscanaccess.AccordingtoKelly,atoolformeta-datamanagementshould
• allowadministratorstoperformsystemadministrationoperations,andinparticularmanagesecurity;
• allowenduserstonavigateandquerymeta-data;
• useaGUI;
• allowenduserstoextendmeta-data;
• allowmeta-datatobeimported/exportedinto/fromotherstandardtoolsandformats.
Asfarasrepresentationformatsareconcerned,ObjectManagementGroup(OMG,2000)releasedastandardcalledCommonWarehouseMetamodel(CWM)thatreliesonthreefamousstandards:UnifiedModelingLanguage(UML),eXtensibleMarkupLanguage(XML),andXMLMetadataInterchange(XMI).Partners,suchasIBM,Unisys,NCR,andOracle,inacommoneffort,createdthenewstandardformatthatspecifieshowmeta-datacanbeexchangedamongthetechnologiesrelatedtodatawarehouses,businessintelligence,knowledgemanagement,andwebportals.
Figure1-15showsanexampleofadialogboxdisplayingexternalmeta-datarelatedtohierarchiesinMicroStrategyDesktopoftheMicroStrategy8toolsuite.Inparticular,
FIGURE 1-15 Accessing hierarchy meta-data in MicroStrategy
ch01.indd26 4/21/093:23:38PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 27
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 27
thisdialogboxdisplaystheCallingCenterattributeparentattributes.Specifically,itstatesthatacallingcenterreferstoadistributioncenter,belongstoaregion,andismanagedbyamanager.
NOTE SeeBarquinandEdelstein,1996;Jarkeetal.,2000;Jennings,2004;andTozer,1999,foracomprehensivediscussiononmeta-datarepresentationandmanagement.
1.7 Accessing Data WarehousesAnalysisisthelastlevelcommontoalldatawarehousearchitecturetypes.Aftercleansing,integrating,andtransformingdata,youshoulddeterminehowtogetthebestoutofitintermsofinformation.Thefollowingsectionsshowthebestapproachesforenduserstoquerydatawarehouses:reports,OLAP,anddashboards.Endusersoftenusetheinformationstoredtoadatawarehouseasastartingpointforadditionalbusinessintelligenceapplications,suchaswhat-ifanalysesanddatamining.SeeChapter15formoredetailsontheseadvancedapplications.
1.7.1 ReportsThisapproachisorientedtothoseuserswhoneedtohaveregularaccesstotheinformationinanalmoststaticway.Forexample,supposealocalhealthauthoritymustsendtoitsstateofficesmonthlyreportssummingupinformationonpatientadmissioncosts.Thelayoutofthosereportshasbeenpredeterminedandmayvaryonlyifchangesareappliedtocurrentlawsandregulations.Designersissuethequeriestocreatereportswiththedesiredlayoutand“freeze”allthoseinanapplication.Inthisway,enduserscanquerycurrentdatawhenevertheyneedto.
Areportisdefinedbyaqueryandalayout.Aquerygenerallyimpliesarestrictionandanaggregationofmultidimensionaldata.Forexample,youcanlookforthemonthlyreceiptsduringthelastquarterforeveryproductcategory.Alayoutcanlooklikeatableorachart(diagrams,histograms,pies,andsoon).Figure1-16showsafewexamplesoflayoutsforthereceiptsquery.
Areportingtoolshouldbeevaluatednotonlyonthebasisofcomprehensivereportlayouts,butalsoonthebasisofflexiblereportdeliverysystems.Areportcanbeexplicitlyrunbyusersorautomaticallyandregularlysenttoregisteredendusers.Forexample,itcanbesentviae-mail.
Keepinmindthatreportsexistedlongbeforedatawarehousesystemscametobe.Reportshavealwaysbeenthemaintoolusedbymanagersforevaluatingandplanningtaskssincetheinventionofdatabases.However,addingdatawarehousestothemixisbeneficialtoreportsfortwomainreasons:First,theytakeadvantageofreliableandcorrect
ch01.indd27 4/21/093:23:38PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
28 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 28 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
FIGURE 1-16 Report layouts: table (top), line graph (middle), 3-D pie graphs (bottom)
ch01.indd28 4/21/093:23:38PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 29
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 29
resultsbecausethedatasummedupinreportsisconsistentandintegrated.Inaddition,datawarehousesexpeditethereportingprocessbecausethearchitecturalseparationbetweentransactionprocessingandanalysessignificantlyimprovesperformance.
1.7.2 OLAPOLAPmightbethemainwaytoexploitinformationinadatawarehouse.Surelyitisthemostpopularone,anditgivesendusers,whoseanalysisneedsarenoteasytodefinebeforehand,theopportunitytoanalyzeandexploredatainteractivelyonthebasisofthemultidimensionalmodel.Whileusersofreportingtoolsessentiallyplayapassiverole,OLAPusersareabletostartacomplexanalysissessionactively,whereeachstepistheresultoftheoutcomeofprecedingsteps.Real-timepropertiesofOLAPsessions,requiredin-depthknowledgeofdata,complexqueriesthatcanbeissued,anddesignforusersnotfamiliarwithITmakethetoolsinuseplayacrucialrole.TheGUIofthesetoolsmustbeflexible,easy-to-use,andeffective.
AnOLAPsessionconsistsofanavigationpaththatcorrespondstoananalysisprocessforfactsaccordingtodifferentviewpointsandatdifferentdetaillevels.Thispathisturnedintoasequenceofqueries,whichareoftennotissueddirectly,butdifferentiallyexpressedwithreferencetothepreviousquery.Theresultsofqueriesaremultidimensional.Becausewehumanshaveadifficulttimedecipheringdiagramsofmorethanthreedimensions,OLAPtoolstypicallyusetablestodisplaydata,withmultipleheaders,colors,andotherfeaturestohighlightdatadimensions.
EverystepofananalysissessionischaracterizedbyanOLAPoperatorthatturnsthelatestqueryintoanewone.Themostcommonoperatorsareroll-up,drill-down,slice-and-dice,pivot,drill-across,anddrill-through.Thefiguresincludedhereshowdifferentoperators,andweregeneratedusingtheMicroStrategyDesktopfront-endapplicationintheMicroStrategy8toolsuite.TheyarebasedontheV-Mallexample,inwhichalargevirtualmallsellsitemsfromitscatalogviaphoneandtheInternet.Figure1-17showstheattributehierarchiesrelevanttothesalesfactinV-Mall.
Theroll-upoperatorcausesanincreaseindataaggregationandremovesadetaillevelfromahierarchy.Forexample,Figure1-18showsaqueryposedbyauserthatdisplays
FIGURE 1-17 Attribute hierarchies in V-Mall; arrows show functional dependencies
ch01.indd29 4/21/093:23:38PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
30 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 30 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
monthlyrevenuesin2005and2006foreverycustomerregion.Ifyou“rollitup,”youremovethemonthdetailtodisplayquarterlytotalrevenuesperregion.Rolling-upcanalsoreducethenumberofdimensionsinyourresultsifyouremoveallthehierarchydetails.IfyouapplythisprincipletoFigure1-19,youcanremoveinformationoncustomersanddisplayyearlytotalrevenuesperproductcategoryasyouturnthethree-dimensionaltable
FIGURE 1-18 Time hierarchy roll-up
ch01.indd30 4/21/093:23:39PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 31
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 31
FIGURE 1-19 Roll-up removing customer hierarchy
FIGURE 1-20 Rolling-up (left) and drilling-down (right) a cube
intoatwo-dimensionalone.Figure1-20usesthecubemetaphortosketcharoll-upoperationwithandwithoutadecreaseindimensions.
Thedrill-downoperatoristhecomplementtotheroll-upoperator.Figure1-20showsthatitreducesdataaggregationandaddsanewdetailleveltoahierarchy.Figure1-21showsanexamplebasedonabidimensionaltable.Thistableshowsthattheaggregationbasedoncustomerregionsshiftstoanewfine-grainedaggregationbasedoncustomercities.
ch01.indd31 4/21/093:23:39PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
32 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 32 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
FIGURE 1-21 Drilling-down customer hierarchy
FIGURE 1-22 Drilling-down and adding a dimension
InFigure1-22,thedrill-downoperatorcausesanincreaseinthenumberoftabledimensionsafteraddingcustomerregiondetails.
Slice-and-diceisoneofthemostabusedtermsindatawarehouseliteraturebecauseitcanhavemanydifferentmeanings.AfewauthorsuseitgenerallytodefinethewholeOLAPnavigationprocess.Otherauthorsuseittodefineselectionandprojectionoperationsbasedondata.Incompliancewithsection1.5.1,wedefineslicingasanoperationthatreducesthe
ch01.indd32 4/21/093:23:40PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 33
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 33
FIGURE 1-23 Slicing (above) and dicing (below) a cube
FIGURE 1-24 Slicing based on the Year='2006' predicate
numberofcubedimensionsaftersettingoneofthedimensionstoaspecificvalue.Dicingisanoperationthatreducesthesetofdatabeinganalyzedbyaselectioncriterion(Figure1-23).Figures1-24and1-25showafewexamplesofslicinganddicing.
Thepivotoperatorimpliesachangeinlayouts.Itaimsatanalyzinganindividualgroupofinformationfromadifferentviewpoint.Accordingtothemultidimensionalmetaphor,ifyoupivotdata,yourotateyourcubesothatyoucanrearrangecellsonthe
ch01.indd33 4/21/093:23:40PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
34 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 34 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
FIGURE 1-25 Selection based on a complex predicate
FIGURE 1-26 Pivoting a cube
basisofanewperspective.Inpractice,youcanhighlightadifferentcombinationofdimensions(Figure1-26).Figures1-27and1-28showafewexamplesofpivotedtwo-dimensionalandthree-dimensionaltables.
Thetermdrill-acrossstandsfortheopportunitytocreatealinkbetweentwoormoreinterrelatedcubesinordertocomparetheirdata.Forexample,thisappliesifyoucalculate
ch01.indd34 4/21/093:23:41PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 35
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 35
FIGURE 1-27 Pivoting a two-dimensional table
FIGURE 1-28 Pivoting a three-dimensional table
anexpressioninvolvingmeasuresfromtwocubes(Figure1-29).Figure1-30showsanexampleinwhichasalescubeisdrilled-acrossapromotionscubeinordertocomparerevenuesanddiscountsperquarterandproductcategory.
MostOLAPtoolscanperformdrill-throughoperations,thoughwithvaryingeffectiveness.Thisoperationswitchesfrommultidimensionalaggregatedataindatamartstooperationaldatainsourcesorinthereconciledlayer.
Inmanyapplications,anintermediateapproachbetweenstaticreportingandOLAPisbroadlyused.Thisintermediateapproachiscalledsemi-staticreporting.Evenifasemi-staticreportfocusesonagroupofinformationpreviouslyset,itgivesuserssomemarginoffreedom.Thankstothismargin,userscanfollowalimitedsetofnavigationpaths.Forexample,thisapplieswhenyoucanrollupjusttoafewhierarchyattributes.Thissolutioniscommon,becauseitprovidessomeunquestionableadvantages.First,usersneedlessskilltousedatamodelsandanalysistoolsthantheyneedforOLAP.Second,thisavoidstheriskthatoccursinOLAPofachievinginconsistentanalysisresultsorincorrectonesbecauseofanymisuseofaggregationoperators.Third,ifyouposeconstraintsontheanalysesallowed,youwillpreventusersfromunwillinglyslowingdownyoursystemwhenevertheyformulatedemandingqueries.
ch01.indd35 4/21/093:23:41PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
36 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 36 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
1.7.3 DashboardsDashboardsareanothermethodusedfordisplayinginformationstoredtoadatawarehouse.ThetermdashboardreferstoaGUIthatdisplaysalimitedamountofrelevantdatainabriefandeasy-to-readformat.Dashboardscanprovideareal-timeoverviewofthetrendsforaspecificphenomenonorformanyphenomenathatarestrictlyconnectedwitheachother.Thetermisavisualmetaphor:thegroupofindicatorsintheGUIaredisplayedlikeacardashboard.Dashboardsareoftenusedbyseniormanagerswhoneedaquickwaytoviewinformation.However,toconductanddisplayverycomplexanalysesofphenomena,dashboardsmustbematchedwithanalysistools.
Today,mostsoftwarevendorsofferdashboardsforreportcreationanddisplay.Figure1-31showsadashboardcreatedwithMicroStrategyDynamicEnterprise.Theliteraturerelatedtodashboardgraphicdesignhasalsoproventobeveryrich,inparticularinthescopeofenterprises(Few,2006).
FIGURE 1-29 Drilling across two cubes
FIGURE 1-30 Drilling across the sales cube (Revenue measure) and the promotions cube (Discount measure)
ch01.indd36 4/21/093:23:42PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 37
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 37
Keepinmind,however,thatdashboardsarenothingbutperformanceindicatorsbehindGUIs.Theireffectivenessisduetoacarefulselectionoftherelevantmeasures,whileusingdatawarehouseinformationqualitystandards.Forthisreason,dashboardsshouldbeviewedasasophisticatedeffectiveadd-ontodatawarehousesystems,butnotastheprimarygoalofdatawarehousesystems.Infact,theprimarygoalofdatawarehousesystemsshouldalwaysbetoproperlydefineaprocesstotransformdataintoinformation.
1.8 ROLAP, MOLAP, and HOLAPThesethreeacronymsconcealthreemajorapproachestoimplementingdatawarehouses,andtheyarerelatedtothelogicalmodelusedtorepresentdata:
• ROLAPstandsforRelationalOLAP,animplementationbasedonrelationalDBMSs.
• MOLAPstandsforMultidimensionalOLAP,animplementationbasedonmultidimensionalDBMSs.
• HOLAPstandsforHybridOLAP,animplementationusingbothrelationalandmultidimensionaltechniques.
FIGURE 1-31 An example of dashboards
ch01.indd37 4/21/093:23:42PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
38 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 38 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Theideaofadoptingtherelationaltechnologytostoredatatoadatawarehousehasasolidfoundationifyouconsiderthehugeamountofliteraturewrittenabouttherelationalmodel,thebroadlyavailablecorporateexperiencewithrelationaldatabaseusageandmanagement,andthetopperformanceandflexibilitystandardsofrelationalDBMSs(RDBMSs).Theexpressivepoweroftherelationalmodel,however,doesnotincludetheconceptsofdimension,measure,andhierarchy,soyoumustcreatespecifictypesofschematasothatyoucanrepresentthemultidimensionalmodelintermsofbasicrelationalelementssuchasattributes,relations,andintegrityconstraints.Thistaskismainlyperformedbythewell-knownstarschema.SeeChapter8formoredetailsonstarschemataandstarschemavariants.
ThemainproblemwithROLAPimplementationsresultsfromtheperformancehitcausedbycostlyjoinoperationsbetweenlargetables.Toreducethenumberofjoins,oneofthekeyconceptsofROLAPisdenormalization—aconsciousbreachinthethirdnormalformorientedtoperformancemaximization.Tominimizeexecutioncosts,theotherkeywordisredundancy,whichistheresultofthematerializationofsomederivedtables(views)thatstoreaggregatedatausedfortypicalOLAPqueries.
Fromanarchitecturalviewpoint,adoptingROLAPrequiresspecializedmiddleware,alsocalledamultidimensionalengine,betweenrelationalback-endserversandfront-endcomponents,asshowninFigure1-32.ThemiddlewarereceivesOLAPqueriesformulatedbyusersinafront-endtoolandturnsthemintoSQLinstructionsforarelationalback-endapplicationwiththesupportofmeta-data.Theso-calledaggregatenavigatorisaparticularlyimportantcomponentinthisphase.Incaseofaggregateviews,thiscomponentselectsaviewfromamongallthealternativestosolveaspecificqueryattheminimumaccesscost.
Incommercialproducts,differentfront-endmodules,suchasOLAP,reports,anddashboards,aregenerallystrictlyconnectedtoamultidimensionalengine.Multidimensionalenginesarethemaincomponentsandcanbeconnectedtoanyrelationalserver.Opensourcesolutionshavebeenrecentlyreleased.Theirmultidimensionalengines(Mondrian,2009)aredisconnectedfromfront-endmodules(JPivot,2009).Forthisreason,theycanbemoreflexible
FIGURE 1-32 ROLAP architecture
ch01.indd38 4/21/093:23:43PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 39
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 39
thancommercialsolutionswhenyouhavetocreatethearchitecture(ThomsenandPedersen,2005).AfewcommercialRDBMSsnativelysupportfeaturestypicalformultidimensionalenginestomaximizequeryoptimizationandincreasemeta-datareusability.Forexample,sinceits8iversionwasmadeavailable,Oracle’sRDBMSgivesuserstheopportunitytodefinehierarchiesandmaterializedviews.Moreover,itoffersanavigatorthatcanusemeta-dataandrewritequerieswithoutanyneedforamultidimensionalenginetobeinvolved.
DifferentfromaROLAPsystem,aMOLAPsystemisbasedonanadhoclogicalmodelthatcanbeusedtorepresentmultidimensionaldataandoperationsdirectly.Theunderlyingmultidimensionaldatabasephysicallystoresdataasarraysandtheaccesstoitispositional(GaedeandGünther,1998).Grid-files(Nievergeltetal.,1984;WhangandKrishnamurthy,1991),R*-trees(Beckmannetal.,1990)andUB-trees(Markletal.,2001)areamongthetechniquesusedforthispurpose.
ThegreatestadvantageofMOLAPsystemsincomparisonwithROLAPisthatmultidimensionaloperationscanbeperformedinaneasy,naturalwaywithMOLAPwithoutanyneedforcomplexjoinoperations.Forthisreason,MOLAPsystemperformanceisexcellent.However,MOLAPsystemimplementationshaveverylittleincommon,becausenomultidimensionallogicalmodelstandardhasyetbeenset.Generally,theysimplysharetheusageofoptimizationtechniquesspecificallydesignedforsparsitymanagement.Thelackofacommonstandardisaproblembeingprogressivelysolved.ThismeansthatMOLAPtoolsarebecomingmoreandmoresuccessfulaftertheirlimitedimplementationformanyyears.Thissuccessisalsoprovenbytheinvestmentsinthistechnologybymajorvendors,suchasMicrosoft(AnalysisServices)andOracle(Hyperion).
Theintermediatearchitecturetype,HOLAP,aimsatmixingtheadvantagesofbothbasicsolutions.IttakesadvantageofthestandardizationlevelandtheabilitytomanagelargeamountsofdatafromROLAPimplementations,andthequeryspeedtypicalofMOLAPsystems.HOLAPimpliesthatthelargestamountofdatashouldbestoredinanRDBMStoavoidtheproblemscausedbysparsity,andthatamultidimensionalsystemstoresonlytheinformationusersmostfrequentlyneedtoaccess.Ifthatinformationisnotenoughtosolvequeries,thesystemwilltransparentlyaccessthepartofthedatamanagedbytherelationalsystem.Overthelastfewyears,importantmarketactorssuchasMicroStrategyhaveadoptedHOLAPsolutionstoimprovetheirplatformperformance,joiningothervendorsalreadyusingthissolution,suchasBusinessObjects.
1.9 Additional IssuesTheissuesthatfollowcanplayafundamentalroleintuningupadatawarehousesystem.Thesepointsinvolveverywide-rangingproblemsandarementionedheretogiveyouthemostcomprehensivepicturepossible.
1.9.1 QualityIngeneral,wecansaythatthequalityofaprocessstandsforthewayaprocessmeetsusers’goals.Indatawarehousesystems,qualityisnotonlyusefulforthelevelofdata,butaboveallforthewholeintegratedsystem,becauseofthegoalsandusageofdatawarehouses.Astrictqualitystandardmustbeensuredfromthefirstphasesofthedatawarehouseproject.
ch01.indd39 4/21/093:23:43PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
40 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 40 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
Defining,measuring,andmaximizingthequalityofadatawarehousesystemcanbeverycomplexproblems.Forthisreason,wementiononlyafewpropertiescharacterizingdataqualityhere:
• Accuracy Storedvaluesshouldbecompliantwithreal-worldones.
• Freshness Datashouldnotbeold.
• Completeness Thereshouldbenolackofinformation.
• Consistency Datarepresentationshouldbeuniform.
• Availability Usersshouldhaveeasyaccesstodata.
• Traceability Datacaneasilybetraceddatabacktoitssources.
• Clearness Datacanbeeasilyunderstood.
Technically,checkingfordataqualityrequiresappropriatesetsofmetrics(Abellóetal.,2006).Inthefollowingsections,weprovideanexampleofthemetricsforafewofthequalitypropertiesmentioned:
• Accuracyandcompleteness ReferstothepercentageoftuplesnotloadedbyanETLprocessandcategorizedonthebasisofthetypesofproblemarising.Thispropertyshowsthepercentageofmissing,invalid,andnonstandardvaluesofeveryattribute.
• Freshness Definesthetimeelapsedbetweenthedatewhenaneventtakesplaceandthedatewhenuserscanaccessit.
• Consistency Definesthepercentageoftuplesthatmeetbusinessrulesthatcanbesetformeasuresofanindividualcubeormanycubesandthepercentageoftuplesmeetingstructuralconstraintsimposedbythedatamodel(forexample,uniquenessofprimarykeys,referentialintegrity,andcardinalityconstraintcompliance).
Notethatcorporateorganizationplaysafundamentalroleinreachingdataqualitygoals.Thisrolecanbeeffectivelyplayedonlybycreatinganappropriateandaccuratecertificationsystemthatdefinesalimitedgroupofusersinchargeofdata.Forthisreason,designersmustraiseseniormanagers’awarenessofthistopic.Designersmustalsomotivatemanagementtocreateanaccuratecertificationprocedurespecificallydifferentiatedforeveryenterprisearea.Aboardofcorporatemanagerspromotingdataqualitymaytriggeravirtuouscyclethatismorepowerfulandlesscostlythananydatacleansingsolution.Forexample,youcanachieveawesomeresultsifyouconnectacorporatedepartmentbudgettoaspecificdataqualitythresholdtobereached.
Anadditionaltopicconnectedtothequalityofadatawarehouseprojectisrelatedtodocumentation.Todaymostdocumentationisstillnonstandardized.Itisoftenissuedattheendoftheentiredatawarehouseproject.Designersandimplementersconsiderdocumentationawasteoftime,anddatawarehouseprojectcustomersconsideritanextracostitem.Softwareengineeringteachesthatastandardsystemfordocumentsshouldbeissued,managed,andvalidatedincompliancewithprojectdeadlines.Thissystemcanensurethatdifferentdatawarehouseprojectphasesarecorrectlycarriedoutandthatallanalysisandimplementationpointsareproperlyexaminedandunderstood.Inthemediumandlongterm,correctdocumentsincreasethechancesofreusingdatawarehouseprojectsandensureprojectknow-howmaintenance.
ch01.indd40 4/21/093:23:43PM
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 41
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 41
NOTE Jarkeetal.,2000havecloselystudieddataquality.Theirstudiesprovideusefuldiscussionsontheimpactofdataqualityproblemsfromthemethodologicalpointofview.Kelly,1997describesqualitygoalsstrictlyconnectedtotheviewpointofbusinessorganizations.Serranoetal.,2004,2007;Lechtenbörger,2001;andBouzeghoubandKedad,2000focusonqualitystandardsrespectivelyforconceptual,logical,andphysicaldatawarehouseschemata.
1.9.2 SecurityInformationsecurityisgenerallyafundamentalrequirementforasystem,anditshouldbecarefullyconsideredinsoftwareengineeringateveryprojectdevelopmentstagefromrequirementanalysisthroughimplementationtomaintenance.Securityisparticularlyrelevanttodatawarehouseprojects,becausedatawarehousesareusedtomanageinformationcrucialforstrategicdecision-makingprocesses.Furthermore,multidimensionalpropertiesandaggregationcauseadditionalsecurityproblemssimilartothosethatgenerallyariseinstatisticdatabases,becausetheyimplicitlyoffertheopportunitytoinferinformationfromdata.Finally,thehugeamountofinformationexchangethattakesplaceindatawarehousesinthedata-stagingphasecausesspecificproblemsrelatedtonetworksecurity.
Appropriatemanagementandauditingcontrolsystemsareimportantfordatawarehouses.Managementcontrolsystemscanbeimplementedinfront-endtoolsorcanexploitoperatingsystemservices.Asfarasauditingisconcerned,thetechniquesprovidedbyDBMSserversarenotgenerallyappropriateforthisscope.Forthisreason,youmusttakeadvantageofthesystemsimplementedbyOLAPengines.Fromtheviewpointofusersprofile–baseddataaccess,basicrequirementsarerelatedtohidingwholecubes,specificcubeslices,andspecificcubemeasures.Sometimesyoualsohavetohidecubedatabeyondagivendetaillevel.
NOTE Inthescientificliteraturethereareafewworksspecificallydealingwithsecurityindatawarehousesystems(Kirkgözeetal.,1997;PriebeandPernul,2000;RosenthalandSciore,2000;Katicetal.,1998).Inparticular,PriebeandPernulproposeacomparativestudyonsecuritypropertiesofafewcommercialplatforms.Ferrandez-Medinaetal.,2004andSoleretal.,2008discussanapproachthatcouldbemoreinterestingfordesigners.TheyuseaUMLextensiontomodelspecificsecurityrequirementsfordatawarehousesintheconceptualdesignandrequirementanalysisphases,respectively.
1.9.3 EvolutionManymaturedatawarehouseimplementationsarecurrentlyrunninginmidsizeandlargecompanies.Theunstoppableevolutionofapplicationdomainshighlightsdynamicfeaturesofdatawarehousesconnectedtothewayinformationchangesattwodifferentlevelsastimegoesby:
• Datalevel Evenifmeasureddataisnaturallyloggedindatawarehousesthankstotemporaldimensionsmarkingevents,themultidimensionalmodelimplicitlyassumesthathierarchiesarecompletelystatic.Itisclearthatthisassumptionisnotveryrealistic.Forexample,acompanycanaddnewproductcategoriestoitscatalogandremoveothers,oritcanchangethecategorytowhichanexistingproductbelongsinordertomeetnewmarketingstrategies.
ch01.indd41 4/21/093:23:44PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
42 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
• Schemalevel Adatawarehouseschemacanvarytomeetnewbusinessdomainstandards,newusers’requirements,orchangesindatasources.Newattributesandmeasurescanbecomenecessary.Forexample,youcanaddasubcategorytoaproducthierarchytomakeanalysesricherindetail.Youshouldalsoconsiderthatthesetoffactdimensionscanvaryastimegoesby.
Temporalproblemsareevenmorechallengingindatawarehousesthaninoperationaldatabases,becausequeriesoftencoverlongerperiodsoftime.Forthisreason,datawarehousequeriesfrequentlydealwithdifferentdataand/orschemaversions.Moreover,thispointisparticularlycriticalfordatawarehousesthatrunforalongtime,becauseeveryevolutionnotcompletelycontrolledcausesagrowinggapbetweentherealworldanditsdatabaserepresentation,eventuallymakingthedatawarehousesobsoleteanduseless.
Asfaraschangesindatavaluesareconcerned,differentapproacheshavebeendocumentedinscientificliterature.Somecommercialsystemsalsomakeitpossibletotrackchangesandquerycubesonthebasisofdifferenttemporalscenarios.Seesection8.4formoredetailsondynamichierarchies.Ontheotherhand,managingchangesindataschematahasbeenexploredonlypartiallytodate.Nocommercialtooliscurrentlyavailableonthemarkettosupportapproachestodataschemachangemanagement.
Theapproachestodatawarehouseschemachangemanagementcanbeclassifiedintwocategories:evolution(Quix,1999;Vaismanetal.,2002;Blaschka,2000)andversioning(Ederetal.,2002;Golfarellietal.,2006a).Bothcategoriesmakeitpossibletoalterdataschemata,butonlyversioningcantrackpreviousschemareleases.Afewapproachestoversioningcancreatenotonly“true”versionsgeneratedbychangesinapplicationdomains,butalsoalternativeversionstouseforwhat-ifanalyses(Bebeletal.,2004).
Themainproblemthathasnotbeensolvedinthisfieldisthecreationoftechniquesforversioninganddatamigrationbetweenversionsthatcanflexiblysupportqueriesrelatedtomoreschemaversions.Furthermore,weneedsystemsthatcansemiautomaticallyadjustETLprocedurestochangesinsourceschemata.Inthisdirection,someOLAPtoolsalreadyusetheirmeta-datatosupportanimpactanalysisaimedatidentifyingthefullconsequencesofanychangesinsourceschemata.
ch01.indd42 4/21/093:23:44PM