ew-shopp d3.1 v1.1€¦ · the term "data analytics" is too often used to define a...

78
EW-Shopp GA number: 732590 H2020-ICT-2016-2017/H2020-ICT-2016-1 1 D3.1 – EW-Shopp Components Specification and Design Deliverable n: 3.1 Date: 29 December 2017 Status: Final Version: 1.0 Authors: M. Ciavotta (UNIMIB), F. De Paoli (UNIMIB), A. Košmerlj (JSI), A. Marguglio (ENG), N. Nikolov (SINTEF), M. Zvan (BT) Contributors: F. Rodriguez (JOT), P. Orel (BB), D. Jerončič (CE), O. Melnyk (ME), S. Dalgard (SINTEF) Rreviewers: F. Perales (JOT), D. Čreslovnik (CE) Distribution: Public Grant n. 732590 - H2020-ICT-2016-2017/H2020-ICT-2016-1

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

1

D3.1–EW-ShoppComponentsSpecificationandDesign

Deliverablen: 3.1Date: 29December2017Status: FinalVersion: 1.0Authors: M. Ciavotta (UNIMIB), F. De Paoli (UNIMIB), A. Košmerlj (JSI), A.

Marguglio(ENG),N.Nikolov(SINTEF),M.Zvan(BT)

Contributors: F. Rodriguez (JOT), P. Orel (BB), D. Jerončič (CE), O. Melnyk (ME), S.Dalgard(SINTEF)

Rreviewers: F.Perales(JOT),D.Čreslovnik(CE) Distribution: Public

Grantn.732590-H2020-ICT-2016-2017/H2020-ICT-2016-1

Page 2: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

2

HistoryofChanges

Version Date Description Revisedby

0.1 10/10/2017 DefinitionoftheTableofContents FlaviodePaoli

0.2 15/11/2017 ContributiontoBC4Section FranciscoRodriguez

0.3 20/11/2017 Contribution on the state of the art, DataAnalyzerandQminer

AljažKošmerlj

0.4 22/11/2017 ContributiontoBC1Section Davy Jerončič,DarkoDujič

0.5 24/11/2017 ContributiontoBC3Section OlgaMelnyk

0.6 24/11/2017 SINTEF’scontributiontoseveralsections NikolayNikolov

0.7 27/11/2017 ContributiontoBC1Section Patricija FilipičOrel

0.8 29/11/2017 Chapter1,Chapter2andChapter3completed MicheleCiavotta

0.9 04/12/2017 Chapter4andChapter6completed MicheleCiavotta,FlavioDePaoli

0.10 06/12/2017 Integrationstrategysectionadded MicheleCiavotta

0.11 07/12/2017 Summaryandconclusionsadded MicheleCiavotta

1.0 22/12/2017 Revisionaccordingtoreviewers’commentsandfinalizationofthedeliverable

MicheleCiavotta,FlavioDePaoli

Page 3: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

3

ExecutiveSummary

EW-Shoppaimstosupporte-commerce,retailandmarketingindustriesinimprovingtheirefficiencyand competitiveness throughproviding theability toperformpredictive andprescriptive analyticsoverintegratedandenrichedlargedatasetsthroughtheuseofopenandflexiblesolutions.

ThisdocumentisintendedtobethefirststeptowardstheconstructionoftheEW-Shoppecosystem,aplatformthatseekstosignificantlyreduceintegrationtimeandtoimprovethequalityofanalyticsbyprovidingcutting-edgetoolsandaccesstodataasaservice,withparticularreferencetoweatherandeventinformation.

WeproposeareferencearchitecturespecificallydesignedtohandleBigDatavolumes.Inorderforthis to be possible, we have carried out a thorough requirement gathering phase that takes intoaccount thebestpractices in creatingBigDataarchitecturesand theexpectationsof thepartnersinvolved inbusinesscases.Asa result,wewereablenotonly todefine theplatformarchitecture,butalsotosetupaunifiedcommonlanguagethat,webelieve,willhaveclearbenefits inthenextstagesoftheproject.

Inaddition,weproposeaninitialreferenceimplementationbyselectingoneormoresolutionsforeach reference component. In particular, embracing the lean development approach alreadyintroducedinDeliverableD4.1,wedetailonlythecoreoftheplatform,namelythetoolsonwhichastrongconvergenceexistsamongtheconsortiummembers(theminimumviableproductorMVP).The rest of the platform is constituted by suggestion of tools and/or technologies that at thismomentseemtobegoodcandidatesbutthatmaybereplacedastheprojectdevelops.

Page 4: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

4

TableofcontextHistoryofChanges.................................................................................................................................2

Summary................................................................................................................................................3

Listoffigures.........................................................................................................................................6

Listoftables...........................................................................................................................................7

Acronyms...............................................................................................................................................8

Chapter1 Introduction...................................................................................................................10

1.1 ObjectivesandScope.........................................................................................................11

1.2 RelationshiptoOtherDeliverables....................................................................................11

1.3 DocumentStructure...........................................................................................................12

Chapter2 ReferenceSolutions.......................................................................................................13

2.1 BigDataReferenceArchitectures......................................................................................13

2.1.1 NISTReferenceArchitecture..........................................................................................13

2.1.2 BDVAReferenceModel..................................................................................................16

2.2 DataWrangling...................................................................................................................17

2.3 Analytics.............................................................................................................................19

2.4 VisualizationandReporting...............................................................................................20

Chapter3 DrivingRequirements....................................................................................................22

3.1 ElicitationMethodology.....................................................................................................22

3.2 RequirementsDefinitionSurvey........................................................................................23

3.2.1 Ceneje............................................................................................................................24

3.2.2 BigBang.........................................................................................................................25

3.2.3 Browsetel.......................................................................................................................26

3.2.4 Measurence...................................................................................................................28

3.2.5 JOT.................................................................................................................................29

3.3 ArchitecturalRequirements...............................................................................................31

3.3.1 FunctionalRequirements...............................................................................................31

3.3.2 Non-functionalRequirements........................................................................................37

Chapter4 ArchitectureDesign.......................................................................................................39

4.1.1 EnvisionedWorkflowandActors...................................................................................39

4.2 ArchitecturalSpecifications................................................................................................40

Page 5: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

5

4.2.1 TheDimensionReductionApproach.............................................................................43

4.2.2 CoreDataServices.........................................................................................................44

4.2.3 PlatformServices...........................................................................................................46

4.2.4 CorporateServices.........................................................................................................50

4.2.5 CrosscuttingServices.....................................................................................................51

4.3 ControlFlow.......................................................................................................................51

4.4 DataFormat.......................................................................................................................54

4.5 MappingtheReferenceArchitectureontheBusinessCases.............................................55

4.5.1 BC1–IntegrationforConsumerJourney.......................................................................55

4.5.2 BC3–LocationScouting.................................................................................................56

4.5.3 BC4–Campaign-drivenPurchasingIntentions..............................................................57

Chapter5 ReferenceImplementation............................................................................................59

5.1 Overview............................................................................................................................59

5.2 IntegrationStrategy...........................................................................................................61

5.3 CoreDataServices..............................................................................................................61

5.3.1 EventRegistry.................................................................................................................61

5.3.2 WeatherAPI...................................................................................................................62

5.3.3 ProductsandOtherKnowledgeBases...........................................................................63

5.4 PlatformServices................................................................................................................64

5.4.1 DataWrangler(DW).......................................................................................................64

5.4.2 DataAnalyzer(DA).........................................................................................................67

5.4.3 DataReporter(DR).........................................................................................................69

5.5 CorporateServices.............................................................................................................71

5.5.1 ClusterConfiguration.....................................................................................................71

5.5.2 DataFlowManager........................................................................................................72

5.5.3 Enrichmentandanalysisdatabase.................................................................................72

5.6 Crosscuttingservices..........................................................................................................73

5.6.1 Securityeprivacy...........................................................................................................73

5.6.2 MonitoringandAuditingServices..................................................................................73

5.7 Single-siteConfiguration....................................................................................................75

Chapter6 Conclusions....................................................................................................................76

References...........................................................................................................................................77

Page 6: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

6

Listoffigures

FIGURE1:NISTREFERENCEARCHITECTURE...................................................................................................................16FIGURE2:BDVAREFERENCEMODEL...........................................................................................................................16FIGURE3:REQUIREMENTSDEFINITIONPROCESS..............................................................................................................23FIGURE4:MAPPINGEW-SHOPPPLATFORMCOMPONENTSONNISTREFERENCEARCHITECTUREFORBIGDATA.........................41FIGURE5:EW-SHOPPLOGICALREFERENCEARCHITECTURE...............................................................................................42FIGURE6:EW-SHOPPREFERENCEDATAFLOWPIPELINE....................................................................................................44FIGURE7:DATAWRANGLERARCHITECTURE...................................................................................................................47FIGURE8:DATAANALYZERARCHITECTURE.....................................................................................................................48FIGURE9:DATAREPORTERARCHITECTURE....................................................................................................................49FIGURE10:GENERICCONTROLSEQUENCEFORDATAWRANGLING.....................................................................................52FIGURE11:GENERICCONTROLSEQUENCEFORDATAANALYTICS........................................................................................53FIGURE12:GENERICCONTROLSEQUENCEFORREPORTVISUALIZATION...............................................................................53FIGURE13:BC1ARCHITECTURE..................................................................................................................................56FIGURE14:BC3ARCHITECTURE..................................................................................................................................57FIGURE15:BC4ARCHITECTURE..................................................................................................................................58FIGURE16:ASIAARCHITECTURE.................................................................................................................................66FIGURE17:DATAWRANGLERCOMPONENTSPECIFICATION..............................................................................................67FIGURE18:DATAANALYZERCOMPONENTSPECIFICATION................................................................................................68FIGURE19:DATAREPORTERCOMPONENTSPECIFICATION................................................................................................69FIGURE20:DATASETCREATION...................................................................................................................................70FIGURE21:QUERYINGTHEDATASET.............................................................................................................................70FIGURE22:BIGDATARUNTIMECOMPONENTSPECIFICATION...........................................................................................71FIGURE23:MONITORINGSYSTEM................................................................................................................................74FIGURE24:SELF-CONTAINEDARCHITECTUREANDCOMPONENTSPECIFICATION....................................................................75

Page 7: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

7

Listoftables

TABLE1:REQUIREMENTF.1.......................................................................................................................................32TABLE2:REQUIREMENTF.2.......................................................................................................................................32TABLE3:REQUIREMENTF.3.......................................................................................................................................32TABLE4:REQUIREMENTF.4.......................................................................................................................................32TABLE5:REQUIREMENTF.5.......................................................................................................................................33TABLE6:REQUIREMENTF.6.......................................................................................................................................33TABLE7:REQUIREMENTF.7.......................................................................................................................................33TABLE8:REQUIREMENTF.8.......................................................................................................................................33TABLE9:REQUIREMENTF.9.......................................................................................................................................34TABLE10:REQUIREMENTF.10...................................................................................................................................34TABLE11:REQUIREMENTF.11...................................................................................................................................34TABLE12:REQUIREMENTSP.1...................................................................................................................................35TABLE13:REQUIREMENTSP.2...................................................................................................................................35TABLE14:REQUIREMENTSP.3...................................................................................................................................35TABLE15:REQUIREMENTSP.4...................................................................................................................................36TABLE16:REQUIREMENTSP.5...................................................................................................................................36TABLE17:REQUIREMENTSP.6...................................................................................................................................37TABLE18:REQUIREMENTNF.1...................................................................................................................................37TABLE19:REQUIREMENTNF.2...................................................................................................................................37TABLE20:REQUIREMENTNF.3...................................................................................................................................38TABLE21:REQUIREMENTNF.4...................................................................................................................................38TABLE22:REQUIREMENTNF.5...................................................................................................................................38TABLE23:EW-SHOPPENVISIONEDACTORS...................................................................................................................39

Page 8: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

8

Acronyms

Abbreviation Description

MVP MinimumViableProduct

NIST NationalInstituteofStandardsandTechnology

BDVA EuropeanonecalledBigDataValueAssociation

NBD-PWG NISTBigDataPublicWorkingGroup

ETL Extract,TransformandLoad

QoS QualityofService

DPU DataProcessingUnit

RDF ResourceDescriptionFramework

BI BusinessIntelligence

ROI ReturnOfInvestment

EAN EuropeanArticleNumber

TSV TabSeparatedValues

AWS AmazonWebServices

JSON JavaScriptObjectNotation

XML ExtensibleMarkupLanguage

KML KeyholeMarkupLanguage

OLAP Onlineanalyticalprocessing

NAS Network-attachedstorage

CSV CommaSeparatedValues

PPC PricePaidperClick

GDPR GeneralDataProtectionRegulation

SSO SingleSignOn

DoA DescriptionofAction

DW DataWrangler

Page 9: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

9

DA DataAnalyzer

DR DataReporter

BDR BigDataRuntime

API ApplicationProgrammingInterface

SPARQL SPARQLProtocolandRDFQueryLanguage

KG KnowledgeGraph

KB KnowledgeBase

BC BusinessCase

VM VirtualMachine

PETs Privacy-EnhancingTechnologies

CEP ComplexEventProcessor

MARS MeteorologicalArchivalandRetrievalSystem

ECMWF European Centre for Medium-Range WeatherForecasts

Page 10: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

10

Chapter1 Introduction

For some years now we have been witnessing the flourishing and reinforcement of Big Datamethodologies inacademiaandespecially in industry.ThissuccesscaninpartbeexplainedbythepromiseofBigDatasupporters tobeable todefineanddelivermoreaccurateandefficientdata-drivendecision-makingprocesses.Thisisnottheonlyasset,butitiswhatwearemostinterestedinhighlightingwithintheEW-Shoppproject.

The term "Data Analytics" is too often used to define a generic data driven decision process andtherefore covers heterogeneous activities such as cleaning, linking, enrichment, application ofanalyticmodelsuptobusinessintelligenceandvisualization.Inthisdocument,weproposetodesignthe EW-Shopp platform from its components and their relationships. We present a referencearchitecture obtained from a rigorous process of requirements elicitation that took into accountbothliteratureandbestpracticesaswellastheneedsexpressedbyinvolvedstakeholders.

A considerable amount of work has already been done within the consortium (presented indocumentD4.1 [1]) tooutline the strategy for the implementationofbusiness casepilots. In thatparticular document, not only the functional requirements associated with business cases (theimplementation and reporting of which in marketed services is one of the core aspects of theproject) have been collected and refined, but also the development guidelines based on leanmethodology [2] have been defined. In particular, thebuild-measure-learn feedback loop is alsopresented. Inanutshell, themethodologyenvisions foreachpilotastartingphasewithMinimumViableProduct(MVP)meanttoreducethetime-to-market.TheMVPisgoingtobeenrichedastheproject progresses and the vision matures. Pilots are therefore seen as playgrounds forexperimentingapproachesand technologies thateventuallywill be included in the finalmarketedservices.

Threestagesofthedevelopmentareforeseen:

• Firststage isbuildingthe integrateddataplatformfromheterogeneoussourcesasthekeyinputs.

• In the second stage the technical partners will need to present is analytical expertise tomodel these data to extract relevant analytical patterns, predictive strength and businesssense.

• Thelaststagewillbethedevelopmentofmarketedservicesfocusedontargetsegmentsontopoftheintegratedplatform.

At the timethisdocument iswritten,weareat thebeginningof the firstphaseand it is the righttimetodesignaunified,completeandwell-foundedreferencearchitecture.Thedeclaredobjectiveistoprovideapointofreferencethatwillremainassteadyaspossiblefortheentiredurationoftheprojectandthatcanthusleaditsevolutiononthebasisofasolidandagreedbasis.Todothis,aswewill see,westarted fromthe functional requirementscollected inD4.1 trying to read in themthe

Page 11: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

11

answers to the architectural questions that we asked ourselves. It was immediately evident,however,thattheserequirementswerenotsufficienttodefineamodern,efficient,responsiveandscalableplatform.Asaconsortium,wehavethereforedecidedtoundertakeanin-depthstudyoftheproblemsthatwewanttotackle.Theresultsofsuchteamworkaredetailedinthisdocument.

1.1 ObjectivesandScope

Withtheworktheconsortiumhasdoneoncomponentdesign,theprocessesandoutcomesofwhicharedescribedinthisdocument,weaimtoachievethefollowingtwoobjectives:

1. Defineareferencearchitecture.Thisdeliverableprovidesareferencearchitecturethataimsatbeinggeneralandflexibleenoughtobesuccessfulappliedinthepilots.Thearchitecturedescribes thecorecomponentsand their relationships in termsofdata flow.Moreover, itaims at setting a common language among the partners. Tomake this possible, we havereferred to the pilot descriptions, the partners’ experience, and the best practices of thefieldofBigData.Furthermore,theeffortrequiredtodefinepreciselythepilots,especiallyinterms of processes, data flow and workflow, has spawned awareness in the consortiummembers about the complexity of the problem and the need of a common reference.Finally, it is importanttonotethatdefiningareferencearchitecturedoesnotconflictwiththe lean methodology as we describe the components according to their generalfunctionality,withoutimposingspecificsolutions.

2. Propose an initial implementation of the platform. This deliverable also provides an earlystageimplementationofthereferencearchitecture,whosecomponentscanbedividedintotwogroups;thefirstonecontainstoolsuponwhichthereisageneralagreementamongthepartners, the second one features components that are either left unspecified or areconcretetooltobeconsideredasan initialchoice (subjecttobefurtherrefinement).Suchanapproach isnecessary for two reasons. Firstof all, following the leanmethodology thepilotsneedfreedomtoexperimentinordertomaketheplatformevolve.Secondly,partoftheecosystemmustbeadaptabletotheneedsofindividualpilots,whichmightrequiretheuseofspecifictechnologies.

1.2 RelationshiptoOtherDeliverables

Thisdocument isbasedon theworkdone inDeliverableD4.1 [1]. Inbothdocuments,we seek toprovide the groundwork for realizing the vision of EW-Shopp, but this is done in a largelycomplementaryway.TheD4.1outlinesthedevelopmentstrategyanddefinesthevariouspilotsaswell as their functionalities. The document presents an initial set of requirements, the mainelementsofabaselineplatformanda first versionof thedata flow.Here, the requirementshavebeentakenup,expanded,evaluatedandselectedfortheirimpactonarchitecture.Thearchitecturehasbeendetailedandreorganized.Inaddition,aclearstrategyforhandlinglargescaledatasetshasbeenoutlined.Finally,thedataandcontrolflowshavebeendefined.

Page 12: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

12

ThisdocumentmustalsobeconsideredlinkedtoD2.2[3]tobereleasedinM12aswell.Infact,D2.2presentsthefirstversionoftheEW-Shoppplatformdeploymentinthecloudenvironmentprovidedby SINTEF. The deployment is accompanied by a description of the work done and the varioussoftwaretoolsinvolved.Theobjectivesofthetwodocumentsthereforeoverlapinpart.Intheend,however,D2.2 responds to the specificneed topresent the current versionof theplatformwhilethisdocumentdescribesamoregeneralandcompletearchitectureandtheprocessthat ledto itsdefinition.

1.3 DocumentStructure

Thisdocumentisstructuredasfollows.InChapter2wepresentthereferencearchitecturesforBigData as well as the main solutions for data wrangling, data analytics, business intelligence andreporting. The requirements that have guided the definition of our reference architecture arepresentedanddiscussed inChapter3. TheEW-Shopp referenceplatform isdetailed inChapter4,where both the reference data flow and control flow are also presented. Once the referencearchitecturehasbeendefined,wehavedecidedtopresentapossibleimplementation(inembryonicform)oftheEW-ShoppplatforminChapter5.Finally,Chapter6concludesthedocument.

Page 13: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

13

Chapter2 ReferenceSolutions

The purpose of this chapter is to provide an exhaustive overview of the methodologies andtechnologies commonly adopted to create data preparation, analytics and reporting platformscapableofmanaginglargevolumesofdata.

2.1 BigDataReferenceArchitectures

In recent years,wehavewitnessed theproliferationof references forBigData architectures. Theaimof theseproposalswas to fill thegapbetween industryandacademiaby seeking toestablishshared objectives and taxonomies. In the particular field of Big Data architectures, industry hasassumed the lead and driving role for the entire sector, proposing and experimenting variousapproaches.Manyof the solutions thathavebeendevelopedwith this "trial anderror" approachhave been jealously guarded for a long time, but today we are witnessing a change in trend,especially due to initiatives of the open source community. Frommany quarters, it was deemednecessarytore-organizetheknow-howofthesectorandtoproposeastructuredandascompleteas possible vision of a software architecture that can be defined as Big Data. Both researchinstitutions and large companies have therefore proposed their architectural vision. Governmentinitiativesjoinedforces.Amongthese,certainlyworthmentioningaretheU.S.undertheNationalInstitute of Standards and Technology (NIST), and the European one called Big Data ValueAssociation(BDVA).

The NIST Big Data PublicWorking Group (NBD-PWG), born in 2013, and the BDVA, born in 2015share similar ambitions. In fact, their primaryobjective is to foster theparticipationofAcademia,Industry andGovernment in the creationof a community of interests for the adoptionof theBigData in many sectors. Both initiative, moreover, aims at developing consensus on definitiontaxonomies,referencemodels,andstandards.

2.1.1 NISTReferenceArchitecture

The work of the NBD-PWG results in the NIST Big Data Interoperability Framework, whichaddressesthemostimportanttopicsofBigData,thatis:

• Definitionandtaxonomies

• Usecasesandgeneralrequirements

• Securityandprivacy

• Architecturalsurvey

Page 14: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

14

• ReferenceArchitecture

• StandardsRoadmap

Volumes5[4]and6[5]oftheNISTBigDATAInteroperabilityFramework,whichproposeasurveyofthemostproposedarchitecturesandareferencearchitecture,respectively,fallwithinthescopeofthisproject.

Asfarasvolume5isconcerned,thesurveycoveredninereferencearchitectures:

1. ETStrategies

2. Microsoft

3. UniversityofAmsterdam

4. IBM

5. Oracle

6. Pivotal

7. SAP

8. 9Sight

9. LexisNexis

Theworkhasbeensystematicandmeticulous.Inparticular,foreachspecificimplementationataskhasbeen carriedout to identify the common components, towhich referencehasbeenmadebymeansofasingletaxonomy.Threemacro-componentshavebeenidentified:

• Big Data Management and Storage: provides Big Data capabilities to structured andunstructureddata.

o Structured,semi-structuredandunstructureddata

o Volume,variety,velocity,andvariability

o SQLandNoSQL

o Distributedfilesystem

• BigDataAnalyticsandApplicationInterfaces

o ETL(Extract,TransformandLoad)andMining

o Analytics

o ReportandVisualization

o Governance

• Big Data Infrastructure: infrastructure and system support with capabilities of integratinginternalorexternalsources.

Page 15: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

15

o Inmemorydatagrids

o Operationaldatabase

o Analyticdatabase

o Relationaldatabase

o Flatfiles

o Contentmanagementsystem

o Horizontalscalablearchitecture

Figure1depictstheReferenceArchitectureproposedbytheNIST,whichisexpectedtobeavendor-neutral, technology- and infrastructure-agnostic conceptualmodel of a BigData architecture. It istheresultsofanelicitationandsynthesisprocessbasedonthemajorproposals fromthe IndustryandtheAcademia.

ThemaincomponentsoftheNISTReferenceArchitectureare:

• Data provider: The data providers are the actors that provide the input datasets orinformationfeedstotheBigDataecosystem.

• System orchestration: Its purpose in the architecture is set up and manage the othercomponents to implement one or more workloads. The System Orchestrator may alsomonitortheworkloadsandsystemtoconfirmthattheQualityofService(QoS)requirementsarefulfilledineachphaseofthedataflowpipeline,andmayactuallyelasticallyassignandprovisionadditionalphysicalorvirtualresources.

• BigDataApplicationProvider:Thiscomponentofthearchitectureistheonetheimplementsthebusiness logicand theadded-value functionalities thatareexecutedby theunderlyingarchitecturelayers.TheBigDataApplicationProvidercan,inparticular,providessupportforthe following activities/functionalities: Collection, Preparation,Analytics,Visualization, andAccess.

• Big Data Framework Provider: This is the macro-component, made up of severalhierarchically organized layers, that provides the infrastructure (virtual and physicalresources, storage buckets, networks, etc.) and the tools for the management of highvolumesofdata(SQL,NoSQL, filesystems,etc.)andtoprocessthemefficiently,often inadistributedway(Batch,Interactive,Streamingworkloads).

• DataConsumer:asforthedataprovider,itcanbeanactualuseroranothersystem.Itistheentity that interacts with the platform (e.g., it can execute Search and Retrieve tasks, orimplement Analytics models) and consumes its outcomes (e.g., Reporting, Visualization,Download).

MoredetailsregardingtheNISTReferenceArchitecturecanbefoundin[5]

Page 16: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

16

Figure1:NISTReferenceArchitecture

2.1.2 BDVAReferenceModel

SimilarlytowhathappenedintheU.S.,Europeanpoliticshaveseentheimportanceofpromotingacommon space of public-private aggregation with the aim of fostering growth through thedevelopmentof aData-driveneconomy.This led to the stipulationof apublic-privatepartnershipbetween the European Commission (public partner) and the Big Data Value Association (privatepartner)assignedinOctober2014.

Recently,BDVAhasreleasedtheforthversionofitsresearchagenda[6],whichsummarizestheworkdonebyseveraltaskforcesinthedefinitionofprioritiesand,aReferenceModel(seeFigure2).

.

Figure2:BDVAReferenceModel

Page 17: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

17

TheFigureaboverepresentsgraphicallytwodifferentpointsofview,calledHorizontalandVerticalConcernsintheBDVAlingo.Thehorizontalconcernsregardthedifferentaspectsoftheenvisioneddata processing chain, from data ingestion to reporting and visualization tasks. Interestingly, italthough a clearmapping between the BDVA andNIST architectures can be identified, the BDVAreference model focuses more on the data, whereas the NIST architecture is more process andcomponent centric. The inclusionofdataprotectionanddatamanagementas first-class citizenofthearchitectureprovesit.UnliketheNISTarchitecture,theBDVAseemstoconsiderThins,SensorsandActuatorsasthemaindataprovidersandconsumers.ThedatacentralityintheBDVAplatformisclearlyevident inwhatarecalledverticalconcerns. In thiscontext, thearchitecturemakesexplicitreferencetodatatypesandtheirsemantics,andtodatasharingplatforms.

Finally,wewouldliketopointoutthatintheplatformproposedbyBDVAthereareconsiderationsrelatingtotheengineering,developmentandoperationsprocesses,whicharecompletelyabsentinthe NIST architecture as well as the need for standardization. This, although falling within theobjectivesofNIST,doesnotappearinitsreferencearchitectureasitisfocusedoncomponents,dataflowsandprocesses.MoredetailsontheBDVAreferencemodelcanbefoundin[6].

2.2 DataWrangling

DataWrangling(preparation,curation,andintegration)activitiesstillconsumeuptothemajorityoftheresources investedindataexploitationactions.TheEW-Shoppintegrationarchitectureaimstodeliverthebesttop-to-bottomsolutiontointegratebusinessdatawithdatacomingfromdifferentsectors, by using shared systems of identifiers for CoreData. Recently, several software solutionsappearedinthemarket,whichprovidetransformationtoolboxesanddataflowmanagementagenttorealizesuitablewranglingpipelines.

OpenRefine1 isanopensourcedesktop tool (since2010)previouslyknownasGoogleRefine.Thissoftware solutionprovides the toolingnecessary toperformcleanupand transformationactivitiesondataprovidedintabularform.AnOpenRefineprojectconsistsofasingledatatablethatcanbefiltered and enriched using data from external data sources. Data reconciliation usingWikidata isalsosupported.OpenRefinepresentsawebuserinterfaceservedbyawebserverrunninglocallyonthe user'smachine. This software'smain limitations are related to the size of the datasets it canhandle,thenumberofexternalsourcesanddatamonitoringcapabilities.

Trifacta2 isa commercial suiteofapplications for theexplorationandpreparation (transformationand enrichment) of raw datasets to make them suitable for the subsequent analysis phase. Thedeclaredobjective is toprovidenon-skilleduserswithspecific intelligenttools topreparedatasetsfordifferentanalysistypes.Todothis,advancedtechniquesofmachinelearning,parallelprocessing,andhuman-machineinteractionareemployed.Thecompanythatdevelopsthesetoolswasbornin2012asajointeffortofresearchersfromtheuniversitiesofBerkeley,WashingtonandStanford.Thesuite consists of three software solutions with increasingly advanced features that allow the

1http://openrefine.org2https://www.trifacta.com

Page 18: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

18

processingoflargevolumesofdata.SinceMarch2017,GooglehasbeenprovidingacloudversionofTrifactabrandedasCloudDataprep3.

UnifiedViews4[7]isanopen-sourcetoolforcreatingETLpipelinesthatnativelysupportsdatainRDF(ResourceDescriptionFramework).UnifiedViewsprovidestheabstractionsandtoolboxesneededtodefine,execute,andmonitorDataProcessingUnit(DPU)tasks.TheDPUssupportedbyUnifiedViewsare:

• Importingrawdatafromexternalsources

• TransformationofdataintoRDF

• Cleaningthedata

• Interlinkingwithotherdatasources

• Conflictresolution

• Italsoallowsthesharingoftransformationresults.

Karma5isadataintegrationtoolwhosemainobjectiveistoproviderapidandintuitivetoolsfordataintegration for users. Datasets can be generated from different sources such as spreadsheets,delimitedtextfiles,XML,JSON,KML,andWebAPIs.Moreover,KarmaisopensourceandversionscompatiblewithWindowsLinuxandMacoperatingsystemsareavailable.Karmadevelopersclaimtooffersomeinnovativefeaturesthatdistinguishthistoolfromothers:

• Usability: guaranteed by a graphical interface that gives users the possibility to refine asemanticmodelpreviouslygeneratedinautomaticmode.Inthisway,thecomplexityoftheassociationrulesishiddenfromendusers.

• Flexibility: Itsupportsavarietyofstructureddatasourcesthatcanalwaysbeanalysedandannotated semantically. These range from simple tables to the most complex structuredXML, JSON,andKML files.Data integration service is also supported forWebAPIs.Karmaoffersthepossibilityofcombiningdifferentontologiestoassociatetheirdatawithstandardvocabularyandensuremoreaccurateandaccuratedataintegration.

• Responsiveness and Big Data: karma uses only a subset of the data provided by users tocomplete its services, and thanks to this strategy it offers a responsive graphical interfaceandreducedcalculationtimes;inaddition,itprovidesmanyabatchexecutionmodefortheintegrationoflargedatasets;

ThislastconsiderationisparticularlyinterestinginthescopeofEW-Shoppasitoutlinesanapproachthatsharemanysimilaritiestotheonedesignedandimplementedintheproject.MoredetailscanbefoundinChaptersChapter4andChapter5.

DataGraft6 isaweb-poweredtoolthatfeaturesacollationofservicesformanagingtheprocessoftransformingdatafromtabularformatintoagraphformatanddeployingthedataontoadatabase.3https://cloud.google.com/dataprep4https://unifiedviews.eu5http://usc-isi-i2.github.io/karma6https://datagraft.net

Page 19: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

19

Thetransformationsupportsbothcleaningandmappingsteps.Transformationstepsonrows(add,drop, filter, duplicatedetectionetc.), columns (add,drop, rename,mergeetc.) andentiredataset(sort, aggregate etc.) are provided togetherwith visualization of the result after each step.MoredetailsofDataGraftcanbefoundinSection5.4.1.

2.3 Analytics

Thereisaplethoraofdifferentdataanalyticstoolsandarchitecturesavailabletoday.Ingeneral,thedifferencesbetween themareoftenminuteand the choice comesdown topersonalpreferences.Hereweprovideanoverviewofsomeofthemostknownandwidely-usedanalyticsplatforms.

Weka7 isoneofmostwell-knownmachine-learning suites. Itwasdevelopedand ismaintainedbytheUniversityofWaikato,NewZealandandiswidelyspreadduetoitscomprehensivecollectionofdatapre-processingandmodelling techniques.Themainareaswhere it isusedareeducationandresearch, where it is popular also due to its free availability through the GNU General PublicLicense8. It supports all stages of machine learning including pre-processing, modelling andvisualization.ItisdevelopedinJavaprogramminglanguageandoffersaGUIfrontend.

Orange9 followsthemottoofmakingdatamining“fruitfulandfun”.Developedandmaintainedatthe University of Ljubljana it is a machine-learning suite with a strong support for visualprogramming.Theuser canbuilda solutionbyvisuallydrawinga flowcharton theOrangecanvasusing elements from theOrange library,whichproduces a functional program in thebackground.These functionalities make it popular in education. Research users typically use it directly as aPythonlibrary.

R10differsfromthepreviousplatformsinthesensethat itsrootsaremoreintheareaofstatisticsand not machine learning. It is an open source programming language and an environment forstatisticalcomputingandgraphics. It iswidelyusedthroughoutresearchandbusinessandhasoneof the most extensive libraries of tools. As can be expected, the area of statistical modelling iscoveredbest.

Python11 is a general-purpose scripting programming language and not a data analytics platform.Nevertheless, its selectionofmodules supporting analysis of datamake it a popular choice.Mostfunctionalitiesofthepreviouslydescribedplatformsarefreelyavailablethroughmodulessuchas:

• Scikit–generalpurposeanalyticsandmodelling,• TensorFlow,Theano–powerfullibrariesformodellingusing(deep)neuralnetworks,• Pandas–alibraryofferingversatiledatastructuresfordataanalysis,• Matplotlib,Plotly–librariesfordatavisualization.

RapidMiner12 isacommercialdatasciencesoftwareplatformdevelopedby thecompanywith thesame name. As a general-purpose suite, it has support for data preparation, modelling and

7http://www.cs.waikato.ac.nz/~ml/weka8https://www.gnu.org/licenses/gpl.html9https://orange.biolab.si10https://www.r-project.org11https://www.python.org

Page 20: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

20

visualizationand isused foracademicaswellasbusinessapplications. Itsdesign follows theopencoremodelwiththefreebasicversionbeinglimitedto1logicalprocessorand10,000datarows.Itisdeveloped in JavaandoffersaGUI frontend. Its suiteofmodelsandalgorithmscanbeextendedusingeitherPythonorRlanguages.

Qminer13isanopensourcedataanalyticsplatformwritteninC++,especiallydesignedforprocessingoflarge-scaledatasetsinrealtime.Itsupportsbothstructuredandunstructureddataincludingtext,geospatial data, networks and time streams. The QMiner library contains an extensive array ofanalyticstoolscoveringclassification,regression,andpreprocessing,amongtheothers.MoredetailsonQminercanbefoundinSection5.4.2.

2.4 VisualizationandReporting

Big Data applications require new forms of processing to enable enhanced process optimization,insight discovery and decision making. Effective data visualization is a key part of the discoveryprocessintheeraofBigData,sincevisualizationisanimportantapproachtohelpingBigDatagetacompleteviewofdataanddiscoverdatavalues,byplacingtheminavisualcontext.Patterns,trendsandcorrelationsthatmightgoundetectedintext-baseddatacanbeexposedandrecognizedeasierwithdatavisualizationsoftware.

Data visualization is a Business Intelligence (BI) tool aims to create reports; dashboards and datavisualizationstopresentactionableinformationtousersthatcanmakeinformedbusinessdecisions.

Thepotentialbenefitsofbusinessintelligencetools,asDatavisualization,include:acceleratingandimproving decision-making, optimizing internal business processes, improving collaboration andinformationsharing,providingself-servicecapabilitiestoendusers,increasingReturnOfInvestment(ROI),savingtimeandreducingburdenonIT[8].

Theoutcomeofallvisualizationtoolsisacertainnumberofcharts/reportsprovided,beingvaluablebytheenduserandmakingpossibletoemphasizeaspecificfeatureofthedata.

EW-Shoppprojectaimstotakeadvantageofall thesebenefitsthroughthedevelopingofbusinessservices, which provide insights resulting from the analysis and visualization of large amount ofconsumer and market data, in multiple languages, collected by different business partners andenrichedwithinformationaboutweatherandevents.

AccordingtoEW-Shoppprojectneeds,thevisualizationanddatanavigationtierhastoprovideend-users simplifieddata access and serviceswithin aBigData visualizationplatformand insightfullydisplay of geospatial (transport/traffic), weather, network (linked) data with multiple attributes(multivariatedata)and/oratemporaldimension.

12https://rapidminer.com13http://qminer.ijs.si

Page 21: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

21

Several solutions exist in themarket, but with different dashoarding capabilities, API availability,licensing, and pricing. Most relevant solutions are: Tableau14, Google data studio15, Datahero16,Plotlysaas17,Raw18,Visualizefree19,Datawrapper20,Infogram21,Chartblocks22,Zohoreports23.

Looking at project context and after comparing all those different Big Data visualization tools,discoveringstrengthsandweaknesses,theKnowageSuite(atooldevelopedbyENGINEERING)canbeconsideredagoodchoiceinEW-Shoppprojectthankstoitsfeatures.

Knowage24 is an open source business analyticssuite that combines traditional data andBigDatasources intovaluableandmeaningful information. It represents theevolutionofSpagoBI, the firstsuite released by ENGINEERING and with a ten-year history. It is a suite of autonomous andcombinable products that allows everyone to define their own functional perimeter, with thepossibility of extending it at a later date. Actually, it is available in two versions: KnowageCommunityEdition(CE),withopensourcelicense,ensuringfullutilizationofanalyticsfunctionalitiesandfulloperativenessofthefinaluser;KnowageEnterpriseEdition(EE),withasubscriptionmodel,facilitating management operations for the installed park and ensuring the professional servicesrequiredforuseinbusinesssettings.

Knowageprovidesadvancedself-servicecapabilitiesthatgiveautonomytotheend-user,whoisabletobuildhisownanalysis,getinsightsondataandturnthemintoactionableknowledgeforeffectivedecision-making processes. The software is flexible because it adopts open standards and can beused in various environments without considerable requirements. Itsmodular approach, scalablearchitecture andopen standards ensure easy customization and thedevelopmentof user-friendlysolutions.

Tosummarize,themainKnowagecharacteristicsarethefollowing:

• ItisapureOpenSourceModel• Itisreleasedatnolicensefee• Itpresents“nosoftwareandvendorlock-in”,therighttousethesoftwareisseparatedfrom

thepurchaseofprofessionalservices• Itembracesinnovationandtheresultscomingfromresearchactivitiesandcontributions• ItisofferedalsoasreferenceimplementationsofFIWAREGEspecifications25• It has enabled connectors: the platform can gather information from multifarious data

sourcesbymeansofsomeconnectorsthatarereadytobeusedoff-the-shelf

14https://www.tableau.com15https://datastudio.google.com16https://datahero.com17https://plot.ly18http://rawgraphs.io19https://visualizefree.com20https://www.datawrapper.de21https://infogram.com22https://www.chartblocks.com23https://www.zoho.eu24https://www.knowage-suite.com25https://forge.fiware.org/plugins/mediawiki/wiki/fiware/index.php/Summary_of_FIWARE_Open_Specifications

Page 22: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

22

Chapter3 DrivingRequirements

Themainobjectiveofthissectioninthescopeofthedocumentistoelicit,gatherandorganizethecore requirements to be addressed in EW-Shopp Reference Architecture as well as in its actualimplementationasafirststeptowardstherealizationoftheprojectvision.

3.1 ElicitationMethodology

In order to collect the requirements that will drive the definition of the EW-Shopp Referencearchitecture,thefollowingstepshavebeencarriedout(seeFigure3):

i. Review of the relevant State of the Art in Big Data Architecture; In Chapter 2 of thisdocument,thestateoftheartanalysisofboththetwomainreferencearchitecturesandthemaindatatransformation,analyticsandreportingtoolshasbeenreported.AstheEW-Shoppproject is focusedondefining large-scaledata transformationandanalyticsprocesses, thestateoftheartofdatastorageanddataprocessinghasbeendeliberatelyneglected.Fromthis analysis, we mainly extracted a list of components that (sometimes under differentnames)arepresentintheBigDataplatforms.Someofthesewerenotconsideredbecausethey relate to real time analysis systems.Others, however, such as those relating to dataingestion, are not explicitly included among the requirements, because we have tried tofocusexclusivelyondatatransformation,enrichment,analysisandreportingofdata

ii. Collection of requirements defined in other deliverables; Deliverable D4.1 outlines thedevelopment strategy and defines the various pilots as well as their functionalities. Thedocumentpresentsaninitialsetoffunctionalrequirements,themainelementsofabaselineplatformanda first versionof thedata flow.The requirements that canbe found inD4.1mainlyconcernthebusinesslogicbehindthepilot;however,importantrequirementsaswellasa firstdefinitionof thecomponents tobe involved in thedefinitionof theplatformarepresented.

iii. Developmentof surveys for thestakeholders; Inorder toactively involve theconsortiumstakeholders,i.e.,theEW-Shoppbusinesspartners,intherequirementselicitationasurveyhavebeendeveloped.Thequestionsofthissurveymainlyconcernthecurrentstateofthedata transformation and analytics process, the size of the datasets to be processed, thepresenceornotofBigData solutionsonpremises. In Section3.2, the reader can find thesurveyquestionsandtherelatedanswers.

iv. Elicitationofarchitecturalrequirementsfrom(i),(ii),and(iii);Oncethesethreeparallelstepswerecompleted,twosets,oneoffunctionalandtheotherofnon-functionalrequirementswereidentified(seeSections3.3.1and3.3.2).

Page 23: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

23

v. Prioritizationoftherequirements.Finally,aprioritizationprocesshasbeencarriedout.Thegoalof thisprocess is to identify toppriority requirements.Consequently,both functionalandnon-functionalrequirementshavebeendividedintotwosubsets,namelymust,that isrequirements of primary importance, and should, containing those requirements whosenon-fulfilmentwouldnotaffectthesuccessoftheproject.

Figure3:Requirementsdefinitionprocess

3.2 RequirementsDefinitionSurvey

Below are the requests for clarification addressed to the stakeholders involved in theimplementation of Business cases. In particular, we asked to define and describe the currentplatform (where present), the possible deployment of the entire ecosystem and the expectedworkflow.

1. Describeherethecurrentbusinesscaseplatform(preEW-Shopp),inparticulartrytodetail:• Thecurrentdatatransformationprocesses(eitherbatchorstreaming)• Thedatasources(+externalAPIs)• Theapproximatedatasetsizes• ThecurrentBigDataarchitecture(ifany)–Analytics,Processingtools,database

2. Describetheenvisioneddeploymentfortheentireecosystem(currentbusinesscaseplatform+EW-Shopp tools). Inparticular, specifywhether thesystemwillbedeployedonpremises,entirelyonthecloudoruseahybridconfiguration.

3. Definethespecificworkflowfor thecaseofuse. Inparticular,aneffort isneededtodefinethefrequencyofuseforthetoolsmadeavailablebytheproject.Possibleexamples(amongothers)ofworkflowsare:

Page 24: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

24

• Fully static: The EW-Shopp components are used (once or rarely) to define thetransformationandenrichmentprocess,andvalidate theanalyticalalgorithms.Onceasatisfactory resulthasbeenachieved, thedefined transformationandanalyticsmodelsare implemented ina stable fashionwithin thebusiness caseplatform,giving itaddedvalue.

• Dynamic:Toolsareusedonaregularbasistocontinuouslyimprovethedataflowonthebusinesscaseplatform.

• Fully interactive: in this case the size of the datasets could be small and a Big Dataplatformunnecessary.Themainobjectiveoftheanalysesistogainknowledgeaboutthecorrelationbetweendataandeventsandweather.

3.2.1 Ceneje

CurrentplatformCenejegathersmarketandproductdatafrompartnersandstorestheminanSQLdatabasewhereitis transformed and merged to suit Ceneje business needs. All consumer activity is gathered andstoredinaNoSQLdatabase.Followingdatasetswillbeavailable:

• Cenejedataset–Consumerdata:Purchaseintent:Acollectionofuserjourneydata–pageviews, search terms, redirects to sellers and similar.Data is logged to localdatabases anddata from 1. 1. 2015 is provided. Local databases consist of SQL databases and NoSQLdatabases.FormoreinformationpleaseseeDeliverableD2.1[9]Section6.1.

• Ceneje dataset – Market data: A collection of seller quotes for products. For moreinformationpleaseseeDeliverableD2.1,Section6.16.

• Cenejedataset–Productdata:Acollectionofproductattributes(varyingfromgenericsuchasname,EAN(EuropeanArticleNumber),brand,categorizationandcolortomorespecificas dimensions or technical specifications). Data is collected from more than a thousandonlinestoresin5countriesandthenautomaticallyandmanuallymergedintoanorganizeddataset.FormoreinformationpleaseseeDeliverableD2.1,Section6.9.

AlldatawillbeavailableinTSVandJSONformatonAWS(AmazonWebServices)servers.Basedontheseproperties,theapproximatedatasetsizesare:

• Cenejedataset–Consumerdata:Purchaseintent:>approx.1TB,updatedwith50MBdaily• Cenejedataset–Marketdata:approx.200GB,updatedwith3GBdaily• Cenejedataset–Productdata:approx.100GB,updatedwith100MBdaily

Current Big Data architecture: Ceneje has no Big Data architecture. All current data is on on-premiseshardware,archiveddatahasbeenmovedeithertothecloud(AWS)ortoaNAS.

EnvisioneddeploymentCenejeprefers anon-premisesplatformdeploymentbut canalso settlewitha clouddeployment.Cenejewill be flexible according to EW-Shopp specifications and other partners’ needs. The final

Page 25: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

25

deployment architecture and resources will be customized depending on the needs andfunctionalitiesof thecoreEW-Shoppmodules:DataGraft/ASIA/ABSTAT (Fordatabasisenrichmentandmanipulation),QMINER(analytics)andKnowage(forreportingandvisualization).

WorkflowCenejeBusinesscaseconsistsoftwoservices.Forbothcases,dailyTSVsandJSONswillbesenttoEW-Shoppwiththepreviousdayinformation.Then,therequestedinformationwillbedifferentforBS1andBS2.

• BS1: Themain objective of this service is to knowproducts of interest for a certain user.Whenever a Partner wants to predict product of users’ interest, knowing the effect ofweatherandevents,hewillaskEW-Shoppforthisservice.

• BS2: Themainobjectiveof this service is to know categories of interest for a certainB2Bpartner.WheneveraPartnerwantstopredict interestingcategories,knowingtheeffectofweatherandevents,hewillaskEW-Shoppforthisservice.

Therefore, based on the number of products and user interactions Ceneje business services willinteractwith theEW-Shoppplatform indynamicand fully interactiveapproach (dependingon thedate,season,currentweatherdata,weatherforecastandsoon)

3.2.2 BigBang

CurrentplatformBig Bang stores all data (customer purchase history data, customer intent and interaction data,productattributes,productandcategoriessell-outdata) inrelationaldatabasesonMSSQLServer.ThedatasetsavailableforEW-Shoppprojectwillbe:

• BigBangdataset–Consumerdata–Customerpurchasehistory:Acollectionofsell-outdatamatchedwithcustomerbaskets inadefined timeframe.Dataset isavailable from2013andcanbeprovidedintotalorperpick-uplocation.Dataisloggedtolocaldatabasesthatconsistof SQLdatabases andNoSQLdatabases. Formore informationplease seeDeliverableD2.1.Section6.4.

• BigBangdataset–Consumerdata–Customer Intentand Interaction:A collectionofuserjourney data (page views, page events, search terms, redirects to channels, product views,etc.).Dataset isavailablesince2013.Data is loggedtoGoogleAnalyticsdatabase.FormoreinformationpleaseseeDeliverableD2.1.Section6.5.

• Big Bang dataset – Product and Categories data – Product Attributes and Sales Data: AcollectionofdetailedproductspecificationsforproductswhichareincludedinBigBangsellingportfolio (fromgeneric to specific technicaldetails).Dataset canbeprovided in totalorperstore location. Data is logged to local SQL databases. For more information please seeDeliverableD2.1.Section6.15.

AlldatawillbeavailableinSQLformat,onBigBangservesandcanbetransformedtovariousoutputfiles.Basedontheseproperties,theapproximatedatasetsizesare:

Page 26: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

26

• BigBangdataset–Consumerdata–Customerpurchasehistory: approx.800MB-1.2TB,updatedwith100MBdaily.

• Big Bang dataset – Consumer data – Customer Intent and Interaction: approx. 100 GB,updatedwith1GBdaily.

• BigBangdataset–ProductandCategoriesdata:approx.75GB,updatedwith50daily.

Current Big Data architecture: Big Bang system has a Big Data architecture based on relationaldatabase systems,mostlyworkingwithMySQL database enginewith custom import/export toolsand tools for OLAP (Online analytical processing) analysis. All current data is on on-premiseshardware, archived data has been moved either to the cloud or to a NAS (Network-attachedstorage).

EnvisioneddeploymentBigBangprefersanon-premisesplatformdeploymentbutcanalsosettlewithaclouddeployment.BigBangwillbeflexibleaccordingtoEW-Shoppspecificationsandotherpartners’needs.Thefinaldeployment architecture and resources will be customized depending on the needs andfunctionalitiesof thecoreEW-Shoppmodules:DataGraft/ASIA/ABSTAT (Fordatabasisenrichmentandmanipulation),QMINER(analytics)andKnowage(forreportingandvisualization).

WorkflowBigBangBusinesscaseconsistoftwoservicesforwhichdata inagreedformatwillbesenttoEW-Shoppwiththepreviousdayinformation.

Therequestedinformationwillbeasfollows:

• BS1:ThemainobjectiveofthisserviceistodetermineinfluentialfactorstosupportDecisionMaking i.e. which activities should be implemented by channels and content for specificcategoriesofinterest.

• BS2: This service supports the analysis and the visualization of the possible impact ofweather,events,marketingactivities,pricechangesetc.onsearchandsell-outofcategoriesandproducts.

Therefore, based on the number of products and user interactions Big Bang business servicewillinteractwith theEW-Shoppplatform indynamicand fully interactiveapproach (dependingon thedate,season,currentweatherdata,weatherforecastandsoon).

3.2.3 Browsetel

CurrentplatformCurrentdatatransformationprocesses:Withthegoalofgeneratingsuitabledatasetstobeenrichedwithweatherandevents informationandthefurtherprocessing,severaltransformationprocessesaredevelopedwithinBTITarchitecture.

Data Sources: With the aim of generating the most complete and real dataset possible, BT willcollecttrafficrelateddatawithinthefollowingdataset:

Page 27: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

27

• BTDataset–ContactandConsumerInteractionData:dataset,ofaproprietarystructure,inCSV format, collecting information related to customers’ interactions and other eventswithintheBTsystem(e.g.agents’login/logout,agentsspenttimeincertainstatus,agentstimeofjoiningacampaign,etc.).ItcoversinformationrelatedtoSlovenianregioninEnglishlanguage.FormoreinformationpleaseseeDeliverableD2.1.Section6.7.

Basedontheseproperties,theapproximatedatasetsizesare:

• BTDataset – Contact and Consumer InteractionData: > 600GB, updatedwith 5 GB permonth;

Current Big Data architecture: Actual BT system has a Big Data architecture based on relationaldatabase systems,mostlyworkingwithMySQL database enginewith custom import/export toolsandtoolsforOLAPanalysislikeMondrian.BTisalsoexperimentingwithApacheNifi.Withtheaimoffacilitating the scalability of the Core Data basis and their business partners’ access to theinformation, BT is providing implementation of its solutions on the premises, cloud or hybridplatforms, enabling integration of business partner already existing systems (via standard orproprietarydevelopedAPIs).

EnvisioneddeploymentBTplatformdeploymentwillbeenabledinapremiseorcloudplatformandwillbeflexibleaccordingto EW-Shopp specifications. The final deployment architecture and resources will be customizeddependingontheneedsandfunctionalitiesofthecoreEW-Shoppmodules:DataGraft/ASIA/ABSTAT(Fordatabasisenrichmentandmanipulation),QMINER(analytics)andKnowage(forreportingandvisualization).

WorkflowBTBusinesscase isdivided in threedifferent services.Forall cases, csvwillbe sent toEW-Shopp.PeriodanddatasetfordataupdatingwillbedefinedbyBT.

• BS1:Themainobjectiveof this service is toknowthebestmoment to launchacampaignbasedonthecharacteristicsof theservice,climatic factors,andevents thatmay influencetheobjectives set for that campaign. This bestmoment canbedefineddependingon theexpected goals to be achieved: interactions related to a certain campaign, product, etc.WheneverBTwantstolaunchanewcampaign,knowingtheeffectofweatherandevents,willaskEW-Shoppforthisservice.

• BS2andBS3:Thisservicesupportstheanalysisandthevisualizationofthepossibleimpactofeventsononlinesearches.Thegoalwillbetopredicthowandwhenaneventcanaffectuserbehavior.BTuserwillsetupthehowhewantstobewarnedandthentheplatformwillshow an alert in the period that BT user specified. Analytics (and visualization) can beperformedonabasisofBTdefinedperiod.

Therefore,basedontheamountofcampaignsthecompanyusedtomanageandthefactthattheyare dealing with all possible categories, BT business services will interact with the EW-Shoppplatform between a dynamic and fully interactive approach (depending on the date, weatherforecastandsoon).

Page 28: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

28

3.2.4 Measurence

CurrentPlatformMeasurence’s sensorplatformcomprisesofWi-Fi sensors, sensormanagerand sensormonitoringsystem.Wi-Fisensors,whichareplacedin/outofphysicallocations(i.e.smallandbigshops,out-of-home advertising, etc.) passively collect and anonymize Wi-Fi signals transmitted by nearbysmartphones. After anonymization and filtering from anomalies, unnecessary or private data andstatic devices, data aboutWi-Fi signals, namely time and signal strength of probe requests fromdevicesisprocessedbyMeasurence’sWi-FiAnalyticPipeline.TheWi-FiAnalyticPipelineisacomposedbyaseriesofmicro-servicesdevelopedinScalawiththeAkka framework26 and by a job procedure realized in Scala [10] that is run on anApache Spark27clusterandthattranslatestherawdata intoouranalytics. Inotherwords,Measurencealgorithmsestimatetheflowofpeopleacrosssensorsandlocations,peopletrafficinsideandoutsidealocation,and loyalty and engagement metrics. All described above raw and processed data afteranonymizationstepisstoredinthedatabase.Measurenceprovidestotheclients(retailers)theinformationabouttrafficofpeopleinandoutoftheirphysical location.Thepilotdatasetcontains informationabouthalf-hourly,dailyandweeklynumberofvisitorsandpedestrians(walkby)fortwoshopsandthedailynumberofreceipts(privatedata of the shop) for one of the location. SoMeasurence does not provide to the partners/pilotduring the project the raw data for individual devices, but already processed data – aggregatednumbers.Six months of Measurence data (csv-file) contains the data set of traffic (half-hourly, daily andweeklynumberofpeople)in/outoftheparticularlocation~0.5Mb.

EnvisionedDevelopmentEW-Shoppplatformwilladdtothedataanadditionallayerofintelligence:weatherandeventsdata.During the pilot Measurence expects to build the analytical model and consider/find differentcorrelations between number of receipts/visitors/walkby and different weather attributes liketemperature, precipitation,wind etc.Measurencewill use both its own algorithms / facilities andEW-Shoppmodules: DataGraft (for data basis enrichment andmanipulation), QMINER (analytics)andKnowage(forreportingandvisualization)tofindthemosteffectivewayofdataintegrationanddeliveringoftheanalytictothecustomers.Thetraining(pilot)datasetiscsvfilebutweplantousetheAPIsprovidedbytheEW-Shoppecosystemforthedataintegrationprocess.

Workflow

1. historical analysis – static approach. Measurence business case expects to study thebehavior of customers in particular location using historical data of traffic of people andweather/events.Sincethebehaviorofpeoplemightbedifferentindifferenttypeofstoresthat evidently depends not only on daily deviations of weather conditions but alsoseasonality,physical locationandcharacteristicsofproducts soldby the shop, this kindofhistoricalanalysisshouldbedoneforeveryMeasurencecustomer/locationwhichdecideto

26https://akka.io27https://spark.apache.org

Page 29: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

29

useEW-Shoppservice.Also,weshouldstudyhowtheforecastoftheweatherdependsonthe real weather and how it may influence on the people behavior and accuracy of theforecast of people traffic itself. The output of this study should look like “formula” in theformof“decisiontree”thatwillbeappliedtothedailydata inordertopredictnextweekdata.

2. weekly/daily data – interactive approach. As soon as the historical analysis is done,Measurencemakes a list of physical coordinates/addresses for the locations/storeswhichdesiretouseEW-Shoppservice(atthebeginningthesearetwopilotlocations)andsendsarequest to EW-Shopp platform onweekly (or daily) basis. EW-Shopp ecosystem response(ideally throughAPI) containsneededattributesofpastweekactualweatherand forecastfor a coming week.Measurence applies “formula” to the integrated data and shows thetrafficforecastonitsdashboard.

3.2.5 JOT

CurrentplatformCurrentdatatransformationprocesses:Withthegoalofgeneratingsuitabledatasetstobeenrichedwithweatherandevents informationandthefurtherprocessing,severaltransformationprocessesaredevelopedwithinJOTITarchitecture.Themainonesare:Automationofdataextractionbasedon platform APIs, definition and selection of campaigns and parameters containing relevantinformation,selectionoflinkingattributesforexternaldataintegration(likedata,location,etc.)

Data Sources:With the aim of generating themost complete and real dataset possible, JOT willcollectwebtraffic relateddata fromthreedatasources.Twoof themwillcontainthe informationregardingJOTmarketingcampaignsandtheotheronewillcontaintwittertrends.

• JOTDataset-Consumerdata:Trafficsource(Google):Thisdataset is inCSVformatand itcollects information related to different countries and different languages (Spain andGermanyforthepilot).ItcontainsalltheinformationrelatedtosearchesmadethroughtheGoogle platform. In order to access the data for the campaigns launched on the GoogleAdWordsplatform,GoogleprovidesanAPIwhichallowsustoconsultallthedataweneedquickly and intuitively28. Each row features different fields among which there are thenumberofclicks, impressions,andthe location. TheDataset iscreateddailymeaningthateverydaythedatasetcontainsonlythenewlygenerateddata.FormoreinformationpleaseseeDeliverableD2.1,Section6.19.

• JOTDataset-Consumerdata:Trafficsource(Bing):ThisdatasetisalsoinCSVformatanditcollectsdatagatheredfromdifferentcountriesandindifferentlanguages.ItcontainsalltheinformationrelatedtosearchesmadeinMicrosoftBingplatform,showingdifferentfieldsasthenumberofclicksorthelocation.Dataisupdateddailythatmeanseverydaythedatasetcontainsonlythenewlygenerateddata.FormoreinformationpleaserefertoD2.1,Section6.18.AsfortheBingAdsplatform,thewayweobtainthedataisverysimilartotheGoogle

28https://developers.google.com/adwords/api/docs/reference/v201607/ManagedCustomerService

Page 30: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

30

AdWordsplatformasoutlinedabove,giventhattheyhaveanAPItoobtainthesamedatathatisanalogous29.

• JOTDataset -Market data: Twitter trends: This dataset has a CSV format, containsOpendataandfocusesontrendingtopicsasavailablethroughTwitterAPIs.Dataisupdateddailythat means every day the dataset contains only the data newly generated. For moreinformationpleaseseeD2.1,Section6.20.APIRESTisusedtodownloadalltheinformationneeded

Basedontheseproperties,theapproximatedatasetsizes:

• JOTDataset-Consumerdata:Trafficsource(Google):>3TB,updatedwith4GBdaily• JOTDataset-Consumerdata:Trafficsource(Bing):1TB,updatedwith1.5GBdaily• JOT Dataset - Market data: Twitter trends: 2 million registers, updated with 24,000 new

registersdaily

Current Big Data architecture: Nowadays JOT has no Big Data architecture. With the aim offacilitatingthescalabilityoftheCoreDatabasisandtheaccesstotheinformation,JOThasmigratedthedatastorageprocessfromon-premiseshardwaretothecloud(MicrosoftAzure).

EnvisioneddeploymentJOTplatformdeploymentwillbeinacloudplatform(Azure),anditwillbeflexibleaccordingtoEW-Shoppspecifications.ThefinaldeploymentarchitectureandresourceswillbecustomizeddependingontheneedsandfunctionalitiesofthecoreEW-Shoppmodules:DataGraft/ASIA/ABSTAT(Fordatabasis enrichment and manipulation), QMINER (analytics) and Knowage (for reporting andvisualization).

WorkflowJOTBusinesscase isdivided in fourdifferentservices, so the frequencyofusecanbesplit in two,dependingontheBusinessServiceused.Forallcases,dailycsvwillbesenttoEW-Shoppwiththeinformationaboutthecampaignperformanceonthepreviousday.Then,therequestedinformationwillbedifferentforBS1andBS2,BS3andBS4.

• BS1:Themainobjectiveofthisserviceistoknowthebestmomenttolaunchacampaignbasedon the characteristics of the service, climatic factors, and events that may influence theobjectivessetforthatcampaign.Thisbestmomentcanbedefineddependingontheexpectedgoals tobeachieved:numberof clicks, averageposition,PPC (pricepaidper click).WheneverJOTwants to launchanewcampaign,knowing theeffectofweatherandevents,willaskEW-Shoppforthisservice.

• BS2,BS3andBS4:Thisservicesupportstheanalysisofthepossibleimpactofeventsononlinesearches.Thegoalwillbetopredicthowandwhenaneventcanaffectuserbehavior.JOTuserwillsetuphowhewantstobewarnedandthentheplatformwillshowanalert intheperiodthatJOTuserspecified.

29https://msdn.microsoft.com/enus/library/bing-ads-customer-management-apireference(v=msads.90).aspx

Page 31: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

31

Therefore,basedontheamountofdigitalcampaignsthecompanyusedtomanageandthefactthattheyaredealingwithallpossiblecategories,JOTbusinessserviceswill interactwiththeEW-Shoppplatform between a dynamic and fully interactive approach (depending on the date, weatherforecastandsoon)

3.3 ArchitecturalRequirements

The harvested requirements are presented and described in this section. Compared to a classicclassification into functional and non-functional requirements, we have preferred to make thefunctionalrequirementsofsecurityandprivacyexplicitastheyaredeemedofstrategicimportanceasrecognizedatEuropeanlevelbytheGeneralDataProtectionRegulation-GDPR[11].

It is important tonote that therequirementscollectedare thosethathaveadirect impactonthearchitectureoftheEW-Shoppecosystem.Youcaneasilyrealizethattheyhavebeenleftspeciallyatahighlevelbecausetheyhavetoanswerarchitecturalquestions:

1. What functionality should the platform support? The project is defined by its preciseobjectives,whichconcernonlycertainproblemsofBigDataarchitectures.Itisimportanttodefine what falls within the scope of the project and therefore needs to be analyzed indetail,andwhatcanbeachievedsimplybyfollowingbestpractices.

2. Towhich areas should the identified functionsbe attributed? Inotherwords,wewant todefinearchitecturebyidentifyingitscomponentsbasedontheirresponsibilities.

3. How do architecture components interact? The requirements have served us to reasonabout the relationships between the components, so thatwe candesign their integrationwhilekeepingtheobjectivesofeachcomponentseparatedasmuchaspossible.

4. Whatare theusersof thesystem,andhowdo they interactwith it?Thedefinitionof theworkflowanddataflowfurtherallowsyoutospecifythesystemarchitecture.

Thesequestionshaveguidedusinidentifyingthemaincomponents,dividingthemintogroups.TheyalsoallowedustoclarifymanypointsoftheprojectbyfocusingonimportantissuessuchasBigDataandsecuritymanagement.Finally,thisworkhasallowedustocreateauniquelanguageandalsotobetterdefinethecontoursoftheassociatedpilotprojects.

3.3.1 FunctionalRequirements

Belowarelisted,intabularform,themainfunctionalrequirementstobemetbytheplatform.Eachone of them is associatedwith a unique id, a short textual description and a level of stringency.Stringency can take on two values, namelyMust and Should,meaning that the achievement of aparticularrequirementcannotbecompromisedasit isconsideredafundamentalconditionforthesuccess of the project (must). Other requirements, on the other hand, are desiderata, i.e. theydescribe functionality that the system should possess in order to fully express its potential butwhoselackdoesnotaffectthesuccessoftheproject.

Page 32: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

32

Table1:RequirementF.1

Title Datawrangling

Req.ID F.1

Req.Description The platform must provide a (possibly open source) tool for the datatransformationprocess. The toolwill allow theuser to interactivelybuilddatatransformations. Among the activities supported by the tool there is thetransformationoftheuserdataintoRDFknowledgegraphs.

Req.Stringency Must

Table2:RequirementF.2

Title Dataannotation

Req.ID F.2

Req.Description Theplatformmustprovidea(possiblyopensource)toolforthedataannotation,data linking, and data enrichment. This software componentmust be able tohelptheuserduringthesemanticannotationandenrichmentofthedatasetbyincorporatinginformationavailableopenlyoronlytoconsortiummembers.

Req.Stringency Must

Table3:RequirementF.3

Title CoreData

Req.ID F.3

Req.Description The data enrichment processmust support the EW-Shopp Core Data sources,whichprovideinformationaboutEvents,WeatherandProducts(atleast).Suchdata sources can be accessed and consumed in the form of both remote andlocalAPIs.

Req.Stringency Must

Table4:RequirementF.4

Title DataAnalytics

Req.ID F.4

Req.Description The system must provide a (possibly open source) tool for the data analysisactivity.Theselectedtoolmustprovideefficientandscalable implementations

Page 33: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

33

ofthemainanalysisandmachinelearningalgorithms.

Req.Stringency Must

Table5:RequirementF.5

Title BusinessIntelligenceandReporting

Req.ID F.5

Req.Description The system must provide a (possibly open source) tool for the businessintelligence and reporting activity. The selected tool must provide support tobusiness intelligence, open standards, Big Data processing engine, and NoSQLanddistributeddatabases.

Req.Stringency Must

Table6:RequirementF.6

Title DataFormats

Req.ID F.6

Req.Description The system must guarantee support to the most common data format (e.g.,specifiedwithintheRFCstandards)torepresentdatatables

Req.Stringency Must

Table7:RequirementF.7

Title Interface-serviceseparation

Req.ID F.7

Req.Description Thecomponentsofthereferencearchitectureshouldprovideaclearseparationbetween user interface and service. The consortium, moreover, will favorcommunicationbasedonRESTfulAPIs.

Req.Stringency Should

Table8:RequirementF.8

Title Dataflowmanagement

Req.ID F.8

Req.Description Thesystemshouldprovidea(possiblyopensource)automationtooltosupport

Page 34: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

34

data flow (a.k.a. data logistics) among the different components of EW-Shoppecosystem.

Req.Stringency Must

Table9:RequirementF.9

Title DataStorage

Req.ID F.9

Req.Description The systemmust feature a persistent layer capable to handle data coming indifferent formats. At least column and graph based databased must besupported. Such layer may consist of either different federated solutionsdedicated to a particular database type or a single multi-model distributeddatabase.ThiscomponentmustprovidesuitabledataaccessAPI(fortableandRDF)compliantwiththestandardscommonlyrecognized.

Req.Stringency Must

Table10:RequirementF.10

Title Dataprocessing

Req.ID F.10

Req.Description Thesystemshouldprovidea(possiblyopensource)distributedprocessingtoolto supportdata flow (a.k.a.data logistics) among thedifferent componentsofEW-Shoppecosystem.

Req.Stringency Must

Table11:RequirementF.11

Title Cloudready

Req.ID F.11

Req.Description The software tools selected to be part of the EW-Shopp system shouldpreferablybeweb-basedinordertobereadyforpossiblefuturedeploymentincloudenvironments.

Req.Stringency Should

Page 35: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

35

SecurityandPrivacyrequirementsSecurity and privacy management is one of the main issues in the Big Data area. A debate iscurrently taking place within the scientific community and industry to definemethodologies andarchitectures capable of guaranteeing high safety standards [12]. States are also moving in thisdirection,asevidencedbytherecentlegislativeinterventionoftheEuropeancommunityintheareaofprivacymanagementwithregardtosensitivedatasets(GDPR[11]).

Thissectionpresentsthoserequirementsthatexplicitlyrelatetosecuritymanagementandprivacyin the EW-Shopp system. They concern security aspects, authentication, encryption ofcommunicationbetweenthevariousactors(softwareorhuman)involvedinthesystem.Duringthecollection, definition and selection of requirements, the importance of clearly defining theserequirements was recognized, for this reason a separate section of this document has beendedicatedtothem.

Table12:RequirementSP.1

Title Authentication

Req.ID SP.1

Req.Description Authentication is theprocessofconfirmingthe identityofanactor inorder toavoid possibly malicious access to the system resources and services.Authentication can be defined as the set of actions a software system has toimplementinordertograntuserthepermissiontoexecuteanoperationononeormoreresources.TheEW-Shoppecosystemshouldsupportthemostcommonauthenticationtechnologiesandstandards.

Req.Stringency Must

Table13:RequirementSP.2

Title Authorization

Req.ID SP.2

Req.Description Authorization is the process that implements the access control by specifyingaccess privileges to resources. In particular, suitable access policies should beconfiguredwithinthesysteminordertopermitselectiveaccesstoresourcesinformofdataandprograms.

Req.Stringency Must

Table14:RequirementSP.3

Title Encryption

Page 36: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

36

Req.ID SP.3

Req.Description Encryption is the process of encoding messages in order to set up securecommunication channels. Encryption, generally is based on the concept ofdecoding key, which is a piece of information able to decode the messages.Without the key, the message results unintelligible. EW-Shop consortium iscommittedtoemploymodernencryptionmechanismsasHTTPS,SFTP,SSL,andTLS.

Req.Stringency Must

Table15:RequirementSP.4

Title Anonymization

Req.ID SP.4

Req.Description With the term "anonymization" we mean the process of datatransformation/sanitization aimed at encoding or removing sensitiveinformation fromadataset.Anonymization is required toobfuscateandmakeunintelligible personal information that would allow the identification of aperson, and important business information that constitutes the competitiveadvantageofacertaincompany.TheoverallEW-Shopparchitecturemustoffera proper anonymization task/component in completion with the GDPRrequirements.

Req.Stringency Must

Table16:RequirementSP.5

Title SingleSignOn

Req.ID SP.5

Req.Description EW-Shopp ecosystem should provide a Single SignOn (SSO) to thewrangling,analyzingand reporting solutions. Single sign-on is a functionalityof anaccesscontrol system that allows auser to execute the authenticationprocessonce,gettingaccessrightsvalidformultiplesoftwaresystemsorcomputerresourcestowhichshe/heisenabled.

Req.Stringency Should

Page 37: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

37

Table17:RequirementSP.6

Title MonitoringandAuditing

Req.ID SP.6

Req.Description The EW-Shopp ecosystem should feature suitablemechanisms formonitoring,tracing,andauditingtheprocessesinexecutiononcorporatedata.Thosethreetermsrefertothepropertyoftracking(andanalyzing)theeventshappeninginthesystem.

Req.Stringency Should

3.3.2 Non-functionalRequirements

In software engineering, whilst functional requirements are used to define the behavior the EW-Shoppecosystemmusthave,non-functional requirements specify the criteria tobeused to judgethe operation of the system. In layman’s terms whereas functional requirements define what asystemshoulddo,thenon-functionalrequirementsspecifyhowthesystemshouldbedoingit.

Table18:RequirementNF.1

Title Scalability

Req.ID NF.1

Req.Description TheEW-Shoppapproachshouldbescalable,thatisitmustbeabletostoreandprocess large amounts of data. Ideally the system should be able to processincreasinglylargedatasetwithanegligibleimpactonperformances;however,aresultinwhichgigabyte-sizedatabasescanbetransformedandenrichedintheorderofminutestimecanbeconsideredsatisfactoryoutcomefortheproject.

Req.Stringency Must

Table19:RequirementNF.2

Title Adaptability

Req.ID NF.2

Req.Description Integrationbetweenthevariouscomponentsoftheplatformmusttakeplaceinsuch a way that they result being weakly coupled. In this way, the finalarchitecturemust be flexible so that it can be possible to replace any of thetoolswithanequivalentonewithout thishavingaconsiderable impacton the

Page 38: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

38

entireecosystem.

Req.Stringency Must

Table20:RequirementNF.3

Title Extensibility

Req.ID NF.3

Req.Description The proposed systemmust exhibit extensibility properties. Thismeans that itmustbedesignedinsuchawaytoallowinfuturetoaddnewsolutionsandnewtasksintheprocessofvaluecreationfromdata.

Req.Stringency Must

Table21:RequirementNF.4

Title Usability

Req.ID NF.4

Req.Description The system should exhibit a high degree of usability. In particular, wheneverpossible intuitiveandreactivegraphical interfacesshouldbeavailabletoguidetheuserinthedefinitionofthedatagrindingprocessaswellasinoperationsasmonitoringandmaintenance.

Req.Stringency Must

Table22:RequirementNF.5

Title Elasticity

Req.ID NF.5

Req.Description Thesystemshouldbeasmuchaspossibleelastic,thatistheresourceallocatedtotheitshouldvaryaccordingtotheneeds.Inparticular,itwouldbeadvisableto implement the necessary measures to allocate resources to systemcomponentsonlyfortheamountoftimestrictlynecessary.

Req.Stringency Should

Page 39: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

39

Chapter4 ArchitectureDesign

ThischapteraimstopresentanddiscussthereferencearchitectureofEW-Shoppbyhighlightinganddiscussing,wherenecessary,theprocessthatledtoitsdefinition.

4.1.1 EnvisionedWorkflowandActors

ThissectionisdevotedtothedefinitionoftheworkflowsthatcanbeimplementedontheEW-Shoppsystem. In particular, the requirements collected show the need for tools to design the chain oftransformationsthatbusinessdatamustundergotocreatevalueforprojectpartners.Furthermore,oncethesechangeshavebeendefined,wewantthepartnerstohavethetoolsattheirdisposalfortheirdeploymentandexecution.Thismeansthattheplatformmustsupporttwotypesofworkflow:

• Designtime:definingthelinkingandenrichmentprocess• Runtime:theexecutionofthisprocessinproduction.

InthetablebelowtheactorsinvolvedinoperatingtheEW-Shopparchitecturearedescribed:

Table23:EW-Shoppenvisionedactors

Actor Role

Analyst The data analyst is the actor in charge to interact with the system to definelinkingandenrichmentprocess.She/hemustalsobeanexpert indataanalyticstobeabletodesignsuitablepredictiveandprescriptivealgorithmstobeappliedontheenricheddataset.

Reporter Fromtherequirements, itemerges that theEW-Shoppecosystemmustprovidean application capable of carrying out business intelligence and visualizationactivitiesonaprofessionallevel.Webelievethatspecialskillsareneededtousethistool.Forthisreason,wehaveidentifiedintheuserofthistoolanactorthatwehavecalleddatareporter.

BigDataEngineer

Once the transformation process of the business data is defined, it must becarried out on a platform capable of processing large amounts of data. This isgenerallymadeupofmanycomponentsanditsuserequiresspecificknowledge.Forthisreason,webelievethatoneoftheactorsinvolvedshouldbeanengineerexperiencedinBigDataprocessing.

Page 40: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

40

4.2 ArchitecturalSpecifications

Thefirstandforemostoftheconsortium'sobjectivesindefiningreferencearchitectureistoprovideguidelinesforthedevelopmentofthepilotsandtheirsubsequentrefinementinmarketedservices.Consequently, themost significant requirement tobe satisfied isadaptability, i.e. thearchitecturethatweareabouttodescribemustbeabletoapplywithoutchangestoalltheplannedpilots.Thisisindeed challenging as pilots are characterizedbyuse scenarios,workflowand resources thatmayvary fromoneanother;nevertheless, theyallplan toexploit thedatasetsprovidedby theprojectpartners (hereinafter referred to as Core Data) and specific solutions for data transformation,analyticsandreportingpresentedintheprojectDescriptionofAction(DoA)aswellasindeliverableD4.1[1].

Forthisreason,thevariouscomponentsofthereferencearchitecturehavebeengroupedintothreemacroareas:

• CoreDataServices:Thesecomponentsprovideaccesstodatamadeaccessiblebypartnersorfreelyavailable.Thesedatasetsareusedinbothdatalinkingandenrichmentprocesses.

• PlatformServices:datapreparation,analyticsandvisualizationservices.TheyarelocatedatthetopofthereferencestackforBigDataarchitectures.Thefocusoftheprojectisontheseservices,theirintegrationandexploitation.

• Corporate Services: These components refer to tools for data ingestion, storage andprocessing,dataflowandsecuritymanagement.Ingeneral,thesecomponentsarefoundatthe lowest levelsofBigDataarchitecturesand,moreover, inourcasetheydependheavilyonthetechnologiespresentatcustomers'premises.

Asmentionedabove,thefirsttwogroupsarethecentralfocusoftheprojectandwillthereforebedescribedextensivelyinthisdocument.AsfarastheCorporateServicesgroupisconcerned,wewilllimit ourselves to indicating themain components and describing their functionalities in order toprovidethereaderwithacomprehensiveoverview.

Figure4depicts thereferencearchitecture forBigDataproposedbyNIST inwhichwehighlightedthecomponentsoftheEW-ShoppPlatform(withinthegreenrectangle)distinguishingthemfromalltheothercomponents(in lightred)thatareconsideredasCorporateservices.Asyoucansee,theEW-Shoppecosystemincludesthreemacrocomponentsthatrelatetothepreparation,analyticsandvisualizationprocesses(req.0,F.2,F.4,F.5).EW-ShoppCorporate,ontheotherhand,presentsthecomponentsneededtostorelargevolumesofdata,processtheminadistributedandscalableway,data ingestion systems, data flow orchestration, and security, i.e. authentication, authorization,encryption and anonymization (req. F.8, F.9, F.10, SP.1, SP.2, SP.3, SP.4, SP.5, SP.6). Core DataservicesarenotincludedinthefigureastheyhavenocounterpartintheNISTarchitecture.

Page 41: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

41

Figure4:MappingEW-ShoppplatformcomponentsonNISTreferenceArchitectureforBigData

Ontheotherhand,thefocus inFigure5 isexclusivelyonthecomponentscontributingtotheEW-Shoppreferencearchitecture.UnlikethesolutionsproposedbyNISTorBDVA,thearchitecturethatwe describe in this document intentionally exclude the infrastructure layer, data ingestion andaccess layer.While thesecomponentswillbe considered in the implementationphasewebelievethat presenting and discussing them here would have added no value to the architecturespecification,whichaimstobeespeciallyfocusedonthelarge-scaleexecutionofdatapreparation,analytics, and reporting toolsbelonging to theapplication layer (i.e. theNISTBigDataApplicationProvider).

UsingthesamecolorschemeusedinFigure4asaninterpretationguide,wespecifiedtheplatformcomponentsatapplicationlevelinFigure5.Theyare:

• DataWrangler(DW),whichisthecomponentthatisdesignedtoenabletheusertodefineatdesign time the transformations of data cleaning, linking and enrichment. Such datapreparation processeswill then be carried out by the component referred to as Big DataRuntime.

• Data Analyzer (DA), which is the component that provides a set of predefined tools forpredictiveandprescriptivealgorithmsonenricheddata.

• DataReporter (DR),which is the component that allows theuser to visualize and analyzefromabusinesspointofviewtheoutcomesproducedbytheAnalyzerdata.

Moredetailsonthiscomponentand its implementationcanbefound inSection4.2.3andSection5.4.As regards the so-calledCorporate services, these are gathered in a singlemacro-componentcalled Big Data Runtime (BDR). This part of the architecture provides the capability to executetransformation operations defined by the Data Wrangler and analytics operations of the DataAnalyzeronlargeamountsofdata(req.NF.1,NF.5). Inaddition,thiscomponentprovidesthedatareporterwith specific APIs for data access.Within theBigData Runtime,we recognize (using theterminologyproposedbyNIST)asub-componentinchargeofDataStorageandanotherdedicatedto the Processing of such data (req. F.8, F.9, F.10). Moreover, we included in the picture a

Page 42: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

42

componentresponsiblefororchestratingtheprocessestobeperformedandthedataflow(SystemOrchestration).

Thearchitectureforeseesthepresenceofcrosscuttingservicestoenforceappropriatesecurityandmonitoringpolicies(SecurityServices).

Unlike the previous picture, Figure 5 includes an area (in light blue) that groups together thecomponentsforthemanagementoftheCoreData(req.F.3).TheseareaccessibleviagenericRESTAPIs (req.F.7,F.11,NF.2,NF.3).Withnoambitionforcompleteness, thefigureshowstheAPIs forevents, weather and products data. Other enrichment sourcesmay include free datasets such asDBpedia30.

Clearly,sincetheimageinquestiondoesnotconstituteadeploymentdiagram,havingpresentedthethree logical sections of the architecture separately does not imply that they will be deployedindependently and/or geographically apart. On the contrary, as we will see later on. Thearchitecture, thecontrol flow,and thedataflowpipelinehavebeenconceived insuchawayas torealizedifferenttypologiesofdeployments,fromthecompletelydistributedoneinwhichthethreelogicalregionsarealsophysicallyseparatedanddistant,tothecentralizedone inwhichtheentirearchitectureisimplementedwithinthesamefacility.

Figure5:EW-ShopplogicalreferenceArchitecture

30http://dbpedia.org/ontology/

Page 43: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

43

4.2.1 TheDimensionReductionApproach

The EW-Shopp system aims to support e-commerce, retail andmarketing industries in improvingtheir efficiency and competitiveness through providing the ability to perform predictive andprescriptiveanalyticsover integratedandenrichedlargedatasetsusingopenandflexiblesolutions(req.F.4,F.10).Inaddition,thesetoolsmustprovidearesponsivegraphicaluserinterfacetoguidetheuserindesigningdatatransformations(req.F.7,F.9,NF.4).Theseobservationstogetherwiththeconfidentialityrequirement(req.SP.4)leadustobelievethatthedatatransformationdesignphasehastobecarriedoutonasmallandpossiblyanonymizedsubsetoftheinitialdata.Forthisreason,wehave introduced into theplatforma specialized component, namedSampler,whose task is toproperlygeneratethisdatasubset.

The idea of reducing the size of the dataset to be able to handle it more easily is not new[13][14][15][16][17]andiscommonlyreferredtoasDimensionReductionorBigDataReduction.Theapproach is rather simple in its general lines; it consists in reducing the size of the dataset byidentifying a possible compact representationof it. In thisway, thedata transformation anddataanalyticsoperationsinEW-Shoppcanbedesignedandtestedonasmallersetthantheoriginaldata,which should ensure greater responsiveness and efficiency for the applications involved withoutnegativelyaffectingaccuracy.

Figure 6 graphically illustrates how the reduction approach is implementedwithin the EW-Shopparchitecture. First of all, we point out that only the Data Wrangler and the Data Analyzer areaffectedbythismethodologyasthedatareporterwillvisualizeandinspectdataofsmalldimensionswhichwereobtainedasaresultoftheDataAnalyzerresults.

As faras theDataWranglerandDataAnalyzerare concerned, theoperationsassociatedwith theDimensionReductionApproacharethefollowing:

1. Creatingareduceddataset(calledSampleinthediagram).Suchsamplemaydependontheparticularoperationtobecarriedout(preparationoranalytics)

2. Theuseroperatestheapplicationworkingonthesample.IntheparticularcaseoftheDataWrangler,italsoreceivesininputacollectionofrecommendationstoguidetheuserintheprocessoftableannotation.

3. Theapplicationgeneratesamachine-readabledescriptionoftheuser'soperations

4. Themodel(oftransformationoranalytics)isexecutedontheinitialdatasetbytheBigDataruntimecomponent.

It is importanttonotethatthischoicedoesnotreducetheapplicabilityandgenericityof theEW-Shopp solution as, where possible and desired by the user (e. g. in cases where the data to betransformed ismanageableanddoesnotrequireanonymization), thesamplingcomponentcanbeexcluded.Inthismode,theuserisfreetoworkdirectlyagainsttheoriginaldataset.

Page 44: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

44

Figure6:EW-ShoppreferenceDataflowpipeline

4.2.2 CoreDataServices

The purpose of this section is to introduce the first group of components thatmake up the EW-Shopp architecture, i.e. those that provide access to theCoreData. Since suchCoreDatasets canvary depending on the particular deployment we focus here only on those that are consideredindispensablebytheconsortiummembers.Wethereforeonlyrefertothethreemaindatasets,i.e.weather,eventsandproducts.

Please note that the Core Data and services guaranteeing access to them have been coveredextensivelyinD1.1[18].Thatiswhywewillonlybrieflyintroducetheminthisdocumenttoensurethatarchitecturecanbedefinedasexhaustivelyaspossible.

WeatherThis service, in simple terms, is intended to provide access to weather information related to acertainlocationandtime.Inaddition,theservicemustalsobeabletoprovideweatherforecastsfora240-hourperiodinthefuture.Forthefirst90hours,thetimestepis1hour,fromhours90to144thetimestepis3hoursandfromhours144to240thetimestepis6hours.

Theweather information, as said, is also related to spatial information. Thismeans that it canbequeried providing in input the opposite corners of a reduced Gaussian grid, which is regular inlongitudeandalmostregularinlatitude.

Page 45: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

45

EventsThepurposeofthiscomponentistoprovideaccesstoinformationonpastorplannedevents.Eventsareobtainedandcontinuouslyupdatedbycollectingandprocessingwebcommunicationchannels(suchasnewspapersandsocialmedia).Integrationofevents,withinamarketingcampaignstrategyis strongly connected to the events’ characteristics where each of events can be described as apublic assembly for the purpose of celebration, education, marketing or reunion. Also, otherclassificationsarepossible.Forexample,basedontheirtype,eventscanbeclassifiedas:

• Sportsevents,• Social/life–cycleevents,• Corporateevents,• Educationandcareerevents,• Entertainmentevents,• Politicalevents,• Fundraising/causerelatedevents;

Bytheirdispersionlevel,eventscanbedefinedas:

• Globaleventswith impact to localpopulation (e.g.Olympics,WorldCups,TourdeFrance,DubrovnikSummerFestival),

• Regional events (large local events with great (international) participation, e.g.MarathonFranja,MetalCampTolmin),

• Localevents;

ProductsThis component provides an API for access to GFK's product catalog. GFK provides product datasolutionsforretailers,resellers,wholesalers,manufacturersanddistributors;fromproducttechnicalspecificationsand images, tomarketing text,merchandizing functions, richcontentandmore.Theproduct catalog consisting of more than 20 million products, structured and optimized for SEO,across 56 markets and 30 languages. Category/product data include structured datasets with anumber of dimensions from product attributes, life cycle, price development, though consumerinteraction data within the purchase funnel model divided to interest, research, intent and buybehavior.

To retrieve specific features of products and categories, expressive queries on products andcategories using web-based data access methods must be supported. To this end, products andcategoriesshouldberepresentedinnativeweb-compliantlanguages,liketheonesproposedintheLinkedData paradigm,where, using a SPARQL [19] endpoint, queries canbemade to provide fordataextraction.Asaresult,productdatawillbepublishedasKnowledgeGraphs31(KGs)usingRDF-basedstandardlanguages.

OtherReferenceDatasetsIn addition to the presented services, there may be components that provide access to otherdatasetswithgeographicinformationandotherLinkedOpenDataarereferencedata,whichcanbe

31https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

Page 46: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

46

usedasbridgestojoinbusinessandexternaldatasources.ThosedatasetswillbealsoprovidedwithaSPARQLendpoint.

Especially important for the project is the access to geographical information, including spatialentities at different levels of resolution, e.g., states, regions, cities, districts, locations andattractions, using standard systems of identifiers, and shared geospatial reference systems andvocabularies.ThereareverycommonandveryusedKGssuchasGeoNames32,LinkedGeoData33,andDBpedia34thatcanbeusedasbasedatasetsforthoseservices.

4.2.3 PlatformServices

In this section,we present the components that form the core of the EW-Shopp platform. Thesetoolscanbelocatedattheapplicationlevel,i.e.atthehighestleveloftheBigDatastack.Usingtheterminology proposed by NIST [5], they cover the areas of Data preparation, Analytics andVisualization. These components are called Data Wrangler, Data Analyzer and Data Reporterrespectively.

DataWrangler(DW)Datatobe importedwill typicallybe intabularformatwithvaryingqualitywithrespecttomissingvalues,invalidvalues,duplicatedrecords,etc.Improvingqualitythroughdatacleaningisimportantwhendifferentdatasetsareconnected in the finaldatagraph.Whendataquality is sufficient, thedatacanbetransformedbyselectingthe importantcolumnsandrowsthrougha filteringprocess.Thiscanreducetheamountofdataandmakethedataseteasiertostoreandquery.Thedatasetwillfurther be enriched with data from other domains like event/weather data. Depending on theapplicationarea, thedataset canbe converted/mapped intographofdata. Themappingmustbedone according to an ontology and vocabulary for the data set. Themapping/annotation step istypically themost time consuming. The data is stored in an appropriate format for the receivingapplications.Asampledatasetisusedforspecifyingandassessingthetransformationandmappingsteps using online tools. A transformation model is generated for batch execution on Big DataRuntimeagainstthefulldataset.

Figure 7 shows a component diagram of the DataWrangler tool with component interactions intermsofdataexchanging. The image represents themost general case, i.e.where thedata tobetransformed is very large andmanaged by the corporate area of EW-Shopp. In this scenario, weapply the Dimension Reduction Approach presented in Section 4.2.1. The components of everylogical area of the platform are involved. We have on the left the Data Wrangler Sampler, thesummarizer and the engine that are components that interact with the Big Data Runtime togeneratethesampletobeusedtodefinethetransformations,torealizeasetofsuggestionsforthetable annotation process based on summarizations of existing knowledge bases or previousannotations, and finally to perform on a large scale the transformations defined by the user,respectively. The transformationprocess is definedby theuser through a graphical interface thatinteractswithabackendservice.32http://www.geonames.org/ontology/documentation.html33http://linkedgeodata.org/ontology34http://dbpedia.org/ontology

Page 47: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

47

Bothforthedefinitionoftheentireprocess(whichrequiresthetransformationoftheworkdataset)andforitsactualexecutiononthebusinessdataset,theDataWranglerprocessmayneeddatafordisambiguation,linkingandenrichment.ThesedataareprovidedbytheCoreDataservices.

Figure7:DataWranglerarchitecture

DataAnalyzer(DA)TheroleoftheDataAnalyzer(DA)componentistosupportalldataanalysisandmodellingneedsofthe platform. It receives enriched data from the data wrangling stage and produces descriptivestatistics offering insight into the data as well as operational data models that support businessfunctionality.Itsoutputsarevisualizedbythereportercomponent.

ThemainworkingmodelsoftheDataAnalyzerareLearningandPrediction,respectively:

• Learning–isusedwhenamodelhastobebuiltfromhistoricaldata.Typically,thisisabatchjob procedure where a large dataset is provided to the analyzer, which processes it andoutputs amodel performing the analytic function. The latter can be a prediction of somevalue, such as the number of page viewswithin a certain category of products on awebstore,oramostlikelycategoryforagivennewproduct.Learningconsists:

• Dataanalysis–Thisincludesseveraltypesofanalyticstasks.Forinstance,classificationof products into categories based on their properties and descriptions, clustering ofsimilar products from different sources (stores) and duplicate detection, time seriesanalysis of sale and web click through logs with identification of factors that impacttrendsetc.

• Modellingspecification–Platformusersarecompaniesthatvaluetheirinternaldataandmaynotbepreparedorabletoshare itoutside.Theanalyzercomponentneedstobeabletoformamodellingspecificationusingjustasampleofthedata.Thisincludesthemodellingalgorithmaswellasfeaturesusedandhowtocomputethemfromthedataandanyotherinformationneeded.Thespecificationcanthenbetransportedfromtheplatformtotheanalyzerinstanceonuserpremiseswherethemodelcanbebuiltontheentiredatasetfollowingthespecification.

Page 48: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

48

• Deployablemodels–Modelsbuiltbytheanalyzeranswerbusinessneedsoftheplatformuser.Forusetheyhavetobeintegratedintotheworkflow.Theirdesignneedstoallowforloosecoupling(req.NF.2,NF.3)andquickdeployment.

• Prediction – This is the production mode of the Data Analyzer component, used foroperationaldeploymentofthelearningmodeoutput.Themodelbuiltinthelearningstageisgivennewexamples toprocess.For the twoexamples fromthepreviousparagraph, thesewouldbedataabouttheweatherandeventcontext foranupcomingdatewiththetargetwebstorecategoryanddescriptionsofnewproductssubmittedtoawebstore.Themodelprocessesthenewexamplesandreturnsthepredictedvalue.

Figure8showsthecomponentdiagramoftheDataAnalyzerwheretheinteractionsintermsofdataexchangingamong theelementsarehighlighted.The image represents themostgeneral case, i.e.wherethedatatobeprocessed isvery largeandmanagedbythecorporateareaofEW-Shopp. Inthis scenario, we apply the Dimension Reduction Approach presented in Section 4.2.1. For thisreason, thecomponentsof theCorporateandPlatform logicalareaare involved.Onthe left-handside of the figure there are the Data Analyzer Sampler and the Data Analyzer engine that arecomponentsthatinteractwiththeBigDataRuntimetogeneratethesampletobeusedtodefinetheanalytics and to perform on a large scale some user-defined analytics, respectively. The analyticsprocess is defined by the user through aweb graphical interface that permits the user to definealgorithmsandtestthemonlocaldata.

Figure8:DataAnalyzerarchitecture

DataReporter(DR)Data Reporter (DR) component, as shown in Figure 5, follows theWrangler (Transformation andEnrichment)andAnalyzer(Analytics)componentsinthelogicalworkflowforeseenintheEW-Shoppplatform. This component should perform different tasks, but one of the most important is toprovideH2M(HumantoMachine)capabilities,tosupportuserswithgettingtheanalyticsresultsinauser-friendly and extremely concise structure, with respect to the volume of processed data, in

Page 49: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

49

ordertoextractoperationalknowledgefromthevaluederivedfromtheanalysis.Forinstance,theseinsightsmaysupportthemanagementteaminthedecision-makingprocess,byexploitingthepasttrendsandhighlightingsomefavorableconditions.This scenariowillbecarriedoutmainly inBC1,BC2andBC4.Thescopeshouldbetodevelopadhocdashboardstohighlightspecificaspectsoftheinvolved data, and make them available and constantly updated to the interested stakeholders.Moreover,inBC3,orothersimilarscenario,enricheddatawouldbemadeavailabletostakeholdersoutsidetheplatform.SuchdatamayinturnbecomeinputforothersystemsinaM2M(MachinetoMachine) collaborative scenario, where the Data Reporter output format will facilitate thisknowledgetransferbyadoptingstandardformats(e.g.XML,CSV,etc.).

ThemainexpectedfeaturesoftheDataReportercomponentfollow:

• togivetheopportunitytoworkwithlargedatasourcesandtraditionalones;• to allow the creation of dashboards (with different graphic representations) providing an

easierandconcisedataanalysisreading;• tomakethesecockpitsavailabletousersbothwithintheplatformandthroughexternallink;• toexportofflinereportinstandardformats(e.g.XML,CSV,etc.).

Finally,Figure9showsthemaincomponents involvedinthedefinitionofthebusiness intelligenceandvisualizationtool:thedatareporter.Unliketheothertwotoolsthat,together,definetheheartoftheEW-Shoppplatform,thedatareporterisnotbasedontheDimensionReductionApproachoftheoriginaldataset,presentedinSection4.2.1.ThisisnotnecessarybecausethedatatobehandledbythispieceofsoftwarearegenerallysmallinsizeasaresultoftheAnalyticsprocess.Inthemoregeneral scenario, these datamust bemoved from the corporate to the platform side of the EW-Shoppecosystem.Todothis,wehaveacomponentcalledtheDataAccessAPIwhichprovideaccessto the analytics dataset. Once the data has been accessed, the user can perform businessintelligence operations and design the final report through a graphical web interface thatcommunicateswithaservicetier.

Figure9:DataReporterArchitecture

Page 50: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

50

4.2.4 CorporateServices

As mentioned above, in the reference architecture we impose a logical separation betweencomponentsbydividingthemintoplatformservicesandcorporateservices.Thisstructurehasbeenconceivedtomakethearchitecturemoreflexibleandadaptabletodifferentdeploymentscenarios.Inouropinion, this isnecessary inorder to implement the leanmethodology in thedevelopmentcyclesofmarketedservices.

Wehave introducedtheplatformcomponents in theprevioussection, in this sectionwe focusonthe services of the corporate section. In themore general case, these components are physicallydistant fromtheplatformcomponentsand interactwiththemviaAPIs. Intheearlystagesofpilotdevelopment,wedonotruleoutthepossibilitythattheinteractionbetweenthecomponentsofthetwologicareascanbecarriedoutmanuallybymeansofoneormorehumanusers.

BigDataRuntime(BDR)TheBigDataRuntime (BDR) component's role in theEW-Shopp solution is toperformbatchdatacleaning, integration, enrichment and data transformation on large amounts of input data. Theobjective of theworkflows is to eventually expose the data in a central database for performinganalytics.

TofulfiltheaforementionedroleintheEW-Shoppplatform,theBigDataRuntimecomponentneedstosupportthefollowingmainfunctionalities:

• Deployment ina cluster configuration– inorder tohandle the scaleofdata, theBigDataRuntimecomponenthastobeabletodistributetheindividualworkflowstepsoveraclusterofvirtualmachines (VMs)orcontainers.TheclusterhastobedeployedonthepremiseoftheEW-Shoppplatformuser tomaintain the confidentialityof their data. The cluster alsoneeds to provide access to the input data to all members of the cluster, for example bydeployingdataareonareachablenetworklocationoronamountedsharedfilesystem.

• Executionofthedataflow–TheBigDataRuntimeexecutesadataflowpipelineconsistingofthestepsspecifiedbytheusers.Toprovideanoverviewoftheexecutionandfeedbackonthe results, the workflow is defined and displayed through a user interface. The dataworkflowsneedtobereusedwhennewdatacomesinthesharedinputdatasteps.Thedataworkflow steps need to be able to be scaled up and down, automatically, or semi-automatically,accordingtotheirload.

• DataStorage–TheBigDataRuntimedeploystheoutputofthedataworkflowinacentraldatabase,whichisaccessiblethroughthenetwork,ordeployedonthesamecluster.

ThetoolsandservicesforprovidinganalyticsandreportingfunctionstotheEW-Shoppconsumethedataproducedby thedataworkflowexecutedby theBigDataRuntime (BDR) through thecentraldatabase.

Page 51: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

51

4.2.5 CrosscuttingServices

The section aims at introducing the crosscutting services, that is those components that are inchargeofimplementingfunctionalitiesthatspanoverseverallogicalareasofthearchitecture.

SecurityServicesEW-Shopp reference architecture encompasses also security and privacy solutions as theirenforcementassumeparamountimportanceinBigData;forthisreason,weapproachthoseaspectsfocusing on suitable Privacy-Enhancing Technologies (PETs) [20] to implement suitableAuthentication, Authorization and Encryption mechanisms as stated by requirements SP.1, SP.2,SP.3,SP.4.Specifically,seekingforflexibilitywesuggesttheadoptionofsecurityprotocolsandtoolsimplementingtherole-basedaccesscontrolmodel.Itbindstheauthenticationprocesstodependonthe actor’s role.Moreover, the authentication and authorizationmechanismsmust be integratedwiththecustomer’sITplatform.

Securing communication assumes a paramount importance, as no trustworthy authentication andauthorization mechanism can be built in a distributed platform without the establishment of asecure communication channel. For this reason, EW-Shopp consortium is committed to employmodern encryption mechanisms to create secure channels over untrustworthy communicationmeans.

Itisimportanttoconsiderthatsecurityservicesmusthavecompetenceovertheentirearchitecture,regardless of its deployment. For this reason, if the system is distributed, thenwith a separationbetween platform and corporate services, it will be necessary to provide an adequate securitysystem inboth sub-systemsandensure that they can interact transparently through somesortofsecurityproxy.

MonitoringandAuditingServicesFinally, Corporate services should include a set of components responsible for monitoring theexecution of the data transformation process (req. SP.6). These services are not among the coreelementsof theplatformas their absencewouldnot limit the functionalityof the systembutarenonethelessvery importantfor itsday-by-dayoperability. Infact,asystemthatdoesnotallowforthetimelydetectionofanomaliesinprocessesorsecuritybreachesiscertainlynotoperational.

Forthistobepossible,thesystemmustbeobservable,i.e.itsvariouscomponentsmustbeproperlyinstrumentedtogenerateauditeventsofadequategranularity.Theseeventsareusuallyrecordedinseverallogfiles(butamessagequeuesystemcanalsobeusedforthispurpose)andanalyzedbyaComplexEventProcessor(CEP)[21].

4.3 ControlFlow

Figure10illustratesagenericcontrolflowassociatedwiththedatapreparationprocess.Theimageentailsboththedesigntimeandruntimephase.Duringthedesigntimephasetheauthenticatedandauthorizeduser (aDataAnalyst)designsthetransformationprocessthedatasetmustundergo.To

Page 52: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

52

dothis,theusergeneratesasampledatasetaccordingtospecificdomaincriteriaandmanipulatesitby interacting with the Data Wrangler Web App. Once the process is fully defined it can bedownloadedinasuitableformatfromtheDataWranglerbackendtoEW-ShoppCorporate.Heretheruntimedata flowcanbesetupand firedby loadingandexecuting thegenerated transformationmodelintotheDataWranglerEngine.Thiscomponentisinchargeofexecutingthetransformation(cleaning,linking,enrichment)againsttheoverallBigDataset.Theresultsaresavedbackinthedatastorage. Please notice that, in the figure, the Corporateworkflowmanagement tool represents asuitableautomationtoolimplementedwithintheBigDataruntime.

Figure10:GenericcontrolsequenceforDataWrangling.

Similarly,thecontrolflowforthedataanalyticsprocessisgraphicallyrepresentedinFigure11.Onceagain,thesystemusermusthavepreviouslyloggedinandbeenauthorized.Afterwards,theusercanthenuse a sample, generatedby the Sampler from theenricheddataset, as a playground for thedefinitionandconfigurationoftheanalyticsmodels.Thesemodelscancoverthevariousdomainsofdata analytics including predictive and prescriptive methods. The user interacts with the systemthroughasuitablegraphical interface.After satisfactory resultshavebeenachieved, theanalyticalmodels can be downloaded to the Corporate platform and executed on a large scale on thecompleteenricheddataset.

Page 53: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

53

Figure11:GenericcontrolsequenceforDataAnalytics

Regardingtheexecutionflowofthedatareporter,thisslightlydiffersfromtheothertwotools. Infact,thedatareporterdoesnotrequiretheuseofareducedsetoftheenricheddatabutisabletoworkontheresultsoftheanalysistoolasthesearegenerallysmall-sized.InFigure12weseehowinessence the user can only interact with the graphical interface of the data reporter to generatevisualizationsandreports.WepointoutthatsincethedefinitionoftheDataReporterarchitecture(Figure9)wehaveadoptedtheoptiontoletit interactwiththedatastoragesolutionbymeansofsuitable APIs. If thiswas not possible, we could still opt for a fallback solution inwhich the useruploadsafilecontainingadumpofanalyticsdataontothePlatform.

Figure12:GenericcontrolsequenceforReportVisualization

Page 54: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

54

4.4 DataFormat

Toenablesimpledataexchange,dataarekeptwithinaCSVformat(req.F.6)followingfurtherrules:

Thesystemissupportinginternationalcharacterencoding:

FileNamesofuploadedfilesarestandardized:

Eachdataisdescribedbyitsglobalidentity(GUId).Standardcolumnnamescanhave$EWS_prefix.

EachdataaccessandmanipulationisrecordedasAudittraillog(req.SP.6)whatenablestrackingoftheprocessesrelatedtoentitiessuchascustomer,user,salesobject,campaign,etc.

CSVfileformatrules

• Datafieldsareseparatedbyasinglecharacter(usually“Comma”,canbe“Tab”or“Pipe”(|)),

• Eachrecordiskeptinaseparateline,startinginitsownline,butcanspanmultiplelines,• Thelastrecordisnotfollowedby“CarriageReturn”,• ThefirstlineofthefileshouldincludetheHeaderwiththelistofthecolumnnamesused

inthefile(optional,butstronglyrecommended–enablesself-documentingofthefile),• The Header list should bedelimited in the sameway as the datawithin the file.This

helps guard against the data fields being transposed in the data when it is loaded, which can lead to getting wrong answers when you query the data.

• Enclosingcharacter(typicallydoublequotes)mustbeusedwhenrequired,suchaswhenthedelimiterappearsinafield;

MorestrictcomplianceaccordingtoRFC4180maybeconsidered.

Characterset

• UNICODEUTF8,• WithBOMHeader;

FileNamesConvention

Theuploadedfilenamecontains:

• CompanyGUID,• Date (Some file systems change the dates when copying file) in a form

YYYYMMDDhhmmss,• BatchGUIDiftherewillbemultipleuploadsthroughtheday;

Page 55: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

55

4.5 MappingtheReferenceArchitectureontheBusinessCases

The platformmay be instantiated differently in different business cases (for example, likely, BC3doesnotneed tohave theanalyticspartavailable), for this reason,weask the representatives toimagineaspecificdeploymentprofileanddataflowsfortheircase.

Basedonthefeedbackobtainedfromthepartnersinvolvedinthedefinitionandimplementationofthepilotprojects,wehaveendeavoredtoimaginethefutureimplementationofthepilotintermsofthereferencearchitecturepresentedinthischapter.

However,abriefdisclaimerisnecessary:atthetimewhenthisdocumentiswrittenwearestillsixmonths away from the actual realization of the pilots, this means that many aspects may beadjusted.Nonetheless,theeffortmadeatthisstage,webelieve,isnotjustvainintellectualexercisebut a fundamental starting point define a common language (components and relations) and agroundbasisarchitecturearoundwhichthepilotswillbebuilt.

4.5.1 BC1–IntegrationforConsumerJourney

Initspilotphase,theEW-Shoppprojectforeseesthreepilotprojects:

• PilotI–Enrichmentofpurchaseinformationforwebplatforms,• PilotII–Integratedplatformforcategoryandmarketingoptimizations(B2B),• Pilot III – External data access API and decision-making systems supporting customized

campaigns;

All pilots are described in the project deliverable D4.1 Chapter 5. (Technical and syntacticinteroperabilityrequirements).ResultsoftheEW-ShoppprojectsshouldreflectintheavailabilitytooffertheEW-Shoppservicestodifferentindustriesandtheirend-customers.

BC1 enables the EW-Shopp partners (BB, CE and BT) integration of their EW-Shopp CorporateServiceswith:

• EW-ShoppPlatformServicesononeside• Their business partners (industries) environment on the other side (via the Local

InformationSystemintegrationAPI);

asdescribedinchapter4.2.4.

AllEW-ShoppBC1solutionsareprovidedbytheprogramminginfrastructureandinterfacesdefinedwithin other chapters of this document. The solution architecture is presented in Figure 13 todescribe the platform (already sketched in D4.1.) for business case 1 (BC1) in detail by using thenomenclatureandcomponentsdetailedinSection4.2.

Page 56: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

56

Figure13:BC1Architecture

Inaddition,wehavebrokendownthedataflowandhighlightedtheneedtomanagethecorporate(i.e.business)datageneratedby the linkingandenrichmentprocessand those resulting from theanalytics phase. As far as the Platform area components are concerned, an initial analysis of thepilotsdidnotleadtoarealneedtoincludetheDataReportertool.Acomponentspecifictotheusecase will ensure access to data for customers. This decision can certainly be questioned in thecomingmonths,duringwhichthepilotswillbeimplemented.

4.5.2 BC3–LocationScouting

Figure14referstobusinesscase3(BC3)andthiswasderivedfromthetextualdescriptionandaninitialrepresentationcontainedindocumentD4.1.Wehavethereforerefineditbydefiningthedataflowandwehavedescribeditintermsofthereferencearchitecture.Thefigureshowstwosectionsdedicatedtothedatapreparation.Inthefirst,whichispartoftheCorporateareaofthesystem,thedatacomingfromthesensorsarecollected,cleanedandfiltered;inthesecond,theresultingdatasetis enrichedwithweather information, products and events. To complete this secondprocess, theDataWranglerandCoreDataareused(lightgreenandlightblue,respectively).

ThedatacollectedandenrichedarestoredinApacheCassandra35,acolumnarNoSQLdatabase[22][23], particularly suitable for handling huge amounts of data arranged in table form. Cassandranatively provides data partitioning and replication capabilities. As for the data processingcomponent, the company currently uses another open source application from the Apachecommunity: Spark. Apache Spark36 is one of the most widely used general purpose processingenginesintheworld.ItenjoysahighreputationintheBigDataworldandiscurrentlyusedbytheworld's leading companies to process their corporate data. Apache Spark allows distributed35http://cassandra.apache.org36https://spark.apache.org

Page 57: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

57

processing of large amounts of data, has a flexible and comprehensive programmingmodel, andincludesalibrarywithmanymachinelearningalgorithms.

Thedataoncelinkedandenrichedaresubjecttoanalyticsalgorithms.Asforthisphase,atthetimethis document is written, we expect it will be managed by means of software and algorithmsspecificallyimplementedbythecompany.Similarly,thebusinessintelligenceandvisualizationphasewillbedirectlymanagedbycomponentscreatedbyMeasurence.Inparticular,thereareAPIsandaspeciallybuiltwebdashboard.

Figure14:BC3Architecture

4.5.3 BC4–Campaign-drivenPurchasingIntentions

As in the case of the twoprevious business cases,we have refined the architecture presented indeliverable D4.1 [1] in the light of a greater awareness of the elements making up a Big Dataarchitecture.Inordertodothis,asinpreviouscases,wehaveavailedourselvesofthecollaborationofthecompanymostinvolvedinthebusinesscase:JOT.Wehavethereforesucceededinoutliningthedataflow,deployment(theMicrosoftcloud:Azure37)andtherequirementsintermsofplatformcomponents thatarecurrentlyexpectedtobepartof thebusinesscasearchitecture.Theresult isshowngraphicallyinFigure15.

JotcollectsdatafromtheAPIsmadeprovidedbyGoogleandMicrosoftviaadownloadagent.Atthisstage,dataarecollected,cleanedandaggregated.TheyarethenloadedintoadatastoreinAzure.

37https://azure.microsoft.com

Page 58: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

58

Oncethedatahasbeentransferredtothecloud,itcanbeprocessedusingoneofthetoolsprovidedbyMicrosoft(Hadoop38,Storm39,Spark40,amongtheothers)orbytherecommendedcomponentoftheEW-Shoppconsortium.Forfurtherdetails,pleaserefertoSection5.5ofthisdocument.

Thedataflowcontinueswiththetransformationandenrichmentprocess,whichmakesuseoftheDataWranglerandtheCoreData(usedbothindefinitionandexecutionoftheprocess).Aswiththeother business cases, it is expected that the volumeof data to be processedwill be considerableand,therefore,itwillbenecessarytouseasampleofdata.Similarly,theDataAnalyzerisappliedona sample of enriched data to define the analytics process thatwill then be applied to the entiredataset.

Finally, theDataReporteraccesses the resultsproducedby theDataAnalyzergeneratinga reportthat,inturn,isusedtooptimizethecampaign.

Figure15:BC4Architecture

38http://hadoop.apache.org39http://storm.apache.org40https://spark.apache.org

Page 59: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

59

Chapter5 ReferenceImplementation

This chapter provides a brief description of the initial system implementation. Details on theimplementationaredescribedinEW-ShoppDeliverableD2.2[3].

5.1 Overview

Asmentionedearlierinthisdocument,EW-Shoppecosystemislogicaldividedintothreeareas,CoreData,Platform,andCorporate,respectively,plusasetofcrosscuttingservices.

TheEW-ShoppCoreDataaredatasetthatcanbeexploitedinthelinkingandenrichingphasesofthedataflowpipeline.ThesedataaremadeavailableviastandardizedAPIsbyspecialservicesthatcanbedeployedlocallyoraccessiblethroughtheWeb.WeatherandEvents informationareservedbyJSI by means of the EventRegistry41 and a Python API42 whereas Product data and other freelyavailabledatasetwill be served via suitableAPIs (such as SPARQLendpoints [19]). Please refer toSection5.3formoredetails.

TheEW-ShoppPlatformmainlyfeaturesthreefunctionalcomponents:

• DataWrangler–Thiscomponentisimplementedbymeansofintegrationofthreesoftwaresolutions: DataGraft, ASIA and ABSTAT. Data transformations are developed online usingDataGraft43 online tools used for data cleaning, transformation and enrichment using asample data set. DataGraft uses ArangoDB44 for storage of transformed sample data. AmodelisexportedasaDocker45imagethatcanbeexecutedontheruntimeframeworkwiththe completedata set as input.ASIA46 is a tool for semantic enrichmentof data tables. ItaimsatguidingtheusersoftheEW-ShoppecosysteminintegratingbusinessdatawiththeCore Data. Semantic reconciliation algorithms are implemented into the interface to helpusersmapthedataschematosharedvocabulariesandontologies,andlinkdatatosharedsystems of identifiers. Data enrichment widgets exploit these links to shared systems ofidentifierstoeasetheextractionofadditionaldatafromthird-partysourcesandtheirfusionintotheoriginaltabulardata.ABSTAT47isatooltoprofileKGsrepresentedinRDFbasedonlinked data summarization mechanisms. The profiles extracted by ABSTAT describe thecontentofKGs,usingabstraction(schema-levelpatterns)andstatistics.Theprofilesresultsin annotation suggestions that help users understand the content of the KGs used in the

41http://eventregistry.org42https://github.com/JozefStefanInstitute/weather-data43https://www.datagraft.io44https://www.arangodb.com45https://www.docker.com46https://github.com/UNIMIBInside/ASIA47https://github.com/siti-disco-unimib/abstat

Page 60: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

60

platform (e.g., linked product data), support ASIA semantic reconciliation algorithms, andprovidedataqualityinsights.

• DataAnalyzer–EW-ShoppDataAnalysisconsistofLearningandPredictionprocesses,whichare provided by QMiner data analytic platform. While the Learning process defines ananalytics model based on existing examples, the Prediction process uses this modelprocessesforreturningthepredictedvalueofnewexamples.ItisnoteworthythatitsroleintheplatformisnotspecifictoQMineranditcouldbeinterchangedwithadifferentanalyticscomponent.Thus,inthetextanyQMinermentionisinterchangeablewiththemoregeneralterm:Analyzer.

• Data Reporter – Data Reporter refers to the delivery of interactive dashboards or staticreportsbyusingamoduleof theKnowageSuite48, that combines traditional data andBigData sources into valuable and meaningful information. The Knowage Suite is developedusingamodularapproach,whereeachmoduleisdesignedforaspecificanalyticaldomain.Theycanbeusedindividuallyascompletesolutionforacertaintask,orcombinedwithoneanothertoensurefullcoverageofuser’requirements,allowingtobuildatailoredproduct.

TheCorporateservices,ontheotherhand,consistsof:

• Big Data Runtime– This macro-component targets processing of datasets that cannot beprocessedbythecloudservicesprovidedduetodatavolumeordataconfidentiality.Itisanexecutionenvironmentfortransformationandanalyticsmodels.Tohandledataownersthatrequiresinhousestorage,theplatformisdeliveredasasetofDocker49imagesdeployedonascalablehostmachineprovidingdataaccessthroughanetworkfilesystem.

Crosscutting concerns relate themanagementof securitypolicies andmonitoring (whichembracemorethanoneareaaltogether:

• Monitoring–supportingthemonitoringofalltheeventsproducedinthedataflowpipelineiscrucialforanyproductionenvironment.Implementationofsystemmonitorscanbedoneon different layers. In order to maximize the usage and minimize the implementationcomplexity, theusageofstandardSYSLOGmessaging logging(permittingtheconsolidationofloggingdatafromdifferentsourcesinacentralrepository)andREDISservicesisforeseenandrecommend.

• Security and privacy – Security (authentication, authorization, encryption) and privacymanagement (dataconfidentiality)are two featuresofparamount importance thatcannotbe implemented in justoneof the logical areasof theEW-Shopp system. The consortiumconsiders that it is necessary to create anecosystem that is globally safe and to take thisobjective into account from the earliest stages of the project. For this reason,we believethatitisnecessarytounify(wherepossible)securitytechniquesandprotocolsandtocreateaservicecapableofmanaging, inatransparentwaytotheuser, rolesandpermissions fortheresourceaccess.

48https://www.knowage-suite.com/site/product/enterprise-reporting49https://www.docker.com

Page 61: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

61

5.2 IntegrationStrategy

AsfarastheintegrationofthevariouscomponentsthatcontributetothecreationoftheEW-Shoppecosystemisconcerned,wemustdistinguishthreemainaspects:

• Integration of the components that form the EW-Shop platform, namely Data Wrangler,DataAnalyzer,andDataReporter.

• IntegrationofthecomponentscomprisingtheDataWrangler,

• Integrationbetweenthevariousareasoftheecosystem,i.e.EW-ShoppPlatform,EW-ShoppCorporate,andEW-ShoppCoreData.

In the first case, we chose to perform a loose integration in which mainly the various toolscommunicateexclusively throughacommondatabase. In thisway,wehave thebenefitofgreatlydecouplingthecomponentsandthusensuringmaximumflexibilityduringbothdeploymentanduse(req.NF.2,NF.3).Inaddition,thecomponentscanbedevelopedinparallelincompleteisolation.

AsforthestructureoftheDataWrangler,thisconsistsofawebapp(GUI)thatinteractswithasetofservices(Grafterizer50andDataGraft,mainly).AcomponentcalledASIAhasbeenaddedtotheGUItoprovidetheuserwithatoolforthetableannotationprocess(moredetailsareavailableinSection4.2.3).ASIA isoneofthecomponentsthatarebeingdevelopedwithintheprojectand iscurrentlyfully integrated in the graphical interface of the Data Wrangler. Integration was not onlytechnologicalbutalsomethodologicalinnature.

Finally, integration between the various logical areas of the architecture is essentially achievedthrough the use of REST APIs. This is particularly evident in the case of Core Data that can beconsumedbyboth thePlatformand corporate areas. The integrationbetween the corporate andplatforming parties depends on the particular business cases. Herewewill simply say that at themoment we plan to use a proxy system that integrates the security aspect and REST API toimplementdataexchange (suchasSample files, transformationandanalyticsmodels).ConcerningdataaccessforthedatareporteratthemomentweconsiderthatthiscanbeimplementedusingtheJDBC51API.

5.3 CoreDataServices

5.3.1 EventRegistry

EventdataisprovidedbytheEventRegistry52–aglobalmediamonitoringplatformdevelopedandmaintained by JSI. Event Registry collects and processes articles from more than 100.000 newssources from all over the world in 15 languages. Through semantic analysis these articles areannotatedwith conceptsmentioned in the text.Basedon these semanticannotations thearticles

50https://github.com/datagraft/grafterizer51http://www.oracle.com/technetwork/java/javase/jdbc/index.html52http://eventregistry.org

Page 62: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

62

arethenclustered intoevents, i.e.groupsofarticlesaboutthesamereal-worldhappening. In thistext,wewillcommonlyusetheterm“event”interchangeablyfortheactualeventandthegroupofarticlesitisrepresentedby.

TheEventRegistrywebsiteoffersawidearrayofoptionsformanualexplorationofeventdatawitha powerful search engine and various visualizations. The data is also available through an open-sourcePythonAPI53.TheAPIsupportsallthefunctionalityofavailableonthewebsiteandhasbeenmadeavailabletoallprojectpartnerswithoutrestrictions.ThedataformatusedbytheAPItoreturndataisJSON.Dataaboutthefollowingdatatypescanberequested:

• events,• newsarticlesintheevents,• conceptsmentionedinthearticlesanddetectedbythesemanticanalysis,• topicalcategoriesintowhichtheeventsandarticlesareclassified,• andsources(i.e.newspublishers)ofthearticles.

The full data model for all data types returned is available in the online documentation54, datamodelsofasubsetofthemostrelevantdatatypesisdetailedindeliverableD1.1[18].

5.3.2 WeatherAPI

WeatherdataisprovidedbytheEuropeanCentreforMedium-RangeWeatherForecasts(ECMWF55)anintergovernmentalorganizationsupportedby34statesthatactsasbotharesearchinstituteanda24/7operationalservice,producinganddisseminatingnumericalweatherpredictions.ECMWFisoneof the leadingmeteorological institutions in theworldwithadecades-longhistoryofweatherdatacollection.Theyhavegranted theprojectaccess to theMeteorologicalArchivalandRetrievalSystem56 (MARS), which is the main repository of meteorological data at ECMWF. It containspetabytesofoperationaland researchdata rangingbackas faras the1950s,aswellasdata fromspecialECMWFprojects.

MARSarchiveisaveryrichdatasourcethatfarexceedstheprojectneeds.Itsdatasetofoperationaldatacontains thecompleteglobal stateof145weatherparameters twiceperday (12o’clockandmidnight). The weather parameters include measurements such as surface temperature, windspeed,cloudcoverage,precipitationtypeandamount,snowdepthetc.

Besidesthecurrentstateatagiventime,thearchivealsocontainstheforecastmadeatthattimeforthenext240hoursin125timepoints.

Thedata is availableoveraRESTfulAPI aswell as aPythonAPI. Thedata is returned in theGRIB(GRIddedBinary) format commonlyused inmeteorology.ToavoidoverheadsandenableefficientworkaninterfaceAPI57willbedevelopedbyJSIforweatherdataaccessintheproject.Thisinterface

53https://github.com/EventRegistry/event-registry-python54https://github.com/EventRegistry/event-registry-python/wiki/Data-models55https://www.ecmwf.int56https://www.ecmwf.int/en/faq/what-mars57https://github.com/JozefStefanInstitute/weather-data

Page 63: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

63

API will transform the raw weather data into a format suitable for integration with the projectdatasets.

5.3.3 ProductsandOtherKnowledgeBases

WithregardtotheProductdataandanyotherknowledgebases(e.g.,GeoName,DBpedia,etc.)thatmay be relevant both in the linking and enrichment phases, we will proceed according to theguidelinespresentedinthissection.

Wehaveidentifiedtwomainwaystoquerytheseknowledgebases:

• Full-text search on countries, administrative divisions, cities, and related information. Theaccessonfull-textsearchisenabledbytheinterlinkingservicesandgeocodingfunctionality.Traditional indexing techniques for text searching can be exploited to implement thisfeature;

• Spatial-query (e.g. retrieve the name of the location nearest to a given coordinates),enabled by reverse geocoding functionality. Spatial queries, on the other hand, leveragesdifferentkindof index,designedspecificallyforspatialdata(suchasQuadtree,OctreeandR-tree).

In order to maximize the maintainability and reliability, the consortium is inclined to useElasticSearch58asindexingengine,sinceitcanmanagebothfull-textandspatialqueries(alongwithother types of queries). ElasticSearch is an open source distributed engine that can be used forindexing,searchingandanalyzingdata.Itimplementsinvertedindexesforfull-textqueries,whileitadoptsBKD-treeindexes[24]forspatialqueries.AnotherimportantfeatureofElasticSearchisthatitscales to handle millions of events per second, while automatically managing how indexes andqueriesaredistributed.

WeintendtoprovideindexesonEuropeancountries,administrativedivisionsandcitiesthatcanbefoundinGeoNames,whicharetheinterestingentitiesfortheEW-Shoppproject.Inaddition,wewillstoresomeotherusefulinformationthatcanbeusedforData.Wewillindexalsocoordinatesaboutcountries, administrative divisions and cities. This scenariomay require the execution of complexspatialqueries, in thecaseElasticSearchfallsshort for thisduty,wewillalsoevaluatetheneedofdeveloping a GIS (Geographic Information System), storing data in a spatial database (e.g.PostgreSQL+PostGIS).

With regard to Product data or other datasets that do not require spatial queries, they can beaccessedandexploitedbymeansofservicesthatexposeastandardizedandpossiblysemanticAPIs(suchasaSPARQLendpoint).Thereareseveralalternativesonthemarketthatcanbeusedforthispurpose,suchasOpenLlinkVirtuoso59andD2RQ60.

58https://www.elastic.co/59https://virtuoso.openlinksw.com60http://d2rq.org

Page 64: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

64

5.4 PlatformServices

5.4.1 DataWrangler(DW)

TheDataWrangler,whosemainfunctionshavebeenexplainedinSection4.2.3,isrealizedthroughtheintegrationandcollaborationofthreeindependentpiecesofsoftware,namelyDataGraft,ASIA,andABSTAT.Inparticular,DataGraftandASIAhavebeenintegratedatcodelevel,mergingthetwooriginal repositories together.ABSTAT,on theotherhand,collaborateswithASIA throughasetofRESTAPIs.Inthefollowing,thesethreetoolsarepresentedseparately.

DataGraftDataGraft is an online tool that consists of GUI operated services for managing the process oftransformingdataanddeployingthedataontoadatabase.

• Grafterizer is used for transforming data from tabular format into a graph format. Thetransformation supports both cleaning andmapping steps. Transformation steps on rows(add, drop, filter, duplicate detection etc.), columns (add, drop, rename,merge etc.) andentire dataset (sort, aggregate etc.) are provided togetherwith visualization of the resultaftereachstep.Themappingtoprojectontologyisbasedoncolumninformationafterthecleaningsteps.TheresultisformattedasedgeandvaluecollectionsreadytobeuploadedtoArangoDB. Grafterizer provides a GUI on top of the Grafter library that’s implemented inClojure[25]–alanguagethatrunsonaJavavirtualmachine.ThismakesiteasytogenerateJAR files that can be executed at customers premise. This will be used to scaletransformationexecutiononBigDataRuntimeforBigDatasets.

• DataGrafthasbeenupdatedtoincludeamodulecalledDBMSmanager,whichimplementsthe data storage in ArangoDB. The DBMSmanager accesses databases for storage of thetransformed data. The manager uses full access privileges for administration of thecollectionsinthedatabase.AdditionaluserswithrestrictedaccessarehandledthroughtheuseradministrationinArangoDB.

• ArangoDBdatabases,collections,andgraphsarehandledinDataGraftusingregistrationofArangoDBassets,whichactastheotherassetsontheplatform(e.g.,SPARQL[19]endpointsforRDFdata).TheArangoDBassetsupportscreate, read,updateanddeleteofcollectionsfor a specific database. This is used to integrate the flow from transforming data touploading the result into collections in the database. The module provides informationaboutthecollectionswhenconnectedtoArangoDB.AllinformationintheArangoDBassetsissynchronizeddirectlywiththeArangoDBserverthroughtheDBMSmanagermodule.Thisisdonetoavoidcomplicatedsynchronizationwhenthecollectionsareaccesseddirectly.

• QueryofcollectionsissupportedusingArangoDBQueryLanguage(AQL)throughDataGraftqueryassetsordirectlyusingArangoDBinterfaces.AQLissimilartoSQLinitspurposewhenitcomestoreadingandmodifyingcollectiondata.OperationssuchascreatinganddroppingdatabasesandcollectionsarenotsupportedusingAQL.

SupportforArangoDBinDataGraftonlinetoolsisimplementedinEW-Shopp.

Page 65: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

65

ASIAASIA (Assisted Semantic Interpretation and Annotation of web sources tool) is a tool for thesemantic enrichment of data in tabular formats, developed by UNIMIB. Joining tabular data thatdoesnotusethesamerecordidsorsomeotheridentifyingvaluesisnotstraightforward,itrequiresto create links from the table values to a shared systemof identifiers.ASIA aims to help users increating these links, by means of especially designed semantic reconciliation algorithms. In thiscontext,theentitylinkingisperformedattwodifferentlevels:

- Schema-level linking: linking table schema values (i.e. the header of a table) to sharedvocabulariesandontologies;

- Instance-levellinking:linkingdatavaluestosharedsystemsofidentifiers.

Schema-level and instance-level links are created by ASIA asannotations for the table. Users cancreate schema-level annotation through the ASIA interface, by validating ASIA suggestions aboutclasses andproperties tobeused. If a user specifies adifferent class (or property),ASIA suggestsclasses (or properties) that syntactically match the user input (autocomplete functionality).Otherwise,theinstance-levelannotationsareexpectedtobecreatedbyASIAautomatically,asthelargesizeofthedatasetoftenimpedestovalidatethevaluessingularly.

TableannotationsunderpinstwodifferentfunctionalitiesofASIA:

- Generationofknowledgegraphsfromatabulardataset:theschema-levelannotationsaretransformed into executable data transformation to publish tabular data as a knowledgegraph;datavalueswillbeusedtocreatenewinstancesandpopulatethegraph.

- Enrichment of tabular data with third-party data: instance-level annotations are used tofacilitate enrichment of business data with data from these reference knowledge graphs(e.g., the link to a product in the products datasets supports the retrieval of the productbrand, stored in thesamedataset)or fromthird-partydatabyusing linksasbridges (e.g.,the linktoaGeoNames locationcanbeusedtoretrievetheGeoNames location identifier,thatisrequiredforretrievingeventsfromtheEventRegistry).

TheASIA interface is developedas componentofGrafterizer. TheASIABackend is developedandmaintainedbyUNIMIB.AlllinkingservicesimplementedintheASIABackendaremadeaccessibleviaRESTAPIs.

ASIAhasbeendesignedwithalayeredarchitecture.TheBusinessLogicisdevelopedwithintheASIABackend component, while the Presentation Layer contains the ASIA Frontend component. WedescribeinthefollowingthetwomaincomponentsofASIA,depictedinFigure16:

• ASIABackend,whichcontainsthelogicrelatedtotheannotationsuggestionserviceandthatis responsible for the intercommunication with third-party services (like ABSTAT); inparticular,ASIABackendiscurrentlyintegratedwithABSTATserviceviaAPI;asaresult,theBackendprovidestoASIA,thereal-timeautocompleteserviceofferedbyABSTATaswellasstatisticscomputedbythesametool.

• ASIAFrontend,whichisthefrontendcustomizedandintegratedwithinGrafterizer.

Page 66: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

66

Figure16:ASIAArchitecture

ABSTATABSTAT(LinkedDataSummarieswithABstractionandSTATistics)[26]isaframeworkdevelopedbyUNIMIBtosupportusersandmachinesinbetterunderstandingbigandcomplexRDFdatasets.GivenanRDFdatasetand,optionally,anontologyABSTATcomputesasummarythatprovidesanabstractbutcompletedescriptionofthedatasetcontent.Summariesarepublishedandmadeaccessibleviawebinterfaces, insuchawaythatthe informationtheycontaincanbeconsumedbyhumanusersandmachines (viaAPIs).ABSTATmakesalsouseofaminimizationmechanismtokeepsummariescompletebutassmallaspossible.

ABSTAT summaries provide answers to questions like:what types of resources are described in adataset?Whatpropertiesareusedtolinktheresources?Whattypesofresourcesarelinkedandbymeansofwhatproperties?Howmanyresourceshaveacertaintypeandhowfrequentistheuseofagivenproperty?Inpractice,ABSTATsummariesdescribetheuseofvocabulariesindatasets.

ABSTATisabackendinfrastructurethataimsatsupportingdifferenttasks:

• Data understanding. ABSTAT summaries provide a complete overviewof the content of adataset; this feature was proved to be useful to support, for example, SPARQL queryformulation.

• Vocabularymatchingfortableannotation.Summariesrecordrichstatisticsabouttheusageof vocabularies/ontologies in datasets. Thus,we can use summaries to provide types andpropertiesthatmatchastring,forexample,theheaderofacolumninatablethatwewantto publish in RDF reusing existing vocabularies. In addition, statistics provide valuableinformation to algorithmsaimedat suggesting thebest properties and types tousewhentransformingatabulardataintoRDF.

Page 67: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

67

Figure17:DataWranglerComponentSpecification

TheroleofABSTATinEW-ShoppplatformistoprovidesuggestionstoASIAforthetableannotationprocess. For this reason, ABSTAT is placed twice in the figure, as it is at the same time thecomponent that provides suggestions about theCoreData (summarizing informationof Products,GeoNames,etc.),andsuggestionaboutbusinessdata(leveraginginformationofprevioususages).

AsregardstheSamplerComponent,ithasnotbeenimplementedinthisversionoftheplatformasitis strongly business-case dependent and therefore its actual implementation is entrusted to therealizationofthesinglepilots.

ItmustbenotedthattheCoreDatasectionofthearchitecture(graphicallyrepresentedbytheblueboxinthefigure)doesnotreflecttheactualdeploymentofthedatasources,whichmeansthateveniftheweatherandeventAPIsrefertoaremotesource(theEventRegistry),othersources(suchastheproductdatasource),maybedeployedinthecorporateinfrastructureorleftremoteoncase-by-caseevaluation.

5.4.2 DataAnalyzer(DA)

The Data Analyzer component is realized by QMiner61 - an open source data analytics platformdevelopedandmaintainedbyJSI.QMinerwasdesignedforprocessingoflarge-scaledatastreamsinreal time. It supports both structured and unstructured data including text, geospatial data,networksandtimestreams.

The QMiner library contains an extensive array of analytics tools covering classification (supportvectormachine,neuralnetworks, logistic regression), regression (ridge regression, recursive linearregression), clustering (k-means, data partitioning), preprocessing (multi-dimensional scaling,principalcomponentanalysis)andothers.Therearealsoanumbertoolsdirectlysupportingfeatureextractionfromunstructureddatasourcessuchastext.

61http://qminer.ijs.si

Page 68: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

68

QMinercoreisimplementedinC++programminglanguagetoensureprocessingefficiency.Ithasacustom built internal dataset representation which provides out-of-the-box support for indexing,queryingandaggregatingusingasimplequerylanguage.DatasetstructureisdefinedusingasimpleJSON specification. Feature extraction working on top of the data layer is specified in the samefashionmakingthesetupeasilytransferablebetweenQMiner instances.This isuseful insituationswhenfinalmodelmaybedevelopedonadifferentinstanceastheonethathasaccesstoallthedataasdescribedinSection4.2.3

Foreaseofuse,QMiner iswrapped ina JavaScript libraryand isdeployedasanode.jspackage62.Thismakesinstallationeasythroughthenodepackagemanager.TheJavaScriptinterfaceallowsforfast development and rapid prototyping.Once themodels are built they are simple to expose tootherprocessesassimpleRESTservicesusingoneofthemanynodepackagesforsettingupHTTPservers.ThiswayQMinermodelscanbeeasily integratedintootherworkflowsaswasspecified intherequirementsinSection3.3.Thoughitdoesnothaveacustomgraphicaluserinterface,QMinercanbemanagedinrealtimeusingthenode.jsconsoleoraninteractivecomputingplatformsuchaJupyter63.

Figure18:DataAnalyzerComponentSpecification

ThetwomainmodesofoperationoftheAnalyzercomponentarelearningandprediction.

Learning is typically a batchoperationperformedbyprocessing a largedataset of historical data.Therearealternativeapproachestothebatchapproach,butwedonotforeseetheneedforthemintheplatform.Shouldsuchaneedariseinfuture,itispossibletomodifythisstagewithoutaffectingthepredictionstage,thusminimizingtheimpactofthechange.

Thedataset is processedbyQMiner,whichbuilds amodel. SinceQMiner parses thedata into itsowninternalrepresentationduetooptimizationforthe learningalgorithm,thesourceofthedata(e.g.fileondiskoratableinadatabase)isnotimportant.Theonlythingneededisthespecificationof the learning problem,where the target variable and the input features are specified. This alsoallows the development of themodel on a separate sample due to hardware or data sensitivityconstraints and then transferring the developed specification to a different QMiner instance (forexampleondataowner’spremises)wherethefinallearningisrun.

62https://www.npmjs.com/package/qminer63https://jupyter.org

Page 69: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

69

Conceptually,theoutputmodelisafunctionwhichmapstheinputfeatures’valuesintothevaluesof the target variable. This function is encoded into a QMiner module which performs thistransformation.ItcanbeserializedandtransferredtootherQMinerinstancesorsavedtodisk.

InthePredictionmode,themodelbuiltusingthe learningphase is loaded intoaQMiner instancewhichcanthenreceivenewexamplesandperformprediction.Thenewexamplesarerepresentedassetsofvaluesofthesamefeaturesaswereusedtobuildthemodelandthepredictionisthevalueofthetargetvariableascomputedbythemodelfunction.Predictionismadeavailableforqueryingtotheexternalprocesses.MostcommonlyasimpleserverissetupusingaRESTAPIforaccess.

ThecontrolflowfortheDataAnalyzerispresentedinSection4.3,Figure11.

5.4.3 DataReporter(DR)

Figure19:DataReporterComponentSpecification

The Knowage Suite (developed by ENGINEERING) has been selected as the main backgroundcomponent able to realize the foreseenData Reportermodule in the overall EW-Shoppplatform.Knowage isanOpenSourcebusinessanalytics suite that combines traditionaldataand largedatasources into valuable and meaningful information. It merges the innovation coming from thecommunitywiththeexperienceandpracticesfromtheenterprise-levelsolutions.

The Knowage Suite is composed of several modules, each one designed for a specific analyticaldomain.Theycanbeusedindividuallyascompletesolutionforacertaintask,orcombinedwithoneanothertoensurefullcoverageofuser’requirements,allowingtobuildatailoredproduct.Knowagegives the opportunity toworkwith BigData sources and traditional ones, federating data sets tobuild different analysis such as: static reports, maps, network views, interactive cockpits anddata/textminingmodels.Moreover, theusercan freelyexplorehisowndatausingadrag&dropquery builder or having immediate insights thanks to advanced visualizations. Advanced datavisualization has to show general phenomenon producing insights, before going in depth. Withrespect to Big Data, this approach becomes central, because the end user needs to be driventowardsrelevantfactswithoutbeingswampedwithdetails.

In Knowage, the concept ofdataset is essential tomanage information on the cockpit,OLAP andanalysis.Thisinformationcanhavedifferentnature:aDBquery,aCSVfileoraNoSQLDBcollection.

Page 70: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

70

Herebelow,Figure20andFigure21showsomebasicdatabaseoperations(CreationandQuerying,respectively)canbeperformedboththroughtheGUIwebinterface(aRESTAPIisavailableaswell).

Figure20:Datasetcreation

Figure21:Queryingthedataset

Page 71: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

71

5.5 CorporateServices

Figure22:BigDataRuntimeComponentSpecification

The envisioned implementation of the Big Data Runtime component contains a set of tools andservices that satisfy theplatformdesign,described inSection4.2.4. In this section,weoutline thedifferent components and their role in the implementation. More details on the currentimplementation and deployment of EW-Shopp ecosystem, inclusive of the Big-Data Runtime, arereportedinDeliverableD2.2[3].

5.5.1 ClusterConfiguration

The cluster configuration in EW-Shopp will be implemented with the help of Docker containers,instead of VMs. Thereby, each of the data workflow steps, which produce the integrated andenrichedoutputdata,willbepackagedinaDockerimage,andthenecessarynumberofinstancesofthe imagewillbedeployed incontainers. Similarly, toVMs,Dockercontainersareable to shareasingle kernel and application libraries. However, containers aremore resource-efficient than VMsand perform isolated tasks faster. Furthermore, Docker containers can work in a heterogeneousenvironment, making them much more flexible than VMs, which need to be coordinated by ahypervisor. The operating environment of Docker is additionally easy to deploy and install on a

Page 72: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

72

proprietaryclusterofservers,whichisoneoftherequirementstointegratewithintheinfrastructureofEW-Shoppplatformusers.

In order to enable easy independent horizontal scaling for each workflow step, the EW-Shoppplatform will make use of a container orchestration engine (e.g., Docker Swarm64, Rancher65,Kubernetes66).Containerorchestrationsystemsallowforseamlessscalingupanddownofindividualimages, whichwill be used by the Dataflow orchestration component to control the load on thedataflowsteps.

5.5.2 DataFlowManager

DataflowpipelineswillbeimplementedandmanagerbytheDataFlowManagercomponent,whichbelongstotheBigDataRuntimeandprovidesfunctionalitiesrelatedto1)specifyingdataflowonahighlevel,and2)scalingupanddowntheflowsteps(req.NF.1,NF.5).Thedataflowswillconsumetheinputdataandtransformit–e.g.,cleaningupdatausingGrafterizer,convertingthedatausingits ArangoDB mapper component, as well as other dataset-specific operations that can becustomized in thedata flow interface. For example, a data pipeline could consist of the followingsteps:

1. Callingascripttounzipinputfilesstoredinafolderandstoringtheunzippeddatainanotherfolder,

2. Cleaninguptheoutputofthepreviousstepbyusingapre-definedself-sustainedexecutablegeneratedbyGrafterizer

3. Transforming the cleaned-up data to ArangoDB collections using the ArangoDB mappercomponent

4. StoringtheresultingdatainthesharedArangoDBdatabase

TheDataFlowManagerwillbeimplementedusingadataflowautomationsystem,suchasApacheNiFi67. The file system that is shared within the cluster will be implemented using a distributedclusteredfilesystem,suchasHDFS68,GlusterFS69,orCeph70. Itwillbeusedtostore inputsamples(extracted froma corporate database or copied there fromother sources), or share intermediateresultsbetweenworkflowstepstobeconsumedbystepsdownstream.

5.5.3 Enrichmentandanalysisdatabase

The Enrichment andAnalysis Database implements the central databasewhere the results of thedata workflows will be stored. The component will be implemented using ArangoDB, which is a

64https://docs.docker.com/engine/swarm65http://rancher.com66https://kubernetes.io67https://nifi.apache.org68https://hortonworks.com/apache/hdfs69https://www.gluster.org70http://ceph.com

Page 73: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

73

multi-model database that provides document, key-value, and graph data storage and accessfunctionalities.

The Enrichment andAnalysisDatabase is usedby theDataAnalyzer component, implementedbyQMiner to produce analytics models and by the Data Reporter component, implemented byKnowagetooutputreports.

More details on the Enrichment and Analysis Database and other connected components can befoundindeliverableD2.2[3].

5.6 Crosscuttingservices

5.6.1 Securityeprivacy

Asmentionedabove,project stakeholders areparticularly sensitive to security andprivacy issues.Forthisreason,weaimatdevelopinganintegratedidentitymanagementsystemthatcanbeusedinthevariouslogicalareasofarchitecture.

It is important to note, however, that platform services implement their own authentication andauthorization technology (see Deliverable D2.2 for more details). In addition, the services of theCorporateareaarehighlydependentontheparticularbusinesscaseandthereforeitispossiblethattheyalsoimplementsolutionsthatvaryfromcasetocase.

Wethereforeexpectthatitwillbenecessaryasinglesign-oncomponentimplementinganidentityfederation mechanism capable of securely proxying and transparently integrating the varioussecurity technologies to the user. In this way, the system would guarantee the Single Sign-Onproperty.

ExamplesoftechnologiesthatcouldbeusedtointegratethevarioussecuritysystemsareKerberosticket-grantingtickets(TGT)[27],ActiveDirectory[28],SecurityAssertionMarkupLanguage(SAML)[29]andOpenID[30].

5.6.2 MonitoringandAuditingServices

By analyzing the logging message, the severity level, facility and the message content, systemadministrators can use the SYSLOG protocol [31] capabilities and analyses to implement a simplesystem monitoring tool. SYSLOG servers, in fact, features sophisticated message loggingfunctionalities as well as different routing mechanisms ranging from logging into a text file todistributedlogging.

Besidemessagingloggingfunctionality,EW-Shoppprojectaimstosupportalsotracingofkeysystemeventsandvariablestoaclusterserver(a.k.a.distributedtracingsystem).

Page 74: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

74

There are several possible alternatives available for the creation of the tracing system. Today, apossible simple implementation can be achieved by using REDIS71, which is an open-source in-memorydatabase implementing a distributed key-value storewithoptional durability. It supportsdifferent kinds of data structure abstractions, such as strings, lists,maps, sets, sorted sets, hyperlogs, bitmaps and spatial indexes. REDIS also supports trivial-to-setupmaster-slave asynchronousreplication, with very fast non-blocking first synchronization, auto-reconnection with partialresynchronizationonnetsplit.

EW-Shopp project, foresees to provide support for distributed tracing by means of a solutionpowered by REDIS. Such a systemwill support both single value variables andmore complex listvariables whichmay be used for storing the entries of repeated events. System variableswill bewrittenintothesystemwithexpirationperiodsonoadditionalfunctionality isneededfordeletingthevalues.

Implementation of system monitoring capabilities based on REDIS server consists of writing asuitable REDIS Client that checks the known system variables, analyzes their values and triggerspredefinedactivitiesonthebaseofaproject-dependentscripting.

Figure23:Monitoringsystem

WhiletheSYSLOGandREDIScomponentscollectinformationabouttheprocessedeventswithintheEW-Shopp ecosystem (regardless of the generation location, Platform or Corporate Services)implementingwhatwerefertoasmonitoringinfrastructure,theMonitoringAgent(seetoFigure23formore details) is providing a script-controlled event processing phase. Reporting is enabled viaNagios72agents,which isanopensource solution thatprovidesdifferent channelsofevent statusalarming(e.g.viaBrowser,SMSore-mail)totheoperationspersonal,orwritetriggerssotheywillbecomeapartofgeneralescalationprocedures.

71https://redis.io72https://www.nagios.org

Page 75: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

75

5.7 Single-siteConfiguration

Whenthefulldatasetcanbemanageddirectly(i.e.,BigDataprocessingsolutionsarenotrequired),a simplified architecture can be envisaged (see Figure 24). In this configuration,DataGraftwill beusedtoregisterandmanagetheEnrichmentandAnalysisdatabaseusingArangoDBassetsandtheDBMSmanagermodulediscussedinSection5.4.1.ThedataflowpipelinescanalsobeimplementedintheinfrastructuremanagedbytheDataWranglercomponent(Grafterizer),whichconsumesinputdatadirectly fromtheCorporateDBandstores it inthe locallyaccessibleEnrichmentandAnalysisdatabase.DataGraftcanbeusedtodirectlyimplementthedataworkflows,butincaseitisneeded,asharedBigDatainfrastructurecanbemadeavailabletoallcomponentsoftheEW-Shoppplatform.TheDataAnalyzer andData Reporter components also have direct access to the Enrichment andAnalysisdatabase.

Figure24:Self-containedArchitectureandComponentSpecification

Page 76: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

76

Chapter6 Conclusions

TheEW-Shoppecosystemaimstosupporte-commerce,RetailandMarketingindustriesinimprovingtheir efficiency and competitiveness through providing the ability to perform predictive andprescriptiveanalyticsoverintegratedandenrichedlargedatasetsusingopenandflexiblesolutions.

Withthisdocument,wehavetriedtopursueseveralobjectivesatonce.Mainly,wehaveoutlinedareference architecture specifically designed to handle large amounts of data. This architecture isbasedontheprincipleofthedatasetDimensionReductiontowork.Wehavetriedtoexplaintothereader how, by appropriately reducing the size of the dataset, it is possible to design a dataenrichmentprocessthatisbothpreciseandmanageable.Thesameprincipleunderpinstheprocessofdefininganalyticsalgorithms.Theproposedarchitecturenativelysupportsthisapproachbutcanalsoworkdirectlyonbusinessdataaslongasitisnottoolarge.

Our second objective was to raise awareness among consortium members about the inherentdifficultiesassociatedwith theprojectand thepossible solutions that couldbeadopted.Wehavethereforeattemptedtocreateaunifiedcommonlanguagethat,webelieve,willhaveclearbenefitsinthenextstagesoftheproject.

Finally,wehavetriedtoproposeaninitialimplementationofthearchitecturebyidentifyingoneormore tools for each componentof the referencearchitecture. Inparticular, at this stagewehaveembracedthe leandevelopmentapproachalready introduced inDeliverableD4.1. Inourcase thishasmeant that only the centerpiece of the platform, namely the tools already identified and onwhichanagreementexistsamongtheconsortiummembers,ispresentedasstableanddetailed.Therest of the platform, on the other hand, is constituted by tools and/or technologies that at thismomentseemtobegoodcandidatesbutthatcanbereplacedastheprojectdevelops.Inaddition,business cases canevolve independentlyandhave individual requirements. This could lead to theadoption of different tools to cover the same functionalities. Nevertheless, we believe that thereferencearchitectureanditsimplementation,howeverincompleteitmaybe,willbeabletoguidethedevelopmentprocessthatwilltakeplaceinthecomingmonths.

Page 77: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

77

References

[1] D.Dujicetal.,“EW-ShoppDeliverableD4.1BusinessCaseRequirements-Businessmodeling,potentialandmonetizationPilotdefinitionanddescriptions,”2017.

[2] E. Ries,The Lean Startup:HowToday’s EntrepreneursUseContinuous Innovation to CreateRadicallySuccessfulBusinesses.2011.

[3] N.Nikolov,S.Dalgard,A.Marguglio,andV.Hoffman,“EW-ShoppDeliverableD2.2D2.2EW-ShoppPlatform–v1,”2017.

[4] N. Big Data Public Working Group, “NIST Big Data Interoperability Framework: Volume 5,ArchitecturesWhitePaperSurvey,”NISTSpec.Publ.,pp.1500–5.

[5] N. B. D. P. W. G. (NBD-PWG), “NIST Big Data Interoperability Framework: Volume 6,ReferenceArchitecture,”Gaithersburg,MD,Oct.2015.

[6] European Big Data Value Association, “European Big Data Value Strategic Research andInnovationV4.0,”BigDataValu.eu,2017.

[7] T.Knapetal.,“UnifiedViews:AnETLToolforRDFDataManagement.”

[8] L.Wang,G.Wang,andC.A.Alexander,“BigDataandVisualization:Methods,ChallengesandTechnologyProgress,”Digit.Technol.,vol.1,no.1,pp.33–38,Jul.2015.

[9] A.Marguglio,A.Maurino,M.Palmonari, andN.Nikolov, “EW-ShoppDeliverableD2.1DataManagementPlan,”2017.

[10] M.Oderskyetal.,“AnOverviewoftheScalaProgrammingLanguage.”

[11] G.Data,P.Regulation,O.Journal,andD.P.Directive,“GDPR :GettingReadyfortheNewEUGeneralDataProtectionRegulation,”pp.1–5,2016.

[12] P.ColomboandE.Ferrari,“PrivacyAwareAccessControlforBigData:AResearchRoadmap,”BigDataRes.,vol.2,no.4,pp.145–154,Dec.2015.

[13] A.GandomiandM.Haider, “Beyond thehype:Bigdataconcepts,methods,andanalytics,”Int.J.Inf.Manage.,vol.35,no.2,pp.137–144,Apr.2015.

[14] M.H.urRehman,C.S.Liew,A.Abbas,P.P.Jayaraman,T.Y.Wah,andS.U.Khan,“BigDataReductionMethods:ASurvey,”DataSci.Eng.,vol.1,no.4,pp.265–284,Dec.2016.

[15] T.ZhangandB.Yang,“BigDataDimensionReductionUsingPCA,”in2016IEEEInternationalConferenceonSmartCloud(SmartCloud),2016,pp.152–157.

[16] XindongWu,XingquanZhu,Gong-QingWu,andWeiDing,“Dataminingwithbigdata,”IEEETrans.Knowl.DataEng.,vol.26,no.1,pp.97–107,Jan.2014.

[17] J.A.Ramos,R.Mary,B.Kery,S.Rosenthal,andA.Dey,“SamplingTechniquestoImproveBigDataExploration.”

[18] M. Palmonari, M. Zvan, V. Cutrona, and M. Dolenc, “EW-Shopp Deliverable D1.1Interoperability-RequirementsSpecification,”2017.

[19] E.Prud’hommeauxandA.Seaborne,“SPARQLQueryLanguageforRDF.”

[20] N. Borisov and I. Goldberg, Privacy Enhancing Technologies - 8th International Symposium2008,vol.7246.2008.

Page 78: EW-Shopp D3.1 v1.1€¦ · The term "Data Analytics" is too often used to define a generic data driven decision process and therefore covers heterogeneous activities such as cleaning,

EW-Shopp GAnumber:732590 H2020-ICT-2016-2017/H2020-ICT-2016-1

78

[21] G.CugolaandA.Margara,“Processingflowsofinformation,”ACMComput.Surv.,vol.44,no.3,pp.1–62,Jun.2012.

[22] Jing Han, Haihong E, Guan Le, and Jian Du, “Survey on NoSQL database,” in 2011 6thInternationalConferenceonPervasiveComputingandApplications,2011,pp.363–366.

[23] R. Cattell, “Scalable SQL andNoSQL data stores,”ACM SIGMODRec., vol. 39, no. 4, p. 12,2011.

[24] J.T.Robinson,“TheK-D-B-tree,”Proc.1981ACMSIGMODInt.Conf.Manag.data-SIGMOD’81,p.10,1981.

[25] R.Hickey, “The Clojure programming language,” inProceedings of the 2008 symposiumonDynamiclanguages-DLS’08,2008,pp.1–1.

[26] M.Palmonari,A.Rula,R.Porrini,A.Maurino,B.Spahiu,andV.Ferme,“ABSTAT:LinkedDataSummarieswithABstractionandSTATistics,”inLectureNotesinComputerScience(includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol.9341,Springer,Cham,2015,pp.128–132.

[27] B.C.NeumanandT.Ts’o,“Kerberos:anauthenticationserviceforcomputernetworks,”IEEECommun.Mag.,vol.32,no.9,pp.33–38,Sep.1994.

[28] A.G.Lowe-Norris,A.G./Denn,andRobert,Windows2000ActiveDirectory.O’Reilly,2000.

[29] J.HughesandE.Maler,“Securityassertionmarkuplanguage(saml)v2.0technicaloverview,”OASISSSTCWork.Draftsstc-saml-tech-overview-2.0-draft-08,pp.29–38,2005.

[30] D. Recordon and D. Reed, “OpenID 2.0,” in Proceedings of the second ACM workshop onDigitalidentitymanagement-DIM’06,2006,p.11.

[31] R.Gerhards,“TheSyslogProtocol,”Mar.2009.