datalab 101 (hadoop, spark, elasticsearch) par jonathan winandy - paris spark meetup
TRANSCRIPT
Rex > Data Platform
U n i v a l e n c e
Rex > Data Platform > Schéma cible
Staging DWH Business Views
sql3
sql2
sql1
Logs
Events
other
cube
sql
Serving
Metadata
Rex > Data Platform
U n i v a l e n c e
Rex > Data Platform > Schéma cible
Staging DWH Business Views
sql3
sql2
sql1
Logs
Events
other
cube
sql
Serving
Metadata
Staging : Storage space used to decouple from upstream sources.
Rex > Data Platform
HADOOP
ETL workflow :
Rex > Data Platform > Data Warehouse > ETL
API 1 (file)
API 2 (file)
Ref (file)DB
API adapter
result
DBadapter
DBadapter
servingDBFilesFilesFiles
processprocessprocess
U n i v a l e n c e
Rex > Data Platform
U n i v a l e n c e
Rex > Data Platform > Business views & Reporting
● Création des axes métiers● Visualisation des données
DWH
BV
BV
BV
DBSQL
Self service Data visualisation
Rex > Data Platform
Objectives : Storage / Warehousing. Reduce access time. Elasticity. Collaboration. Reuse.
U n i v a l e n c e
Rex > Datalab > Recipe
1. Stage the data 2. Source mapping 3. CoGroup 4. Enrich 5. Make it accessible
Sprint
A. Cardinality Study B. Technical mapping C. Business-oriented
model
Marathon
Rex > Datalab > CoGroup
{ "group":123, "V":[{"c2":true, "c1":123}], "R":[{"c3":"DIRECT", "c2":"boeuf bourguignon", "c1":123}, {“c3":"DIRECT", "c2":"nouilles de riz", “c1":123}, {“c3":"INDIRECT", "c2":"soupe au melon d’hiver", "c1":123}, {"c3":"INDIRECT", "c2":"nouilles de riz", “c1":123}]} }
group int v array<struct<c1:int, c2:boolean>> r array<struct<c1:int, c2:string, c3:string>>
Rex > Datalab > Query
select count(*) from visitor, visitor.session session, session.page page where visitor.is_robot = false and page.type = product
U n i v a l e n c e
Query for nested Data (Impala) :
Rex > Datalab > Sum UP
CoGroup all your inputs with PIG.
Map the data with Spark.
Store in ElasticSearch.