datalab 101 (hadoop, spark, elasticsearch) par jonathan winandy - paris spark meetup

U n i v a l e n c e

DATALAB 101Jonathan WINANDY

About me

Jonathan WINANDY Data Engineer / Entrepreneur @AHOY_JON

U n i v a l e n c e

Présentation

What are Datalabs ?Projects to transform an organisation

based on its existing data.

Présentation

Why ?

Data is a leverage for economic growth.

Présentation

But ?

Data has no value by itself.

Is data the new oil ?

Présentation

How do we start ?

By building a Data platform ?

Présentation

Data PlatformAwesome pipelines

+ BIG Data technologies

Rex > Data Platform

U n i v a l e n c e

Rex > Data Platform > Schéma cible

Staging DWH Business Views

sql3

sql2

sql1

Logs

Events

other

cube

sql

Serving

Metadata

Rex > Data Platform

U n i v a l e n c e

Rex > Data Platform > Schéma cible

Staging DWH Business Views

sql3

sql2

sql1

Logs

Events

other

cube

sql

Serving

Metadata

Staging : Storage space used to decouple from upstream sources.

Rex > Data Platform

HADOOP

ETL workflow :

Rex > Data Platform > Data Warehouse > ETL

API 1 (file)

API 2 (file)

Ref (file)DB

API adapter

result

DBadapter

DBadapter

servingDBFilesFilesFiles

processprocessprocess

U n i v a l e n c e

Rex > Data Platform

U n i v a l e n c e

Rex > Data Platform > Business views & Reporting

● Création des axes métiers● Visualisation des données

DWH

BV

BV

BV

DBSQL

Self service Data visualisation

Rex > Data Platform

Objectives : Storage / Warehousing. Reduce access time. Elasticity. Collaboration. Reuse.

U n i v a l e n c e

Présentation

But ?Building a data platform

is a BIG project with no clear return on investment.

Présentation

“The Datalab as an infrastructure.”

Présentation

How to grow a Datalab ?Start small with an

end to end business case.

Rex > Datalab

U n i v a l e n c e

U n i v a l e n c e

CoGroup Map

Rex > Datalab > Recipe

1. Stage the data 2. Source mapping 3. CoGroup 4. Enrich 5. Make it accessible

Sprint

A. Cardinality Study B. Technical mapping C. Business-oriented

model

Marathon

Rex > Datalab > CoGroup

{ "group":123, "V":[{"c2":true, "c1":123}], "R":[{"c3":"DIRECT", "c2":"boeuf bourguignon", "c1":123}, {“c3":"DIRECT", "c2":"nouilles de riz", “c1":123}, {“c3":"INDIRECT", "c2":"soupe au melon d’hiver", "c1":123}, {"c3":"INDIRECT", "c2":"nouilles de riz", “c1":123}]} }

group int v array<struct<c1:int, c2:boolean>> r array<struct<c1:int, c2:string, c3:string>>

Rex > Datalab > Ex

f: G => Visiteur

Rex > Datalab > Query

select count(*) from visitor, visitor.session session, session.page page where visitor.is_robot = false and page.type = product

U n i v a l e n c e

Query for nested Data (Impala) :

Rex > Datalab > Sum UP

CoGroup all your inputs with PIG.

Map the data with Spark.

Store in ElasticSearch.

Présentation

Conclusion

Présentation

Questions ?

datalab 101 (hadoop, spark, elasticsearch) par jonathan winandy - paris spark meetup

Technology