hadoop meetup zhemzhitsky

28
Company Profile Сегментация пользователей в online-рекламе Spark vs Hadoop Сергей Жемжицкий, CTO, CleverDATA, 22 мая, 2015

Upload: -

Post on 11-Aug-2015

56 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Hadoop meetup zhemzhitsky

Company Profile Сегментация пользователей в online-рекламе

Spark vs Hadoop

Сергей Жемжицкий, CTO, CleverDATA, 22 мая, 2015

Page 2: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

International market business development since 2012

One of three leading IT companies in Russia 43 branches in Russia and abroad +5500 employees 100K projects for 10K customers

Data management innovative platform (Data Exchange Service) Cloud Service In-house development

Internet advertising solutions Data Management Platforms Customers Base Management Web Analytics Marketing automation

Big Data Data Mining Digital Intelligence Operational Intelligence Low Latency and NoSQL Cloud Computing

Page 3: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Агенда

• Про задачу; • Hadoop vs. Spark; • Особенности; • Что дальше.

Page 4: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

publishers

AD NETWORK AD NETWORK

AD NETWORK AD NETWORK

AD NETWORK AD NETWORK

advertisers

DS P

SS P

Real Time Bidding (RTB)

Page 5: Hadoop meetup zhemzhitsky

TRACKING DATA

cleverdata.ru | [email protected]

publishers

COOKIE SYNCs ACCESS LOGS

PARTNER’S DATA 3rd PARTY DATA CLICK STREAMS

advertisers

SS P

DS P

DMP

Data Management Platform (DMP)

Page 6: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

3rd party data

Relational Data Store

raw data 3rd party data

3rd party data

Raw Data Store & Processing

RealTime Data Store

user profiles aggregates

Типовые потоки данных

Page 7: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Типовые потоки данных :: RTB

3rd party data

Relational Data Store

RTB

SRV

Exchange SSP

bid req. bid resp.

pixels :: impressions :: clicks

bid requests

user profiles

raw data 3rd party data

3rd party data

Raw Data Store & Processing

RealTime Data Store

user profiles aggregates

Page 8: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

1st-party data

3rd party data

Relational Data Store

RTB

SRV

Exchange SSP

bid req. bid resp.

pixels :: impressions :: clicks

bid requests

user profiles

raw data 3rd party data

3rd party data

Raw Data Store & Processing

RealTime Data Store

user profiles aggregates

Page 9: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

1st-party data

• Зачем монетизировать?

• Как монетизировать?

• Чем монетизировать?

Page 10: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Зачем монетизировать?

Найти всех пользователей, которые участвовали в рекламной кампании “Star Wars” [и] видели один из баннеров “Darth Vader” или “Luke Skywalker”

в течении последних 6 дней [и] кликнули на этот баннер [и] посетили страницу покупки светового меча Darth’а Vader’а [и] но так ничего и не купили

Для того, чтобы сделать ретаргетинг персонифицированным баннером со скидкой на меч в 40%

Page 11: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

find all users who have taken part in campaign[s] “Star Wars” [and] viewed banner[s] “Darth Vader” or “Luke Skywalker”

during [last] 6 day[s] [and] clicked banner[s] “Darth Vader's lightsaber” [and] visited buying area of “Darth Vader's lightsaber” [and] not visited order confirmed area of “Darth Vader's lightsaber”

Как монетизировать?

[impression]

[click] [tr. pixel] [tr. pixel]

id cookie event_id event_type campaign_id timestamp …

1 c1 “Darth Vader” impression “Star Wars” 2015-04-20 14:25:11.462 … 2 c1 “Darth Vader's lightsaber” click “Star Wars” 2015-04-21 06:31:12.157 … 3 c1 “Darth Vader's lightsaber” tr. pixel “Star Wars” 2015-04-22 18:57:19.628 …

[cookies]

Page 12: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Как монетизировать?

reduce find all users who have

taken part in campaign[s] “Star Wars”

viewed banner[s] “Darth Vader” or “Luke Skywalker” during [last] 6 day[s]

clicked banner[s] “Darth Vader's lightsaber”

visited buying area of “Darth Vader's lightsaber”

not visited order confirmed area of “Darth Vader's lightsaber”

(c1, 0)

(c1, 1)

(c1, 2)

(c1, 3)

Ø

map

(c1, 0;1;2;3)

true(0) and true(1) and true(2) and true(3) and not false(4)

C1

Page 13: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

VS.

Page 14: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Правда жизни

• Стильно;

• Модно;

• Молодежно.

Page 15: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Spark :: Размер

Page 16: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Перед тем, как смотреть на Hadoop

Page 17: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Map-Reduce :: Размер

Page 18: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Материалы и инструменты

Hardware (3 Nodes) • 12 Core AMD Opteron™ 6338P

~ 2.8 GHz • 64 GB RAM • 1 GBPS NICs

Software • CDH 5.3.1 (Hadoop 2.5.0) • Spark 1.2.0

Data • 14.2 GB of raw data • 61.1 M of transactions • 128 MB block size

Page 19: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Время выполнения

Page 20: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Spark :: Exec-cores vs Num-execs

Page 21: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Инициализация

MR

9 protected void setup(Context ctx) 9 o.a.h.c.Configured 9 distributed cache

Spark

9 mapRegion 9 broadcast vars

Page 22: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Параллелизм

MR

9 mapred.reduce.tasks 9 mapreduce.job.reduces 9 splittable formats

Spark

9 spark.default.parallelism 9 num-executors, executor-cores in

yarn 9 numTasks в groupByKey,

reduceByKey, aggregateByKey…

Page 23: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Зависимости

MR

9 o.a.h.u.Tool 9 o.a.h.u.ToolRunner 9 -conf app.conf 9 -files 9 -libjars 9 setUserClassesTakesPrecedence

Spark

9 --jars 9 --files 9 --conf 9 --driver-java-options 9 spark.driver.extraJavaOptions 9 spark.executor.extraJavaOptions 9 spark.driver.userClassPathFirst 9 spark.executor.userClassPathFirst

Page 24: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Secondary Sort

MR

9 setSortComparatorClass 9 setGroupingComparatorClass 9 setPartitionerClass

Spark

9 repartitionAndSortWithinPartitions 9 mapPartitions 9 Entire partition processing result

must be able to fit in memory

Page 25: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Тестирование

MR

9 MRUnit 9 o.a.h.h.MiniDFSCluster 9 o.a.h.m.MiniMRCluster 9 o.a.h.y.s.MiniYARNCluster 9 o.a.h.m.v2.MiniMRYarnCluster

Spark

9 Local executor

Page 26: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Что дальше и почему Spark?

• Spark Streaming;

• Micro Batches;

• λ-архитектура.

без серьезного хирургического вмешательства

Page 27: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Спасибо за вопросы!

Page 28: Hadoop meetup zhemzhitsky

[email protected] :: [email protected]

cleverleaf.co.uk :: cleverdata.ru

1dmp.io :: crawler.1dmp.io

facebook.com/CleverData :: +7 (495) 967-66-50