the paradox of big data - dataiku / oxalide aperotech

Post on 12-Jul-2015

460 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Paradox of Big Data

2001 Programming Languages 2004 Natural Language Processing

2006 Social Recommendation

2008 Distributed Computing

2011 Social Gaming2012 Advertising

2013 Dataiku

2009 Web Mining

Type Spent Coding

2010

100%100%80%50%

20%

0%10%50%

20%

Favorite Language

CExascriptExascript

Exascript

Python

Powerpoint

Python

Java

None

Largest Dataset

100GB100GB10GB10TB

100TB

100kB500GB100TB

10TB

I’m Florian and I like data

www.dataiku.com

Dataiku in short

Software  editor  behind  Data  Science  Studio,the  «  Photoshop  for  Data  Science  »  

COMMUNITY  EDITION

http://www.dataiku.com/dss/trynow/

Goals For Today• Big Data with the bias of what I know of it

(Analytics …)

• Big Data: History and Feelings

• What are the key technologies to watch ?

• Some practical use cases ?

• How to get started ?

Dataiku

Motivation

1/8/144

First Hard Drive: 3,75 Megabytes Access Time: 1 second

IN 2008 man

invented big data

Volume Variety Velocity

WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER?

Capacity Complexity Celerity

OR SIMPLER

Size Serendipity Speed

OR AFTER A DRINK

Big Blur Blazing

Or Combine

C… B.. S….

Or Combine

Complete Bull Sh..

SOOO WHAT IS

BIG DATA ?

PARADOX #1 SIMPLEXITY

SUBTLE PATTERNS

"MORE BUSINESS" BUTTONS

PARADOX #2 SELF-AWARE

DATA SCIENTIST AT NIGHT

DATA CLEANER THE DAY

DATA PLUMBERER THE WEEK-END

WAIT COMPUTATION BETWEEN COFFEES

PARADOX #3 WHERE TO STORE DATA?

MY DATA IS WORTH MILLIONS

I SEND IT TO THE

MARKETING CLOUD

AND BACKUP IT TO GOOGLE

PARADOX #4 IS IT BIG OR NOT ?

WE ALL LIVE IN A BIG DATA

LAKE

ALL MY DATA MAY FITS IN HERE

PARADOX #5 (at last) HUMAN OR NOT ?

TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE

US ALL

I JUST WANT MORE REPORTS

BIG DATA TECH TRENDS

ELEPHANT MAKE BABIES

Dataiku - Pig, Hive and Cascading

WELCOME TO TECHNOSLAVIA

Hadoop Ceph

Sphere Cassandra

Kafka Flume Spark

Scikit-Learn GraphLAB prediction.io jubatus

Mahout WEKA

MLBase LibSVM

RapidMiner Panda

Kibana

InfiniDB Drill Spark SQL

Hive Impala

Elastic Search

SOLR MongoDB

Riak Membase

Pig

Cascading

Talend

Machine Learning Mystery Land

Scalability Central

SQL Colunnar Republic

Vizualization County Data Clean Wasteland

Statistician Old House

R Real-time island

Storm

NOSQL Nihiland

DRIVER  1:  BACK  TO  THE  BASICS

RAM      -­‐    CPU    -­‐  DISK    

2000 2013

1000$  /  GB

6$  /  GB$10  /  GB

$0.06  /  GB

memory    divided  by  150  

disk  cost  divided  by  250  

MAP  REDUCE  times

HACK  REDUCE  times

A  PERSISTENT  MEMORY  PROBLEM

DATA  IS  BIGGER

IS  USEFUL  DATA  BIGGER  ?

WHOLE  DATA

REFINED  DATA

GOLD

NEEDLE  IN  HAYSTACK  ?

OILD

REFINE  BEFORE  USE

HOW  BIG  IS  BIG  DATA  ?Web  Site  

– $1Billion  revenue  per  year    – 10  Millions  Unique  Visitor  per  month  – 100.Millions  orders  /  actions  /  per  day

10TB  RAW  DATA

1TB  REFINED  DATA

1  TERABYTE

FITS  IN  MEMORY  

1TB

DRIVER  2  :  ECOSYSTEM  GROWS

• GOOGLE  

• 1  Circle   OPEN  SOURCE  – YAHOO  –  IBM  –  LINKEDIN  -­‐  FACEBOOK  

• 2  Circle    – STANDFORD  BERKELEY  – STARTUPS

STARTUPS

64m$

6.75m$

14m$

2m$

40m$

20m$

20.5m$

19m$

4m$

100m$

1.8m$

17m$

11m$

7.75m$

1.7m$

20132012

2011

2010

2009

 $1B  per  year  Invested  in  Big  Data    

TECH  223m$

301m$

ALL  >    SPARK

Real-­‐Time  Resilient  Distributed  Memory  Framework  

• Abstraction  with  any  DAG  operation  on  data:  -­‐ Filter  -­‐ Map  -­‐ Reduce    -­‐ Cache

SPARK  AND  ITS  ECOSYSTEM

SHARK

MLBASE

STREAMING

Real-­‐Time  Queries  

Real-­‐Time  Updates

In-­‐Memory  Learning

SPAR

K

SooOOo WHAT IS IT IN PRACTICE?

www.dataiku.com

Turn Device Logs Into Next Years' Business

Parking  ticket  machine  data

OpenStreetMapdata

Cleaning  and  enrichment  of  data Crossing  data

Data Science Studio

Creation  of  a  predictive  algorithm

Availability  of the  predictions

Each  street  is  segmented  into  small  pieces  that  are  enriched  with  geospatial  information.

The  parking  ticket  history  is  joined  with  the  points  of  

interest  from  OpenStreetMap.

The  availability  of  parking  lots  is  predicted  by  street  

segments  from  the  joined  data.

The  algorithm  is  finally  integrated  in  the  iPhone  

app «  Find  me  a  space  ».  

by

www.dataiku.com

Optimizing Last Mile with Data Science Studio

Data Science Studio

Historical delivery and retrieval data

Modeling of a score for each delivery

Cleaning and temporal enrichment of data

Data aggregation by geographic location

Incorporation of new deliveries to the existing model

by

• Reformulation de la recherche

• Pas de réponse

• Clic sur un pro• Top recherche• Clic de navigation ou filtre

COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?

20 M

Analyse & corrections

automatisation

>10 occurrences1,4M

requêtes

>200M recherches

✗ ✓

0,5M requêtes priorisées

SOLUTION

Machine

Gestion Exploration

pagesjaunes.frAnnuaire

hadoop PIG+Hive

Export indexation

Moteur d’interprétation

crawl Autres référentiels

Sickit-learn

www.dataiku.com

Analyst

Panels

1970 : Birth of Computer Analytics

ComputerExpensive Software

Marketing Studies

www.dataiku.com

Multiple  Data    Sources  

Analyst Team

Many  Models

CRM

Logs

2015 : BUILD YOUR FACTORY

Server ClusterLight Software

Personalised Experience Model

Acquisition Cost Opportunity

Model

Stock Optimisation Model

Optimize Delivery

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

A MODEL An automated way to make a computertake a decision from raw (historical) data

The model can be used to take immediate (real-time)actions through an API

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

SooOOo How To I ENTER WONDERLAND ?

STEP 1 : LEARN

• PYTHON + PANDAS + SCIKIT

• R

• SCALA

http://scikit-learn.org/https://www.coursera.org/course/rprog

STEP 2 : PRACTICE• Try to enter in a Contest on kaggle.com or

• or datascience.net

• Join a meetup

www.dataiku.com

http://www.dataiku.com/dss/trynow/

Dataiku HQ

2 rue Jean Lantier

75001 Paris France

Dataiku West

2423A Durant Avenue

Berkeley, CA 94704

Florian florian.douetteau@dataiku.com

You have ideas

“My data is too dirty. I don’t even know where to start ”

“We could probably better understand ours users. But how ?

“There’s a trend here, but our full historical data is just too big”

You have data

You need a tool

top related