spark hadoop

Post on 17-Jan-2017

134 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DIFFERENCE BETWEEN SPARK AND HADOOP MAPREDUCE

SPARK IS MUCH FASTER

Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk.

LOGISTICS REGRESSION PERFORMANCE

WORDCOUNT WITH HADOOP

WORDCOUNT WITH SPARK

It’s easier to develop for Spark.

Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL

SPARK GENERAL FLOW

SOME ACTIONS AND TRANSFORMATIONS

map(func)flatMap(func)froupByKey()reduceByKey(func)mapValues(func)sample(…)union(other)distinct()sortByKey()..

reduce(func)collect()count()first()take(n)saveAsTextFile(path)countByKey()foreach(func)…

CREATE INPUT RDDs

SPLIT INTO TRAINING,VALIDATION AND TEST DATASETS

FIND OUT OPTIMAL RANK ANDNUMBER OF ITERATIONS

RMSE (ROOT MEAN SQUARE ERROR)CALCULATION METHOD

EVALUATE THE BEST MODELON THE TEST SET

CREATE A NAIVE BASELINE AND COMPARE IT WITH THE BEST MODEL

OUTPUT

RECOMMEND SOME NEW PRODUCTS FOR USER WITH ID #150

AND SOME OUTPUT...

USER ALREADY REACTED ON SOME CAMPAIGNS

USE THIS INFORMATION FOR PREDICTION

AND SOME OUTPUT...

RDD FAULT TOLERANCE

SPARK DEPLOYMENT

MACHINE LEARNING

Types of Machine Learning

ALS Algorithm

ALS MODEL AND ALGORITHM

Model Ratings as product of User (A) and Movie Feature (B) matrices of size UxK and MxK

Alternating Least Squares (ALS)

• Start with random A nd B vectors

• Optimize user vectors (A) based on movies

• Optimize movie vectors (B) based on users

• Repeat until converged

ALS ALGORITHM

top related