the data scientist's guide to apache spark

129
The Data Scientist’s Guide to Apache Spark Hands on with a practical case study

Upload: jonathan-dinu

Post on 12-Jan-2017

3.750 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: The Data Scientist's Guide to Apache Spark

The Data Scientist’s Guide to Apache Spark

Hands on with a practical case study

Page 2: The Data Scientist's Guide to Apache Spark

Jonathan DinuVP of Academic Excellence, Galvanize

@clearspandex

Page 3: The Data Scientist's Guide to Apache Spark

Data Science Immersive

Full StackImmersive

Data Engineering Immersive

Weekend Workshops

+

Questions? tweet @clearspandex

Page 4: The Data Scientist's Guide to Apache Spark

Spark Fundamentals

Page 5: The Data Scientist's Guide to Apache Spark

5

What Is Spark?

Page 6: The Data Scientist's Guide to Apache Spark

6

What Is Spark?

• Framework for distributed processing

• In-memory, fault tolerant data structures

• Flexible APIs in Scala, Java, Python, SQL… and now R!

• Open Source

Page 7: The Data Scientist's Guide to Apache Spark

Acquisition

7

Parse

StorageTransform/Explore

VectorizationTrain

ModelExpose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Flask

Dataframes/Spark SQL

HDFS

MLlib/spark.ml

Spark Streaming

scikit-learn/NLTK

PySpark

Data PipelineAt Scale Locally

picklemodel.save()

UnifiedPlatform

Page 8: The Data Scientist's Guide to Apache Spark

8

Performance

• Very fast at iterative algorithms

• DAG scheduler supports cyclic flows (and graph computation)

• Intermediate results kept in memory when possible

• Bring computation to the data (data locality)

Page 9: The Data Scientist's Guide to Apache Spark

9

Rich API

map() reduce()

filter() sortBy()

join() groupByKey()

first() count()

… and more …

map()

reduce()

Page 10: The Data Scientist's Guide to Apache Spark

10

Spark Core

Standalone Scheduler YARN Mesos

Spark StreamingDataFrames MLlib GraphXspark.ml

PySpark SparkR Scala Java

Page 11: The Data Scientist's Guide to Apache Spark

11

Review

• Framework for distributed processing

• In-memory, fault tolerant data structures

• Flexible APIs in Scala, Java, Python, SQL… and now R!

• Open Source

Page 12: The Data Scientist's Guide to Apache Spark

12

Spark Programming Basics

Page 13: The Data Scientist's Guide to Apache Spark

13

Spark Execution Context

Cluster Manager

Laptop

Driver Program

SparkContext

Worker Node

Worker Node

Executor

Executor

cache

cache

Task Task

Task Task

StandaloneYARNMesos

Cluster

Page 14: The Data Scientist's Guide to Apache Spark

14

TerminologyTerm Meaning

Driver Process that contains the SparkContext

Executor Process that executes one or more Spark tasks

Master Process that manages applications across the cluster

Worker Process that manages executors on a particular node

Page 15: The Data Scientist's Guide to Apache Spark

15

• Resilient: if the data in memory (or on a node) is lost, it can be recreated

• Distributed: data is chunked into partitions and stored in memory across the cluster

• Dataset: initial data can come from a file or be created programmatically

What is a RDD?

Note: RDDs are read-only and immutable, we will come back to this later…

Page 16: The Data Scientist's Guide to Apache Spark

16

Functions Deconstructedimport randomflips = 1000000

# lazy evalcoins = xrange(flips)

# lazy eval, nothing executedheads = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51) \ .count()

Python Generator

Create RDD

Transformations

Action (materialize result)

Page 17: The Data Scientist's Guide to Apache Spark

17

Functions Deconstructedimport randomflips = 1000000

# lazy evalcoins = xrange(flips)

# lazy eval, nothing executedheads = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51) \ .count()

# create a closure with the lambda function# apply function to data

Closures

Page 18: The Data Scientist's Guide to Apache Spark

18

Spark Functions

Transformations Actions

Lazy Evaluation (does not immediately evaluate)

Returns new RDD

Materialize Data (evaluates RDD lineage)

Returns final value (on driver)

Page 19: The Data Scientist's Guide to Apache Spark

19

Transformations

# Every Spark application requires a Spark Context# Spark shell provides a preconfigured Spark Context called `sc`nums = sc.parallelize([1,2,3])

# Pass each element through a functionsquared = nums.map(lambda x: x*x) # => {1, 4, 9}

# Keep elements passing a predicateeven = squared.filter(lambda x: x % 2 == 0) # => [4]

# Map each element to zero or more othersnums.flatMap(lambda x: range(x)) # => {0, 0, 1, 0, 1, 2}

Page 20: The Data Scientist's Guide to Apache Spark

20

Actionsnums = sc.parallelize([1, 2, 3])

# Retrieves RDD contents as a local collectionnums.collect() # => [1, 2, 3]

# Returms first K elementsnums.take(2) # => [1, 2]

# Count number of elementsnums.count() # => 3

# Merge elements with an associative functionnums.reduce(lambda: x, y: x + y) # => 6

# Write elements to a text filenums.saveAsTextFile("hdfs://file.txt")

Page 21: The Data Scientist's Guide to Apache Spark

21

Functions Revisitedimport randomflips = 1000000

# lazy evalcoins = xrange(flips)

# lazy eval, nothing executedheads_rdd = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51)

head_count = heads_rdd.count()

nothing runs here

Everything runs here

Page 22: The Data Scientist's Guide to Apache Spark

22

Functions Revisitedimport randomflips = 1000000

# local sequencecoins = xrange(flips)

# distributed sequencecoin_rdd = sc.parallelize(coins)flips_rdd = coin_rdd.map(lambda i: random.random())heads_rdd = flips_rdd.filter(lambda r: r < 0.51)

# local valuehead_count = heads_rdd.count()

Page 23: The Data Scientist's Guide to Apache Spark

23

worker (distributed)driver driver

RDD Lineage

coins coin_rdd flips_rdd heads_rdd head_count

sc.parallelize() map() filter() count()

Page 24: The Data Scientist's Guide to Apache Spark

24

Key-Value Operations

pets = sc.parallelize([("cat", 1), ("dog", 1), ("cat", 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

Page 25: The Data Scientist's Guide to Apache Spark

25

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/pyspark.ipynb

Page 26: The Data Scientist's Guide to Apache Spark

26

Functional Programming Primer

• Functions are applied to data (RDDs)

• RDDs are Immutable: f(RDD) -> RDD2

• Function application necessitates creation of new data

Page 27: The Data Scientist's Guide to Apache Spark

27

Review

• Client-Server execution model

• Spark leverages higher-order functions (map(), filter(), etc.)

• Transformations create new RDDs and are lazily evaluated

• Actions force materialization of RDD on driver

Page 28: The Data Scientist's Guide to Apache Spark

28

Spark Programming APIs

Page 29: The Data Scientist's Guide to Apache Spark

29

ClusterLocal

Spark Context

PySpark

Py4J

JavaSpark

Context

Local File System

Spark Worker

Spark Worker

SocketPython

Python

Python

Python

Python

Python

Python

PythonPipe

Python

Page 30: The Data Scientist's Guide to Apache Spark

30

Review

• PySpark enables developers to write driver programs in Python

• For both of these, closures are serialized and sent to workers

• Execution happens in native language (Python/R) of closure

Page 31: The Data Scientist's Guide to Apache Spark

Data Science Applications with Spark

Page 32: The Data Scientist's Guide to Apache Spark

32

AcquisitionParse

StorageTransform/Explore

VectorizationTrain

ModelExpose

Presentation

Dataframes/Spark SQL

HDFS

MLlib/spark.ml

Spark Streaming

PySpark

Data PipelineAt Scale

model.save()

We are Here

Page 33: The Data Scientist's Guide to Apache Spark

33

What Is Exploratory Data Analysis?

• Developed at Bell Labs in the 1960’s by John Tukey

• Techniques used to visualize and summarize data

• Five-number summary: describe()

• Distributions: box plots, stem and leaf, histogram, scatterplot

Page 34: The Data Scientist's Guide to Apache Spark

34

Goals of Exploratory Data Analysis

• Gain greater intuition

• Validate our data (consistency and completeness)

• Make comparisons between distributions

• Find outliers

• Treat missing data

• Summarize data (a statistic -> one number that represents many #'s)

Page 35: The Data Scientist's Guide to Apache Spark

35

Case Study: DonorsChoose.org

Page 36: The Data Scientist's Guide to Apache Spark

36

http://data.donorschoose.org/open-data/overview/

Page 37: The Data Scientist's Guide to Apache Spark

37

Unique Values

• rdd.distinct()

• rdd.countApproxDistinct(relative_accuracy)

http://content.research.neustar.biz/blog/hll.html

http://dx.doi.org/10.1145/2452376.2452456

Page 38: The Data Scientist's Guide to Apache Spark

38

Missing Values

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

• column.isNull()

• dataframe.fillna()

Page 39: The Data Scientist's Guide to Apache Spark

39

Missing Values

Page 40: The Data Scientist's Guide to Apache Spark

40

Frequently Occurring Values

• dataframe.freqItems(columns, support)

http://dl.acm.org/citation.cfm?doid=762471.762473

required minimum proportion of rows

Note: this is an approximate algorithm that always returns all the frequent items, but may contain false positives.

Page 41: The Data Scientist's Guide to Apache Spark

Summary Statistics

• dataframe.describe(column_name)

41

Page 42: The Data Scientist's Guide to Apache Spark

Interlude: Sometimes numbers aren’t enough!

42

Page 43: The Data Scientist's Guide to Apache Spark

Anscombe’s Quartet

4 6 8 12 16

4

6

8

10

12

x1

y 1

4 6 8 12 16

4

6

8

10

12

x2

y 2

4 6 8 12 16

4

6

8

10

12

x3

y 3

4 6 8 12 16

4

6

8

10

12

x4

y 4

Mean (x) 9

Sample Variance (x) 11

Mean (y) 7.50

Sample Variance (y) 4.127

Correlation 0.816

Linear Regression y = 3.00 + 0.500x

43

Page 44: The Data Scientist's Guide to Apache Spark

44

Page 45: The Data Scientist's Guide to Apache Spark

45

Page 46: The Data Scientist's Guide to Apache Spark

46

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/donors_choose_eda.ipynb

Page 47: The Data Scientist's Guide to Apache Spark

Review

47

• The data science process is inherently interactive

• Spans many scales of data and computation

• Data pipelines require linking many diverse tasks (and data)

• Quick insights necessary for fast iteration

Page 48: The Data Scientist's Guide to Apache Spark

How Spark can help

• Interactive REPL

• Rapid computation (especially aggregates) on large amounts of data

• High level abstractions for efficient querying of data

• “Condense” data for easier local exploration and visualization

48

Page 49: The Data Scientist's Guide to Apache Spark

Natural Language Processing

Page 50: The Data Scientist's Guide to Apache Spark

50

AcquisitionParse

StorageTransform/Explore

VectorizationTrain

ModelExpose

Presentation

Dataframes/Spark SQL

HDFS

MLlib/spark.ml

Spark Streaming

PySpark

Data PipelineAt Scale

model.save()

We are Here

Page 51: The Data Scientist's Guide to Apache Spark

51

Natural Language Processing

[1, 3, 1, 1, 2, 0, 1, 0][0, 1, 4, 0, 0, 1, 1, 1][3, 0, 1, 1, 2, 2, 3, 2][0, 1, 1, 1, 0, 3, 2, 3][1, 2, 1, 2, 2, 0, 0, 0][1, 0, 1, 1, 0, 1, 1, 1][0, 2, 0, 0, 2, 2, 0, 0][1, 1, 1, 1, 0, 1, 1, 1]

Page 52: The Data Scientist's Guide to Apache Spark

52

DonorsChoose: Project Essays

Page 53: The Data Scientist's Guide to Apache Spark

53

Bag of Words

• Document: Single row of data/corpus

• Corpus: Entire set of all documents

• Vocabulary: Set of all words in corpus

• Vector: Mathematical representation of document

(counts of word occurrences)

Page 54: The Data Scientist's Guide to Apache Spark

54

Bag of Words

original document dictionary of word

countsfeature vector

The brown fox

{ “the” : 1, “brown”: 1, “fox” : 1

}

[0,0,1,0,1,0,...]

brown fox

Tokenization Vectorization

Page 55: The Data Scientist's Guide to Apache Spark

55

Tokenization

Page 56: The Data Scientist's Guide to Apache Spark

56

Vectorization

Page 57: The Data Scientist's Guide to Apache Spark

57

With MlLib

Page 58: The Data Scientist's Guide to Apache Spark

58

Vector Space Model

By Riclas (Own work) CC BY 3.0 , via Wikimedia Commons

Similarity is a measure of “distance”

Page 59: The Data Scientist's Guide to Apache Spark

59

Interlude: How to Scale

testing

Start small (data) and fast

(development)

testing

Increase size of data set

Optimize and productionize

PROFIT!

$$$

Page 60: The Data Scientist's Guide to Apache Spark

60

TF-IDF• Measure of discriminatory power of word (feature)

• Highest when term occurs many times in a small number of documents

• Lowest when term occurs few times in document or many times in corpus

• Useful for information retrieval (queries) and keyword extraction (among

other things)

tf (t,d) = fd (t)| d |

idf (t,D) = log( |D || {d ∈D : t ∈d} |

)

Page 61: The Data Scientist's Guide to Apache Spark

61

TF-IDF

Page 62: The Data Scientist's Guide to Apache Spark

62

TF-IDFMost Common Least Common

Page 63: The Data Scientist's Guide to Apache Spark

63

Summarization

Page 64: The Data Scientist's Guide to Apache Spark

64

Scale Up

Page 65: The Data Scientist's Guide to Apache Spark

65

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/natural_language_processing.ipynb

Page 66: The Data Scientist's Guide to Apache Spark

66

Review

• We need to represent text as vectors to model documents

• The Bag-of-words model uses word counts (tf-idf improves on this)

• In vector space, we can compare documents using linear algebra

• Spark provides feature transformers to handle text input

Page 67: The Data Scientist's Guide to Apache Spark

Word2Vec

Page 68: The Data Scientist's Guide to Apache Spark

68

Vector Space Model

Source: deeplearning4j

Page 69: The Data Scientist's Guide to Apache Spark

69

Vector Space Model

Source: Vector representation of words. Source: Mikolovov T., et al. : Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.

Page 70: The Data Scientist's Guide to Apache Spark

70

AcquisitionParse

StorageTransform/Explore

VectorizationTrain

ModelExpose

Presentation

Dataframes/Spark SQL

HDFS

MLlib/spark.ml

Spark Streaming

PySpark

Data PipelineAt Scale

model.save()

We are Here

Page 71: The Data Scientist's Guide to Apache Spark

71

Predict Context

Source: https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

Page 72: The Data Scientist's Guide to Apache Spark

72

doc2vec

Page 73: The Data Scientist's Guide to Apache Spark

73

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/word2vec_search.ipynb

Page 74: The Data Scientist's Guide to Apache Spark

Search and Results

Page 75: The Data Scientist's Guide to Apache Spark

75

Machine Learning Pipeline

RawText

Words

Tokenization

Contextualized Documents

Query

Search Results

Vectorization

doc2vec

word2vec

Page 76: The Data Scientist's Guide to Apache Spark

76

Machine Learning Pipeline

Page 77: The Data Scientist's Guide to Apache Spark

77

Machine Learning Pipeline

Page 78: The Data Scientist's Guide to Apache Spark

78

Machine Learning Pipeline

Page 79: The Data Scientist's Guide to Apache Spark

79

Machine Learning Pipeline

Page 80: The Data Scientist's Guide to Apache Spark

Questions?

Thank You!

Jonathan DinuVP of Academic Excellence, Galvanize

@clearspandex

Page 81: The Data Scientist's Guide to Apache Spark

Appendix: Installing Spark

Page 82: The Data Scientist's Guide to Apache Spark

82

Installation: Requirements

• Spark binary (version 1.4.1)

• Java JDK 6/7

• Scientific Python (and Jupyter notebook)

• py4j

• (Optional) IRKernel (for Jupyter)

Page 83: The Data Scientist's Guide to Apache Spark

83

Installation: Requirements

• Spark binary (version 1.4.1)

• Java JDK 6/7

• Scientific Python (and Jupyter notebook)

• py4j

• (Optional) IRKernel (for Jupyter)

Page 84: The Data Scientist's Guide to Apache Spark

84

Installation: Requirements

NOTE Please do not install Spark with

• Homebrew on OSX

• Cygwin on Windows

Page 85: The Data Scientist's Guide to Apache Spark

85

Installation: Spark

• Find your OS here: http://spark.apache.org/downloads.html

• Select “Pre-built for Hadoop 2.4” or earlier

under “Choose a package type”

• Download the tar package for spark-1.4.1-bin-hadoop1.tgz

(If you are not sure pick the latest version.)

Make sure you are downloading the binary version, not the source version.

Page 86: The Data Scientist's Guide to Apache Spark

86

Installation: Configuration

• Unzip the file and place it at your home directory

(/Users/jonathandinu/)

• Set PATH: Include the following lines in your

~/.bash_profile (or ~/.bashrc):

Page 87: The Data Scientist's Guide to Apache Spark

87

Installation: Configuration

• Unzip the file and place it at your home directory

(/Users/jonathandinu/)

• Set PATH: Include the following lines in your

~/.bash_profile (or ~/.bashrc):

export SPARK_HOME=/full/path/to/your/unzipped/spark/folder

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

Page 88: The Data Scientist's Guide to Apache Spark

88

Installation: Java JDK

• http://www.oracle.com/technetwork/java/javase/downloads/jdk8-

downloads-2133151.html

• Find download for your OS

• Follow install instructions/wizard

• Make sure you get JDK instead of JRE

Page 89: The Data Scientist's Guide to Apache Spark

89

Installation: Requirements

• Spark binary

• Java JDK 6/7

• Scientific Python (and Jupyter notebook)

• py4j

• (Optional) IRKernel (for Jupyter)

Page 90: The Data Scientist's Guide to Apache Spark

90

Installation: Scientific Python

• http://continuum.io/downloads

Page 91: The Data Scientist's Guide to Apache Spark

91

Installation: Scientific Python

• http://continuum.io/downloads

• Find download for your OS (make sure it is Python 2.7)

• Follow install instructions/wizard

Page 92: The Data Scientist's Guide to Apache Spark

92

Installation: Scientific Python

• http://continuum.io/downloads

• Find download for your OS (make sure it is Python 2.7)

• Follow install instructions/wizard

To make sure it installed correctly:

ipython notebook

Page 93: The Data Scientist's Guide to Apache Spark

93

And finally: pip install py4j

Page 94: The Data Scientist's Guide to Apache Spark

94

Installation: Test It All Outjonathan$ ipython

Page 95: The Data Scientist's Guide to Apache Spark

95

Installation: Requirements

• Spark binary

• Java JDK 6/7

• Scientific Python (and Jupyter notebook)

• py4j

• (Optional) IRKernel (for Jupyter)

Page 96: The Data Scientist's Guide to Apache Spark

96

Installation: IRKernel (Jupyter kernel for R)

• Make sure R is installed: https://cran.r-project.org/bin/

• Install kernel via R (get into an R shell):

install.packages(c('rzmq','repr','IRkernel','IRdisplay'),

repos = c('http://irkernel.github.io/', getOption('repos')))

IRkernel::installspec()

Page 97: The Data Scientist's Guide to Apache Spark

97

And in the notebook:

# Set this to where Spark is installed

Sys.setenv(SPARK_HOME=“/Users/jonathandinu/spark")

# This line loads SparkR from the installed directory

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

https://github.com/apache/spark/tree/master/R

Installation: IRKernel (Jupyter kernel for R)

Page 98: The Data Scientist's Guide to Apache Spark

98

Installation: IRKernel (Jupyter kernel for R)

And in the notebook: library(SparkR)

https://github.com/apache/spark/tree/master/R

Page 99: The Data Scientist's Guide to Apache Spark

99

Installation: IRKernel (Jupyter kernel for R)

https://github.com/apache/spark/tree/master/R

Page 100: The Data Scientist's Guide to Apache Spark

100

Note: If for any reason you cannot get Spark

installed on your OS following these instructions,

Cloudera and Hortonworks provide Linux VMs

with Spark installed.

• http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

• http://hortonworks.com/products/hortonworks-sandbox/#install

Page 101: The Data Scientist's Guide to Apache Spark

101

Review

• Command-line Spark shell: ./bin/pyspark

• Spark module: import pyspark as ps

• Jupyter Notebook interface: ipython notebook

• Also R support in the notebook (or RStudio)!

Page 102: The Data Scientist's Guide to Apache Spark

Spark Deployment

Configuring a cluster

Page 103: The Data Scientist's Guide to Apache Spark

103

Spark Deployment

Local Mode Cluster Mode

• Single threaded: SparkContext(‘local’)

• Multi-threaded: SparkContext(‘local[4]’)

• Pseudo-distributed cluster

• Standalone

• Mesos

• YARN

• Amazon EC2

Page 104: The Data Scientist's Guide to Apache Spark

104

Spark Deployment: Local

Mode Advantage

Single threaded sequential execution allows easier debugging of program logic

Multi-threaded concurrent execution leverages parallelism and allows debugging of coordination

Pseudo-distributed cluster distributed execution allows debugging of communication and I/O

Page 105: The Data Scientist's Guide to Apache Spark

105

Standalone

• Packaged with Spark core

• Great if all you need is a dedicated Spark cluster

• Doesn’t support integration with any other applications on a cluster.

The Standalone cluster manager also has a high-availability mode that can leverage Apache ZooKeeper to enable standby master nodes.

Page 106: The Data Scientist's Guide to Apache Spark

106

Mesos

• General purpose cluster and global resource manager (Spark, Hadoop, MPI, Cassandra, etc.)

• Two-level scheduler: enables pluggable scheduler algorithms

• Multiple applications can co-locate (like an operating system for a cluster)

Page 107: The Data Scientist's Guide to Apache Spark

107

YARN

• Created to scale Hadoop, optimized for Hadoop (stateless batch jobs with long runtimes)

• Monolithic scheduler: manages cluster resources as well as schedules jobs

• Not well suited for long-running, real-time, or stateful/interactive services (like database queries)

Page 108: The Data Scientist's Guide to Apache Spark

108

EC2

• Launch scripts bundled with Spark

• Elastic and ephemeral cluster

• Sets up:• Spark• HDFS• Hadoop MR

Page 109: The Data Scientist's Guide to Apache Spark

109

Spark Deployment: Cluster

Mode Advantage

Standalone Encapsulated cluster manager isolates complexity

Mesos global resource manager facilitates multi-tenant and heterogeneous workloads

YARN Integrates with existing Hadoop cluster and applications

EC2 elastic scalability and ease of setup

Page 110: The Data Scientist's Guide to Apache Spark

110

Pseudo-distributed local cluster

MasterStandalone Scheduler

Web UI (monitoring/logging)

Worker Node

Executor cache

Task Task

Worker Node

Executor cache

Task Task

Laptop

Page 111: The Data Scientist's Guide to Apache Spark

111

Pseudo-distributed local cluster

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.master.Master \-h 127.0.0.1 \-p 7077 \--webui-port 8080

Master

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker \-c 1 \-m 1G \spark://127.0.0.1:7077

Workers (x 2)

Page 112: The Data Scientist's Guide to Apache Spark

112

Pseudo-distributed local cluster

One Master + two Workers: Run each process in separate terminal window

Page 113: The Data Scientist's Guide to Apache Spark

113

EC2: Setup

1. Create AWS account: https://aws.amazon.com/account/

2. Get Access keys:

Page 114: The Data Scientist's Guide to Apache Spark

114

1. Include the following lines in your

~/.bash_profile (or ~/.bashrc):

2. Download EC2 keypair:

EC2: Setup

export AWS_ACCESS_KEY_ID=xxxxxxxexport AWS_SECRET_ACCESS_KEY=xxxxxx

Page 115: The Data Scientist's Guide to Apache Spark

115

1. Launch EC2 cluster with script in $SPARK_HOME/ec2:

EC2: Launch

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem --copy-aws-credentials --instance-type=m1.large -m m1.large -s 19 launch spark

Page 116: The Data Scientist's Guide to Apache Spark

116

Page 117: The Data Scientist's Guide to Apache Spark

117

EC2: Scripts

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark

Login to the Master

./spark-ec2 stop spark

Stop cluster

./spark-ec2 destroy spark

Terminate Cluster

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem start spark

Restart cluster (after stopped)

Page 118: The Data Scientist's Guide to Apache Spark

118

Setup IPython/Jupyter

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark

Login to the Master

Installed needed packages (on master)

# Install all the necessary packages on Master

yum install -y tmux

yum install -y pssh

yum install -y python27 python27-devel

yum install -y freetype-devel libpng-devel

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27

easy_install-2.7 pip

easy_install py4j

pip2.7 install ipython==2.0.0

pip2.7 install pyzmq==14.6.0

pip2.7 install jinja2==2.7.3

pip2.7 install tornado==4.2

pip2.7 install numpy

pip2.7 install matplotlib

pip2.7 install nltk

Page 119: The Data Scientist's Guide to Apache Spark

119

Setup IPython/Jupyter

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark

Login to the Master

Installed needed packages (on workers)

# Install all the necessary packages on Workers

pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel

pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"

pssh -h /root/spark-ec2/slaves easy_install-2.7 pip

pssh -t 10000 -h /root/spark-ec2/slaves pip2.7 install numpy

pssh -h /root/spark-ec2/slaves pip2.7 install nltk

Page 120: The Data Scientist's Guide to Apache Spark

120

Allow inbound requests to enable IPython/Jupyter notebook(WARNING: this will create a security risk however)

Page 121: The Data Scientist's Guide to Apache Spark

121

IPython/Jupyter Profile

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark

Login to the Master

Set notebook password

ipython profile create default

python -c "from IPython.lib import passwd; print passwd()" \ > /root/.ipython/profile_default/nbpasswd.txt

cat /root/.ipython/profile_default/nbpasswd.txt# sha1:128de302ca73:6b9a8bd5bhjde33d48cd65ad9cafb0770c13c9df

Page 122: The Data Scientist's Guide to Apache Spark

122

Configure IPython/Jupyter Settings

/root/.ipython/profile_default/ipython_notebook_config.py:

# Configuration file for ipython-notebook.

c = get_config()

# Notebook configc.NotebookApp.ip = '*'c.NotebookApp.open_browser = False# It is a good idea to put it on a known, fixed portc.NotebookApp.port = 8888

PWDFILE="/root/.ipython/profile_default/nbpasswd.txt"c.NotebookApp.password = open(PWDFILE).read().strip()

Page 123: The Data Scientist's Guide to Apache Spark

123

Configure IPython/Jupyter Settings

/root/.ipython/profile_default/startup/pyspark.py:

# Configure the necessary Spark environmentimport osos.environ['SPARK_HOME'] = '/root/spark/'

# And Python pathimport syssys.path.insert(0, '/root/spark/python')

# Detect the PySpark URLCLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()

Page 124: The Data Scientist's Guide to Apache Spark

124

Configure IPython/Jupyter Settings

Add the following to /root/spark/conf/spark-env.sh:

export PYSPARK_PYTHON=python2.7

Sync across workers:~/spark-ec2/copy-dir /root/spark/conf

Make sure master’s env is correct

source /root/spark/conf/spark-env.sh

Page 125: The Data Scientist's Guide to Apache Spark

125

IPython/Jupyter initialization (on master)

Start remote window manager (screen or tmux):

ipython notebook

screen

Start notebook server:

Ctrl-a d

Detach from session:

Page 126: The Data Scientist's Guide to Apache Spark

126

IPython/Jupyter login

On your laptop:

http://[YOUR MASTER IP/DNS HERE]:8888

Page 127: The Data Scientist's Guide to Apache Spark

127

IPython/Jupyter login

Test that it all works:

Page 128: The Data Scientist's Guide to Apache Spark

128

EC2: Data

/root/ephemeral-hdfs

HDFS (ephemeral)

s3n://bucket_name

Amazon S3

/root/persistent-hdfs

HDFS (persistent)

Page 129: The Data Scientist's Guide to Apache Spark

129

Review• Local mode or cluster mode each have their benefits

• Spark can be run on a variety of cluster managers

• Amazon EC2 enables elastic scaling and ease of development

• By leveraging IPython/Jupyter you can get the performance of a cluster with the ease of interactive development