the data scientist's guide to apache spark

The Data Scientist’s Guide to Apache Spark

Hands on with a practical case study

Jonathan DinuVP of Academic Excellence, Galvanize

@clearspandex

Data Science Immersive

Full StackImmersive

Data Engineering Immersive

Weekend Workshops

Questions? tweet @clearspandex

Spark Fundamentals

What Is Spark?

• Framework for distributed processing

• In-memory, fault tolerant data structures

• Flexible APIs in Scala, Java, Python, SQL… and now R!

• Open Source

Acquisition

StorageTransform/Explore

VectorizationTrain

ModelExpose

Presentation

requests

BeautifulSoup4

pandas

pymongo

Dataframes/Spark SQL

MLlib/spark.ml

Spark Streaming

scikit-learn/NLTK

PySpark

Data PipelineAt Scale Locally

picklemodel.save()

UnifiedPlatform

Performance

• Very fast at iterative algorithms

• DAG scheduler supports cyclic flows (and graph computation)

• Intermediate results kept in memory when possible

• Bring computation to the data (data locality)

Rich API

map() reduce()

filter() sortBy()

join() groupByKey()

first() count()

… and more …

reduce()

Spark Core

Standalone Scheduler YARN Mesos

Spark StreamingDataFrames MLlib GraphXspark.ml

PySpark SparkR Scala Java

Review

• Framework for distributed processing

• In-memory, fault tolerant data structures

• Flexible APIs in Scala, Java, Python, SQL… and now R!

• Open Source

Spark Programming Basics

Spark Execution Context

Cluster Manager

Laptop

Driver Program

SparkContext

Worker Node

Executor

Task Task

StandaloneYARNMesos

Cluster

TerminologyTerm Meaning

Driver Process that contains the SparkContext

Executor Process that executes one or more Spark tasks

Master Process that manages applications across the cluster

Worker Process that manages executors on a particular node

• Resilient: if the data in memory (or on a node) is lost, it can be recreated

• Distributed: data is chunked into partitions and stored in memory across the cluster

• Dataset: initial data can come from a file or be created programmatically

What is a RDD?

Note: RDDs are read-only and immutable, we will come back to this later…

Functions Deconstructedimport randomflips = 1000000

# lazy evalcoins = xrange(flips)

# lazy eval, nothing executedheads = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51) \ .count()

Python Generator

Create RDD

Transformations

Action (materialize result)

Functions Deconstructedimport randomflips = 1000000

# lazy eval, nothing executedheads = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51) \ .count()

# create a closure with the lambda function# apply function to data

Closures

Spark Functions

Transformations Actions

Lazy Evaluation (does not immediately evaluate)

Returns new RDD

Materialize Data (evaluates RDD lineage)

Returns final value (on driver)

Transformations

# Every Spark application requires a Spark Context# Spark shell provides a preconfigured Spark Context called `sc`nums = sc.parallelize([1,2,3])

# Pass each element through a functionsquared = nums.map(lambda x: x*x) # => {1, 4, 9}

# Keep elements passing a predicateeven = squared.filter(lambda x: x % 2 == 0) # => [4]

# Map each element to zero or more othersnums.flatMap(lambda x: range(x)) # => {0, 0, 1, 0, 1, 2}

Actionsnums = sc.parallelize([1, 2, 3])

# Retrieves RDD contents as a local collectionnums.collect() # => [1, 2, 3]

# Returms first K elementsnums.take(2) # => [1, 2]

# Count number of elementsnums.count() # => 3

# Merge elements with an associative functionnums.reduce(lambda: x, y: x + y) # => 6

# Write elements to a text filenums.saveAsTextFile("hdfs://file.txt")

Functions Revisitedimport randomflips = 1000000

# lazy eval, nothing executedheads_rdd = sc.parallelize(coins) \

.map(lambda i: random.random()) \ .filter(lambda r: r < 0.51)

head_count = heads_rdd.count()

nothing runs here

Everything runs here

Functions Revisitedimport randomflips = 1000000

# local sequencecoins = xrange(flips)

# distributed sequencecoin_rdd = sc.parallelize(coins)flips_rdd = coin_rdd.map(lambda i: random.random())heads_rdd = flips_rdd.filter(lambda r: r < 0.51)

# local valuehead_count = heads_rdd.count()

worker (distributed)driver driver

RDD Lineage

coins coin_rdd flips_rdd heads_rdd head_count

sc.parallelize() map() filter() count()

Key-Value Operations

pets = sc.parallelize([("cat", 1), ("dog", 1), ("cat", 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/pyspark.ipynb

Functional Programming Primer

• Functions are applied to data (RDDs)

• RDDs are Immutable: f(RDD) -> RDD2

• Function application necessitates creation of new data

Review

• Client-Server execution model

• Spark leverages higher-order functions (map(), filter(), etc.)

• Transformations create new RDDs and are lazily evaluated

• Actions force materialization of RDD on driver

Spark Programming APIs

ClusterLocal

Spark Context

PySpark

JavaSpark

Context

Local File System

Spark Worker

SocketPython

Python

PythonPipe

Python

Review

• PySpark enables developers to write driver programs in Python

• For both of these, closures are serialized and sent to workers

• Execution happens in native language (Python/R) of closure

Data Science Applications with Spark

AcquisitionParse

VectorizationTrain

ModelExpose

Presentation

MLlib/spark.ml

Spark Streaming

PySpark

Data PipelineAt Scale

model.save()

We are Here

What Is Exploratory Data Analysis?

• Developed at Bell Labs in the 1960’s by John Tukey

• Techniques used to visualize and summarize data

• Five-number summary: describe()

• Distributions: box plots, stem and leaf, histogram, scatterplot

Goals of Exploratory Data Analysis

• Gain greater intuition

• Validate our data (consistency and completeness)

• Make comparisons between distributions

• Find outliers

• Treat missing data

• Summarize data (a statistic -> one number that represents many #'s)

Case Study: DonorsChoose.org

http://data.donorschoose.org/open-data/overview/

Unique Values

• rdd.distinct()

• rdd.countApproxDistinct(relative_accuracy)

http://content.research.neustar.biz/blog/hll.html

http://dx.doi.org/10.1145/2452376.2452456

Missing Values

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

• column.isNull()

• dataframe.fillna()

Missing Values

Frequently Occurring Values

• dataframe.freqItems(columns, support)

http://dl.acm.org/citation.cfm?doid=762471.762473

required minimum proportion of rows

Note: this is an approximate algorithm that always returns all the frequent items, but may contain false positives.

Summary Statistics

• dataframe.describe(column_name)

Interlude: Sometimes numbers aren’t enough!

Anscombe’s Quartet

4 6 8 12 16

Mean (x) 9

Sample Variance (x) 11

Mean (y) 7.50

Sample Variance (y) 4.127

Correlation 0.816

Linear Regression y = 3.00 + 0.500x

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/donors_choose_eda.ipynb

Review

• The data science process is inherently interactive

• Spans many scales of data and computation

• Data pipelines require linking many diverse tasks (and data)

• Quick insights necessary for fast iteration

How Spark can help

• Interactive REPL

• Rapid computation (especially aggregates) on large amounts of data

• High level abstractions for efficient querying of data

• “Condense” data for easier local exploration and visualization

Natural Language Processing

AcquisitionParse

VectorizationTrain

ModelExpose

Presentation

MLlib/spark.ml

Spark Streaming

PySpark

model.save()

We are Here

Natural Language Processing

[1, 3, 1, 1, 2, 0, 1, 0][0, 1, 4, 0, 0, 1, 1, 1][3, 0, 1, 1, 2, 2, 3, 2][0, 1, 1, 1, 0, 3, 2, 3][1, 2, 1, 2, 2, 0, 0, 0][1, 0, 1, 1, 0, 1, 1, 1][0, 2, 0, 0, 2, 2, 0, 0][1, 1, 1, 1, 0, 1, 1, 1]

DonorsChoose: Project Essays

Bag of Words

• Document: Single row of data/corpus

• Corpus: Entire set of all documents

• Vocabulary: Set of all words in corpus

• Vector: Mathematical representation of document

(counts of word occurrences)

Bag of Words

original document dictionary of word

countsfeature vector

The brown fox

{ “the” : 1, “brown”: 1, “fox” : 1

[0,0,1,0,1,0,...]

brown fox

Tokenization Vectorization

Tokenization

Vectorization

With MlLib

Vector Space Model

By Riclas (Own work) CC BY 3.0 , via Wikimedia Commons

Similarity is a measure of “distance”

Interlude: How to Scale

testing

Start small (data) and fast

(development)

testing

Increase size of data set

Optimize and productionize

PROFIT!

TF-IDF• Measure of discriminatory power of word (feature)

• Highest when term occurs many times in a small number of documents

• Lowest when term occurs few times in document or many times in corpus

• Useful for information retrieval (queries) and keyword extraction (among

other things)

tf (t,d) = fd (t)| d |

idf (t,D) = log( |D || {d ∈D : t ∈d} |

TF-IDF

TF-IDFMost Common Least Common

Summarization

Scale Up

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/natural_language_processing.ipynb

Review

• We need to represent text as vectors to model documents

• The Bag-of-words model uses word counts (tf-idf improves on this)

• In vector space, we can compare documents using linear algebra

• Spark provides feature transformers to handle text input

Word2Vec

Vector Space Model

Source: deeplearning4j

Vector Space Model

Source: Vector representation of words. Source: Mikolovov T., et al. : Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.

AcquisitionParse

VectorizationTrain

ModelExpose

Presentation

MLlib/spark.ml

Spark Streaming

PySpark

model.save()

We are Here

Predict Context

Source: https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

doc2vec

Notebook

https://github.com/Jay-Oh-eN/data-scientists-guide-apache-spark/blob/master/word2vec_search.ipynb

Search and Results

Machine Learning Pipeline

RawText

Tokenization

Contextualized Documents

Search Results

Vectorization

doc2vec

word2vec

Machine Learning Pipeline

Questions?

Thank You!

Jonathan DinuVP of Academic Excellence, Galvanize

@clearspandex

Appendix: Installing Spark

Installation: Requirements

• Spark binary (version 1.4.1)

• Java JDK 6/7

• Scientific Python (and Jupyter notebook)

• py4j

• (Optional) IRKernel (for Jupyter)

• Spark binary (version 1.4.1)

• Java JDK 6/7

• py4j

NOTE Please do not install Spark with

• Homebrew on OSX

• Cygwin on Windows

Installation: Spark

• Find your OS here: http://spark.apache.org/downloads.html

• Select “Pre-built for Hadoop 2.4” or earlier

under “Choose a package type”

• Download the tar package for spark-1.4.1-bin-hadoop1.tgz

(If you are not sure pick the latest version.)

Make sure you are downloading the binary version, not the source version.

Installation: Configuration

• Unzip the file and place it at your home directory

(/Users/jonathandinu/)

• Set PATH: Include the following lines in your

~/.bash_profile (or ~/.bashrc):

Installation: Configuration

• Unzip the file and place it at your home directory

(/Users/jonathandinu/)

• Set PATH: Include the following lines in your

export SPARK_HOME=/full/path/to/your/unzipped/spark/folder

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

Installation: Java JDK

• http://www.oracle.com/technetwork/java/javase/downloads/jdk8-

downloads-2133151.html

• Find download for your OS

• Follow install instructions/wizard

• Make sure you get JDK instead of JRE

• Spark binary

• Java JDK 6/7

• py4j

Installation: Scientific Python

• http://continuum.io/downloads

• Find download for your OS (make sure it is Python 2.7)

To make sure it installed correctly:

ipython notebook

And finally: pip install py4j

Installation: Test It All Outjonathan$ ipython

• Spark binary

• Java JDK 6/7

• py4j

Installation: IRKernel (Jupyter kernel for R)

• Make sure R is installed: https://cran.r-project.org/bin/

• Install kernel via R (get into an R shell):

install.packages(c('rzmq','repr','IRkernel','IRdisplay'),

repos = c('http://irkernel.github.io/', getOption('repos')))

IRkernel::installspec()

And in the notebook:

# Set this to where Spark is installed

Sys.setenv(SPARK_HOME=“/Users/jonathandinu/spark")

# This line loads SparkR from the installed directory

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

https://github.com/apache/spark/tree/master/R

And in the notebook: library(SparkR)

Note: If for any reason you cannot get Spark

installed on your OS following these instructions,

Cloudera and Hortonworks provide Linux VMs

with Spark installed.

• http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

• http://hortonworks.com/products/hortonworks-sandbox/#install

Review

• Command-line Spark shell: ./bin/pyspark

• Spark module: import pyspark as ps

• Jupyter Notebook interface: ipython notebook

• Also R support in the notebook (or RStudio)!

Spark Deployment

Configuring a cluster

Spark Deployment

Local Mode Cluster Mode

• Single threaded: SparkContext(‘local’)

• Multi-threaded: SparkContext(‘local[4]’)

• Pseudo-distributed cluster

• Standalone

• Mesos

• YARN

• Amazon EC2

Spark Deployment: Local

Mode Advantage

Single threaded sequential execution allows easier debugging of program logic

Multi-threaded concurrent execution leverages parallelism and allows debugging of coordination

Pseudo-distributed cluster distributed execution allows debugging of communication and I/O

Standalone

• Packaged with Spark core

• Great if all you need is a dedicated Spark cluster

• Doesn’t support integration with any other applications on a cluster.

The Standalone cluster manager also has a high-availability mode that can leverage Apache ZooKeeper to enable standby master nodes.

• General purpose cluster and global resource manager (Spark, Hadoop, MPI, Cassandra, etc.)

• Two-level scheduler: enables pluggable scheduler algorithms

• Multiple applications can co-locate (like an operating system for a cluster)

• Created to scale Hadoop, optimized for Hadoop (stateless batch jobs with long runtimes)

• Monolithic scheduler: manages cluster resources as well as schedules jobs

• Not well suited for long-running, real-time, or stateful/interactive services (like database queries)

• Launch scripts bundled with Spark

• Elastic and ephemeral cluster

• Sets up:• Spark• HDFS• Hadoop MR

Spark Deployment: Cluster

Mode Advantage

Standalone Encapsulated cluster manager isolates complexity

Mesos global resource manager facilitates multi-tenant and heterogeneous workloads

YARN Integrates with existing Hadoop cluster and applications

EC2 elastic scalability and ease of setup

Pseudo-distributed local cluster

MasterStandalone Scheduler

Web UI (monitoring/logging)

Worker Node

Executor cache

Task Task

Worker Node

Executor cache

Task Task

Laptop

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.master.Master \-h 127.0.0.1 \-p 7077 \--webui-port 8080

Master

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker \-c 1 \-m 1G \spark://127.0.0.1:7077

Workers (x 2)

One Master + two Workers: Run each process in separate terminal window

EC2: Setup

1. Create AWS account: https://aws.amazon.com/account/

2. Get Access keys:

1. Include the following lines in your

2. Download EC2 keypair:

EC2: Setup

export AWS_ACCESS_KEY_ID=xxxxxxxexport AWS_SECRET_ACCESS_KEY=xxxxxx

1. Launch EC2 cluster with script in $SPARK_HOME/ec2:

EC2: Launch

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem --copy-aws-credentials --instance-type=m1.large -m m1.large -s 19 launch spark

EC2: Scripts

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark

Login to the Master

./spark-ec2 stop spark

Stop cluster

./spark-ec2 destroy spark

Terminate Cluster

./spark-ec2 -k keyname -i ~/.ssh/keyname.pem start spark

Restart cluster (after stopped)

Setup IPython/Jupyter

Login to the Master

Installed needed packages (on master)

# Install all the necessary packages on Master

yum install -y tmux

yum install -y pssh

yum install -y python27 python27-devel

yum install -y freetype-devel libpng-devel

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27

easy_install-2.7 pip

easy_install py4j

pip2.7 install ipython==2.0.0

pip2.7 install pyzmq==14.6.0

pip2.7 install jinja2==2.7.3

pip2.7 install tornado==4.2

pip2.7 install numpy

pip2.7 install matplotlib

pip2.7 install nltk

Setup IPython/Jupyter

Login to the Master

Installed needed packages (on workers)

# Install all the necessary packages on Workers

pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel

pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"

pssh -h /root/spark-ec2/slaves easy_install-2.7 pip

pssh -t 10000 -h /root/spark-ec2/slaves pip2.7 install numpy

pssh -h /root/spark-ec2/slaves pip2.7 install nltk

Allow inbound requests to enable IPython/Jupyter notebook(WARNING: this will create a security risk however)

IPython/Jupyter Profile

Login to the Master

Set notebook password

ipython profile create default

python -c "from IPython.lib import passwd; print passwd()" \ > /root/.ipython/profile_default/nbpasswd.txt

cat /root/.ipython/profile_default/nbpasswd.txt# sha1:128de302ca73:6b9a8bd5bhjde33d48cd65ad9cafb0770c13c9df

Configure IPython/Jupyter Settings

/root/.ipython/profile_default/ipython_notebook_config.py:

# Configuration file for ipython-notebook.

c = get_config()

# Notebook configc.NotebookApp.ip = '*'c.NotebookApp.open_browser = False# It is a good idea to put it on a known, fixed portc.NotebookApp.port = 8888

PWDFILE="/root/.ipython/profile_default/nbpasswd.txt"c.NotebookApp.password = open(PWDFILE).read().strip()

/root/.ipython/profile_default/startup/pyspark.py:

# Configure the necessary Spark environmentimport osos.environ['SPARK_HOME'] = '/root/spark/'

# And Python pathimport syssys.path.insert(0, '/root/spark/python')

# Detect the PySpark URLCLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()

Add the following to /root/spark/conf/spark-env.sh:

export PYSPARK_PYTHON=python2.7

Sync across workers:~/spark-ec2/copy-dir /root/spark/conf

Make sure master’s env is correct

source /root/spark/conf/spark-env.sh

IPython/Jupyter initialization (on master)

Start remote window manager (screen or tmux):

ipython notebook

screen

Start notebook server:

Ctrl-a d

Detach from session:

IPython/Jupyter login

On your laptop:

http://[YOUR MASTER IP/DNS HERE]:8888

IPython/Jupyter login

Test that it all works:

EC2: Data

/root/ephemeral-hdfs

HDFS (ephemeral)

s3n://bucket_name

Amazon S3

/root/persistent-hdfs

HDFS (persistent)

Review• Local mode or cluster mode each have their benefits

• Spark can be run on a variety of cluster managers

• Amazon EC2 enables elastic scaling and ease of development

• By leveraging IPython/Jupyter you can get the performance of a cluster with the ease of interactive development

the data scientist's guide to apache spark

Technology

apache spark - installation

apache cassandraと apache...

big data with apache spark - wunca · 2017-07-21 · -...

apache spark overview

análisis de datos con apache spark

beneath rdd in apache spark by jacek laskowski

apache spark - big data -...

apache spark ile twitter’ı izlemek

apache spark overview part2 (20161117)

apache spark rdd 101

apache sparkの紹介

wprowadzenie do apache spark · 2017-01-20 · wprowadzenie...

apache spark performance observations

uvod u apache spark zagreb meetup

apache spark : genel bir bakış

#hstokyo16 apache spark crash course

acelerando la innovación con apache...

apache spark

introducciÓn a apache spark con...

apache sparkとapache...