Download - Apache Spark Overview part1 (20161107)

아파치 스파크 소개Part1

2016.11.07

민형기

Contents

• MapReduce

• Apache Spark

• Spark SQL

Brief History

MapReduce

MapReduce History

• 1979 – Stanford, MIT, CMU, etc• set/list operations in LISP, Prolog, etc. for parallel processing

• 2004 – Google• MapReduce(2004): Simplified Data Processing on Large Clusters• http://research.google.com/archive/mapreduce.html

• 2006 – Apache Hadoop: http://hadoop.apache.org/• Hadoop, originating from the Nutch Project, Doug Cutting

• 2008 – Yahoo• Web scale search indexing• Hadoop Summit, HUG, etc

• 2009 – Amazon AWS• Elastic MapReduce• Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc

• 2012.01 – Apache Hadoop 1.0• MapReduce 1.0: cluster resource management & data processing

• 2013.10 – Apache Hadoop 2.2• MapReduce 2.0: data processing• YARN: cluster resource management

Jeff Dean

Doug Cutting제프 딘의 29가지 진실: http://ppss.kr/archives/16672

http://hadoop.apache.org/

http://ppss.kr/archives/16672

MapReduce Motivation

• 구글에서 사용중인 데이터를 가공하기 위해서는 많은 머신이 필요함.• 특히, 입력 데이터가 크고, 적절한 시간 내에 완료되려면 컴퓨테이션이 많은 장비에 분산되어

야 한다.

• 웹 페이지의 인덱스를 생성하는 과정에서 방대한 양의 웹 페이지를 처리해야 할 때도 분산처

리가 필요함.

• 데이터 가공의 종류는 지속적으로 증가함• 검색 색인(역 인덱스) 계산, 웹 문서의 그래프 구조의 다양한 표현, Host별로 크롤된 페이지의

수의 Summary, 해당 일자의 가장 많이 요청된 쿼리 셋 등

• 대부분은 개념적으로 어렵지 않으나, 분산처리 고려(작업 병렬화, 데이터분산, 실패 처리 등) 로 인하여 코드가 복잡해 짐

• 분산 데이터 처리 Framework 필요http://research.google.com/archive/mapreduce.html

http://research.google.com/archive/mapreduce.html

MapReduce Programming Model

• Map과 Reduce는 Lisp과 같은 함수형 언어에서 유래한 용어• Map: 데이터의 집합에 함수를 적용하여 새로운 집합을 만드는 것

• Reduce: 데이터의 집합에 함수를 적용하여 하나의 결과로 모으는 것

• Map: <키, 값> <키`, 값`>*

• Reduce: <키`, 값` *> 값``*

MapReduce Process

MapReduce Design Pattern

• Basic MapReduce Patterns• Counting, Summing• Collating• Filtering(“Grepping”), Parsing, and Validation• Distributed Task Execution• Sorting

• Not-So-Basic MapReduce Patterns• Iterative Message Passing(Graph Processing)• Distinct Values(Unique Items Counting)• Cross-Correlation

• Relational MapReduce Patterns• Selection• Projection• Union• Intersection• Difference• GroupBy and Aggregation• Joining

• Machine Learning and Math MapReduce Algorithms

https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

MapReduce Limitations

• MR로 직접 프로그래밍 하는 것은 어렵다.• MR은 어렵고, 개발 노력이 많이 들고, 성능 보장이 어렵다.• 개발자의 수준에 따른 성능 차 발생• 기존 SQL 구현에 비해 생산성이 많이 떨어짐

• MapReduce는 one-pass 연산에는 우수한 성능을 보이나, multi-pass 알고리즘에는 효율적이지 못하다.• Disk IO에 최적화됨 / 메모리를 잘 사용하지 못함• 반복적인 알고리즘의 경우 디스크 IO를 계속 발생시키기 때문에 효율적이지

못함

MR은 다양한 종류의 연산에 최적화 되어있지 않다. • 특화된 시스템이 필요함

https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf


MapReduce Limitations

MapReduce

Storm

Giraph

DrillTez

Impala

…

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

TajoDruid

Presto

아파치 스파크

아파치 스파크란?

• Fast and general engine for large-scale data processing.

• 특징

• 스피드: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

• 쉬운 사용: Java, Scala, Python, R로 쉽게 작성 가능

• 일반성: Batch, Streaming, iterative, interactive

• Runs Everywhere: Hadoop, Mesos, Standalone

• ‘09 UC Berkeley AMPLab, open sourced in ‘10

Spark Stack(Unified Platform)

Spark Core / RDD

Spark Streaming

(Streaming)

GraphX

(graph)Spark SQL

MLlib

(Machine Learning)

Standalone YARN Mesos

Scala Java Python R

Separate engine:

Benefit for Users

동일한 엔진으로 데이터 추출, 모델 학습, interactive 쿼리를 수행할 수 있다.

…DFS

read

DFS

writepars

e DFS

read

DFS

writetra

in DFS

read

DFS

writequery

HDFS

DFS

read pars

e

tra

in

query

Spark: Interactive

analysis

https://spark-summit.org/2013/zaharia-the-state-of-spark-and-where-were-going/

https://spark-summit.org/2013/zaharia-the-state-of-spark-and-where-were-going/

스파크 히스토리

• 2009년 – UC Berkeley RAD Lab(AMP Lab)에서 개발시작

• 2010년 – Open Source화

• 2010년 - Spark: Cluster Computing with Working Sets

• 2012년 – Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

• 2013년 – 아파치 프로젝트로 전환

• 2014년 – 아파치 최상위 프로젝트(Top-Level Project)

• 2014년 – 스파크로 Large scale Sorting 세계기록(Databricks)

• 2014년 5월 – 1.0 release

• 2016년 7월 – 2.0 release

http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

스파크 - Motivation

• MapReduce는 빅데이터 분석을 쉽게 만들어 줌• 그러나 이것은 방향성을 갖는 데이터 플로우 모델에만 적합

• MapReduce가 부족한 것• Iterative Job: 기계학습, 그래프 처리• Interactive analytics: Ad-hoc 쿼리 (Hive, Pig) Data Sharing is Slow

• 어떻게 개선할 수 있을까?• Fast data sharing• General DAGs

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdfhttp://www.slideshare.net/yongho/rdd-paper-review

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

http://www.slideshare.net/yongho/rdd-paper-review

Operations in MapReduce

• MR에서 데이터공유는 replication, serialization, and disk IO로 느림

• 대부분의 MR의 90%시간은 HDFS read-write에서 사용됨

• Iterative Operations • Interactive Operations

https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm


Operations in Spark RDD

• Iterative Operations • Interactive Operations


• RDD: 데이터공유를 메모리에서 함

• 메모리를 이용한 데이터 공유는 네트워크나 디스크보다 10~100배 빠름


아파치 스파크 – Time to Sort 100TB

http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east

http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east

Scala, Java, Python, R

// Scala:val lines = sc.textFile(…)val pairs = lines.map( s => (s, 1) )val counts = pairs.reduceByKey( (a,b) => a + b)

// Java:JavaRDD<String> lines = sc.textFile("data.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);

// Python:lines = sc.textFile(…)pairs = lines.map(lambda s: (s, 1))counts = pairs.reduceByKey(lambda a, b: a+b)

Spark Context

• 모든 Spark 응용프로그램은 Spark Context가 필요함• Spark API를 위한 Main entry point

• Spark cluster와의 connection을 대표함

• Spark Shell은 미리 설정된 Spark Context인 sc를 제공함

• Scala (spark-shell):

• Python (pyspark):

Master

• SparkContext의 master파라메터는 어떤 클러스터를 사용할지결정함

master description

local run Spark locally with one worker thread(no parallelism)

local[K] run Spark locally with K worker threads(ideally set to # cores)

spark://host:port connect to a Spark standalone cluster;PORT depends on config (7077 by default)

mesos://host:port connect to a Mesos cluster;PORT depends on config (5050 by default)

yarn Connect to yarn cluster in client or cluster mode depending on the value of –deploy-mode.The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

Master

http://spark.apache.org/docs/latest/cluster-overview.html

1. Application 리소스 할당을 위해 Cluster Manager에 접속

2. 클러스터의 task를 수행할 executors를 획득

3. Applicaion code를 executor에 전달

4. Task를 executor에 전달하고 실행

http://spark.apache.org/docs/latest/cluster-overview.html

Master – YARN vs. Standalone

• Master 종류에 따른 비교(YARN vs. Standalone)

YARN Cluster YARN Client Spark Standalone

Driver runs in: Application Master Client Client

Who requests resources? Application Master Application Master Client

Who starts executor processes? YARN NM YARN NM Spark Workers

Persistent services YARN RM / NM YARN RM / NM Spark Master / Workers

Supports Spark Shell? No Yes Yes

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/


Resilient Distributed Datasets(RDD)

• Primary abstraction in Spark• An Immutable collections of objects that can be operated on in parallel

• RDD• Resilient: 메모리에 저장된 데이터가 유실 되도, 다시 만들어짐• Distributed: 메모리가 클러스터를 통해 저장됨

• Main idea: Resilient Distributed Datasets• Immutable collections of objects, spreads across cluster • 유저는 컬렉션의 파티셔닝과 퍼시스턴스(메모리, 디스크 등)를 관리할 수 있

음• RDD 생성: 스토리지 RDD, RDD RDD만 가능• Statically typed: RDD[T] has objects of type T• Fault-tolerant: 어떤 데이터의 계보(lineage)만 기록

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdfhttps://gist.github.com/hellerbarde/2843375http://www.slideshare.net/yongho/rdd-paper-review

https://gist.github.com/hellerbarde/2843375



• Two types: transformations and actions

• Transformation Operation

• 변환을 통해 새로운 RDD를 생성, e.g, rdd.map(…)

• lazy operation

• 계보(lineage)에 기록

• Action operation

• 모든 계산된 결과를 제공하거나 저장, e.g. rdd.count()

• 즉시 수행

• 계보에 있는 정보(transformation operations)를 이용하여, Execution Plan을 계산

• 최적의 코스로 수행됨

RDD - Operations

http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf


RDD – Transformations

Transformation Meaning

map(f: TU) RDD[T] RDD[U]

filter(f: TBool) RDD[T] RDD[T]

flatMap(f: T Seq[U]) RDD[T] RDD[U]

mapPartitions(f: Iterator[T] Iterator[U]) RDD[T] RDD[U], 각 파티션 블록에서 개별적으로 수행됨

mapPartitionsWithIndex(f: (Int, Iterator[T]) Iterator[U]) RDD[T] RDD[U], integer value는 파티션 index임

sample(withReplacement, fraction, seed) RDD[T] RDD[T], fraction 비율 만큼 sampling

union(otherDataset) (RDD[T], RDD[T]) RDD[T], A ∪ B

intersection(otherDataset) (RDD[T], RDD[T]) RDD[T], A ∩ B

distinct([numTasks]) RDD[T] RDD[T], source dataset에서 distinct element를 제공함

groupByKey([numTasks]) RDD[(K,V)] RDD[(K, Iterable[V])]

reduceByKey(f: (V,V) V, [numTasks]) RDD[(K,V)] RDD[(K,V)], 각 Key별로 value를 aggregated value함

sortByKey([ascending], [numTasks]) RDD[(K,V)] RDD[(K,V)], Key를 기준으로 정렬

http://spark.apache.org/docs/1.6.2/programming-guide.htmlhttp://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

http://spark.apache.org/docs/1.6.2/programming-guide.html


RDD – Transformations

Transformation Meaning

join(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)]) RDD[(K,(V,W))], 각 k에 대한 모든 (v,w)leftOuterJoin, rightOuterJoin, fullOuterJoin

cogroup(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)]) RDD[(K, (Iterable[V], Iterable[W]))]alias: groupWith

cartesian(otherDataset) (RDD[T], RDD[U]) RDD[(T,U)], RDD간의 cartesian product모든 (a,b) element, (a in RDD[T], b in RDD[U])

pipe(command, [envVars]) String RDD[String], shell command 실행후 결과를 RDD로 변환함stdin, lines-> process -> stdout 한 stdout결과를 RDD[string]으로 제공함http://blog.madhukaraphatak.com/pipe-in-spark/

coalesce(numPartitions) RDD[T] RDD[T], RDD의 파티션 개수를 지정된 파티션 수로 줄임

repartition(numPartitions) RDD[T] RDD[T], RDD에 있는 데이터를 지정된 파티션 수로 줄이고, 리셔플됨. 항상 전 데이터가 네트웍을 통해서 셔플됨

repartitionAndSortWithinPartitions(partitioner) RDD[(K,V)] RDD[(K,V)], 주어진 파티셔너를 통해서 repartition됨, 그리고 각 파티션 결과 내에서 정렬함repartition보다 효과적


http://blog.madhukaraphatak.com/pipe-in-spark/



RDD - Transformations

Scala:

Python:

val distFile = sc.textFile(“README.md”)distFile.map(l => l.split(“ “)).collect()distFile.flatMap(l => l.split(“ “)).collect()

distFile = sc.textFile(“README.md”)distFile.map(lambda x: x.split(’ ‘)).collect()distFile.flatMap(lambda x: x.split(’ ‘)).collect()

RDD – Actions

Action Meaning

reduce(f: (T,T) T) RDD[T] T, dataset내의 모든 element를 f를 사용해서 aggregate한 결과를 반환

collect() RDD[T] Array[T], dataset내의 모든 element를 array로 반환

count() RDD[T] Long, dataset내의 element의 개수를 반환

first() RDD[T] T, 첫 번째 element를 반환

take(n) RDD[T] Array[T], n 번째 까지의 element들을 반환

taskSample(withReplacement, num, [seed]) RDD[T] Array[T], 랜덤으로 num만큼 element들의 결과를 반환

takeOrdered(n, [ordering]) RDD[T] Array[T], 정렬된 n번째까지의 element를 반환

saveAsTextFile(path) RDD[T] Unit, 모든 element를 text파일로 저장, local filesystem, HDFS등에 저장

saveAsSequenceFile(path) RDD[T] Unit, 모든 element를 Hadoop SequenceFile로 저장

saveAsObjectFile(path) RDD[T] Unit, 모든 element를 java serialization을 이용한 simple format으로 저장

countByKey() RDD[(K,V)] Map[K, Long], 각 key에 대한 count를 반환

foreach(f: Iterator[T] Unit) RDD[T] Unit, 각 element에 대한 함수 f를 수행

saveAsNewAPIHadoopDataset RDD[T] Unit, Hadoop API의 ‘OutputFormat’(mapreduce.OutputFormat)을 이용하여 임의의 HDFS에 저장(MR Job), HBase BulkLoad에서 사용




RDD - Actions

Scala:

Python:

val f = sc.textFile(“README.md”)val words = f.flatMap(l => l.split(“ “)).map(word => (word, 1))words.reduceByKey(_ + _).collect

from operator import addf = sc.textFile(“README.md”)words = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1))words.reduceByKey(add).collect()

RDD - Persistence

• MapReduce와 다르게 Spark은 dataset을 persist(or cache)할 수있다.

• 다른 RDD operation(trans/action)에서 재사용하기 위해 각 노드는 임의의 파티션을 메모리나 스토리지에 저장

• 10배의 스피드 증가

• 가장 중요한 스파크 피처 중의 하나임

>>> val wordCounts = rdd.flatMap(x => x.split(“ “)).map(s => (s, 1)).reduceByKey((a,b) => a + b).cache()

RDD - Persistence

Storage Level Meaning

MEMORY_ONLY RDD를 deserialized Java objects로 jvm heap에 저장. Default Level임RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 저장되지 않고, 필요 시 재계산.

MEMORY_AND_DISK RDD를 deserialized Java objects로 jvm heap에 저장.RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음.

MEMORY_ONLY_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장.메모리 공간에 효과적, 읽는 시점에 cpu-intensive

MEMORY_AND_DISK_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장. RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음.

DISK_ONLY 오직 디스크에만 저장함.

MEMORY_ONLY_2 MEMORY_ONLY와 동일하나, 각 파티션 마다 2개의 node에 저장.

MEMORY_AND_DISK_2 MEMORY_AND_DISK와 동일하나, 각 파티션 마다 2개의 node에 저장.

OFF_HEAP RDD를 Tachyon에 맞는 serialized format으로 저장. MEMORY_ONLY_SER와 비교해서, gc overhead가 감소.Large heap과 다수의 concurrent application을 사용하는 경우에 효과적.Executor에 crash가 발생하더라도 cache된 데이터가 유실되지 않음.

RDD - Persistence

Scala:

Python:

val f = sc.textFile(“README.md”)val w = f.flatMap(l => l.split(“ “)).map(word => (word, 1)).cache()w.reduceByKey(_ + _).collect.foreach(println)

from operator import addf = sc.textFile(“README.md”)w = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)).cache()w.reduceByKey(add).collect()

RDD - Fault Tolerance

The State of Spark, and Where We're Going Next - Matei Zaharia, Spark Summit (2013)youtu.be/nU6vO2EJAb4

• RDD는 각 변환에 대해서 계보(lineage)를 기록해서 유실된 데이터를 복구할 수 있다.

• Narrow Dependencies• 한 노드• 메모리만 이용• 빠름• 일부 파티션 복구도 빠름

• Wide Dependencies• 여러 노드• 셔플 발생• 네트웍을 이용함• 복구에 많은 시간 소요• Checkpoint 권장

RDD – Narrow vs. Wide Dependencies

RDD – Job Scheduling

• DAG 방향 따라 계산

• Stage는 가능하면 로컬에서 실행할 수 있도록 구성(Narrow Dependency를 갖도록)

• 셔플이 필요한 경우 Stage 구분

• 파티션이 수행될 노드는 데이터로컬리티를 고려함(HDFS)

Examples – Word Count

aardvark 1

cat 1

mat 1

on 2

sat 2

sofa 1

the 4

Input Data

the cat sat on the matthe aardvark sat on the sofa

Result

http://www.slideshare.net/cloudera/spark-devwebinarslides-final


Examples – Word Count

the cat sat on the mat

the aardvark sat on the sofa

the

cat

sat

on

the

mat

the

aardvark

sat

…

(the, 1)

(cat, 1)

(sat, 1)

(on, 1)

(the, 1)

(mat, 1)

(the, 1)

(aardvark, 1)

(sat, 1)

…

(aardvark, 1)

(cat, 1)

(mat, 1)

(on, 2)

(sat, 2)

(sofa, 1)

(the, 4)

(aardvark, 1)(cat, 1)(mat, 1)(on, 2)(sat, 2)(sofa, 1)(the, 4)

val f = sc.textFile(file)

val w = f.flatMap(l => l.split(“ “))

val counts = w.reduceByKey(_ + _)

counts.saveAsTextFile(output)

.map(word => (word, 1))

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

ShuffledRDD

Array

Examples - Estimate Pi

• Monte Carlo method를 이용한 Pi 값 계산

• ./bin/run-example SparkPi 2 local

• 알고리즘1. Draw a square, then inscribe a circle within it.

2. Uniformly scatter objects of uniform size over the square.

3. Count the number of objects inside the circle and the total number of objects.

4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdfhttp://demonstrations.wolfram.com/MonteCarloEstimateForPi/https://en.wikipedia.org/wiki/Monte_Carlo_method

https://en.wikipedia.org/wiki/Monte_Carlo_method

https://en.wikipedia.org/wiki/Inscribed_figure

https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)

https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

http://demonstrations.wolfram.com/MonteCarloEstimateForPi/

https://en.wikipedia.org/wiki/Monte_Carlo_method

Examples – Estimate Pi

Base RDD

transformed RDD


action


Spark SQL

Spark SQL

• Spark module for structured data processing (e.g. DB tables, JSON files)

• Adding Schema to RDDs

• Three ways to manipulate data:• SQL (2014.05, Spark 1.0)• DataFrame (2015.03, Spark 1.3)• Datasets (2016.01, Spark 1.6)

• Same execution engine for all three

• Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API

Spark SQL Motivation

• Create and Run Spark Programs Faster• Write less code

• Read less data

• Let the optimizer do the hard work

• Shark의 한계• Limited integration with Spark programs

• Hive optimizer not designed for Spark

Spark SQL reuses the best parts of Shark

http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014


SQL

• Execute SQL queries written using either a basic SQL syntax or HiveQL

• When running SQL from within another programming language the results will be returned as a DataFrame.

• Interact with the SQL interface using the CLI or JDBC/ODBC

DataFrames

• Distributed collection of data organized into named columns.

• Conceptually equivalent to a table in relational DB or data frame in R/Python

• API available in Scala, Java, Python, and R

• Richer optimizations(significantly faster than RDDs)

• Can be constructed from a wide array of sources• data files, tables in Hive, external databases, exisiting RDD

DataFrames

http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune

• Constructed from a wide array of sources


Datasets

• New experimental interface added in Spark 1.6

• Tries to provide the benefits of RDDs(strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

• Unified Dataset API can be used both in Scala and Java.

SQL Context and Hive Context

• SQLContext• Entry point into all functionality in Spark SQL

• Wraps / extends existing spark context

• HiveContext• Superset of functionality provided by basic SQLContext

• Read data from Hive tables

• Access to Hive Functions -> UDFs

val sqlContext = SQLContext(sc)

val hc = HiveContext(sc)

DataFrame Example

• Reading Data From Table

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

+------+----+--------+

|Origin|Dest|DepDelay|

+------+----+--------+

| IAD| TPA| 8|

| IAD| TPA| 19|

| IND| BWI| 8|

| IND| BWI| -4|

| IND| BWI| 34|

+------+----+--------+

DataFrame Example

• Using DataFrame API to Filter Data(show delays more than 15min)

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

+------+----+--------+


+------+----+--------+

| IAD| TPA| 19|

| IND| BWI| 34|

| IND| JAX| 25|

| IND| LAS| 67|

| IND| MCO| 94|

+------+----+--------+

SQL Example

• Using SQL to Query and Filter Data(again, show delays more than 15 min)

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelay

FROM flights

WHERE DepDelay > 15 LIMIT 5").show

+------+----+--------+


+------+----+--------+

| IAD| TPA| 19|

| IND| BWI| 34|

| IND| JAX| 25|

| IND| LAS| 67|

| IND| MCO| 94|

+------+----+--------+

RDD vs. DataFrame

• RDD• Lower-level API (more control)

• Lots of existing code & users

• Compile-time type-safety

• DataFrame• Higher-level API(faster development)

• Faster sorting, hashing, and serialization

• More opportunities for automatic optimization

• Lower memory pressure

DataFrame은 직관적

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Find average age by department?RDD Example

Data Frame Example

SQL Examplesc.sql (“SELECT avg(age) FROM data GROUP BY dept”)

http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin


Spark SQL Optimizations

• Spark SQL uses an underlying optimization engine(Catalyst)• Catalyst can perform intelligent optimization since it understands the schema

• Spark SQL does not materialize all the columns(as with RDD) only what’s needed



Plan Optimization & Execution

• Spark SQL uses an underlying optimization engine(Catalyst)



An example query

SELECT nameFROM (

SELECT id, nameFROM People) p

WHERE p.id = 1

Logical Plan

Projectname

Filterid = 1

Projectid,name

People



Optimizing with Rules

Original

Plan

Projectname

Filterid = 1

Projectid,name

People

Projectname

Projectid,name

Filterid = 1

People

Filter

Push-Down

Combine

Projection

Projectname

Filterid = 1

People

IndexLookupid = 1

return: name

Physical

Plan



References

Papers

• Spark: Cluster Computing with Working Sets: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing : http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

• Spark SQL: Relational Data Processing in Spark: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

RDD

• Spark 의 핵심은 무엇인가? RDD! (RDD paper review): http://www.slideshare.net/yongho/rdd-paper-review

• Apache Spark RDDs: http://www.slideshare.net/deanchen11/scala-bay-spark-talk

Stanford 자료

• Intro to Apache Spark: https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

• Distributed Computing with Spark: https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf

http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf


https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf


http://www.slideshare.net/deanchen11/scala-bay-spark-talk



References

• Apache Apache Spark Overview - MapR: http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792

• Introduction to Apache Spark Developer Training - Cloudera: http://www.slideshare.net/cloudera/spark-devwebinarslides-final

• Apache Spark Overview: http://www.slideshare.net/VadimYBichutskiy/apache-spark-overview

• Simplifying Big Data Analytics with Apache Spark: http://www.slideshare.net/databricks/bdtc2

• Intro to Spark with Zeppelin: http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin

• Introduction to Big Data Analytics using Apache Spark and Zeppelin: http://www.slideshare.net/alexzeltov/introduction-to-big-data-analytics-using-apache-spark-and-zeppelin-on-hdinsights-on-azure-saas-andor-hdp-on-azurepaas

• Spark overview: http://www.slideshare.net/LisaHua/spark-overview-37479609

• Introduction to real time big data with Apache Spark: http://www.slideshare.net/tmatyashovsky/introduction-to-realtime-big-data-with-apache-spark

• Spark은 왜 이렇게 유명해 지고 있을까?: http://www.slideshare.net/KSLUG/ss-47355270

• Apache Spark Briefing: http://www.slideshare.net/ThomasWDinsmore/apache-spark-briefing-12062013

• Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106: http://www.slideshare.net/sanghoonlee982/spark-overview-20141106

• Zeppelin(Spark)으로 데이터 분석하기: http://www.slideshare.net/sangwookimme/zeppelinspark-41329473

• Lightening Fast Big Data Analytics using Apache Spark: http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark

• Apache Hive on Apache Spark: http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/

• Apache Spark: The Next Gen toolset for Big Data Processing: http://www.slideshare.net/prajods/apache-spark-the-next-gen-toolset-for-big-data-processing

• Intro to Spark and Spark SQL: http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014

• Spark SQL Deep Dive @ Melbourne Spark Meetup: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune

• A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

• Apache Spark (big Data) DataFrame - Things to know: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary

http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792


http://www.slideshare.net/VadimYBichutskiy/apache-spark-overview

http://www.slideshare.net/databricks/bdtc2


http://www.slideshare.net/alexzeltov/introduction-to-big-data-analytics-using-apache-spark-and-zeppelin-on-hdinsights-on-azure-saas-andor-hdp-on-azurepaas

http://www.slideshare.net/LisaHua/spark-overview-37479609

http://www.slideshare.net/tmatyashovsky/introduction-to-realtime-big-data-with-apache-spark

http://www.slideshare.net/KSLUG/ss-47355270

http://www.slideshare.net/ThomasWDinsmore/apache-spark-briefing-12062013

http://www.slideshare.net/sanghoonlee982/spark-overview-20141106

http://www.slideshare.net/sangwookimme/zeppelinspark-41329473

http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark

http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/

http://www.slideshare.net/prajods/apache-spark-the-next-gen-toolset-for-big-data-processing



https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary

References

• Spark SQL - Quick Guide: https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm

• Big Data Processing with Apache Spark - Part 2: Spark SQL: https://www.infoq.com/articles/apache-spark-sql

• Analytics with Apache Spark Tutorial Part 2: Spark SQL: https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-spark

• Apache Spark Resource Management and YARN App Models: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

• Why Spark Is the Next Top (Compute) Model: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model-39976454

• Introduction to Apache Spark:http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010

• Apache Spark & Hadoop:http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014

• Spark와 Hadoop, 완벽한 조합 (한국어): http://www.slideshare.net/pudidic/spark-hadoop

• Big Data visualization with Apache Spark and Zeppelin: http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin

• latency: https://gist.github.com/hellerbarde/2843375

• Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015: http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

• Apache Spark Architecture: http://www.slideshare.net/AGrishchenko/apache-spark-architecture

https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm

https://www.infoq.com/articles/apache-spark-sql

https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-spark


http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model-39976454

http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010

http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014

http://www.slideshare.net/pudidic/spark-hadoop

http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin


http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

http://www.slideshare.net/AGrishchenko/apache-spark-architecture

Appendix

Latency numbers

구글 빅데이터 관련 기술

기술 연도 내용

GFS 2003 Google File System: A Distributed Storage

MapReduce 2004 Simplified Data Processing on Large Clusters

Sawzall 2005 Interpreting the Data: Parallel Analysis with Sawzall

Chubby 2006 The Chubby Lock Service for Loosely-Coupled Distributed Systems

BigTable 2006 A Distributed Storage System for Structured Data

Paxos 2007 Paxos Made Live - An Engineering Perspective

Colossus 2009 GFS II

Percolator 2010 Large-scale Incremental Processing Using Distributed Transactions and Notifications

Pregel 2010 A System for Large-Scale Graph Processing

Dremel 2010 Interactive Analysis of Web-Scale Datasets

Tenzing 2011 A SQL Implementation On The MapReduce Framework

Megastore 2011 Providing Scalable, Highly Available Storage for Interactive Services

Spanner 2012 Google's Globally-Distributed Database

F1 2012 The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business

GFS Motivation

• More than 15,000 commodity-class PC's.

• Fault-tolerance provided in software

• More cost-effective solution

• Multiple clusters distributed worldwide

• One query reads 100’s of MB of data

• One query consumes 10’s of billions of CPU cycles

• Thousands of queries served per second.

• Google stores dozens of copies of the entire Web!

• Conclusion: Need Large, distributed, highly fault-tolerant file system

http://www.cs.brandeis.edu/~dilant/WebPage_TA160/The%20Google%20File%20System.pdf

http://www.cs.brandeis.edu/~dilant/WebPage_TA160/The Google File System.pdf

GFS Assumptions

• 높은 컴포넌트 장애율• 장애에 대한 모니터링/감시, 장애 내성, 장애 복구 등의 준비가 필요하다.

• “적당한” 규모의 큰(HUGE) 파일들• Just a few million

• Each is 100MB or larger; multi-GB files typical

• 파일은 한번 쓰고, 대부분은 추가된다.• Perhaps concurrently

• 큰 순차 읽기(Large Streaming Reads)

• 높은 지속적인 처리량(throughput)이 저 지연(low latency)보다 중요

http://research.google.com/archive/gfs.htmlhttps://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf

https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf

GFS Architecture

http://research.google.com/archive/gfs.htmlhttps://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf

<GFS Architecture>

<GFS 파일 저장 구조>

https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf

SQL JOINS

http://amirulkamil.com/best-describe-join/

http://amirulkamil.com/best-describe-join/

YARN Architecture

https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN Features

Feature Description

Multi-tenancy YARN allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set.Multi-tenant data processing improves an enterprise’s return on its Hadoop investments.

Cluster utilization YARN’s dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in early versions of Hadoop

Scalability Data center processing power continues to rapidly expand. YARN’s ResourceManager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data.

Compatibility Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruption to existing processes that already work

Hive Overview

• Invented at Facebook. Open sourced to Apache in 2008

• A database/data warehouse on top of Hadoop• Structured data similar to relational schema

• Tables, columns, rows and partitions

• SQL like query language (HiveQL)• A subset of SQL with many traditional features

• It is possible to embedded MR script in HiveQL

• Queries are compiled into MR jobs that are executed on Hadoop.

출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

Hive Motivation(Facebook)

• Problem: Data growth was exponential• 200GB per day in March 2008 • 2+TB(compressed) raw data / day in April 2009• 4+TB(compressed) raw data / day in Nov. 2009• 12+TB(compressed) raw data / day today(2010)

• The Hadoop Experiment• Much superior to availability and scalability of commercial DBs• Efficiency not that great, but throw more hardware• Partial Availability/resilience/scale more important than ACID

• Problem: Programmability and Metadata• MapReduce hard to program (users know sql/bash/python)• Need to publish data in well known schemas

• Solution: SQL + MapReduce = HIVE (2007)

출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hive – Data Flow of Facebook

출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

http://borthakur.com/ftp/hadoopmicrosoft.pdf


Hive - Architecture

https://cwiki.apache.org/confluence/display/Hive/Design

https://cwiki.apache.org/confluence/display/Hive/Design

Hive - Query Execution and MR Jobs

출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf


Hive - Limitations

• Performance (주로 ~0.12)• For simple queries, HIVE performance is comparable with hand-

coded MR jobs• The execution time is much longer for complex queries

• 연산단계마다 MR잡이 실행되기 때문에 많은 IO로 인한 성능 병목 발생• 각 단계마다, 비효율적인 데이터 스캔 및 전송이 발생함• 약한 Optimizer로 인한 비효율적인 실행계획

• 스팅어 계획(Stinger Initiative)로 성능은 이전에 비해 비약적으로 향상됨(Tez, Orc 도입, Optimizer개선)

• DW용으로만 한정적으로 사용될 수 있음. • Streaming, Graph, ML등의 작업에는 제한이 있음

RDD Operations in paper



Spark Master - YARN

• YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.

• You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.

• Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.

• Finally, YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.


http://hadoop.apache.org/docs/r2.4.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html


Spark Master - YARN

yarn-client modeyarn-cluster mode



Spark 2.0 Datasets

Spark 2.0 DataSets

Language Main Abstraction

Scala Dataset[T] & DataFrame (alias for Dataset[Row])

Java Dataset[T]

Python* DataFrame

R* DataFrame

Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.

Typed and Un-typed APIs



Static-typing and runtime type-safety



High-level abstraction and custom view

1. Spark reads the JSON, infers the schema, and creates a collection of DataFrames

2. At this point, Spark converts your data into DataFrame=Dataset[Row], a collection of generic Row object, since it does not know the exact type.

3. Now, Spark converts the Dataset[Row]-> Dataset[DeviceIoTData] type-specific Scala JVM Object, as dictated by the class DeviceIoTData

case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2: String, cca3: String, cn: String, device_id: Long,

device_name: String, humidity: Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale:String, temp: Long, timestamp: Long)

{"device_id": 198164, "device_name": "sensor-pad-198164owomcJZ", "ip": "80.55.20.25", "cca2": "PL", "cca3": "POL", "cn": "Poland", "latitude": 53.080000, "longitude": 18.620000, "scale": "Celsius", "temp": 21, "humidity": 65, "battery_level": 8, "c02_level": 1408, "lcd": "red", "timestamp" :1458081226051}

// read the json file and create the dataset from the // case class DeviceIoTData// ds is now a collection of JVM Scala objects DeviceIoTDataval ds = spark.read.json("/databricks-public-datasets/data/iot/iot_devices.json").as[DeviceIoTData]



Ease-of-use of APIs with structure

• Although structure may limit control in what your Spark program can do with data

• Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy

// Use filter(), map(), groupBy() country, and compute avg() // for temperatures and humidity. This operation results in // another immutable Dataset. The query is simpler to read, // and expressive

val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity, d.cca3)).groupBy($"_3").avg()

//display the resulting datasetdisplay(dsAvgTmp)



Performance and Optimization

• DataFrame and Dataset APIs are built on top of the Spark SQL engine.• it uses Catalyst to generate an optimized logical and physical query

plan.

• Spark은 Dataset의 Tungsten’s Encoder를 이용하면, serialize / deserialize시에 bytecode를 compact시켜줘서 Speed에 이점이있다.



Use DataFrames or Datasets?

• If you want rich semantics, high-level abstractions, and domain specific APIs use DataFrame or Dataset.

• If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data DataFrame or Dataset.

• If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation Dataset.

• If you want unification and simplification of APIs across Spark Libraries DataFrameor Dataset.

• If you are a R user DataFrames.

• If you are a Python user DataFrames and resort back to RDDs if you need more control.

Spark Streaming vs. Storm

Reliability Models

Core Storm Storm Trident Spark Streaming

At Most Once Yes Yes No

At Least Once Yes Yes No*

Once and Only Once(Exactly Once)

No Yes Yes*

http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming


Programing Model

Core Storm Storm Trident Spark Streaming

Stream Primitive TupleTuple, Tuple Batch, Partition

Dstream

Stream Sources Spouts Spouts, Trident Spouts HDFS, Network

Computation/Transformation

Bolts

Filters,Functions,Aggregations,Joins

Transformation,Window Operations

Stateful OperationNo(roll your own)

Yes Yes

Output/Persistence

Bolts State, MapState foreachRDD

2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming


Performance

• Storm capped at 10k msgs/sec/node?

• Spark Streaming 40x faster than Storm?

System Performance

Storm(Twitter) 10,000 records/s/node

Spark Streaming 400,000 records/s/node

Apache S4 7,000 records/s/node

Other Commercial Systems 100,000 records/s/node

2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming


Download - Apache Spark Overview part1 (20161107)

Top Related