building a unified data pipline in spark / apache sparkを用いたbig...

Aaron DavidsonSlides adapted from Matei Zaharia

spark.apache.org

Building a Unified Data Pipeline in

Spark で構築する統合データパイプライン

http://spark.apache.org/

What is Apache Spark?Fast and general cluster computing system interoperable with Hadoop

Improves efficiency through:»In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Java, Scala, Python»Interactive shell

Up to 100× faster(2-10× on disk)

2-5× less code

Hadoop 互換のクラスタ計算システム計算性能とユーザビリティを改善

Project History

Started at UC Berkeley in 2009, open sourced in 2010

50+ companies now contributing»Databricks, Yahoo!, Intel, Cloudera, IBM, …

Most active project in Hadoop ecosystem

UC バークレー生まれOSS として 50 社以上が開発に参加

A General Stack

Spark

Spark Streamin

greal-time

Spark SQL

structured

GraphXgraph

MLlibmachine learning

…

構造化クエリ、リアルタイム分析、グラフ処理、機械学習

This Talk

Spark introduction & use cases

Modules built on Spark

The power of unification

Demo

Spark の紹介とユースケース

Why a New Programming Model?MapReduce greatly simplified big data analysis

But once started, users wanted more:»More complex, multi-pass analytics (e.g.

ML, graph)»More interactive ad-hoc queries»More real-time stream processing

All 3 need faster data sharing in parallel appsMapReduce の次にユーザが望むもの :

より複雑な分析、対話的なクエリ、リアルタイム処理

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IO

MapReduce のデータ共有が遅いのはディスク IO のせい

iter. 1 iter. 2 . . .

Input

What We’d Like

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

ネットワークやディスクより 10~100 倍くらい高速化したい

Spark Model

Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in

memory or disk across a cluster»Built via parallel transformations (map,

filter, …)»Automatically rebuilt on failure

自己修復する分散データセット (RDD)RDD は map や filter 等のメソッドで並列に変換でき

る

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for

on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

様々なパターンで対話的に検索。 1 TB の処理時間が 170 -> 5~7 秒に

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filterreducemap

Inp

ut

file

RDDs track lineage info to rebuild lost data

** 系統 ** 情報を追跡して失ったデータを再構築

filterreducemap

Inp

ut

file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

** 系統 ** 情報を追跡して失ったデータを再構築

Example: Logistic Regression

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

4000

Hadoop

Spark

Number of Iterations

Ru

nn

ing

Tim

e (

s) 110 s / iteration

first iteration 80 sfurther iterations

1 s

ロジスティック回帰

Behavior with Less RAM

Cache disabled

25% 50% 75% Fully cached

0

20

40

60

80

100

68.8

58.1

40.729.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)

キャッシュを減らした場合の振る舞い

Spark in Scala and Java// Scala:

val lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Spark in Scala and Java// Scala:

val lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

// Java 8:

JavaRDD<String> lines = sc.textFile(...);lines.filter(s -> s.contains(“ERROR”)).count();

Supported Operators

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

250+ developers, 50+ companies contributing

Most active open source project in big data

Spark Community

0

200

400

600

800

1000

1200

1400

MapR

educe YA

RN H

DFS

Sto

rmSpark

commits past 6 months

ビッグデータ分野で最も活発な OSS プロジェクト

Continuing Growth

source: ohloh.net

Contributors per month to Spark

貢献者は増加し続けている

Get Started

Visit spark.apache.org for docs & tutorials

Easy to run on just your laptop

Free training materials: spark-summit.org

ラップトップ一台から始められます


http://spark-summit.org/

This Talk




Demo

Spark 上に構築されたモジュール

The Spark Stack

Spark

Spark Streamin

greal-time

Spark SQL

structured

GraphXgraph

MLlibmachine learning

…

Spark スタック

Evolution of the Shark project

Allows querying structured data in Spark

c = HiveContext(sc)

rows = c.sql(“select text, year from hivetable”)

rows.filter(lambda r: r.year > 2013).collect()

From Hive:

{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123}}

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)

c.sql(“select text, user.name from tweets”)

From JSON:

tweets.json

Spark SQL

Shark の後継。 Spark で構造化データをクエリする。

Integrates closely with Spark’s language APIs

c.registerFunction(“hasSpark”, lambda text: “Spark” in text)

c.sql(“select * from tweets where hasSpark(text)”)

Uniform interface for data access

Hive Parquet JSONCassan

-dra…

SQL

Python Scala Java

Spark SQL

Spark 言語 API との統合様々なデータソースに対して統一インタフェースを提供

Spark Streaming

Stateful, fault-tolerant stream processing with the same API as batch jobssc.twitterStream(...) .map(tweet => (tweet.language, 1)) .reduceByWindow(“5s”, _ + _)

0

5

10

15

20

25

30

35

Sto

rm

Spark

Th

rou

gh

pu

t (M

B/

s/n

od

e)

ステートフルで耐障害性のあるストリーム処理バッチジョブと同じ API

MLlib

Built-in library of machine learning algorithms

»K-means clustering»Alternating least squares»Generalized linear models (with L1 / L2

reg.)»SVD and PCA»Naïve Bayespoints = sc.textFile(...).map(parsePoint)

model = KMeans.train(points, 10)

組み込みの機械学習ライブラリ

This Talk




Demo

統合されたスタックのパワー

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

現状：特化型のビッグデータシステムが乱立

Spark’s Approach

Instead of specializing, generalize MapReduceto support new apps in same engine

Two changes (general task DAG & data sharing) are enough to express previous models!

Unification has big benefits»For the engine»For users

Spark

Str

eam

ing

Gra

phX

…

Shark

MLb

ase

Spark のアプローチ：　特化しない汎用的な同一の基盤で、新たなアプリをサポートする

What it Means for UsersSeparate frameworks:

…HDFS read

HDFS writeE

TL HDFS

readHDFS writetr

ai

n HDFS read

HDFS writeq

ue

ry

HDFS

HDFS read E

TL

trai

nq

ue

ry

Spark: Interactiveanalysis

全ての処理が Spark 上で完結。さらに対話型分析も

Combining Processing Types// Load data using SQLval points = ctx.sql( “select latitude, longitude from historic_tweets”)

// Train a machine learning modelval model = KMeans.train(points, 10)

// Apply it to a streamsc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

SQL 、機械学習、ストリームへの適用など、異なる処理タイプを組み合わせる

This Talk




Demo

デモ

The Plan

Raw JSON Tweets

SQLMachine Learning

Streaming

訓練したモデルで、ツイートストリームをクラスタリングする特徴ベクトルを抽出して k-means でモデルを訓練する Spark SQL でツイート本文を抽出生 JSON を HDFS から読み込む

Summary: What We Did

Raw JSON

SQLMachine Learning

Streaming

- 生 JSON を HDFS から読み込む -Spark SQL でツイート本文を抽出 - 特徴ベクトルを抽出して k-means でモデルを訓練する - 訓練したモデルで、ツイートストリームをクラスタリングする

import org.apache.spark.sql._ val ctx = new org.apache.spark.sql.SQLContext(sc) val tweets = sc.textFile("hdfs:/twitter") val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) tweetTable.registerAsTable("tweetTable")

ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable \ GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) val texts = sql("SELECT text FROM tweetTable").map(_.head.toString)

def featurize(str: String): Vector = { ... } val vectors = texts.map(featurize).cache() val model = KMeans.train(vectors, 10, 10)

sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model")val ssc = new StreamingContext(new SparkConf(), Seconds(1)) val model = new KMeansModel( ssc.sparkContext.objectFile(modelFile).collect())

// Streamingval tweets = TwitterUtils.createStream(ssc, /* auth */) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter { t => model.predict(featurize(t)) == clusterNumber } filteredTweets.print() ssc.start()

ConclusionBig data analytics is evolving to include:

»More complex analytics (e.g. machine learning)

»More interactive ad-hoc queries»More real-time stream processing

Spark is a fast platform that unifies these apps

Learn more: spark.apache.org

ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化Spark はこれらのアプリを統合した最速のプラットフォーム


building a unified data pipline in spark / apache sparkを用いたbig...

Software