building a unified data pipline in spark / apache sparkを用いたbig...
DESCRIPTION
Presentation material with Japanese subs, by Mr. Aaron Davidson at ScalaMatsuri 2014 http://scalamatsuri.org/en/TRANSCRIPT
Aaron DavidsonSlides adapted from Matei Zaharia
spark.apache.org
Building a Unified Data Pipeline in
Spark で構築する統合データパイプライン
What is Apache Spark?Fast and general cluster computing system interoperable with Hadoop
Improves efficiency through:»In-memory computing primitives»General computation graphs
Improves usability through:»Rich APIs in Java, Scala, Python»Interactive shell
Up to 100× faster(2-10× on disk)
2-5× less code
Hadoop 互換のクラスタ計算システム計算性能とユーザビリティを改善
Project History
Started at UC Berkeley in 2009, open sourced in 2010
50+ companies now contributing»Databricks, Yahoo!, Intel, Cloudera, IBM, …
Most active project in Hadoop ecosystem
UC バークレー生まれOSS として 50 社以上が開発に参加
A General Stack
Spark
Spark Streamin
greal-time
Spark SQL
structured
GraphXgraph
MLlibmachine learning
…
構造化クエリ、リアルタイム分析、グラフ処理、機械学習
This Talk
Spark introduction & use cases
Modules built on Spark
The power of unification
Demo
Spark の紹介とユースケース
Why a New Programming Model?MapReduce greatly simplified big data analysis
But once started, users wanted more:»More complex, multi-pass analytics (e.g.
ML, graph)»More interactive ad-hoc queries»More real-time stream processing
All 3 need faster data sharing in parallel appsMapReduce の次にユーザが望むもの :
より複雑な分析、対話的なクエリ、リアルタイム処理
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication, serialization, and disk IO
MapReduce のデータ共有が遅いのはディスク IO のせい
iter. 1 iter. 2 . . .
Input
What We’d Like
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network and disk
ネットワークやディスクより 10~100 倍くらい高速化したい
Spark Model
Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in
memory or disk across a cluster»Built via parallel transformations (map,
filter, …)»Automatically rebuilt on failure
自己修復する分散データセット (RDD)RDD は map や filter 等のメソッドで並列に変換でき
る
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for
on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
様々なパターンで対話的に検索。 1 TB の処理時間が 170 -> 5~7 秒に
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
filterreducemap
Inp
ut
file
RDDs track lineage info to rebuild lost data
** 系統 ** 情報を追跡して失ったデータを再構築
filterreducemap
Inp
ut
file
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
** 系統 ** 情報を追跡して失ったデータを再構築
Example: Logistic Regression
1 5 10 20 300
500
1000
1500
2000
2500
3000
3500
4000
Hadoop
Spark
Number of Iterations
Ru
nn
ing
Tim
e (
s) 110 s / iteration
first iteration 80 sfurther iterations
1 s
ロジスティック回帰
Behavior with Less RAM
Cache disabled
25% 50% 75% Fully cached
0
20
40
60
80
100
68.8
58.1
40.729.7
11.5
% of working set in memory
Itera
tion
tim
e (
s)
キャッシュを減らした場合の振る舞い
Spark in Scala and Java// Scala:
val lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
Spark in Scala and Java// Scala:
val lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()
// Java 8:
JavaRDD<String> lines = sc.textFile(...);lines.filter(s -> s.contains(“ERROR”)).count();
Supported Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
250+ developers, 50+ companies contributing
Most active open source project in big data
Spark Community
0
200
400
600
800
1000
1200
1400
MapR
educe YA
RN H
DFS
Sto
rmSpark
commits past 6 months
ビッグデータ分野で最も活発な OSS プロジェクト
Continuing Growth
source: ohloh.net
Contributors per month to Spark
貢献者は増加し続けている
Get Started
Visit spark.apache.org for docs & tutorials
Easy to run on just your laptop
Free training materials: spark-summit.org
ラップトップ一台から始められます
This Talk
Spark introduction & use cases
Modules built on Spark
The power of unification
Demo
Spark 上に構築されたモジュール
The Spark Stack
Spark
Spark Streamin
greal-time
Spark SQL
structured
GraphXgraph
MLlibmachine learning
…
Spark スタック
Evolution of the Shark project
Allows querying structured data in Spark
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
rows.filter(lambda r: r.year > 2013).collect()
From Hive:
{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123}}
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)
From JSON:
tweets.json
Spark SQL
Shark の後継。 Spark で構造化データをクエリする。
Integrates closely with Spark’s language APIs
c.registerFunction(“hasSpark”, lambda text: “Spark” in text)
c.sql(“select * from tweets where hasSpark(text)”)
Uniform interface for data access
Hive Parquet JSONCassan
-dra…
SQL
Python Scala Java
Spark SQL
Spark 言語 API との統合様々なデータソースに対して統一インタフェースを提供
Spark Streaming
Stateful, fault-tolerant stream processing with the same API as batch jobssc.twitterStream(...) .map(tweet => (tweet.language, 1)) .reduceByWindow(“5s”, _ + _)
0
5
10
15
20
25
30
35
Sto
rm
Spark
Th
rou
gh
pu
t (M
B/
s/n
od
e)
ステートフルで耐障害性のあるストリーム処理バッチジョブと同じ API
MLlib
Built-in library of machine learning algorithms
»K-means clustering»Alternating least squares»Generalized linear models (with L1 / L2
reg.)»SVD and PCA»Naïve Bayespoints = sc.textFile(...).map(parsePoint)
model = KMeans.train(points, 10)
組み込みの機械学習ライブラリ
This Talk
Spark introduction & use cases
Modules built on Spark
The power of unification
Demo
統合されたスタックのパワー
Big Data Systems Today
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill Tez
Impala
S4 …
Specialized systems(iterative, interactive and
streaming apps)
General batchprocessing
現状: 特化型のビッグデータシステムが乱立
Spark’s Approach
Instead of specializing, generalize MapReduceto support new apps in same engine
Two changes (general task DAG & data sharing) are enough to express previous models!
Unification has big benefits»For the engine»For users
Spark
Str
eam
ing
Gra
phX
…
Shark
MLb
ase
Spark のアプローチ: 特化しない汎用的な同一の基盤で、新たなアプリをサポートする
What it Means for UsersSeparate frameworks:
…HDFS read
HDFS writeE
TL HDFS
readHDFS writetr
ai
n HDFS read
HDFS writeq
ue
ry
HDFS
HDFS read E
TL
trai
nq
ue
ry
Spark: Interactiveanalysis
全ての処理が Spark 上で完結。さらに対話型分析も
Combining Processing Types// Load data using SQLval points = ctx.sql( “select latitude, longitude from historic_tweets”)
// Train a machine learning modelval model = KMeans.train(points, 10)
// Apply it to a streamsc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
SQL 、機械学習、ストリームへの適用など、異なる処理タイプを組み合わせる
This Talk
Spark introduction & use cases
Modules built on Spark
The power of unification
Demo
デモ
The Plan
Raw JSON Tweets
SQLMachine Learning
Streaming
訓練したモデルで、ツイートストリームをクラスタリングする 特徴ベクトルを抽出して k-means でモデルを訓練する Spark SQL でツイート本文を抽出 生 JSON を HDFS から読み込む
Demo!
Summary: What We Did
Raw JSON
SQLMachine Learning
Streaming
- 生 JSON を HDFS から読み込む -Spark SQL でツイート本文を抽出 - 特徴ベクトルを抽出して k-means でモデルを訓練する - 訓練したモデルで、ツイートストリームをクラスタリングする
import org.apache.spark.sql._ val ctx = new org.apache.spark.sql.SQLContext(sc) val tweets = sc.textFile("hdfs:/twitter") val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) tweetTable.registerAsTable("tweetTable")
ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable \ GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) val texts = sql("SELECT text FROM tweetTable").map(_.head.toString)
def featurize(str: String): Vector = { ... } val vectors = texts.map(featurize).cache() val model = KMeans.train(vectors, 10, 10)
sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model")val ssc = new StreamingContext(new SparkConf(), Seconds(1)) val model = new KMeansModel( ssc.sparkContext.objectFile(modelFile).collect())
// Streamingval tweets = TwitterUtils.createStream(ssc, /* auth */) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter { t => model.predict(featurize(t)) == clusterNumber } filteredTweets.print() ssc.start()
ConclusionBig data analytics is evolving to include:
»More complex analytics (e.g. machine learning)
»More interactive ad-hoc queries»More real-time stream processing
Spark is a fast platform that unifies these apps
Learn more: spark.apache.org
ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化Spark はこれらのアプリを統合した最速のプラットフォーム