building a unified data pipline in spark / apache sparkを用いたbig...

Aaron DavidsonSlides adapted from Matei Zaharia

spark.apache.org

Building a Unified Data Pipeline in

Spark で構築する統合データパイプライン

What is Apache Spark?Fast and general cluster computing system interoperable with Hadoop

Improves efficiency through:»In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Java, Scala, Python»Interactive shell

Up to 100× faster(2-10× on disk)

2-5× less code

Hadoop 互換のクラスタ計算システム計算性能とユーザビリティを改善

Project History

Started at UC Berkeley in 2009, open sourced in 2010

50+ companies now contributing»Databricks, Yahoo!, Intel, Cloudera, IBM, …

Most active project in Hadoop ecosystem

UC バークレー生まれOSS として 50 社以上が開発に参加

A General Stack

Spark Streamin

greal-time

Spark SQL

structured

GraphXgraph

MLlibmachine learning

構造化クエリ、リアルタイム分析、グラフ処理、機械学習

This Talk

Spark introduction & use cases

Modules built on Spark

The power of unification

Spark の紹介とユースケース

Why a New Programming Model?MapReduce greatly simplified big data analysis

But once started, users wanted more:»More complex, multi-pass analytics (e.g.

ML, graph)»More interactive ad-hoc queries»More real-time stream processing

All 3 need faster data sharing in parallel appsMapReduce の次にユーザが望むもの :

より複雑な分析、対話的なクエリ、リアルタイム処理

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

query 1

query 2

query 3

result 1

result 2

result 3

HDFSread

Slow due to replication, serialization, and disk IO

MapReduce のデータ共有が遅いのはディスク IO のせい

iter. 1 iter. 2 . . .

What We’d Like

Distributedmemory

query 1

query 2

query 3

one-timeprocessing

10-100× faster than network and disk

ネットワークやディスクより 10~100 倍くらい高速化したい

Spark Model

Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in

memory or disk across a cluster»Built via parallel transformations (map,

filter, …)»Automatically rebuilt on failure

自己修復する分散データセット (RDD)RDD は map や filter 等のメソッドで並列に変換でき

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for

on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

様々なパターンで対話的に検索。 1 TB の処理時間が 170 -> 5~7 秒に

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filterreducemap

RDDs track lineage info to rebuild lost data

** 系統 ** 情報を追跡して失ったデータを再構築

filterreducemap

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

** 系統 ** 情報を追跡して失ったデータを再構築

Example: Logistic Regression

1 5 10 20 300

Hadoop

Number of Iterations

s) 110 s / iteration

first iteration 80 sfurther iterations

ロジスティック回帰

Behavior with Less RAM

Cache disabled

25% 50% 75% Fully cached

40.729.7

% of working set in memory

キャッシュを減らした場合の振る舞い

Spark in Scala and Java// Scala:

val lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Spark in Scala and Java// Scala:

val lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

// Java 8:

JavaRDD<String> lines = sc.textFile(...);lines.filter(s -> s.contains(“ERROR”)).count();

Supported Operators

filter

groupBy

leftOuterJoin

rightOuterJoin

reduce

reduceByKey

groupByKey

cogroup

sample

partitionBy

mapWith

250+ developers, 50+ companies contributing

Most active open source project in big data

Spark Community

educe YA

rmSpark

commits past 6 months

ビッグデータ分野で最も活発な OSS プロジェクト

Continuing Growth

source: ohloh.net

Contributors per month to Spark

貢献者は増加し続けている

Get Started

Visit spark.apache.org for docs & tutorials

Easy to run on just your laptop

Free training materials: spark-summit.org

ラップトップ一台から始められます

This Talk

Spark 上に構築されたモジュール

The Spark Stack

Spark Streamin

greal-time

Spark SQL

structured

GraphXgraph

MLlibmachine learning

Spark スタック

Evolution of the Shark project

Allows querying structured data in Spark

c = HiveContext(sc)

rows = c.sql(“select text, year from hivetable”)

rows.filter(lambda r: r.year > 2013).collect()

From Hive:

{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123}}

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)

c.sql(“select text, user.name from tweets”)

From JSON:

tweets.json

Spark SQL

Shark の後継。 Spark で構造化データをクエリする。

Integrates closely with Spark’s language APIs

c.registerFunction(“hasSpark”, lambda text: “Spark” in text)

c.sql(“select * from tweets where hasSpark(text)”)

Uniform interface for data access

Hive Parquet JSONCassan

-dra…

Python Scala Java

Spark SQL

Spark 言語 API との統合様々なデータソースに対して統一インタフェースを提供

Spark Streaming

Stateful, fault-tolerant stream processing with the same API as batch jobssc.twitterStream(...) .map(tweet => (tweet.language, 1)) .reduceByWindow(“5s”, _ + _)

ステートフルで耐障害性のあるストリーム処理バッチジョブと同じ API

Built-in library of machine learning algorithms

»K-means clustering»Alternating least squares»Generalized linear models (with L1 / L2

reg.)»SVD and PCA»Naïve Bayespoints = sc.textFile(...).map(parsePoint)

model = KMeans.train(points, 10)

組み込みの機械学習ライブラリ

This Talk

統合されたスタックのパワー

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLab

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

現状：特化型のビッグデータシステムが乱立

Spark’s Approach

Instead of specializing, generalize MapReduceto support new apps in same engine

Two changes (general task DAG & data sharing) are enough to express previous models!

Unification has big benefits»For the engine»For users

Spark のアプローチ：　特化しない汎用的な同一の基盤で、新たなアプリをサポートする

What it Means for UsersSeparate frameworks:

…HDFS read

HDFS writeE

TL HDFS

readHDFS writetr

n HDFS read

HDFS writeq

HDFS read E

Spark: Interactiveanalysis

全ての処理が Spark 上で完結。さらに対話型分析も

Combining Processing Types// Load data using SQLval points = ctx.sql( “select latitude, longitude from historic_tweets”)

// Train a machine learning modelval model = KMeans.train(points, 10)

// Apply it to a streamsc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

SQL 、機械学習、ストリームへの適用など、異なる処理タイプを組み合わせる

This Talk

デモ

The Plan

Raw JSON Tweets

SQLMachine Learning

Streaming

訓練したモデルで、ツイートストリームをクラスタリングする特徴ベクトルを抽出して k-means でモデルを訓練する Spark SQL でツイート本文を抽出生 JSON を HDFS から読み込む

Summary: What We Did

Raw JSON

SQLMachine Learning

Streaming

- 生 JSON を HDFS から読み込む -Spark SQL でツイート本文を抽出 - 特徴ベクトルを抽出して k-means でモデルを訓練する - 訓練したモデルで、ツイートストリームをクラスタリングする

import org.apache.spark.sql._ val ctx = new org.apache.spark.sql.SQLContext(sc) val tweets = sc.textFile("hdfs:/twitter") val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) tweetTable.registerAsTable("tweetTable")

ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable \ GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) val texts = sql("SELECT text FROM tweetTable").map(_.head.toString)

def featurize(str: String): Vector = { ... } val vectors = texts.map(featurize).cache() val model = KMeans.train(vectors, 10, 10)

sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model")val ssc = new StreamingContext(new SparkConf(), Seconds(1)) val model = new KMeansModel( ssc.sparkContext.objectFile(modelFile).collect())

// Streamingval tweets = TwitterUtils.createStream(ssc, /* auth */) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter { t => model.predict(featurize(t)) == clusterNumber } filteredTweets.print() ssc.start()

ConclusionBig data analytics is evolving to include:

»More complex analytics (e.g. machine learning)

»More interactive ad-hoc queries»More real-time stream processing

Spark is a fast platform that unifies these apps

Learn more: spark.apache.org

ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化Spark はこれらのアプリを統合した最速のプラットフォーム

building a unified data pipline in spark / apache sparkを用いたbig...

Software

apache spark - big data -...

cours big data chap4 - spark

curriculum vitae - isetsobig data la plate forme hadoop la...

spark : 5 moyens simples et rapides pour exploiter vos big...

spark architecture · spark architecture spark shuffle ......

real time data processing with spark & cassandra @...

pico のパイプライン化

spark sql: relational data processing in...

data analysis with pandas and spark

big data: apache spark -novo pojačanje tradicionalnom bi...

painting the future of big data with apache spark and...

"real-time data processing with spark & cassandra", jdays...

data science week 2016. rambler & co. "Пайплайн...

「パイプライン」...-1- 第1章...

ciscospark インストール手順 (forwindowspc)...cisco...

data science lab meetup: cassandra and spark

big data - mariuszrafalo.plmariuszrafalo.pl/sgh/bd/bd 03 -...

criação de data warehouse com bancos de dados nosql...

real time analytics via spark & scala | spark & scala...

plateforme bigdata orientée bi avec hortoworks data...