spark streamingによるリアルタイムユーザ属性推定

Spark Streaming

/ @laclefyoshi

<[email protected]>

• • Spark Streaming

• •

• Spark Streaming Tips

•

2

: / SAEKI Yoshiyasu

:

IT

: Web 4 9

R&D

Hadoop, Kafka, Storm, Spark, Druid

: RICOH Theta ( ) + Google Cardboard

3

Spark Streaming

http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html

4


• • =

• •

http://www.recruit.jp/company/about/structure.html6

http://www.recruit.jp/company/about/structure.html

• • ≒ …

• • !

OS etc.

7

1. Web

(JavaScript)

2. fluentd Kafka

8

: fluentd → Kafka

• fluent-plugin-kafka

• https://github.com/htgc/fluent-plugin-kafka • output type = kafka_buffered (on file)

• Kafka 0.8.2.2

• 0.9.0

• ACL

9

https://github.com/htgc/fluent-plugin-kafka

Suro

• Netflix

• https://github.com/Netflix/suro • : Kafka Consumer API Thrift API

• :

• HDFS

• AWS S3 • Kafka Producer • Elasticsearch •

11

LinkedIn

Gobblin

https://github.com/Netflix/suro

Hadoop

•

• HDFS

• MLlib

• Streaming linear regression (Classification) • Streaming k-means (Clustering)

•

12

Spark Streaming

13

Kafka

• Direct Approach (>= Spark 1.3)

•

• Exactly-once

• Kafka Simple Consumer API

Direct Approach

14

Spark Streaming 1

15


RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4


Spark Streaming 2

16



Micro-batch

17

1Micro-batch

(Cookie)

Window-based micro-batch

1

1Micro-batch1Micro-batch

18

Micro-batch

• RDD HBasedstream.foreachRDD { rdd => val hbaseConf = createHbaseConfiguration() val jobConf = new Configuration(hbaseConf) jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName) jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName) jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)

new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf) }

// RDD[(String, Map[K,V])] RDD[(String, Put)]

def hbaseConvert(t:(String, Map[String, String])) = { val p = new Put(Bytes.toBytes(t._1)) t._2.toSeq.foreach( m => p.addColumn(Bytes.toBytes("seg"), Bytes.toBytes(m._1), Bytes.toBytes(m._2)) ) (t._1, p) }

19

0.5 1

Spark Streaming :

• DStream RDD

• Spark

Spark Streaming

21



Spark Streaming :

• Fault Tolerance

• Micro-batch

• YARN

• YARN Dynamic Resource Allocation

•

22

Spark Streaming :

• : →

RDD → RDD DStream → DStream

• 1Micro-batch

23

// RDD → RDD

val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c"))

// DStream → DStream

val queue = scala.collection.mutable.Queue(rdd) val dstream:DStream[String] = sparkStreamingContext.queueStream(queue)

Spark Streaming :

• spark-testing-base

• https://github.com/holdenk/spark-testing-base

class JsonElementCountTest extends StreamingSuiteBase { test("simple") { val input = List(List("aa"), List("bb")) val expected = List(List("AA"), List(“BB"))

testOperation[String, String]( input, converterMethod _, expected, useSet = true) }}

24

https://github.com/holdenk/spark-testing-base

Spark Streaming :

• Window-based micro-batch

• • o.a.spark.streaming.util.ManualClock

• private class Scala

• http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

25

http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

Spark Streaming :

• Scala Java

• • Spark Streaming Kafka HBase Scala

• Java

26

// api/java/JavaRDD.scala

object JavaRDD { implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] = new JavaRDD[T](rdd) implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd }

27

• • • =

• Spark Streaming

• MLlib

• GraphX

spark streamingによるリアルタイムユーザ属性推定

Data & Analytics