spark streamingによるリアルタイムユーザ属性推定
TRANSCRIPT
• • Spark Streaming
• •
• Spark Streaming Tips
•
2
: / SAEKI Yoshiyasu
:
IT
: Web 4 9
R&D
Hadoop, Kafka, Storm, Spark, Druid
: RICOH Theta ( ) + Google Cardboard
3
Spark Streaming
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
4
5
• • =
• •
http://www.recruit.jp/company/about/structure.html6
• • ≒ …
• • !
OS etc.
7
1. Web
(JavaScript)
2. fluentd Kafka
8
: fluentd → Kafka
• fluent-plugin-kafka
• https://github.com/htgc/fluent-plugin-kafka • output type = kafka_buffered (on file)
• Kafka 0.8.2.2
• 0.9.0
• ACL
9
10
Suro
• Netflix
• https://github.com/Netflix/suro • : Kafka Consumer API Thrift API
• :
• HDFS
• AWS S3 • Kafka Producer • Elasticsearch •
11
Gobblin
Hadoop
•
• HDFS
• MLlib
• Streaming linear regression (Classification) • Streaming k-means (Clustering)
•
12
Spark Streaming
13
Kafka
• Direct Approach (>= Spark 1.3)
•
• Exactly-once
• Kafka Simple Consumer API
Direct Approach
14
Spark Streaming 1
15
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
Spark Streaming 2
16
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
Micro-batch
17
1Micro-batch
(Cookie)
Window-based micro-batch
1
1Micro-batch1Micro-batch
18
Micro-batch
• RDD HBasedstream.foreachRDD { rdd => val hbaseConf = createHbaseConfiguration() val jobConf = new Configuration(hbaseConf) jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName) jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName) jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf) }
// RDD[(String, Map[K,V])] RDD[(String, Put)]
def hbaseConvert(t:(String, Map[String, String])) = { val p = new Put(Bytes.toBytes(t._1)) t._2.toSeq.foreach( m => p.addColumn(Bytes.toBytes("seg"), Bytes.toBytes(m._1), Bytes.toBytes(m._2)) ) (t._1, p) }
19
0.5 1
20
Spark Streaming :
• DStream RDD
• Spark
Spark Streaming
21
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
Spark Streaming :
• Fault Tolerance
• Micro-batch
• YARN
• YARN Dynamic Resource Allocation
•
22
Spark Streaming :
• : →
RDD → RDD DStream → DStream
• 1Micro-batch
23
// RDD → RDD
val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c"))
// DStream → DStream
val queue = scala.collection.mutable.Queue(rdd) val dstream:DStream[String] = sparkStreamingContext.queueStream(queue)
Spark Streaming :
• spark-testing-base
• https://github.com/holdenk/spark-testing-base
class JsonElementCountTest extends StreamingSuiteBase { test("simple") { val input = List(List("aa"), List("bb")) val expected = List(List("AA"), List(“BB"))
testOperation[String, String]( input, converterMethod _, expected, useSet = true) }}
24
Spark Streaming :
• Window-based micro-batch
• • o.a.spark.streaming.util.ManualClock
• private class Scala
• http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
25
Spark Streaming :
• Scala Java
• • Spark Streaming Kafka HBase Scala
• Java
26
// api/java/JavaRDD.scala
object JavaRDD { implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] = new JavaRDD[T](rdd) implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd }
27
• • • =
• Spark Streaming
• MLlib
• GraphX