bigquery case study in groovenauts & dive into the dataflowjavasdk

68
BigQuery case study in Groovenauts Dive into the DataflowJavaSDK

Upload: nagachika-t

Post on 17-Jul-2015

1.441 views

Category:

Technology


0 download

TRANSCRIPT

BigQuery case study in Groovenauts

Dive into the DataflowJavaSDK

BigQuery case study in Groovenauts

Tomoyuki Chikanaga

2015.04.24 #bq_sushi tokyo #1

Groovenauts, Inc.

HQ:Fukuoka

Tokyo branch

Our Business

• MAGELLAN (new)

• Consulting

• Game Server

BigQuery anywhere

• MAGELLAN (new)

• Consulting

• Game Server

BigQuery anywhere

• MAGELLAN (new)

• Consulting

• Game Server

• Container Hosting Service

• Support HTTP/MQTT

• Built on Google Cloud Platform

BigQuery in MAGELLAN

• Resource Monitoring (VM/container etc..)

• Developer’s Activity Logs

• Application Logs

• End-user’s Access Logs

Schematic View

End-user

Developer

Containers

router

developers console

API request

DeployDeploy

Resource Monitoring

End-user

Developer

Containers

router

developers console

API request

Monitoring System

usage logs

Watch System Usage Extract user’s usage

billing (not yet implemented)

Developer’s Activity Logs

End-user

Developer

Containers

router

developers consoleDeploy

Deploy

developer’s activity logs

Analyze/Trace developer’s action

Application logs

End-user

Developer

Containers

router

developers console

API request

application logs

View logs

query

End user’s access logs

End-user

Developer

Containers

router

developers console

API requestaccess logs

View logs

query

metrics

BigQuery Quota

• Concurrent rate limit: up to 20 concurrent queries.

• Daily limit: 20,000 queries / project

BigQuery Quota

End-user

Developer

Containers

router

developers consoleView logs

querymay reach quota limit by developer increase.

BigQuery Quota

End-user

Developer

Containers

router

developers consoleView logs

we plan to migrate to other storages.

??

BigQuery in Business

• CPG(Maker/Distribution/Retail)

• Automotive after-market

BigQuery in Business• POS Data Analysis

• Excel + BigQuery

• GPS Telemetric Analysis

• company vehicle utilization/travel distance etc..

POS Data Analysis

• Replace existing system

• RDB → BigQuery

• Excel: SQL Generation, Visualization(Table, Graph)

Excel: SQL Generation• Generate SQL using Excel functions

parameters

Templates for SQL

POS Data Analysis• Result

• Analysis Time

• 12x faster

• Running Cost

• 95% cut

GPS Telemetric Analysis

Vehicle device

Customer

GPS Location Data

GPS Telemetric Analysis

BigQuery Pros. & Cons.• Pros.

• Running Cost

• Scalability

• Cons.

• Stability

• Query Latency / Quota

Dive into the DataflowJavaSDK

@nagachika

2015.04.24#bq_sushi tokyo #1

• @nagachika (twitter/github)

• ruby-trunk-changes (d.hatena.ne.jp/nagachika)

• Ruby committer 2.2 stable branch maintainer

• Fukuoka.rb (Regional Ruby Community)

Who are you?

One Day…Boss

I’ve heard about Google Cloud Dataflow! It may unify Batch & Streaming Distributed Processing.

Wow, That sounds awesome.

I’d like to integrate it with our service.

Eh!? I have to investigate the details...

I’ll leave it to you.

Two Missions

• Port SDK to other Language (Ruby etc..)

• Implement Custom Stream Input (AMQP)

from: https://cloud.google.com/dataflow/what-is-google-cloud-dataflow

Dataflow SDK for Java

Open Source

Open Source• Apache License Version 2.0

• You can read it

• You can modify it

• You can run it

• locally (PubsubIO is not supported)

• on the Cloud Dataflow service(beta)

http://dataflow-java-sdk-weekly.hatenablog.com/

Read every commit

• catch-up recent hot topics

• related components are modified concurrently

• know developers and their territory

Disclaimer

• I’m not good at Java.

• I’m a newbie of Distributed Computing.

Directory Tree• sdk/src/

• main/java/com/google/cloud/dataflow/sdk (SDK Source Code)

• test/java/com/google/cloud/dataflow/sdk (Test Code for SDK)

• examples/src/

• main/java/com/google/cloud/dataflow/examples (Example Pipeline Source Code)

• test/java/com/google/cloud/dataflow/examples (Test for Examples)

• contrib/ Community Contributed Library (join-library)

sdk/src/main/java/com/google/cloud/dataflow/sdk/• coders/

• PCoder classes

• io/

• Input/Output (Source/Sink)

• optsions/

• Command Line Options Utilities

• runners/

• Pipeline runners. driver for run pipeline locally or on the service

• transforms/

• PTransform classes

• values/

• PCollection classes

Pipeline Components

CollectionCollectionPCollection

PTransform

PCollection

PTransformPTransformPTransformPTransform

Source

Sink

Pipeline as a Code Pipeline p = Pipeline.create(options);

p.apply( TextIO.Read.named(“Read”).from(input) ) .apply( new MyTransform() ) .apply( TextIO.Write.named(“Write").to(output) );

PCollection

PTransform

public <Output extends POutput> Outputapply(PTransform<? super PCollection<T>, Output> t)

• Pipeline.apply()/PCollection.apply() Signature

PCollection

• Container of data in Dataflow Pipeline

• Bounded (fixed size) or Unbounded (variable size ≒ streaming)

• Handler for the real data (element) cf. file descriptor, pipe etc..

PCollection

Bounded PCollection

Unbounded PCollection

Elements

Coder

• Data in PCollection = Byte Stream

• Decode/Encode at PTransform’s In/Out

Coder

PTransform PTransformPCollection

elemPValue PValue

Coder.encode() Coder.decode()

Coder• Integer

• Double

• String

• List<T>

• Map<K,V>

• KV<K,V> (Key Value pair)

• TableRow (← BigQuery Table’s row)

PTransform• Each step in pipeline

• Core Transforms

• ParDo/GroupByKey/Combine/Flatten/Join

• Composite Transforms

• Root Transforms (read, write, create)

• Predefined Transforms (SDK Builtin)

PTransform• Each step in pipeline

• Core Transform

• ParDo/GroupByKey/Combine/Flatten/Join

• Composite Transforms

• Root Transforms (read, write, create)

• Predefined Transforms (SDK Builtin)

User Defined Code

Composite Transform• Construct a Transform from Transforms

• ex) Sum, Count.Globally<T> etc..

Composite Transform

Count• Override apply() methodpublic class Count { public static class Globally<T> extends PTransform<PCollection<T>, PCollection<Long>> { @Override public PCollection<Long> apply(PCollection<T> input) { Combine.Globally<Long, Long> sumGlobally; … sumGlobally = Sum.longsGlobally().withFanout(fanout); … return input.apply(ParDo.named(“Init") .of(new DoFn<T, Long>() { @Override public void processElement(ProcessContext c) { c.output(1L); } })) .apply(sumGlobally); } }}

PTransfer.apply()

public abstract class PTransform<Input extends PInput, Output extends POutput> { public Output apply(Input input) { }}

apply()PCollection.apply()

=>Pipeline.applyTransform()

=>Pipeline.applyInternal()

=>PipelineRunner.apply()

=>PTransform.apply()

apply()

• used in a construction phase

• apply() construct a Pipeline from Transforms

ParDo & DoFn• User defined Runtime Code = DoFn

return input.apply(ParDo.named(“Init") .of(new DoFn<T, Long>() { @Override public void processElement(ProcessContext c) { c.output(1L); } })) .apply(sumGlobally);

User Defined Code

processElement

• DoFn<I,O>.processElement()

• Receive an element of input PCollection

• I ProcessContext.element()

• void ProcessContext.output(O output)

void DoFn<I,O>.processElement(ProcessContext context)

Example of DoFnstatic class ExtractWordsFn extends DoFn<String, String> { public void processElement(ProcessContext c) { String[] words = c.element().split(“[^a-zA-Z']+"); for (String word : words) { if (!word.isEmpty()) { c.output(word); } } }}

static class FormatCountsFn extends DoFn<KV<String, Long>, String> { public void processElement(ProcessContext c) { c.output(c.element().getKey() + ": " + c.element().getValue()); }}

from WordCount.java

Staging

• How to load user defined code in Dataflow managed service?

• DoFn<I,O> implements Serializable

• .jar files in $CLASSPATH are uploaded to GCS `staging` bucket

Two Missions

• Port SDK to other Language (Ruby etc..)

• Implement Custom Stream Input (AMQP)Dataflow service depend on JVM runtime.(Python SDK is planned for future release.)

Source/Sink

• TextIO (GCS)

• DatastoreIO

• BigQueryIO

• PubsubIO (for streaming mode)

PubsubIO impl. in SDK

• PubsubIO.Read.Bound<T> extends PTransform<PInput, PCollection<T>>

• Bound don’t have any runtime impl.

• runners.worker.ReaderFactory translate these objects into Source/Sink type and parameters and transport to Dataflow service workers

Two Missions

• Port SDK to other Language (Ruby etc..)

• Implement Custom Stream Input (AMQP)

Dataflow custom input development is not supported yet. (Is there no future plan?)

OK. But stay tuned for the activities in Dataflow.

I’ve found that there’s no way to accomplish these missions right now...

Roger.

Official Documentationhttps://cloud.google.com/dataflow/

Official Documentationhttps://cloud.google.com/dataflow/

Let’s dive into the

DataflowJavaSDK

Let’s dive into the

DataflowJavaSDKDataflow Documentation

Windowing

• for Streaming mode

• for Combine/GroupByKey

Windowingk1: 1k1: 2k1: 3k2: 2

Group by Key

k1: [1,2,3]

k2: [2]

Combine

k1: 3

k2: 1

k1: [1,2,3]

k2: [2]

• These transforms require all elements of input. " In streaming mode inputs are unbounded.

Windowing

• Fixed Time Windows

• Sliding Time Windows

• Session Windows

• Single Global Window

Group elements into windows by timestamp

Trigger• Streaming data could be arrived with some delay

• Dataflow should wait for while after end of window in wall time.

• Time-Based Triggers

• Data-Driven Triggers