bigquery case study in groovenauts & dive into the dataflowjavasdk
TRANSCRIPT
BigQuery in MAGELLAN
• Resource Monitoring (VM/container etc..)
• Developer’s Activity Logs
• Application Logs
• End-user’s Access Logs
Resource Monitoring
End-user
Developer
Containers
router
developers console
API request
Monitoring System
usage logs
Watch System Usage Extract user’s usage
billing (not yet implemented)
Developer’s Activity Logs
End-user
Developer
Containers
router
developers consoleDeploy
Deploy
developer’s activity logs
Analyze/Trace developer’s action
Application logs
End-user
Developer
Containers
router
developers console
API request
application logs
View logs
query
End user’s access logs
End-user
Developer
Containers
router
developers console
API requestaccess logs
View logs
query
metrics
BigQuery Quota
• Concurrent rate limit: up to 20 concurrent queries.
• Daily limit: 20,000 queries / project
BigQuery Quota
End-user
Developer
Containers
router
developers consoleView logs
querymay reach quota limit by developer increase.
BigQuery Quota
End-user
Developer
Containers
router
developers consoleView logs
we plan to migrate to other storages.
??
BigQuery in Business• POS Data Analysis
• Excel + BigQuery
• GPS Telemetric Analysis
• company vehicle utilization/travel distance etc..
POS Data Analysis
• Replace existing system
• RDB → BigQuery
• Excel: SQL Generation, Visualization(Table, Graph)
BigQuery Pros. & Cons.• Pros.
• Running Cost
• Scalability
• Cons.
• Stability
• Query Latency / Quota
• @nagachika (twitter/github)
• ruby-trunk-changes (d.hatena.ne.jp/nagachika)
• Ruby committer 2.2 stable branch maintainer
• Fukuoka.rb (Regional Ruby Community)
Who are you?
One Day…Boss
I’ve heard about Google Cloud Dataflow! It may unify Batch & Streaming Distributed Processing.
Wow, That sounds awesome.
I’d like to integrate it with our service.
Eh!? I have to investigate the details...
I’ll leave it to you.
Open Source• Apache License Version 2.0
• You can read it
• You can modify it
• You can run it
• locally (PubsubIO is not supported)
• on the Cloud Dataflow service(beta)
Read every commit
• catch-up recent hot topics
• related components are modified concurrently
• know developers and their territory
Directory Tree• sdk/src/
• main/java/com/google/cloud/dataflow/sdk (SDK Source Code)
• test/java/com/google/cloud/dataflow/sdk (Test Code for SDK)
• examples/src/
• main/java/com/google/cloud/dataflow/examples (Example Pipeline Source Code)
• test/java/com/google/cloud/dataflow/examples (Test for Examples)
• contrib/ Community Contributed Library (join-library)
sdk/src/main/java/com/google/cloud/dataflow/sdk/• coders/
• PCoder classes
• io/
• Input/Output (Source/Sink)
• optsions/
• Command Line Options Utilities
• runners/
• Pipeline runners. driver for run pipeline locally or on the service
• transforms/
• PTransform classes
• values/
• PCollection classes
Pipeline Components
CollectionCollectionPCollection
PTransform
PCollection
PTransformPTransformPTransformPTransform
Source
Sink
Pipeline as a Code Pipeline p = Pipeline.create(options);
p.apply( TextIO.Read.named(“Read”).from(input) ) .apply( new MyTransform() ) .apply( TextIO.Write.named(“Write").to(output) );
PCollection
PTransform
public <Output extends POutput> Outputapply(PTransform<? super PCollection<T>, Output> t)
• Pipeline.apply()/PCollection.apply() Signature
PCollection
• Container of data in Dataflow Pipeline
• Bounded (fixed size) or Unbounded (variable size ≒ streaming)
• Handler for the real data (element) cf. file descriptor, pipe etc..
Coder• Integer
• Double
• String
• List<T>
• Map<K,V>
• KV<K,V> (Key Value pair)
• TableRow (← BigQuery Table’s row)
PTransform• Each step in pipeline
• Core Transforms
• ParDo/GroupByKey/Combine/Flatten/Join
• Composite Transforms
• Root Transforms (read, write, create)
• Predefined Transforms (SDK Builtin)
PTransform• Each step in pipeline
• Core Transform
• ParDo/GroupByKey/Combine/Flatten/Join
• Composite Transforms
• Root Transforms (read, write, create)
• Predefined Transforms (SDK Builtin)
User Defined Code
Composite Transform• Construct a Transform from Transforms
• ex) Sum, Count.Globally<T> etc..
Composite Transform
Count• Override apply() methodpublic class Count { public static class Globally<T> extends PTransform<PCollection<T>, PCollection<Long>> { @Override public PCollection<Long> apply(PCollection<T> input) { Combine.Globally<Long, Long> sumGlobally; … sumGlobally = Sum.longsGlobally().withFanout(fanout); … return input.apply(ParDo.named(“Init") .of(new DoFn<T, Long>() { @Override public void processElement(ProcessContext c) { c.output(1L); } })) .apply(sumGlobally); } }}
PTransfer.apply()
public abstract class PTransform<Input extends PInput, Output extends POutput> { public Output apply(Input input) { }}
apply()PCollection.apply()
=>Pipeline.applyTransform()
=>Pipeline.applyInternal()
=>PipelineRunner.apply()
=>PTransform.apply()
ParDo & DoFn• User defined Runtime Code = DoFn
return input.apply(ParDo.named(“Init") .of(new DoFn<T, Long>() { @Override public void processElement(ProcessContext c) { c.output(1L); } })) .apply(sumGlobally);
User Defined Code
processElement
• DoFn<I,O>.processElement()
• Receive an element of input PCollection
• I ProcessContext.element()
• void ProcessContext.output(O output)
void DoFn<I,O>.processElement(ProcessContext context)
Example of DoFnstatic class ExtractWordsFn extends DoFn<String, String> { public void processElement(ProcessContext c) { String[] words = c.element().split(“[^a-zA-Z']+"); for (String word : words) { if (!word.isEmpty()) { c.output(word); } } }}
static class FormatCountsFn extends DoFn<KV<String, Long>, String> { public void processElement(ProcessContext c) { c.output(c.element().getKey() + ": " + c.element().getValue()); }}
from WordCount.java
Staging
• How to load user defined code in Dataflow managed service?
• DoFn<I,O> implements Serializable
• .jar files in $CLASSPATH are uploaded to GCS `staging` bucket
Two Missions
• Port SDK to other Language (Ruby etc..)
• Implement Custom Stream Input (AMQP)Dataflow service depend on JVM runtime.(Python SDK is planned for future release.)
PubsubIO impl. in SDK
• PubsubIO.Read.Bound<T> extends PTransform<PInput, PCollection<T>>
• Bound don’t have any runtime impl.
• runners.worker.ReaderFactory translate these objects into Source/Sink type and parameters and transport to Dataflow service workers
Two Missions
• Port SDK to other Language (Ruby etc..)
• Implement Custom Stream Input (AMQP)
Dataflow custom input development is not supported yet. (Is there no future plan?)
OK. But stay tuned for the activities in Dataflow.
I’ve found that there’s no way to accomplish these missions right now...
Roger.
Windowingk1: 1k1: 2k1: 3k2: 2
Group by Key
k1: [1,2,3]
k2: [2]
Combine
k1: 3
k2: 1
k1: [1,2,3]
k2: [2]
• These transforms require all elements of input. " In streaming mode inputs are unbounded.
Windowing
• Fixed Time Windows
• Sliding Time Windows
• Session Windows
• Single Global Window
Group elements into windows by timestamp