jcconf 2015 - google dataflow 在雲端大資料處理的應用

Post on 11-Apr-2017

637 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Google Dataflow在雲端大資料處理的應用

Simon Su @ QNAP

https://goo.gl/YuXCw5

var simon = {/** I am at GCPUG.TW **/};

simon.aboutme = 'http://about.me/peihsinsu';

simon.nodejs = ‘http://opennodes.arecord.us';

simon.googleshare = 'http://gappsnews.blogspot.tw'

simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';

simon.blog = ‘http://peihsinsu.blogspot.com';

simon.slideshare = ‘http://slideshare.net/peihsinsu/';

simon.email = ‘simonsu.mail@gmail.com’;

simon.say(‘Good luck to everybody!');

var sunny = {};

sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts';

sunny.email = sunnyhu@mitac.com.tw’;

sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ]

sunny.skill = [ ‘Project management’,’System Analysis’,

’System design’,’Car ho lan’]

sunny.say(‘寫code太苦悶,心情要sunny');

https://www.facebook.com/groups/GCPUG.TW/

https://plus.google.com/u/0/communities/116100913832589966421

Google Cloud Platform User Group Taiwan我們是Google Cloud Platform Taiwan User Group。在Google雲端服務在台灣地區展露頭角之後,

有許多新的服務、新的知識、新的創意,歡迎大家一起分享,一起了解 Google雲端服務...

GCPUG透過網際網路串聯喜好Google Cloud的使用者,分享與交流使用 GCP的點滴鑑驗。如果您

是Google Cloud Platform的初學者,您應該來聽聽前輩們的使用經驗;如果您是 Google Cloud Platform的Expert,您應該來分享一下寶貴的經驗,並與更多高手互相交流;如果您還沒開始用

Google Cloud Platform,那麼您應該馬上來聽聽我們是怎麼使用 Google Cloud的!

Before Dataflow...

What Google provides in Big Data related domain?

Google Cloud Big Data Tools

● Construct scalable and reliable data pipelines

● Executes processing on Compute Engine

instances

● Provides support for:

○ ETL

○ Analytics

○ Real-time computation

○ Process orchestration

● Integrates with GCP services for data processing

○ Cloud Storage

○ Cloud Pub/Sub

○ BigQuery

● Open source Cloud Dataflow Java SDK available

Demo

Run a word count example...

gcloud alpha dataflow jobs list

Install Path: https://dl.google.com/dataflow/eclipse/

Dataflow Programming Models

Pipeline, PCollections, Transforms, Pipeline I/O

• Represents a Data processing job

• Consists of two parts: data and transforms applied to that data

• Consists of a set of operations

○ Read input - >Transform data -> Write output

• May include multiple inputs and multiple outputs

• May encompass many logical MapReduce operations

Transform

Output

Input

• AvroIO

• PubSubIO

• Custom source /

Sink API

YourSource/Sink

Here

•• newline-delimited• file can be compressed with gzip or bzip2

• Read and write avro local or remote GCS files

• A collection of immutable data of any type in a pipeline

• Maybe be either bounded or unbounded in size

• bouded - Text , BigQuery , Datastore , custom data

• unbounded -Data Source : PubSub ,Data Sinks : PubSub , BigQuery

• Created by using a PTransform to:

• Build from a java.util.Collection• Read from a backing data store• Transform an existing PCollection

• Often contain the key-value pairs using KV

● A step, or a processing operation that transforms data○ convert format , group , filter data

● Type of Transforms○ ParDo

■ For generic parallel processing ,processing style is similar “Mapper”

○ GroupByKey■ Is analogous to the Shuffle phase of a Map/Shuffle/Reduce-style algorithm■ Use GroupByKey to collect all of the values associated with a unique key

○ Combine■ Combine the values in your pipeline's PCollectionobjects or to combine

key-grouped values.

○ Flatten■ Multiple PCollection objects that contain the same data type, you can

merge them into a single logical PCollection using the Flatten transform

Map

Shuffle

Reduce

ParDo

GroupByKey

ParDo

How WordCount works?

Look into Word Count...

Dataflow的應用情境

NYC案例分享

• Functional (transform based) programming model

• Unified programming model for batch & stream processing

• Reduced operational cost of “cluster” management

• Decreased job clock time via platform innovation

• Open source ecosystem of SDKs, extensions, runners..

總結一下Dataflow適用情境

麻煩的離散的資料 >.<

From: https://whitelassiblog.files.wordpress.com/2010/09/postpaid-flow-basic.png

From: http://rsrit.com/blog/wp-content/uploads/2014/08/Automatically-detect-data-errors-and-inconsistencies-through-ETL-Tools.jpg

使用Dataflow後,得到了?

親愛的,我把資料變簡單了~

top related