using embulk at treasure data

31
Muga Nishizawa (西澤 無我) Using Embulk at Treasure Data

Upload: treasure-data-inc

Post on 09-Jan-2017

271 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Using Embulk at Treasure Data

Muga Nishizawa (西澤 無我)

Using Embulk at Treasure Data

Page 2: Using Embulk at Treasure Data

Today’s talk

> What’s Embulk?

> Why our customers use Embulk? > Embulk > Data Connector

> Data Connector > The architecture > The use case

> with MapReduce Executor > How we configure MapReduce Executor?

2

Page 3: Using Embulk at Treasure Data

What’s Embulk?

> An open-source parallel bulk data loader > loads records from “A” to “B”

> using plugins > for various kinds of “A” and “B”

> to make data integration easy. > which was very painful…

3

Storage, RDBMS, NoSQL, Cloud Service,

etc.

broken records,transactions (idempotency),

performance, …

Page 4: Using Embulk at Treasure Data

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming

Plugins Plugins

bulk load

Page 5: Using Embulk at Treasure Data

Why our customers use Embulk?

> Upload various types of their data to TD with Embulk > Various file formats

> CSV, TSV, JSON, XML,.. > Various data source

> Local disk, RDBMS, SFTP,.. > Various network environments

> embulk-output-td > https://github.com/treasure-data/embulk-output-td

5

Page 6: Using Embulk at Treasure Data

Out of scope for Embulk

> They develop scripts for > generating Embulk configs

> changing schema on a regular basis > logic to select some files but not others

> managing cron settings > e.g. some users want to upload yesterday’s dataas daily batch

> Embulk is just “bulk loader”

6

Page 7: Using Embulk at Treasure Data

Best practice to manage Embulk!!

7http://www.slideshare.net/GONNakaTaka/embulk5

Page 8: Using Embulk at Treasure Data

Yes, yes,..

8

Page 9: Using Embulk at Treasure Data

Data Connector

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 10: Using Embulk at Treasure Data

Data Connector

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 11: Using Embulk at Treasure Data

2 types of hosted Embulk service

11

Import (Data Connector)

Export (Result Output)

MySQL PostgreSQL Redshift AWS S3 Google Cloud Storage SalesForce Marketo …etc

MySQL PostgreSQL Redshift BigQuery …etc

Page 12: Using Embulk at Treasure Data

Guess/Preview API

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 13: Using Embulk at Treasure Data

Guess/Preview API

> Guesses Embulk config based on sample data > Creates parser config

> Adds schema, escape char, quote char, etc.. > Creates rename filter config

> TD requires uncapitalized column names

> Preview data before uploading

> Ensures quick response

> Embulk performs this functionality running on our web application servers

13

Page 14: Using Embulk at Treasure Data

Connector Worker

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 15: Using Embulk at Treasure Data

Connector Worker

> Generates Embulk config and executes Embulk > Uses private output plugin instead of embulk-output-td to upload users’ data to PlazmaDB directly

> Appropriate retry mechanism

> Embulk runs on our Job Queue clients

15

Page 16: Using Embulk at Treasure Data

Timestamp parsing

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 17: Using Embulk at Treasure Data

Timestamp parsing

> Implement strptime in Java > Ported from CRuby implementation > Can precompile the format

> Faster than JRuby’s strptime > Has been maintained in Embulk repo obscurely..

> It will be merged into JRuby

17

Page 18: Using Embulk at Treasure Data

How we use Data Connector at TD

> a. Monitoring our S3 buckets access > e.g. “IAM users who accessed our S3 buckets?”“Access frequency” > {in: {type: s3}} and {parser: {type: csv}}

> b. Measuring KPIs for development process > e.g. “phases that we took a long time on the process” > {in: {type: jira}}

> c. Measuring Business & Support Performance > {in: {type: Salesforce, Marketo, ZenDesk, …}}

18

Page 19: Using Embulk at Treasure Data

Scaling Embulk

> Requests for massive data loading from users > e.g. “Upload 150GB data by hourly batch”“Start PoC and upload 500GB data today”

> Local Executor can not handle this scale > MapReduce Executor enables us to scale

19

Page 20: Using Embulk at Treasure Data

W/ MapReduce

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Hadoop Clusters

Page 21: Using Embulk at Treasure Data

What’s MapReduce Executor?

21

Task

Task

Task

Task

Map tasks

Task queue

run tasks on Hadoop

Page 22: Using Embulk at Treasure Data

MapReduce Executor with TimestampPartitioning

22

Task

Map tasks

Task queue

run tasks on Hadoop

Reduce tasksShuffle

Page 23: Using Embulk at Treasure Data

built Embulk configs

23

exec: type: mapreduce job_name: embulk.100000 config_files: - /etc/hadoop/conf/core-site.xml - /etc/hadoop/conf/hdfs-site.xml - /etc/hadoop/conf/mapred-site.xml config: fs.defaultFS: “hdfs://my-hdfs.example.net:8020” yarn.resourcemanager.hostname: "my-yarn.example.net" dfs.replication: 1 mapreduce.client.submit.file.replication: 1 state_path: /mnt/xxx/embulk/ partitioning: type: timestamp unit: hour column: time unix_timestamp_unit: hour map_side_partition_split: 3 reducers: 3in: ...

Connector Workers (single-machine workers) are still able to generate config

Page 24: Using Embulk at Treasure Data

Different sized files

24

Map tasks Reduce tasksShuffle

Page 25: Using Embulk at Treasure Data

Same time range data

25

Map tasks Reduce tasksShuffle

Page 26: Using Embulk at Treasure Data

Grouping input files - {in: {min_task_size}}

26

Map tasks Reduce tasksShuffle

Task

Task

Task

It also can reduce mapper’s launch cost.

Page 27: Using Embulk at Treasure Data

One partition into multi-reducers - {exec: {partitioning: {map_side_split}}}

27

Map tasks Reduce tasksShuffle

Page 28: Using Embulk at Treasure Data

Prototype of console Integration

28

Page 29: Using Embulk at Treasure Data

29

Page 30: Using Embulk at Treasure Data

30

¥

Page 31: Using Embulk at Treasure Data

Conclusion

> What’s Embulk?

> Why we use Embulk? > Embulk > Data Connector

> Data Connector > The architecture of Data Connector > The use case

> with MapReduce Executor

31