rakuten technology conference 2017 a distributed sql database for data analysis, astra project

28
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project 2017-10-28 Yosuke Hara (陽亮) Rakuten Institute of Technology Rakuten, Inc. rev. 1.0.5

Upload: rakuten-inc

Post on 21-Jan-2018

189 views

Category:

Technology


0 download

TRANSCRIPT

Rakuten Technology Conference 2017

A Distributed SQL Database For Data Analysis, Astra Project

2017-10-28

Yosuke Hara (原 陽亮) Rakuten Institute of TechnologyRakuten, Inc. rev. 1.0.5

Skylab A Microservices Framework

11 01010010111011

11011001001101110111011001

011101110110010

2

LeoFS A Distributed Storage

11 01010010111011

11011001001101110111011001

011101110110010

Astra A Distributed SQL Database For Data Analytics

11 01010010111011

11011001001101110111011001

011101110110010

R&D Projects

Introducing To Astra

* “Astra” is a code name of a product under development

One of Backgrounds

More “Connected Things” In The World Consumer Applications to Represent 63% of Total IoT Applications in 2017

IoT Units Installed Base by Category

Mill

ions

of U

nits

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

22,000

2016 2017 2018 2020

1,316.6

1,635.4

2,027.7

3,171

1,102.1

1,501

2,132.6

4,381.4

3,9635,244.3

7,036.3

12,863

ConsumerBusiness: Cross-IndustryBusiness: Vertical-Specific

Source: Gartner (January 2017)

+31%

4

63%

18%

19%

20.4B

8.4B6.4B

11.2B

Providing A Database That Anyone Who Can Analyze Data

Initial Concept

6

Provides Components of DataLake as a Service

Data Science

+

DataLake

Data Governance Job Scheduler

+Distributed Computing

Data Store

Astra Skylab

Spark, Hadoop

Self-Service Analytics

11 01010010111011

11011001001101110111011001

011101110110

7

Current ConceptAdvanced Data Analysis In Semi-Realtime At Low Cost

Aggregate, and Analyze Data Find Insights

Streaming Data

Un/Semi-Structured Data

1100101100101110111101100100110110111011001

1101110110

Store Data Into Astra

Data Intelligence Action

Tools / Apps

Automated Systems

8

Current Concept: Depends on Single Source Of Truth

Self-Service Analytics

Data Governance

Distributed Computing For Massive-Parallel Processing

Distributed Database For Aggregation and Analysis

+Distributed Storage

(DataLake Store)

+

Astra’s Components

110010110010111011

1101100100110110111011001

1101110110

In-place Analysis

Features

Database

SQL Engine

Data Science

Analysis Functions On The Distributed

Computing

Reliability, Scalability, and Massive Parallel Processing

Ad-hoc Query

Various Data Without Limit

Data Store

10

Unified Components

Confirms To ANSI SQL99 Standard • Communication With Any BI / Data Visualization Tools, and Apps • Able To Call All Astra’s Functions, UDFs and ML With SQL

The Features - ANSI SQL99 Standard

11

astra:test> SELECT workclass, COUNT(income) -> AS income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19(9 rows)

Advanced Data Analytics On The Distributed Computing, Massive-Parallel Processing

• Built-In Analysis Functions and UDF • Machine Learning

The Features - Advanced Data Analytics

12

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

110010110010111011

1101100100110110111011001

1101110110

FeedbackAble To Repeat Trial And Error w/o Limit

The Features - Availability and ScalabilityHigh Availability

• Automated Data Replication And Recovery, and Failover High Scalability

• An Elastic Cluster - Nodes That Can Flexibly Attach And Detach

13

Worker

Worker

Worker

Worker

Request

WorkerResponse

Clients

Coordinator(s)

HTTPMessage with Gossip Protocol

Monitoring Resources Scheduling Jobs

* Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html

Circuit Breaker

Figure: Akka Circuit breaker

Requesting Jobs

Architecture

15

High-level Architecture

SQL Engine Workers

Database Layer

Data Store Layer

Astra C

LIC

lientsSQL over ODBC/JDBC

Astra DataStore

AstraSQL

AstraBase

- Original Data - Semi-Structured Data - Cold Data

- Columnar Tables

- Metadata Store - Record Operation - Record Set Cache (Hot Data) - Distributed Computing

- Data Analysis - Data Converter

- Semi-Structured Data To Columnar Table

Original Data Load

Operate Astra

Multi-Coordinator

LeoFS is a software defined storage (SDS) for DataLake and Web

LeoFS is an Enterprise Open Source Storage, and it is a highly available, distributed, eventually consistent object/blob store

Goals: - High Availability - High Cost Performance Ratio - High Scalability

LeoFS For Astra DataStore

16

Astra DataStore (LeoFS)

AstraSQL

AstraCLI1-1. Put Original Data w/AstraCLI

2. Store the Data and Metadata

4. Request Converting Data Format of a Table5. Convert Data Format of a Table

and Change Table’s Metadata

Processing Flow - Store a CSV file, Then Query Data

AstraBase 6. Store Converted Data

1-2. Create Metadata

[Store a CSV File]

[Convert Data Format At Async]

[Execute Query]3. Query Data For Aggregation Or Data Analysis

1-1

1-2

2

3

17

REST-API

gRPC

S3-API

gRPC

O/J D

BC

AstraBase Coordinator(s)

AstraBase Workers

Resource Monitor + SchedulerS3-API

gRPC

gRPC

AstraBase Coordinator(s)

6

4

5

Astra DataStore (LeoFS)

AstraSQL 3-1. Retrieve Target Records from the Cache

4. Process Data Analysis in Parallel

5. Reply To AstraBase Coordinator, Then Summarize the Result on the Coordinator

Processing Flow - Query for Advanced Analysis

AstraBase

3-2. Retrieve Target Records From LeoFS (Cache Miss)

[Retrieve Records]

[Reply]

[Execute Query]1. Execute SQL For Data Analysis

3-2

1

2-1

2-1. Request Data Analisys to AstraBase

gRPC

18

gRPC

O/J D

BC

AstraBase Coordinator(s)

AstraBase Workers

Resource Monitor + SchedulerS3-API

3-1, 4AstraBase

Coordinator(s)

5

gRPC

gRPC

2-2

2-2. Request Message to AstraBase’s Workers

Store Files Into Astra (Original Data, Semi-Structured Files)

Data Validation Data Verification

Data Type Inference

Store Chunks and Metadata

1. Data Load

To Handle Plural Data Formats In A Table

Partition Into Plural Chunks

CSV / TSV / JSON To Parquet / CarbonData SerDes

19

Able To Do Self Data Analytics Even If During Data Conversion

Data is partitioned by a condition of a specified column

2. Data Conversion At Async

Data Storage

Supports Data Format and SerDes- CSV, TSV, and Custom Delimiter Files - JSON - RegEx SerDes for Unstructured Data - Parquet SerDes (A Columnar Storage Format) - CarbonData SerDes (A Columnar Storage Format)

Supports Compression Methods- SNAPPY - ZLIB - GZIP - LZO

20

Supports Plural Data Formats And SerDes

Table Schema Parquet Format

CSV Format

An Example of METADATA as JSON

21

Stores Each File Into Astra Data Store, LeoFS

Data Type Inference

AstraBase Coordinator(s)

Astra DataStore (LeoFS)

AstraSQL

AstraBase

3

2, 5

1

22

gRPC

O/J D

BC

Machine Learning on Astra - Modeling

[Create A Model, Then Store It]2. Generate Tasks From A Job On A Coordinator

3. Request A Task To Workers

[Request A Modeling]1. Request A Modeling To An Initiator Of AstraBase

4-1. Execute Function(s) In Parallel On Each Worker

5. Summarize The Result On A Coordinator Then Store The Model Into The Cluster To Reuse

4-2

4-2. Load Data From Data Store If Not Exists On Cache

S3-APIAstraBase Workers

gRPC 4-1

gRPC

Resource Monitor + Scheduler

AstraBase Coordinator(s)

S3-API

Integration With BI Tools

Integration With Tableau (BI Tool)

astra:test> DESCRIBE adult_income -> ; Column | Type | Extra | Comment-----------------+---------+-------+--------- age | integer | | workclass | varchar | | fnlwgt | integer | | education | varchar | | educational-num | integer | | marital-status | varchar | | occupation | varchar | | relationship | varchar | | race | varchar | | gender | varchar | | capital-gain | integer | | capital-loss | integer | | hours-per-week | varchar | | native-country | varchar | | income | varchar | |(15 rows)

astra:test> SELECT workclass, COUNT(income) -> as income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19(9 rows)

24

25

Visualizing Data With 3rd Party ToolsCommunicates With Visualizing Data And BI Tools

Dundas BI

Qlik Sense

Microsoft PowerBI

Future Plans

Future Plans

By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018

Alpha 1st Beta 2nd Beta Publish It

- Alpha - Un/Semi-Structured Data and Parquet SerDes Support - BI Tools and Visualization Tools Integration

- 1st Beta, Step-Growth Phase - Record Set Cache - Distributed Computing For UDF and ML - Other SerDes Support

27

THANK YOU