rakuten technology conference 2017 a distributed sql database for data analysis, astra project
TRANSCRIPT
Rakuten Technology Conference 2017
A Distributed SQL Database For Data Analysis, Astra Project
2017-10-28
Yosuke Hara (原 陽亮) Rakuten Institute of TechnologyRakuten, Inc. rev. 1.0.5
Skylab A Microservices Framework
11 01010010111011
11011001001101110111011001
011101110110010
2
LeoFS A Distributed Storage
11 01010010111011
11011001001101110111011001
011101110110010
Astra A Distributed SQL Database For Data Analytics
11 01010010111011
11011001001101110111011001
011101110110010
R&D Projects
One of Backgrounds
More “Connected Things” In The World Consumer Applications to Represent 63% of Total IoT Applications in 2017
IoT Units Installed Base by Category
Mill
ions
of U
nits
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
22,000
2016 2017 2018 2020
1,316.6
1,635.4
2,027.7
3,171
1,102.1
1,501
2,132.6
4,381.4
3,9635,244.3
7,036.3
12,863
ConsumerBusiness: Cross-IndustryBusiness: Vertical-Specific
Source: Gartner (January 2017)
+31%
4
63%
18%
19%
20.4B
8.4B6.4B
11.2B
Initial Concept
6
Provides Components of DataLake as a Service
Data Science
+
DataLake
Data Governance Job Scheduler
+Distributed Computing
Data Store
Astra Skylab
Spark, Hadoop
Self-Service Analytics
11 01010010111011
11011001001101110111011001
011101110110
7
Current ConceptAdvanced Data Analysis In Semi-Realtime At Low Cost
Aggregate, and Analyze Data Find Insights
Streaming Data
Un/Semi-Structured Data
1100101100101110111101100100110110111011001
1101110110
Store Data Into Astra
Data Intelligence Action
Tools / Apps
Automated Systems
8
Current Concept: Depends on Single Source Of Truth
Self-Service Analytics
Data Governance
Distributed Computing For Massive-Parallel Processing
Distributed Database For Aggregation and Analysis
+Distributed Storage
(DataLake Store)
+
Astra’s Components
110010110010111011
1101100100110110111011001
1101110110
In-place Analysis
Database
SQL Engine
Data Science
Analysis Functions On The Distributed
Computing
Reliability, Scalability, and Massive Parallel Processing
Ad-hoc Query
Various Data Without Limit
Data Store
10
Unified Components
Confirms To ANSI SQL99 Standard • Communication With Any BI / Data Visualization Tools, and Apps • Able To Call All Astra’s Functions, UDFs and ML With SQL
The Features - ANSI SQL99 Standard
11
astra:test> SELECT workclass, COUNT(income) -> AS income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19(9 rows)
Advanced Data Analytics On The Distributed Computing, Massive-Parallel Processing
• Built-In Analysis Functions and UDF • Machine Learning
The Features - Advanced Data Analytics
12
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
110010110010111011
1101100100110110111011001
1101110110
FeedbackAble To Repeat Trial And Error w/o Limit
The Features - Availability and ScalabilityHigh Availability
• Automated Data Replication And Recovery, and Failover High Scalability
• An Elastic Cluster - Nodes That Can Flexibly Attach And Detach
13
Worker
Worker
Worker
Worker
Request
WorkerResponse
Clients
Coordinator(s)
HTTPMessage with Gossip Protocol
Monitoring Resources Scheduling Jobs
* Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html
Circuit Breaker
Figure: Akka Circuit breaker
Requesting Jobs
15
High-level Architecture
SQL Engine Workers
Database Layer
Data Store Layer
Astra C
LIC
lientsSQL over ODBC/JDBC
Astra DataStore
AstraSQL
AstraBase
- Original Data - Semi-Structured Data - Cold Data
- Columnar Tables
- Metadata Store - Record Operation - Record Set Cache (Hot Data) - Distributed Computing
- Data Analysis - Data Converter
- Semi-Structured Data To Columnar Table
Original Data Load
Operate Astra
Multi-Coordinator
LeoFS is a software defined storage (SDS) for DataLake and Web
LeoFS is an Enterprise Open Source Storage, and it is a highly available, distributed, eventually consistent object/blob store
Goals: - High Availability - High Cost Performance Ratio - High Scalability
LeoFS For Astra DataStore
16
Astra DataStore (LeoFS)
AstraSQL
AstraCLI1-1. Put Original Data w/AstraCLI
2. Store the Data and Metadata
4. Request Converting Data Format of a Table5. Convert Data Format of a Table
and Change Table’s Metadata
Processing Flow - Store a CSV file, Then Query Data
AstraBase 6. Store Converted Data
1-2. Create Metadata
[Store a CSV File]
[Convert Data Format At Async]
[Execute Query]3. Query Data For Aggregation Or Data Analysis
1-1
1-2
2
3
17
REST-API
gRPC
S3-API
gRPC
O/J D
BC
AstraBase Coordinator(s)
AstraBase Workers
Resource Monitor + SchedulerS3-API
gRPC
gRPC
AstraBase Coordinator(s)
6
4
5
Astra DataStore (LeoFS)
AstraSQL 3-1. Retrieve Target Records from the Cache
4. Process Data Analysis in Parallel
5. Reply To AstraBase Coordinator, Then Summarize the Result on the Coordinator
Processing Flow - Query for Advanced Analysis
AstraBase
3-2. Retrieve Target Records From LeoFS (Cache Miss)
[Retrieve Records]
[Reply]
[Execute Query]1. Execute SQL For Data Analysis
3-2
1
2-1
2-1. Request Data Analisys to AstraBase
gRPC
18
gRPC
O/J D
BC
AstraBase Coordinator(s)
AstraBase Workers
Resource Monitor + SchedulerS3-API
3-1, 4AstraBase
Coordinator(s)
5
gRPC
gRPC
2-2
2-2. Request Message to AstraBase’s Workers
Store Files Into Astra (Original Data, Semi-Structured Files)
Data Validation Data Verification
Data Type Inference
Store Chunks and Metadata
1. Data Load
To Handle Plural Data Formats In A Table
Partition Into Plural Chunks
CSV / TSV / JSON To Parquet / CarbonData SerDes
19
Able To Do Self Data Analytics Even If During Data Conversion
Data is partitioned by a condition of a specified column
2. Data Conversion At Async
Data Storage
Supports Data Format and SerDes- CSV, TSV, and Custom Delimiter Files - JSON - RegEx SerDes for Unstructured Data - Parquet SerDes (A Columnar Storage Format) - CarbonData SerDes (A Columnar Storage Format)
Supports Compression Methods- SNAPPY - ZLIB - GZIP - LZO
20
Supports Plural Data Formats And SerDes
Table Schema Parquet Format
CSV Format
An Example of METADATA as JSON
21
Stores Each File Into Astra Data Store, LeoFS
Data Type Inference
AstraBase Coordinator(s)
Astra DataStore (LeoFS)
AstraSQL
AstraBase
3
2, 5
1
22
gRPC
O/J D
BC
Machine Learning on Astra - Modeling
[Create A Model, Then Store It]2. Generate Tasks From A Job On A Coordinator
3. Request A Task To Workers
[Request A Modeling]1. Request A Modeling To An Initiator Of AstraBase
4-1. Execute Function(s) In Parallel On Each Worker
5. Summarize The Result On A Coordinator Then Store The Model Into The Cluster To Reuse
4-2
4-2. Load Data From Data Store If Not Exists On Cache
S3-APIAstraBase Workers
gRPC 4-1
gRPC
Resource Monitor + Scheduler
AstraBase Coordinator(s)
S3-API
Integration With Tableau (BI Tool)
astra:test> DESCRIBE adult_income -> ; Column | Type | Extra | Comment-----------------+---------+-------+--------- age | integer | | workclass | varchar | | fnlwgt | integer | | education | varchar | | educational-num | integer | | marital-status | varchar | | occupation | varchar | | relationship | varchar | | race | varchar | | gender | varchar | | capital-gain | integer | | capital-loss | integer | | hours-per-week | varchar | | native-country | varchar | | income | varchar | |(15 rows)
astra:test> SELECT workclass, COUNT(income) -> as income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19(9 rows)
24
25
Visualizing Data With 3rd Party ToolsCommunicates With Visualizing Data And BI Tools
Dundas BI
Qlik Sense
Microsoft PowerBI
Future Plans
By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018
Alpha 1st Beta 2nd Beta Publish It
- Alpha - Un/Semi-Structured Data and Parquet SerDes Support - BI Tools and Visualization Tools Integration
- 1st Beta, Step-Growth Phase - Record Set Cache - Distributed Computing For UDF and ML - Other SerDes Support
27