introduction to disql, a distributed programming framework widely used in baidu
TRANSCRIPT
1
Introduction to DISQLChen Xiaoming
Senior Engineer of Baidu IBASE Dept.陈晓鸣
百度基础平台部高级工程师
2
What is DISQL?
3
DISQL is a distributed programming framework
widely used in Baidu
4
Contents
Problems
Solution
Examples
Rationales
Adoption
5
Problems
6
Problems
statistical analysis of logsextraction of fields
in order to generate reports
7
Problems
statistical analysis of features features of web pages, web sites, ads, user preferences, etc
in order to provide data for data mining and machine learning
8
Problems
common operationsselecting, filtering, grouping, sorting, joining, etc
9
Solution
10
A Platform
named Log Statistical Platform, a.k.a. LSP
web-based
convenient for secondary development
convenient for task/data/rights management
11
A Programming Framework
named DIstributed SQL, a.k.a. DISQL
provide SQL-like operators which can be combined arbitrarily
encapsulate distributed algorithms
automatic code generation
12
Application Programming Interfacesnamed Distributed Query, a.k.a. DQuery
DSL-style APIs embedded in well-known programming languages
PHP so far, C++/Python,… in the future
using method chaining technique to provide fluent interface
data-flow in the form of DAG composed by chains of methods
13
Three Edit Modes – Simple Mode
14
Three Edit Modes – DQuery Mode
15
Three Edit Modes – Complex Mode
16
Hierarchy
Linux
Hadoop,…
DISQL
DQuery
LSP
17
DISQL Architecture
Simple Mode DQuery Mode
ComplexMode
PHP C++ Python
Data-flow Schema Storage APIs Computing APIs
Normalizer Optimizer Splitter Planner Coder
Edit Modes
APIs
Translators
Runtimes
18
LSP Architecturedata presentation & monitoring
data access layer
data management layer
computing layer
storage systems computing systems
third party apps
19
Examples
20
Example 1 – word count
21
Example 2
given a log of query and ad shows
extract site field from url field
filter sites with regex
calculate the amount of query and ad shows per site
output in JSON format
22
Code in DQuery Mode
23
Rationales
24
Use Case Driven VS Completeness
Problem
Problem
Problem
Problem
Our Solution
25
Internal DSL VS External DSL
take advantage of:parsers, libraries and VMs of the host languages
users and communities
language features
different from Pig, Hive, Sawzall, etc
26
Open/Closed Principles
“open for extension, closed for modification”
open for single machine algorithms, closed for distributed algorithms
also different from Pig, Hive, Sawzall, …
27
Adoption
28
Users
…… ……
29
Usage
throughput/day: hundreds of TB
tasks/day: thousands
total tasks: > 1 million
30
Q&A
also welcome to contact me with:•Twitter: @acumon•Email: [email protected]•Gmail/Gtalk: [email protected]
31
The End
THANK YOU!