introduction to disql, a distributed programming framework widely used in baidu

31
Introduction to DISQL Chen Xiaom Senior Engineer of Baidu IBASE De 陈陈陈陈 陈陈 1

Upload: xiaoming-chen

Post on 24-May-2015

7.297 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to DISQL, a distributed programming framework widely used in Baidu

1

Introduction to DISQLChen Xiaoming

Senior Engineer of Baidu IBASE Dept.陈晓鸣

百度基础平台部高级工程师

Page 2: Introduction to DISQL, a distributed programming framework widely used in Baidu

2

What is DISQL?

Page 3: Introduction to DISQL, a distributed programming framework widely used in Baidu

3

DISQL is a distributed programming framework

widely used in Baidu

Page 4: Introduction to DISQL, a distributed programming framework widely used in Baidu

4

Contents

Problems

Solution

Examples

Rationales

Adoption

Page 5: Introduction to DISQL, a distributed programming framework widely used in Baidu

5

Problems

Page 6: Introduction to DISQL, a distributed programming framework widely used in Baidu

6

Problems

statistical analysis of logsextraction of fields

in order to generate reports

Page 7: Introduction to DISQL, a distributed programming framework widely used in Baidu

7

Problems

statistical analysis of features features of web pages, web sites, ads, user preferences, etc

in order to provide data for data mining and machine learning

Page 8: Introduction to DISQL, a distributed programming framework widely used in Baidu

8

Problems

common operationsselecting, filtering, grouping, sorting, joining, etc

Page 9: Introduction to DISQL, a distributed programming framework widely used in Baidu

9

Solution

Page 10: Introduction to DISQL, a distributed programming framework widely used in Baidu

10

A Platform

named Log Statistical Platform, a.k.a. LSP

web-based

convenient for secondary development

convenient for task/data/rights management

Page 11: Introduction to DISQL, a distributed programming framework widely used in Baidu

11

A Programming Framework

named DIstributed SQL, a.k.a. DISQL

provide SQL-like operators which can be combined arbitrarily

encapsulate distributed algorithms

automatic code generation

Page 12: Introduction to DISQL, a distributed programming framework widely used in Baidu

12

Application Programming Interfacesnamed Distributed Query, a.k.a. DQuery

DSL-style APIs embedded in well-known programming languages

PHP so far, C++/Python,… in the future

using method chaining technique to provide fluent interface

data-flow in the form of DAG composed by chains of methods

Page 13: Introduction to DISQL, a distributed programming framework widely used in Baidu

13

Three Edit Modes – Simple Mode

Page 14: Introduction to DISQL, a distributed programming framework widely used in Baidu

14

Three Edit Modes – DQuery Mode

Page 15: Introduction to DISQL, a distributed programming framework widely used in Baidu

15

Three Edit Modes – Complex Mode

Page 16: Introduction to DISQL, a distributed programming framework widely used in Baidu

16

Hierarchy

Linux

Hadoop,…

DISQL

DQuery

LSP

Page 17: Introduction to DISQL, a distributed programming framework widely used in Baidu

17

DISQL Architecture

Simple Mode DQuery Mode

ComplexMode

PHP C++ Python

Data-flow Schema Storage APIs Computing APIs

Normalizer Optimizer Splitter Planner Coder

Edit Modes

APIs

Translators

Runtimes

Page 18: Introduction to DISQL, a distributed programming framework widely used in Baidu

18

LSP Architecturedata presentation & monitoring

data access layer

data management layer

computing layer

storage systems computing systems

third party apps

Page 19: Introduction to DISQL, a distributed programming framework widely used in Baidu

19

Examples

Page 20: Introduction to DISQL, a distributed programming framework widely used in Baidu

20

Example 1 – word count

Page 21: Introduction to DISQL, a distributed programming framework widely used in Baidu

21

Example 2

given a log of query and ad shows

extract site field from url field

filter sites with regex

calculate the amount of query and ad shows per site

output in JSON format

Page 22: Introduction to DISQL, a distributed programming framework widely used in Baidu

22

Code in DQuery Mode

Page 23: Introduction to DISQL, a distributed programming framework widely used in Baidu

23

Rationales

Page 24: Introduction to DISQL, a distributed programming framework widely used in Baidu

24

Use Case Driven VS Completeness

Problem

Problem

Problem

Problem

Our Solution

Page 25: Introduction to DISQL, a distributed programming framework widely used in Baidu

25

Internal DSL VS External DSL

take advantage of:parsers, libraries and VMs of the host languages

users and communities

language features

different from Pig, Hive, Sawzall, etc

Page 26: Introduction to DISQL, a distributed programming framework widely used in Baidu

26

Open/Closed Principles

“open for extension, closed for modification”

open for single machine algorithms, closed for distributed algorithms

also different from Pig, Hive, Sawzall, …

Page 27: Introduction to DISQL, a distributed programming framework widely used in Baidu

27

Adoption

Page 28: Introduction to DISQL, a distributed programming framework widely used in Baidu

28

Users

…… ……

Page 29: Introduction to DISQL, a distributed programming framework widely used in Baidu

29

Usage

throughput/day: hundreds of TB

tasks/day: thousands

total tasks: > 1 million

Page 30: Introduction to DISQL, a distributed programming framework widely used in Baidu

30

Q&A

also welcome to contact me with:•Twitter: @acumon•Email: [email protected]•Gmail/Gtalk: [email protected]

Page 31: Introduction to DISQL, a distributed programming framework widely used in Baidu

31

The End

THANK YOU!