parallelization of regular expression matching and its evaluation on hadoop

29
PARALLELIZATION OF REGULAR EXPRESSION MATCHING AND ITS EVALUATION ON HADOOP KIMINORI MATSUZAKI, KENTO EMOTO, YU LIU 情報処理学会論文誌 プログラミング Vol.4 No.4 1-11 (Sep. 2011)

Upload: keichi-takahashi

Post on 14-Jul-2015

243 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Parallelization of regular expression matching and its evaluation on Hadoop

PARALLELIZATION OF REGULAR EXPRESSION MATCHING AND ITS EVALUATION ON HADOOP

KIMINORI MATSUZAKI, KENTO EMOTO, YU LIU情報処理学会論文誌 プログラミング�Vol.4 No.4 1-11 (Sep. 2011)

Page 2: Parallelization of regular expression matching and its evaluation on Hadoop

INTRODUCTION AND

MOTIVATION

0

Page 3: Parallelization of regular expression matching and its evaluation on Hadoop

REGULAR EXPRESSION

LIST HOMOMORPHISM

HADOOP

FINITE AUTOMATON

PARALLELIZATION

Page 4: Parallelization of regular expression matching and its evaluation on Hadoop

DFA IS BETTER

Page 5: Parallelization of regular expression matching and its evaluation on Hadoop

PROCESSOR SCALABILITY

Page 6: Parallelization of regular expression matching and its evaluation on Hadoop

OPTIMIZATION OF

REGULAR EXPRESSION MATCHING

1

Page 7: Parallelization of regular expression matching and its evaluation on Hadoop

Hadoop

hadoopHadooooop

hadop

Hadooop

hadoooooop

Hadop

(H|h)adoo*pREGULAREXPRESSION

Page 8: Parallelization of regular expression matching and its evaluation on Hadoop

full-text search

search engine

XML processingaccess log analysis

natural language processing

text replacing network securitycompiler front-endACHIEVED WITH

REGULAREXPRESSIONS

URL router

Page 9: Parallelization of regular expression matching and its evaluation on Hadoop

FINITE AUTOMATON

Page 10: Parallelization of regular expression matching and its evaluation on Hadoop

a

ε

a

a

NON-DETERMINISTICFINITE AUTOMATON

Page 11: Parallelization of regular expression matching and its evaluation on Hadoop

a

b

a

c

d e

a

DETERMINISTICFINITE AUTOMATON

Page 12: Parallelization of regular expression matching and its evaluation on Hadoop

PARALLELISM

Page 13: Parallelization of regular expression matching and its evaluation on Hadoop

LISTHOMOMORPHISM

2

Page 14: Parallelization of regular expression matching and its evaluation on Hadoop

({[a],[b],[a, b],[b, c, d],[e, f],..}, ++)

({1,1,2,3,2,..}, +)

HOMOMORPHISM

Page 15: Parallelization of regular expression matching and its evaluation on Hadoop

[1, 2, 3] ++ [7, 8] = [1, 2, 3, 7, 8]

3 + 2 = 5

HOMOMORPHISM

Page 16: Parallelization of regular expression matching and its evaluation on Hadoop

DIVIDEAND

CONQUER

Page 17: Parallelization of regular expression matching and its evaluation on Hadoop

LIST HOMOMORPHISM

Page 18: Parallelization of regular expression matching and its evaluation on Hadoop

B C D A BA ...

foldl

Page 19: Parallelization of regular expression matching and its evaluation on Hadoop

O((n/p + log p))入力文字列の長さがn計算ノードの数がp

Page 20: Parallelization of regular expression matching and its evaluation on Hadoop

DFAO((n/p + log p)|QD|)入力文字列の長さがn計算ノードの数がpDFAの状態数がQD

Page 21: Parallelization of regular expression matching and its evaluation on Hadoop

NFAO((n/p + log p)|QN|^3)入力文字列の長さがn計算ノードの数がpNFAの状態数がQN

Page 22: Parallelization of regular expression matching and its evaluation on Hadoop

EVALUATION ON

HADOOP

3

Page 23: Parallelization of regular expression matching and its evaluation on Hadoop

MAP REDUCE

MAPPER

REDUCER

MAPPER

MAPPER

MAPPER

INPUT OUTPUT

Page 24: Parallelization of regular expression matching and its evaluation on Hadoop

0s

125s

250s

375s

500s

0 8 16 24 32 40

Exec

utin

tim

e

Number of Nodes

DFA NFA

small REGULAR EXPRESSION

Page 25: Parallelization of regular expression matching and its evaluation on Hadoop

0s

1750s

3500s

5250s

7000s

0 8 16 24 32 40

Exec

utin

Tim

e

Number of Nodes

DFA NFA

LARGE REGULAR EXPRESSION

Page 26: Parallelization of regular expression matching and its evaluation on Hadoop

0s

75s

150s

225s

300s

0 1500 3000 4500 6000

Exec

utio

n tim

e

Number of states

DFA

LINEAR

Page 27: Parallelization of regular expression matching and its evaluation on Hadoop

0s

1000s

2000s

3000s

4000s

0 10 20 30 40

Exec

utin

tim

e

Number of states

NFA

CUBIC

Page 28: Parallelization of regular expression matching and its evaluation on Hadoop

RELEVANTSTUDIES

4

Page 29: Parallelization of regular expression matching and its evaluation on Hadoop

TREEHOMOMORPHISM

GPGPU BASED

MAXIMUM MARKING PROBLEMS 松崎公紀, 胡 振江, 武市正人:

リスト上の最大マーク付け問題を解く並列プログラムの導出,情報処理学会論文誌:プログラミング,Vol.49, No.SIG 3 (PRO 36), pp.16‒27 (2008).

Skillicorn, D.B.: Structured Parallel Computation in Structured Documents, Journal of Universal Computer Science, Vol.3, No.1, pp.42‒68 (1997).野村芳明, 江本健斗, 松崎公紀, 胡 振江, 武市正人:木スケルトンによるXPathクエリの並列化とその評価,コンピュータソフトウェア, Vol.24, No.3, pp.51‒62 (2007).

Naghmouchi, J., Scarpazza, D.P. and Berekovic, M.:Small-ruleset Regular Expres- sion Matching on GPGPUs: Quantitative Performance Analysis and Optimization, Proc. 24th International Conference on Supercomputing, 2010, Tsukuba, Ibaraki, Japan, June 2-4, 2010,Boku, T., Nakashima, H. and Mendelson, A. (Eds.), pp.337‒ 348, ACM (2010).