lecture on information knowledge network "information retrieval and pattern matching"

28
北北北北北 Hokkaido University 1 Lecture on Information knowledge network 2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

Upload: euclid

Post on 10-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Lecture on Information Knowledge Network "Information retrieval and pattern matching". Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA. The 3rd Suffix type algorithm. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University

1

Lecture on Information knowledge network2010/12/23

Lecture on Information Knowledge Network

"Information retrieval and pattern matching"

Laboratory of Information Knowledge Network,Division of Computer Science,

Graduate School of Information Science and Technology,Hokkaido University

Takuya KIDA

Page 2: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

The 3rdSuffix type algorithm

Boyer-Moore algorithmGalil algorithm

Horspool algorithmSunday algorithm

Page 3: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University3

Lecture on Information knowledge network2010/11/10

ababcababc

Knuth-Morris-Pratt algorithm (review)

KMP-String-Matching (T, P)1 n ← length[T].2 m ← length[P].3 q ← 1.4 next ← ComputeNext(P).5 for i ← 1 to n do6 while q>0 かつ P[q]≠T[i] do q ← next[q];7 if q=m then report an occurrence at i-m;8 q ← q+1.

The next position of comparison can be obtained by function next (the amount of shifting P is equal to q - next[q]). The comparison restarts from the next character when the value of next is equal to 0.The number of comparison at each text position is O(1) times.

next[5] = 3

Even in the worst case, it takes only O(n+m) time (if next is preprocessed)

next[3] = 0

D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(1):323-350, 1977.

ababbababcbaababcText T:

Pattern P:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

ababcThe pattern occursat position 6 of Tababc

ababc

Page 4: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University4

Lecture on Information knowledge network2010/11/10

Shift-And algorithm (review)

abababba

ababb

01000

10100

10100

10100

&

00000

10000

01000

10100

01010

10100

01010

00001

10000

12345

Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])

Mask table M

ab

10100

01011

ababb

Text T:

Pattern P:

This can be calculated in O(1) time

※Keeping only the right transferred bits by taking AND op. with the maskbits M.

R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Proceedings of the 12th International Conference on Research and Development in Information Retrieval, 168-175. ACM Press, 1989.

Page 5: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University5

Lecture on Information knowledge network2010/12/23

General form of efficient matching algorithms

MatchingAlgorithm (P, T)1 m ← length[P].2 n ← length[T].3 i ← 1.4 while i n – m +1 ≦ do5 decide if i is an occurrence;6 if i is an occurrence then report the occurrence at i;7 decide the amount of Δ to shift the pattern safely;8 i ← i + Δ.

A lot of efficient algorithms including KMP and BM are in this frame ※ Masayuki Taketa “High-speed pattern matching algorithms for full text processing,” Informatics symposium, January 1991(written in Japanese).

Important things for speeding-up the algorithm :• How much can we save our work for the 5th line?• How much can we make the amount of Δ large at the 7th line?

Page 6: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University6

Lecture on Information knowledge network2010/12/23

Boyer-Moore algorithm

a a b c d a a c b c a b c c a ・・・Text T:

Pattern P: a b c b a b a b

a b c b a b a b

Shift P to align the rightmost ‘c’ in P with the current position

delta1(char) := the jump width of which we shift the pattern so that the rightmost position of char in P is aligned to the current text position (if the pattern doesn’t include char, then it is equal to the pattern length).

Δ=delta1(char) – j + 1 = 5 – 0 = 5

delta1(c) = 5

Features: Characters of the pattern are compared from the right to the

left. The values of two functions (delta1 and delta2) are

compared, and thenthe pattern is shifted by the larger.

Although the time complexity of BM algorithm is O(mn) in the worst case, it becomes O(n/m) on average ( sub linear!! )

(bad-character heuristic)

R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762-772, 1977.

Page 7: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University7

Lecture on Information knowledge network2010/12/23

delta2(j)

a a b c d a a b b c a b c c a ・・・

a b c b a b a b

a b c b a b a b

Shift P to align ‘ab’ with the prefix of P

delta2(j) := the jump width of which we shift to align the suffix of P of length j-1 with another factor of P ( or the longest prefix of P such that it is also the suffix of the string ) ( If there isn’t such factor, it is equal to the length of P. )

Δ=delta2(3) – 3 + 1 = 8 – 2 = 6

※There are two candidate, 1 and 5, for the value of delta2(3). However, we can see that the left side character of the 5th, namely the 4th, is ‘b’ , which doesn’t match with ‘a’. Therefore, the 1st position is the only candidate.

a a b c a b a b b c a b c c a ・・・

a b c b a b a b

a b c b a b a b

delta2(3) = 8

delta2(5) = 10

Δ=delta2(5) – 5 + 1 = 10 – 4 = 6

(good-suffix heuristic)

Text T:

Pattern P:

Text T:

Pattern P:

Shift P to align ‘ab’ with the prefix of P

Page 8: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University8

Lecture on Information knowledge network2010/12/23

Problem of BM method

It is complicated to decide the value of the delta functions.– It takes in O(m2) time in a naïve way.–To reduce it to O(m) is somewhat trouble → in a similar way

of KMP It costs to compare the values of delta1 and delta2 for each iteration.–Generally, only delta1 is used. ( However, we have to

devise to shift the pattern correctly since it cannot be shifted by delta1 only.)

It takes O(mn) time in the worst case.–Consider when T = an and P = bam.

The efficiency of BM declines when the alphabet size is small.–For strings in ∑={0,1}, Δ’s would be very small.

Binary strings

Page 9: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University9

Lecture on Information knowledge network2010/12/23

Galil algorithm

Since the information about the matched string is forgotten in the original BM method, it takes O(mn) time in the worst case.

The idea for improvement is to memory how long the prefix of P has been matched with the text.

Galil algorithm scans in O(n) time theoretically, but it slows down in practice since the algorithm becomes much complicated.

a a b c a b a b c b a b a b a ・・・

a b c b a b a b

a b c b a b a b

delta2(5) = 10

Memory that we’ve already compared the forward positions.

Only these are to be compared!

Each character of the text is compared twice at most!

Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm.Communications of the ACM, 22(9):505-508, 1979.

Text T:

Pattern P:

Page 10: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University10

Lecture on Information knowledge network2010/12/23

a b c b a b a d

Horspool algorithm

If ∑ is large enough, delta1 ( bad-character heuristic ) can mostly give the best shift amount.→ A small modification can enlarge the jump width.

a a b c d c a d b c a b c c a b a c a ・・・

a b c b a b a d

delta1(c) = 5

a b c b a b a d

a a b c d c a d b c a b c c a b a c a ・・・

a b c b a b a d

delta1’(d) = 10

Always decide the jump width by the character of the text at the end position of the pattern.

delta1’(b) = 3

R. N. Horspool. Practical fast searching in strings. Software Practice and Experience, 10(6):501-506, 1980.

Text T:

Pattern P:

Text T:

Pattern P:

Page 11: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University11

Lecture on Information knowledge network2010/12/23

Pseudo code

Horspool (P, T)1 m ← length[P].2 n ← length[T].3 Preprocessing:4 For each c ∑ ∈ do delta1’[c] ← m.5 For j←1 to m – 1 do delta1’[ P[j] ] ← m – j .6 Searching:7 i ← 0.8 while i n – m ≦ do9 j ← m;10 while j > 0 and T[i+j] = P[j] do j ← j – 1;11 if j = 0 then report an occurrence at i+1;12 i ← i + delta1’[ T[i+m] ].

Page 12: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University12

Lecture on Information knowledge network2010/12/23

Sunday algorithm

It basically based on BM method. Different point

– It compares in an arbitrary position order of the pattern to match For example, it compares characters in an infrequent order.

– It uses the right side text character at the end position of the pattern to determine the value of delta1 ( it also calculates delta2, and then compares them to select the longer ) .

The jump width tends to be longer than that of Horspool– However, the memory consumption is larger than Horspool– Moreover, it takes much more time to decide the jump width than

Horspool.

a b c b a b a b

a a b c d c a b d c a b c c a b a c a ・・・

a b c b a b a b

delta1’(d) = 9

Always decide the jump width by the character of the text at the right side of the end position of the pattern.

delta1’(c) = 6

D. M. Sunday. A very fast substring search algorithm. Communications of the ACM, 33(8):132-142, 1990.

Text T:

Pattern P:

Page 13: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Factor type algorithm

BDM algorithmBOM algorithm

BNDM algorithm

Page 14: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University14

Lecture on Information knowledge network2010/12/23

a b c b a b a b

Backward Dawg Matching (BDM) algorithm

It basically based on BM method. Different point

– It decide if the pattern occurs at the current position by detecting if the reading string matches with any factors of the pattern, not with a suffix of the pattern.

– It uses Suffix Automaton (or suffix tree) to determine if the reading string is a factor of the pattern.

Features of Suffix automaton (SA):– It can tell whether string u is a factor of pattern P in O(|u|) time.– It can also tell whether string u is a suffix of P.– For P=p0p2…pm , there exists an online construction algorithm that runs in

O(m) time.

M. Crochemore, A. Czumanj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter.Speeding up two string matching algorithms. Algorithmica, 12(4/5):247-267, 1994.

a b c b a b a b

a a b c d c a b d c a b c c a b a c a ・・・

a b c b a b a b

Factor search

As neither ‘cc’ is a factor of P nor ‘c’ is a prefix of P, the pattern can be shifted safely to the next position.

We can see whether the reading string is a prefix of P or not, from the second feature of SA.

uσText T:

Pattern P:

Page 15: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University15

Lecture on Information knowledge network2010/12/23

Suffix AutomatonA. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen and J. Seiferas. The smallest automation recognizing the subwords of a text. Theoretical Computer Science (40):31-55, 1985.

ec

nu

on

na

cn

uo

nn

a

n

u

o

n

n

a

u

o

n

n

a

o

n

n

a

n

a

a

a

ec

nu

on

na

cn

uo

nn

a

n

uo

nn

a

uo

nn

a

onna

na

a

aSuffix treeSuffix trie

Suffix automatonAn automaton that accepts the reverse PR of P = announce.

u o nne3 4 5 7 860 1 2

c n a

9n

a

u

oc

u na

Page 16: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University16

Lecture on Information knowledge network2010/12/23

On-line construction algorithmSuffixAutomaton(P=p1p2…pm)1 Create the one-node graph G=DAWG(e).2 root ← sink ← the node of G. suf[root] ←θ.3 for i ← 1 to m do4 create a new node newsink;5 make a solid edge (sink, newsink) labeled by a;6 w ← suf[sink];7 while w≠θ and son(w,a)=θ do8 make a non-solid a-edge (w, newsink);9 w ← suf[w];10 v ← son(w,a);11 If w=θthen suf[newsink] ← root12 else if (w,v) is a solid edge then suf[newsink] ← v13 else 14 create a node newnode;15 newnode has the same outgoing edges as v except that they are all non-solid;16 change (w,v) into a solid edge (w, newnode);17 suf[newsink] ← newnode;18 suf[newnode] ← suf[v]; suf[v] ← newnode;19 w ← suf[w];20 while w≠θ and (w,v) is a non-solid a-edge do21 redirect this edge to newnode; w ← suf[w].22 sink ← newsink.

This is rather complicated!The online construction of SA is a hard task!

Page 17: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University17

Lecture on Information knowledge network2010/12/23

BNDM algorithm

The idea is basically same as BDM algorithm. Different point:

– It uses a non-deterministic version of suffix automaton to determine the reading string is a factor of the pattern.

– It simulates the move of the NFA by bit-parallel technique.

G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata.ACM Journal of Experimental Algorithmics (JEA), 5(4), 2000.

An NFA that accepts the suffix of PR for pattern P = announce

Simulate this NFA

e0 1 2 4

c un o3 7 8

nn a65

I

εε ε ε ε ε ε ε ε

The same Mask table as that of Shift-And method

Initial condition : R0 = 1m

State transition : R = (R << 1) & M[ T[i] ]

Page 18: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University18

Lecture on Information knowledge network2010/12/23

Pseudo code

BNDM (P, T)1 m ← length[P].2 n ← length[T].3 Preprocessing:4 for c ∑ ∈ do M[c] ← 0m.5 for j ← 1 to m do M[ P[j] ] ← M[ P[j] ] | 0j–110m–j.6 Searching:7 s ← 0.8 while s n – m ≦ do9 j ← m, last ← m, R ← 1m;10 while R ≠ 0m do11 R ← R & M[ T[s+j] ];12 j ← j – 1;13 If R & 10m-1 ≠ 0m then14 If j > 0 then last ← j;15 else report an occurrence at s+1;16 R ← R << 1;17 s ← s + last.

Page 19: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University19

Lecture on Information knowledge network2010/12/23

Backward Oracle Matching (BOM) algorithm

The idea is basically same as BDM algorithm. Different point:

– It uses Factor oracle instead of Suffix automaton–A necessary thing for BDM is that σu is not a factor, rather than

that string u is a factor. Feature of Factor oracle:

– It may accept strings other than the factor of P. For example, in the bottom figure, ‘cnn’ is not a factor of PR.

– It can be constructed in O(m) time. Moreover, it is easy to implement with small memory space.

The number of states: m+1. The number of state transitions: 2m-1.

C. Allauzen, M. Crochemore, and M. Raffinot. Efficient experimental string matching by weak factor recognition.In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, LNCS2089:51-72, 2001.

A factor oracle of PR for P=announce

u o nne3 4 5 7 860 1 2

c n a

na

anc

uo

Page 20: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University20

Lecture on Information knowledge network2010/12/23

Construction algorithm of Factor oracle

Oracle-on-line (P=p1p2…pm)1 Create Oracle(ε) with 2 One single initial state 0, S(0) ←θ.3 for i 1…m ∈ do4 Oracle(P=p1p2…pj) 5 ← Oracle_add_letter (Oracle(P=p1p2…pj-1), pj).

Oracle_add_letter (Oracle(P=p1p2…pm),σ)1 Create a new state m+1.2 δ(m,σ) ← m+1.3 k ← S(m)4 while k≠θ and δ(k,σ)=θ do5 δ(k,σ) ← m+1;6 k ← S(k). 7 If k =θthen s ← 0;8 else s ← δ(k,σ).9 S(m+1) ← s.10 return Oracle(P=p1p2…pmσ).

Page 21: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University21

Matching time comparison

Lecture on Information knowledge network2010/12/23

Flexible Pattern Matching in Strings, Navarro&Raffinot, 2002: Fig.2.22, p39.

2 4 8 16 32 64 128 256

2

4

8

16

32

64

¿ Σ∨¿

𝑚

34 7

8

18

29

50

50

100

Horspoor

Shift-Or

BNDM

BOM

DNA

English

Page 22: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Extensions of Suffix & Factor type algorithms to

multiple patterns

Set Horspool algorithmWu-Manber algorithm

Page 23: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University23

Lecture on Information knowledge network2010/12/23

Suffix & Factor type algorithms for multiple patterns

Commentz-Walter algorithmB. Commentz-Walter. A string matching algorithm fast on the average. In Proceedings of the 6th International Colloquium on Automata, Languages and Programming, LNCS71:118-132, 1979.• A straight extension of BM algorithm

Set Horspool algorithm• A simplified algorithm of Commentz-Walter based on the idea of Horspool

Wu-Manber algorithmS. Wu and U. Manber. A fast algorithm for multi-pattern searching.Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994.• A practically fast algorithm based on hashing. Agrep employs this algorithm.

Uratani-Takeda algorithm• A BM type algorithm with AC machine. It is faster than CW.

Set Backward Oracle Matching (SBOM) algorithmC. Allauzen and M. Raffinot. Factor oracle of a set of words.Techinical report 99-11, Institut Gaspard-Monge, Universite de Marne-la-Vallee, 1999.• A extension of BOM by extending Factor oracle for multiple patterns.

Page 24: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University24

Lecture on Information knowledge network2010/12/23

Set Horspool algorithm First, it makes a trie for the set of the reversed patterns in Π. Its matching approach is the same as Horspool.

– It traverses the trie as doing suffix search.– If the reading string doesn’t match with any suffixes of the

patterns, then it shifts by delta1’.

Text T:

α

σ

Text T: β

suffix search

Reversed trie for

patterns

This range doesn’t include β

delta1’

※Cf. In Uratani-Takeda algorithm, it uses AC machine for the trie, and decides

a jump width by the failure functions

Page 25: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University25

Lecture on Information knowledge network2010/12/23

Reason why the performance decreases

Text T:

Pattern P:

ℓminℓmax

delta ( ℓmin)≦

The maximum of jump width is

limited to ℓmin

When the number of patterns increases, bad-character heuristic cannot work well

since the frequency of each character increases.

Page 26: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University26

Lecture on Information knowledge network2010/12/23

Wu-Manber algorithm

It examines whether some patterns occur or not by reading B characters from the current matching position of the text (i.e. T[i-B+1…i]).

– SHIFT[ T[i-B+1…i] ] : IF T[i-B+1…i] is a suffix of some patterns, then 0. Otherwise, it returns the maximum length of possible shift.

– HASH[ T[i-B+1…i] ] : When SHIFT returns 0, (i.e. T[i-B+1…i] is a suffix of some patterns), it returns the list of patterns that can occur at the position.

S. Wu and U. Manber. A fast algorithm for multi-pattern searching.Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994.

Text T: C P M a n n u a c o n f e r e n c e a n nl o u n c e

Patterns Π:a n n u a l l y

a n n o u n c ea n n u a l

String ll no ou an

un nc ua al ly nn

nuce *

Amount of shift 1 3 4 1 0 0 2 0 5String ce ly ua al *Pattern ID 3, 1 2 φ

SHIFT[B] =

HASH[B] =

SHIFT[an]=4

HASH[al]=2, → Shift by 1

SHIFT[l ]=5

Some patterns may occur!

SHIFT[al]=0

Page 27: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University27

Lecture on Information knowledge network2010/12/23

Pseudo code

Construct_SHIFT (P={p1,p2,…,pr})1 initialize SHIFT table by ℓmin–B+1.2 For each Bl=pi[j–B+1…j] do3 If SHIFT[h1(Bl)] > mi – j do SHIFT[h1(Bl)] = mi – j.

Wu-Manber (P={p1,p2,…,pr}, T=T[1…n])1 Preprocessing:2 Computation of B.3 Construction of the hash tables SHIFT and HASH.4 Searching:5 pos ← ℓmin.6 while pos n ≦ do7 i ← h1( T[pos–B+1…pos] );8 If SHIFT[i] = 0 then9 list ← HASH[ h2( T[pos–B+1…pos] ) ];10 Verify all the patterns in list one by one against the text;11 pos ← pos + 1; 12 else pos ← pos + SHIFT[i].

※ In the implementation of agrep ver4.02 (mgrep.c) in fact, SHIFT ・HASH ・ B are 4096, 8192, and 3.

Page 28: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University28

Lecture on Information knowledge network2010/12/23

The 3rd summary

Suffix type algorithm– It matches with the pattern from the right to the left.– It takes O(mn) time in the worst case, but O(n/m) time on

average. –Boyer-Moore, Galil, Horspool, and Sunday

Factor type algorithm– It determines whether the current position is a factor of the

pattern or not, and then skips the text.–BDM, BNDM, and BOM algorithm

Extensions of Suffix & Factor type algorithms to multiple patterns–When the number of patterns increases, bad-character

heuristic doesn’t work well since the frequency of each character increases.

–Set Horspool and Wu-Manber algorithm The next theme

–Approximate pattern matching: pattern matching with allowing errors.