lecture on information knowledge network "information retrieval and pattern matching"
DESCRIPTION
Lecture on Information Knowledge Network "Information retrieval and pattern matching". Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA. The 3rd Suffix type algorithm. - PowerPoint PPT PresentationTRANSCRIPT
北海道大学 Hokkaido University
1
Lecture on Information knowledge network2010/12/23
Lecture on Information Knowledge Network
"Information retrieval and pattern matching"
Laboratory of Information Knowledge Network,Division of Computer Science,
Graduate School of Information Science and Technology,Hokkaido University
Takuya KIDA
The 3rdSuffix type algorithm
Boyer-Moore algorithmGalil algorithm
Horspool algorithmSunday algorithm
北海道大学 Hokkaido University3
Lecture on Information knowledge network2010/11/10
ababcababc
Knuth-Morris-Pratt algorithm (review)
KMP-String-Matching (T, P)1 n ← length[T].2 m ← length[P].3 q ← 1.4 next ← ComputeNext(P).5 for i ← 1 to n do6 while q>0 かつ P[q]≠T[i] do q ← next[q];7 if q=m then report an occurrence at i-m;8 q ← q+1.
The next position of comparison can be obtained by function next (the amount of shifting P is equal to q - next[q]). The comparison restarts from the next character when the value of next is equal to 0.The number of comparison at each text position is O(1) times.
next[5] = 3
Even in the worst case, it takes only O(n+m) time (if next is preprocessed)
next[3] = 0
D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(1):323-350, 1977.
ababbababcbaababcText T:
Pattern P:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
ababcThe pattern occursat position 6 of Tababc
ababc
北海道大学 Hokkaido University4
Lecture on Information knowledge network2010/11/10
Shift-And algorithm (review)
abababba
ababb
01000
10100
10100
10100
&
00000
10000
01000
10100
01010
10100
01010
00001
10000
12345
Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])
Mask table M
ab
10100
01011
ababb
Text T:
Pattern P:
This can be calculated in O(1) time
※Keeping only the right transferred bits by taking AND op. with the maskbits M.
R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Proceedings of the 12th International Conference on Research and Development in Information Retrieval, 168-175. ACM Press, 1989.
北海道大学 Hokkaido University5
Lecture on Information knowledge network2010/12/23
General form of efficient matching algorithms
MatchingAlgorithm (P, T)1 m ← length[P].2 n ← length[T].3 i ← 1.4 while i n – m +1 ≦ do5 decide if i is an occurrence;6 if i is an occurrence then report the occurrence at i;7 decide the amount of Δ to shift the pattern safely;8 i ← i + Δ.
A lot of efficient algorithms including KMP and BM are in this frame ※ Masayuki Taketa “High-speed pattern matching algorithms for full text processing,” Informatics symposium, January 1991(written in Japanese).
Important things for speeding-up the algorithm :• How much can we save our work for the 5th line?• How much can we make the amount of Δ large at the 7th line?
北海道大学 Hokkaido University6
Lecture on Information knowledge network2010/12/23
Boyer-Moore algorithm
a a b c d a a c b c a b c c a ・・・Text T:
Pattern P: a b c b a b a b
a b c b a b a b
Shift P to align the rightmost ‘c’ in P with the current position
delta1(char) := the jump width of which we shift the pattern so that the rightmost position of char in P is aligned to the current text position (if the pattern doesn’t include char, then it is equal to the pattern length).
Δ=delta1(char) – j + 1 = 5 – 0 = 5
delta1(c) = 5
Features: Characters of the pattern are compared from the right to the
left. The values of two functions (delta1 and delta2) are
compared, and thenthe pattern is shifted by the larger.
Although the time complexity of BM algorithm is O(mn) in the worst case, it becomes O(n/m) on average ( sub linear!! )
(bad-character heuristic)
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762-772, 1977.
北海道大学 Hokkaido University7
Lecture on Information knowledge network2010/12/23
delta2(j)
a a b c d a a b b c a b c c a ・・・
a b c b a b a b
a b c b a b a b
Shift P to align ‘ab’ with the prefix of P
delta2(j) := the jump width of which we shift to align the suffix of P of length j-1 with another factor of P ( or the longest prefix of P such that it is also the suffix of the string ) ( If there isn’t such factor, it is equal to the length of P. )
Δ=delta2(3) – 3 + 1 = 8 – 2 = 6
※There are two candidate, 1 and 5, for the value of delta2(3). However, we can see that the left side character of the 5th, namely the 4th, is ‘b’ , which doesn’t match with ‘a’. Therefore, the 1st position is the only candidate.
a a b c a b a b b c a b c c a ・・・
a b c b a b a b
a b c b a b a b
delta2(3) = 8
delta2(5) = 10
Δ=delta2(5) – 5 + 1 = 10 – 4 = 6
(good-suffix heuristic)
Text T:
Pattern P:
Text T:
Pattern P:
Shift P to align ‘ab’ with the prefix of P
北海道大学 Hokkaido University8
Lecture on Information knowledge network2010/12/23
Problem of BM method
It is complicated to decide the value of the delta functions.– It takes in O(m2) time in a naïve way.–To reduce it to O(m) is somewhat trouble → in a similar way
of KMP It costs to compare the values of delta1 and delta2 for each iteration.–Generally, only delta1 is used. ( However, we have to
devise to shift the pattern correctly since it cannot be shifted by delta1 only.)
It takes O(mn) time in the worst case.–Consider when T = an and P = bam.
The efficiency of BM declines when the alphabet size is small.–For strings in ∑={0,1}, Δ’s would be very small.
Binary strings
北海道大学 Hokkaido University9
Lecture on Information knowledge network2010/12/23
Galil algorithm
Since the information about the matched string is forgotten in the original BM method, it takes O(mn) time in the worst case.
The idea for improvement is to memory how long the prefix of P has been matched with the text.
Galil algorithm scans in O(n) time theoretically, but it slows down in practice since the algorithm becomes much complicated.
a a b c a b a b c b a b a b a ・・・
a b c b a b a b
a b c b a b a b
delta2(5) = 10
Memory that we’ve already compared the forward positions.
Only these are to be compared!
Each character of the text is compared twice at most!
Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm.Communications of the ACM, 22(9):505-508, 1979.
Text T:
Pattern P:
北海道大学 Hokkaido University10
Lecture on Information knowledge network2010/12/23
a b c b a b a d
Horspool algorithm
If ∑ is large enough, delta1 ( bad-character heuristic ) can mostly give the best shift amount.→ A small modification can enlarge the jump width.
a a b c d c a d b c a b c c a b a c a ・・・
a b c b a b a d
delta1(c) = 5
a b c b a b a d
a a b c d c a d b c a b c c a b a c a ・・・
a b c b a b a d
delta1’(d) = 10
Always decide the jump width by the character of the text at the end position of the pattern.
delta1’(b) = 3
R. N. Horspool. Practical fast searching in strings. Software Practice and Experience, 10(6):501-506, 1980.
Text T:
Pattern P:
Text T:
Pattern P:
北海道大学 Hokkaido University11
Lecture on Information knowledge network2010/12/23
Pseudo code
Horspool (P, T)1 m ← length[P].2 n ← length[T].3 Preprocessing:4 For each c ∑ ∈ do delta1’[c] ← m.5 For j←1 to m – 1 do delta1’[ P[j] ] ← m – j .6 Searching:7 i ← 0.8 while i n – m ≦ do9 j ← m;10 while j > 0 and T[i+j] = P[j] do j ← j – 1;11 if j = 0 then report an occurrence at i+1;12 i ← i + delta1’[ T[i+m] ].
北海道大学 Hokkaido University12
Lecture on Information knowledge network2010/12/23
Sunday algorithm
It basically based on BM method. Different point
– It compares in an arbitrary position order of the pattern to match For example, it compares characters in an infrequent order.
– It uses the right side text character at the end position of the pattern to determine the value of delta1 ( it also calculates delta2, and then compares them to select the longer ) .
The jump width tends to be longer than that of Horspool– However, the memory consumption is larger than Horspool– Moreover, it takes much more time to decide the jump width than
Horspool.
a b c b a b a b
a a b c d c a b d c a b c c a b a c a ・・・
a b c b a b a b
delta1’(d) = 9
Always decide the jump width by the character of the text at the right side of the end position of the pattern.
delta1’(c) = 6
D. M. Sunday. A very fast substring search algorithm. Communications of the ACM, 33(8):132-142, 1990.
Text T:
Pattern P:
Factor type algorithm
BDM algorithmBOM algorithm
BNDM algorithm
北海道大学 Hokkaido University14
Lecture on Information knowledge network2010/12/23
a b c b a b a b
Backward Dawg Matching (BDM) algorithm
It basically based on BM method. Different point
– It decide if the pattern occurs at the current position by detecting if the reading string matches with any factors of the pattern, not with a suffix of the pattern.
– It uses Suffix Automaton (or suffix tree) to determine if the reading string is a factor of the pattern.
Features of Suffix automaton (SA):– It can tell whether string u is a factor of pattern P in O(|u|) time.– It can also tell whether string u is a suffix of P.– For P=p0p2…pm , there exists an online construction algorithm that runs in
O(m) time.
M. Crochemore, A. Czumanj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter.Speeding up two string matching algorithms. Algorithmica, 12(4/5):247-267, 1994.
a b c b a b a b
a a b c d c a b d c a b c c a b a c a ・・・
a b c b a b a b
Factor search
As neither ‘cc’ is a factor of P nor ‘c’ is a prefix of P, the pattern can be shifted safely to the next position.
We can see whether the reading string is a prefix of P or not, from the second feature of SA.
uσText T:
Pattern P:
北海道大学 Hokkaido University15
Lecture on Information knowledge network2010/12/23
Suffix AutomatonA. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen and J. Seiferas. The smallest automation recognizing the subwords of a text. Theoretical Computer Science (40):31-55, 1985.
ec
nu
on
na
cn
uo
nn
a
n
u
o
n
n
a
u
o
n
n
a
o
n
n
a
n
a
a
a
ec
nu
on
na
cn
uo
nn
a
n
uo
nn
a
uo
nn
a
onna
na
a
aSuffix treeSuffix trie
Suffix automatonAn automaton that accepts the reverse PR of P = announce.
u o nne3 4 5 7 860 1 2
c n a
9n
a
u
oc
u na
北海道大学 Hokkaido University16
Lecture on Information knowledge network2010/12/23
On-line construction algorithmSuffixAutomaton(P=p1p2…pm)1 Create the one-node graph G=DAWG(e).2 root ← sink ← the node of G. suf[root] ←θ.3 for i ← 1 to m do4 create a new node newsink;5 make a solid edge (sink, newsink) labeled by a;6 w ← suf[sink];7 while w≠θ and son(w,a)=θ do8 make a non-solid a-edge (w, newsink);9 w ← suf[w];10 v ← son(w,a);11 If w=θthen suf[newsink] ← root12 else if (w,v) is a solid edge then suf[newsink] ← v13 else 14 create a node newnode;15 newnode has the same outgoing edges as v except that they are all non-solid;16 change (w,v) into a solid edge (w, newnode);17 suf[newsink] ← newnode;18 suf[newnode] ← suf[v]; suf[v] ← newnode;19 w ← suf[w];20 while w≠θ and (w,v) is a non-solid a-edge do21 redirect this edge to newnode; w ← suf[w].22 sink ← newsink.
This is rather complicated!The online construction of SA is a hard task!
北海道大学 Hokkaido University17
Lecture on Information knowledge network2010/12/23
BNDM algorithm
The idea is basically same as BDM algorithm. Different point:
– It uses a non-deterministic version of suffix automaton to determine the reading string is a factor of the pattern.
– It simulates the move of the NFA by bit-parallel technique.
G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata.ACM Journal of Experimental Algorithmics (JEA), 5(4), 2000.
An NFA that accepts the suffix of PR for pattern P = announce
Simulate this NFA
e0 1 2 4
c un o3 7 8
nn a65
I
εε ε ε ε ε ε ε ε
The same Mask table as that of Shift-And method
Initial condition : R0 = 1m
State transition : R = (R << 1) & M[ T[i] ]
北海道大学 Hokkaido University18
Lecture on Information knowledge network2010/12/23
Pseudo code
BNDM (P, T)1 m ← length[P].2 n ← length[T].3 Preprocessing:4 for c ∑ ∈ do M[c] ← 0m.5 for j ← 1 to m do M[ P[j] ] ← M[ P[j] ] | 0j–110m–j.6 Searching:7 s ← 0.8 while s n – m ≦ do9 j ← m, last ← m, R ← 1m;10 while R ≠ 0m do11 R ← R & M[ T[s+j] ];12 j ← j – 1;13 If R & 10m-1 ≠ 0m then14 If j > 0 then last ← j;15 else report an occurrence at s+1;16 R ← R << 1;17 s ← s + last.
北海道大学 Hokkaido University19
Lecture on Information knowledge network2010/12/23
Backward Oracle Matching (BOM) algorithm
The idea is basically same as BDM algorithm. Different point:
– It uses Factor oracle instead of Suffix automaton–A necessary thing for BDM is that σu is not a factor, rather than
that string u is a factor. Feature of Factor oracle:
– It may accept strings other than the factor of P. For example, in the bottom figure, ‘cnn’ is not a factor of PR.
– It can be constructed in O(m) time. Moreover, it is easy to implement with small memory space.
The number of states: m+1. The number of state transitions: 2m-1.
C. Allauzen, M. Crochemore, and M. Raffinot. Efficient experimental string matching by weak factor recognition.In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, LNCS2089:51-72, 2001.
A factor oracle of PR for P=announce
u o nne3 4 5 7 860 1 2
c n a
na
anc
uo
北海道大学 Hokkaido University20
Lecture on Information knowledge network2010/12/23
Construction algorithm of Factor oracle
Oracle-on-line (P=p1p2…pm)1 Create Oracle(ε) with 2 One single initial state 0, S(0) ←θ.3 for i 1…m ∈ do4 Oracle(P=p1p2…pj) 5 ← Oracle_add_letter (Oracle(P=p1p2…pj-1), pj).
Oracle_add_letter (Oracle(P=p1p2…pm),σ)1 Create a new state m+1.2 δ(m,σ) ← m+1.3 k ← S(m)4 while k≠θ and δ(k,σ)=θ do5 δ(k,σ) ← m+1;6 k ← S(k). 7 If k =θthen s ← 0;8 else s ← δ(k,σ).9 S(m+1) ← s.10 return Oracle(P=p1p2…pmσ).
北海道大学 Hokkaido University21
Matching time comparison
Lecture on Information knowledge network2010/12/23
Flexible Pattern Matching in Strings, Navarro&Raffinot, 2002: Fig.2.22, p39.
2 4 8 16 32 64 128 256
2
4
8
16
32
64
¿ Σ∨¿
𝑚
34 7
8
18
29
50
50
100
Horspoor
Shift-Or
BNDM
BOM
DNA
English
Extensions of Suffix & Factor type algorithms to
multiple patterns
Set Horspool algorithmWu-Manber algorithm
北海道大学 Hokkaido University23
Lecture on Information knowledge network2010/12/23
Suffix & Factor type algorithms for multiple patterns
Commentz-Walter algorithmB. Commentz-Walter. A string matching algorithm fast on the average. In Proceedings of the 6th International Colloquium on Automata, Languages and Programming, LNCS71:118-132, 1979.• A straight extension of BM algorithm
Set Horspool algorithm• A simplified algorithm of Commentz-Walter based on the idea of Horspool
Wu-Manber algorithmS. Wu and U. Manber. A fast algorithm for multi-pattern searching.Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994.• A practically fast algorithm based on hashing. Agrep employs this algorithm.
Uratani-Takeda algorithm• A BM type algorithm with AC machine. It is faster than CW.
Set Backward Oracle Matching (SBOM) algorithmC. Allauzen and M. Raffinot. Factor oracle of a set of words.Techinical report 99-11, Institut Gaspard-Monge, Universite de Marne-la-Vallee, 1999.• A extension of BOM by extending Factor oracle for multiple patterns.
北海道大学 Hokkaido University24
Lecture on Information knowledge network2010/12/23
Set Horspool algorithm First, it makes a trie for the set of the reversed patterns in Π. Its matching approach is the same as Horspool.
– It traverses the trie as doing suffix search.– If the reading string doesn’t match with any suffixes of the
patterns, then it shifts by delta1’.
Text T:
α
σ
Text T: β
suffix search
Reversed trie for
patterns
This range doesn’t include β
delta1’
※Cf. In Uratani-Takeda algorithm, it uses AC machine for the trie, and decides
a jump width by the failure functions
北海道大学 Hokkaido University25
Lecture on Information knowledge network2010/12/23
Reason why the performance decreases
Text T:
Pattern P:
ℓminℓmax
delta ( ℓmin)≦
The maximum of jump width is
limited to ℓmin
When the number of patterns increases, bad-character heuristic cannot work well
since the frequency of each character increases.
北海道大学 Hokkaido University26
Lecture on Information knowledge network2010/12/23
Wu-Manber algorithm
It examines whether some patterns occur or not by reading B characters from the current matching position of the text (i.e. T[i-B+1…i]).
– SHIFT[ T[i-B+1…i] ] : IF T[i-B+1…i] is a suffix of some patterns, then 0. Otherwise, it returns the maximum length of possible shift.
– HASH[ T[i-B+1…i] ] : When SHIFT returns 0, (i.e. T[i-B+1…i] is a suffix of some patterns), it returns the list of patterns that can occur at the position.
S. Wu and U. Manber. A fast algorithm for multi-pattern searching.Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994.
Text T: C P M a n n u a c o n f e r e n c e a n nl o u n c e
Patterns Π:a n n u a l l y
a n n o u n c ea n n u a l
String ll no ou an
un nc ua al ly nn
nuce *
Amount of shift 1 3 4 1 0 0 2 0 5String ce ly ua al *Pattern ID 3, 1 2 φ
SHIFT[B] =
HASH[B] =
SHIFT[an]=4
HASH[al]=2, → Shift by 1
SHIFT[l ]=5
Some patterns may occur!
SHIFT[al]=0
北海道大学 Hokkaido University27
Lecture on Information knowledge network2010/12/23
Pseudo code
Construct_SHIFT (P={p1,p2,…,pr})1 initialize SHIFT table by ℓmin–B+1.2 For each Bl=pi[j–B+1…j] do3 If SHIFT[h1(Bl)] > mi – j do SHIFT[h1(Bl)] = mi – j.
Wu-Manber (P={p1,p2,…,pr}, T=T[1…n])1 Preprocessing:2 Computation of B.3 Construction of the hash tables SHIFT and HASH.4 Searching:5 pos ← ℓmin.6 while pos n ≦ do7 i ← h1( T[pos–B+1…pos] );8 If SHIFT[i] = 0 then9 list ← HASH[ h2( T[pos–B+1…pos] ) ];10 Verify all the patterns in list one by one against the text;11 pos ← pos + 1; 12 else pos ← pos + SHIFT[i].
※ In the implementation of agrep ver4.02 (mgrep.c) in fact, SHIFT ・HASH ・ B are 4096, 8192, and 3.
北海道大学 Hokkaido University28
Lecture on Information knowledge network2010/12/23
The 3rd summary
Suffix type algorithm– It matches with the pattern from the right to the left.– It takes O(mn) time in the worst case, but O(n/m) time on
average. –Boyer-Moore, Galil, Horspool, and Sunday
Factor type algorithm– It determines whether the current position is a factor of the
pattern or not, and then skips the text.–BDM, BNDM, and BOM algorithm
Extensions of Suffix & Factor type algorithms to multiple patterns–When the number of patterns increases, bad-character
heuristic doesn’t work well since the frequency of each character increases.
–Set Horspool and Wu-Manber algorithm The next theme
–Approximate pattern matching: pattern matching with allowing errors.