cyk ) cocke-younger-kasami) parsing algorithm

Post on 14-Jan-2016

108 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر. CYK ) Cocke-Younger-Kasami) Parsing Algorithm. سید محمد حسین معطر پردازش زبان طبیعی. Parsing Algorithms. CFGs are basis for describing (syntactic) structure of NL sentences Thus - Parsing Algorithms are core of NL analysis systems - PowerPoint PPT Presentation

TRANSCRIPT

CYK )Cocke-Younger-Kasami) Parsing Algorithm

معطر حسین محمد سیدطبیعی زبان پردازش

امیر صنعتی دانشگاهکبیر

مهندسی دانشکدهکامپیوتر

Parsing Algorithms

CFGs are basis for describing (syntactic) structure of NL sentences

Thus - Parsing Algorithms are core of NL analysis systems Recognition vs. Parsing:

– Recognition - deciding the membership in the language:– Parsing – Recognition+ producing a parse tree for it

Parsing is more “difficult” than recognition? (time complexity)

Ambiguity - an input may have exponentially many parses

Parsing Algorithms

Parsing General CFLs vs. Limited Forms Efficiency:

– Deterministic (LR) languages can be parsed in linear time– A number of parsing algorithms for general CFLs require O(n3)

time– Asymptotically best parsing algorithm for general CFLs requires

O(n2.37), but is not practical Utility - why parse general grammars and not just CNF?

– Grammar intended to reflect actual structure of language– Conversion to CNF completely destroys the parse structure

CYK )Cocke-Younger-Kasami)

One of the earliest recognition and parsing algorithms The standard version of CYK can only recognize

languages defined by context-free grammars in Chomsky Normal Form (CNF).

It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF– Harder to understand

Based on a “dynamic programming” approach:– Build solutions compositionally from sub-solutions– Store sub-solutions and re-use them whenever necessary

Uses the grammar directly (no PDA is used) Recognition version: decide whether S == > w ?

CYK Algorithm The CYK algorithm for the membership problem is as follows:

– Let the input string be a sequence of n letters a1 ... an. – Let the grammar contain r terminal and nonterminal symbols R1 ... Rr,

and let R1 be the start symbol. – Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. – For each i = 1 to n

• For each unit production Rj -> ai, set P[i,1,j] = true. – For each i = 2 to n -- Length of span

• For each j = 1 to n-i+1 -- Start of span – For each k = 1 to i-1 -- Partition of span

» For each production RA -> RB RC » If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true

– If P[1,n,1] is true • Then string is member of language • Else string is not member of language

CYK Pseudocode

On input x = x1x2 … xn :for (i = 1 to n) //create middle diagonal for (each var. A)

if(Axi) add A to table[i-1][i]

for (d = 2 to n) // d’th diagonal for (i = 0 to n-d)

for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k])

for(each var. C in table[k][k+d]) if(ABC) add A to table[i][k+d]

return Stable[0][n] ? ACCEPT : REJECT

CYK Algorithm

this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk.

Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on.

For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence.

Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol

CYK Algorithm for Deciding Context Free Languages

Q: Consider the grammar G given by

S | AB | XB

T AB | XB

X AT

A a

B b

1. Is x = aaabb in L(G )

2. Is x = aaabbb in L(G )

CYK Algorithm for Deciding Context Free Languages

The algorithm is “bottom-up” in that we start with bottom of derivation tree.

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

CYK Algorithm for Deciding Context Free Languages

1) Write variables for all length 1 substrings

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

CYK Algorithm for Deciding Context Free Languages

2) Write variables for all length 2 substrings

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

TS,T

CYK Algorithm for Deciding Context Free Languages

3) Write variables for all length 3 substrings

S | AB | XB

T AB | XB

X ATA a

B b

a a a b b

A A A B B

T

X

S,T

CYK Algorithm for Deciding Context Free Languages

4) Write variables for all length 4 substrings

S | AB | XB

T AB | XBX AT

A a

B b

a a a b b

A A A B B

T

X

S,T

S,T

CYK Algorithm for Deciding Context Free Languages

5) Write variables for all length 5 substrings.

S | AB | XBT AB | XB

X ATA aB b

REJECT!

a a a b b

A A A B B

T

X

S,T

S,T

X

CYK Algorithm for Deciding Context Free Languages

Now look at aaabbb :

S | AB | XB

T AB | XB

X ATA a

B b

a a a b b b

CYK Algorithm for Deciding Context Free Languages

1) Write variables for all length 1 substrings.

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

b

B

CYK Algorithm for Deciding Context Free Languages

2) Write variables for all length 2 substrings.

S | AB | XB

T AB | XB

X AT

A a

B b

a a a b b

A A A B B

S,T

b

B

CYK Algorithm for Deciding Context Free Languages

3) Write variables for all length 3 substrings.

S | AB | XB

T AB | XB

X ATA a

B b

a a a b b

A A A B B

TX

b

B

S,T

CYK Algorithm for Deciding Context Free Languages

4) Write variables for all length 4 substrings.

S | AB | XB

T AB | XBX ATA a

B b

a a a b b

A A A B B

TX

S,T

b

B

S,T

CYK Algorithm for Deciding Context Free Languages

5) Write variables for all length 5 substrings.

S | AB | XB

T AB | XB

X ATA a

B b

a a a b b

A A A B B

TX

S,T

b

B

X

S,T

CYK Algorithm for Deciding Context Free Languages

6) Write variables for all length 6 substrings.

S | AB | XBT AB | XBX ATA aB b

S is included soaaabbb accepted!

a a a b b

A A A B B

TX

S,T

b

B

XS,T

S,T

CYK Algorithm for Deciding Context Free Languages

Can also use a table for same purpose.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb

1:aaabbb

2:aaabbb

3:aaabbb

4:aaabbb

5:aaabbb

CYK Algorithm for Deciding Context Free Languages

1. Variables for length 1 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A

1:aaabbb A

2:aaabbb A

3:aaabbb B

4:aaabbb B

5:aaabbb B

CYK Algorithm for Deciding Context Free Languages

2. Variables for length 2 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A -

1:aaabbb A -

2:aaabbb A S,T

3:aaabbb B -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context Free Languages

3. Variables for length 3 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - -

1:aaabbb A - X

2:aaabbb A S,T -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context Free Languages

4. Variables for length 4 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - - -

1:aaabbb A - X S,T

2:aaabbb A S,T - -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context Free Languages

5. Variables for length 5 substrings.

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - - - X

1:aaabbb A - X S,T -

2:aaabbb A S,T - -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

CYK Algorithm for Deciding Context Free Languages

6. Variables for aaabbb. ACCEPTED!

end at

start at

1: aaabbb

2: aaabbb

3: aaabbb

4: aaabbb

5: aaabbb

6: aaabbb

0:aaabbb A - - - X S,T

1:aaabbb A - X S,T -

2:aaabbb A S,T - -

3:aaabbb B - -

4:aaabbb B -

5:aaabbb B

Parsing results

We keep the results for every wij in a table.

Note that we only need to fill in entries up to the diagonal – the longest substring starting at i is of length n-i+1

Constructing parse tree

we need to construct parse trees for string w: Idea:

– Keep back-pointers to the table entries that we combine

– At the end - reconstruct a parse from the back-pointers

This allows us to find all parse trees

Ambiguity

Efficient Representation of Ambiguities Local Ambiguity Packing :

– a Local Ambiguity - multiple ways to derive the same substring from a non-terminal

– All possible ways to derive each non-terminal are stored together– When creating back-pointers, create a single back-pointer to the

“packed” representation Allows to efficiently represent a very large number of

ambiguities (even exponentially many) Unpacking - producing one or more of the packed parse

trees by following the back-pointers.

References

Hopcroft and Ullman,“Intro. to Automata Theory, Lang. and Comp.”Section 6.3, pp. 139-141

“CYK algorithm ” , Wikipedia, the free encyclopedia

A representation by Zeph Grunschlag

top related