designing a syntax based retrieval system03

29
1 Designing a Syntax-Based Retrieval System for Supporting Language Learning Educational Technology & Society Nai-Lung Tsao1, Chin-Hwa Kuo2, David Wible1 and Tsung-Fu Hung2 Kaun-Hua Huo( Kaun-Hua Huo( 霍冠樺 霍冠樺 ) 2009/09/29 2009/09/29

Upload: avelin-huo

Post on 04-Jul-2015

315 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Designing A Syntax Based Retrieval System03

11

Designing a Syntax-Based Retrieval System for Supporting

Language Learning

Educational Technology & Society

Nai-Lung Tsao1, Chin-Hwa Kuo2, David Wible1 and Tsung-Fu Hung2

Kaun-Hua Huo(Kaun-Hua Huo( 霍冠樺霍冠樺 ))2009/09/292009/09/29

Page 2: Designing A Syntax Based Retrieval System03

22

Outline Outline

IntroductionIntroduction Regular expression search Regular expression search

engineengine

Syntax-based text retrieval Syntax-based text retrieval systemsystem

Conclusion and future worksConclusion and future works

Page 3: Designing A Syntax Based Retrieval System03

33

Introduction Introduction

Regular expression search Advantage: scalable querying Disadvantage: execution time Finding example of specific types

The main purpose of the syntax-based text retrieval system is to support grammatical querying of tagged corpora for language learners and teachers.

Page 4: Designing A Syntax Based Retrieval System03

44

Consider the following exampleConsider the following example Example 1Example 1

In ESL teaching...keep + Ving

regex (regular expression)keep\s+ \w+ing

X X keep to try

X X keep in touch Regular expression provide more Regular expression provide more

flexible querying, they still create a flexible querying, they still create a problem in search response time. problem in search response time.

Page 5: Designing A Syntax Based Retrieval System03

55

Most of proposals use k-gram index construction and build efficient index structures for quick searching of index terms.

Example 2Example 2Collocation is a persistent area of

difficulty for ESL learners.

V + problemV + problem

RegexRegex<w POS=\”V\W+\”>\w+<\/w> (<w[^>]+><\/w>){0,5} <w[^>]+>problem</w>

where POS=\”V\W+\” indicates words tagged to verb.

Page 6: Designing A Syntax Based Retrieval System03

66

RegularRegular expressionexpression searchsearch engineengine

The prart of typical regular The prart of typical regular expression search engine most expression search engine most different from the traditional different from the traditional keyword-based search engine keyword-based search engine are the are the index constructionindex construction and and query processing componentsquery processing components. .

Page 7: Designing A Syntax Based Retrieval System03

77

Figure 1: The illustrative figure of a typical regular expression search engine

Page 8: Designing A Syntax Based Retrieval System03

88

Index constructionIndex construction

Notations and definitionsNotations and definitions Index construction algorithmIndex construction algorithm Index construction structureIndex construction structure

Index constructionQuery processing

Page 9: Designing A Syntax Based Retrieval System03

99

Notations and definitionsNotations and definitions

A k-gram is a string x = x1x2…xk , where xi : 1 ≤ i ≤ k is a character.

A data unit = the raw data is partitioned (web page, a paragraph or a sentence in documents)

M(x) denote the number of data units which contain x.

Page 10: Designing A Syntax Based Retrieval System03

1010

The filter factor is denoted byff (x) = 1 − M(x) / N, where N means the total number of

data units.

Set the minimum filter factor minff and only keep index terms with filter factors greater than minff.

This would make the number of index terms smaller and more useful.

Page 11: Designing A Syntax Based Retrieval System03

1111

Minff = 0.95 means the system only keeps the index term which can filter 95% of data units.

This filter scheme can make the index terms more discriminative, which is important for a retrieval system.

We call these index terms useful indices.

Page 12: Designing A Syntax Based Retrieval System03

1212

Every string expanded from a useful index will be useful, too.● How to buy NBA tickets

Therefore, the system only maintains a presuf (prefix and suffix) free index set.● {ab, ac, abc}● {bc, ac, abc}

Page 13: Designing A Syntax Based Retrieval System03

1313

Index construction algorithmIndex construction algorithm

Input : text collectionOutput : index[1] k=1 , Useless={.} // . is a zero-length string[2] while (Useless is not empty)[3] k-grams := all k-grams in text collection

whose (k-1)-prefix Useless or (k-1)-suffix Useless∈ ∈[4] Useless := { }[5] For each x in k-grams[6] If ff(x) ≥ minff Then[7] insert(x,index) //the gram is useful[8] Else[9] Useless := Useless {x}∪[10] k := k+1

We use inverted indices as our index storage structure, which is easily accessed by RDBMS.

Page 14: Designing A Syntax Based Retrieval System03

1414

Index construction structureIndex construction structure

In (Cho and Rajagopalan, 2002), the prefix-free indices are built first and then the suffix-free indices are considered.

The different part of our steps from the original one in (Cho and Rajagopalan, 2002) is we do prefix checking and suffix checking in the same pass.

Page 15: Designing A Syntax Based Retrieval System03

1515

Query processingQuery processing

The query processor has to determine which index term to look up.

Index constructionQuery processing

Page 16: Designing A Syntax Based Retrieval System03

1616

Query processing algorithmQuery processing algorithm

determining candidate data units for regex matching

Input : a regular expression query, rOutput : candidate data units[1] Rewrite regular expression r so that it only uses $,

(, ), | and &, where $ means entire data untis and& means AND operator. The details are as follow:

1. Based on the order in Table 1, perform the action of target symbol in r.

2. If there is no operator in any two operands, insert &[2] Transform the infix expression to postfix expression[3] Get k-gram which has the smallest position set

contained in operand string.[4] Perform the set operation among the position sets

extracted in Step 3 and then get the final candidate data units for regex string matching.

Page 17: Designing A Syntax Based Retrieval System03

1717

Step 1Step 1 Eliminate useless symbols Replace the useless symbols with operators or

specific operands.

<w & POS=& >& <w &> (<w &> & <w &>) <w & >problem< & w>

Step 2Step 2 the query processor transforms the infix expressions

to postfix expressions A&(B|C) becomes ABC|&

Example 2 :V + problem

Page 18: Designing A Syntax Based Retrieval System03

1818

Query processing algorithm

determining candidate data units for regex matching

Input : a regular expression query, rOutput : candidate data units[1] Rewrite regular expression r so that it only uses $,

(, ), | and &, where $ means entire data untis and& means AND operator. The details are as follow:

1. Based on the order in Table 1, perform the action of target symbol in r.

2. If there is no operator in any two operands, insert &[2] Transform the infix expression to postfix expression[3] Get k-gram which has the smallest position set

contained in operand string.[4] Perform the set operation among the position sets

extracted in Step 3 and then get the final candidate data units for regex string matching.

Page 19: Designing A Syntax Based Retrieval System03

1919

Syntax-basedSyntax-based texttext retrievalretrieval systemsystem

Corley et al. (2001) syntactic structure search engine.

grammar rules.

Our system no grammar rules or parsers but linear syntax knowledge.

Tagged corpusFilter factorEvaluation Query Generator

Page 20: Designing A Syntax Based Retrieval System03

2020

Tagged corpusTagged corpus

We decide to add part-of-speech information to the raw text.

The lemmatizing operation should be performed after POS tagging because some words have different lemma with different POS.

Page 21: Designing A Syntax Based Retrieval System03

2121

An example after POS tagging and XML structure formatting

<s>

<w Lemma="do" POS="VDD"> Did </w>

<w Lemma="you" POS="PNP">you</w>

<w Lemma="know" POS="VVI">know</w>

<w Lemma="?" POS="PUN">?</w>

</s>

Page 22: Designing A Syntax Based Retrieval System03

2222

Filter factorFilter factor

As for the length 1 and 2 words, the system build the isolated keyword index for them.

We want all words of length 3 kept in the index set.

The index construction starts on k=4

Page 23: Designing A Syntax Based Retrieval System03

2323

Function

, where k0 means the smallest number in k.

α is a real number smaller than 1.

α can be 1/52 if no matter which letter of the English alphabet is added in front of or in back of the string with length k-1, each of them has the same M(x)

Page 24: Designing A Syntax Based Retrieval System03

2424

Evaluation Evaluation The following shows ten

query samples.

Page 25: Designing A Syntax Based Retrieval System03

2525

Table 2 shows the index sizes and posting sizes of complete index construction and presuf free index construction.

Page 26: Designing A Syntax Based Retrieval System03

2626

Page 27: Designing A Syntax Based Retrieval System03

2727

Query GeneratorQuery Generator

Page 28: Designing A Syntax Based Retrieval System03

2828

Examples Examples

Example 3 Example 3 Single wordSingle word

wonderful

Example 4Example 4 Syntax formatSyntax format

verb keep followed by -ing form verb (keep + <anyword>(0,5) + Ving)

Example 5Example 5 Syntax formatSyntax format

adjective before the word rain (adj + <anyword>(0,3) + rain)

Page 29: Designing A Syntax Based Retrieval System03

2929

Conclusion and future worksConclusion and future works

Regular expression search engine to reduce response time

syntax-based searches to locate tokens of specific types of target

no solution for making class-based queries faster