designing a syntax based retrieval system03
TRANSCRIPT
11
Designing a Syntax-Based Retrieval System for Supporting
Language Learning
Educational Technology & Society
Nai-Lung Tsao1, Chin-Hwa Kuo2, David Wible1 and Tsung-Fu Hung2
Kaun-Hua Huo(Kaun-Hua Huo( 霍冠樺霍冠樺 ))2009/09/292009/09/29
22
Outline Outline
IntroductionIntroduction Regular expression search Regular expression search
engineengine
Syntax-based text retrieval Syntax-based text retrieval systemsystem
Conclusion and future worksConclusion and future works
33
Introduction Introduction
Regular expression search Advantage: scalable querying Disadvantage: execution time Finding example of specific types
The main purpose of the syntax-based text retrieval system is to support grammatical querying of tagged corpora for language learners and teachers.
44
Consider the following exampleConsider the following example Example 1Example 1
In ESL teaching...keep + Ving
regex (regular expression)keep\s+ \w+ing
X X keep to try
X X keep in touch Regular expression provide more Regular expression provide more
flexible querying, they still create a flexible querying, they still create a problem in search response time. problem in search response time.
55
Most of proposals use k-gram index construction and build efficient index structures for quick searching of index terms.
Example 2Example 2Collocation is a persistent area of
difficulty for ESL learners.
V + problemV + problem
RegexRegex<w POS=\”V\W+\”>\w+<\/w> (<w[^>]+><\/w>){0,5} <w[^>]+>problem</w>
where POS=\”V\W+\” indicates words tagged to verb.
66
RegularRegular expressionexpression searchsearch engineengine
The prart of typical regular The prart of typical regular expression search engine most expression search engine most different from the traditional different from the traditional keyword-based search engine keyword-based search engine are the are the index constructionindex construction and and query processing componentsquery processing components. .
77
Figure 1: The illustrative figure of a typical regular expression search engine
88
Index constructionIndex construction
Notations and definitionsNotations and definitions Index construction algorithmIndex construction algorithm Index construction structureIndex construction structure
Index constructionQuery processing
99
Notations and definitionsNotations and definitions
A k-gram is a string x = x1x2…xk , where xi : 1 ≤ i ≤ k is a character.
A data unit = the raw data is partitioned (web page, a paragraph or a sentence in documents)
M(x) denote the number of data units which contain x.
1010
The filter factor is denoted byff (x) = 1 − M(x) / N, where N means the total number of
data units.
Set the minimum filter factor minff and only keep index terms with filter factors greater than minff.
This would make the number of index terms smaller and more useful.
1111
Minff = 0.95 means the system only keeps the index term which can filter 95% of data units.
This filter scheme can make the index terms more discriminative, which is important for a retrieval system.
We call these index terms useful indices.
1212
Every string expanded from a useful index will be useful, too.● How to buy NBA tickets
Therefore, the system only maintains a presuf (prefix and suffix) free index set.● {ab, ac, abc}● {bc, ac, abc}
1313
Index construction algorithmIndex construction algorithm
Input : text collectionOutput : index[1] k=1 , Useless={.} // . is a zero-length string[2] while (Useless is not empty)[3] k-grams := all k-grams in text collection
whose (k-1)-prefix Useless or (k-1)-suffix Useless∈ ∈[4] Useless := { }[5] For each x in k-grams[6] If ff(x) ≥ minff Then[7] insert(x,index) //the gram is useful[8] Else[9] Useless := Useless {x}∪[10] k := k+1
We use inverted indices as our index storage structure, which is easily accessed by RDBMS.
1414
Index construction structureIndex construction structure
In (Cho and Rajagopalan, 2002), the prefix-free indices are built first and then the suffix-free indices are considered.
The different part of our steps from the original one in (Cho and Rajagopalan, 2002) is we do prefix checking and suffix checking in the same pass.
1515
Query processingQuery processing
The query processor has to determine which index term to look up.
Index constructionQuery processing
1616
Query processing algorithmQuery processing algorithm
determining candidate data units for regex matching
Input : a regular expression query, rOutput : candidate data units[1] Rewrite regular expression r so that it only uses $,
(, ), | and &, where $ means entire data untis and& means AND operator. The details are as follow:
1. Based on the order in Table 1, perform the action of target symbol in r.
2. If there is no operator in any two operands, insert &[2] Transform the infix expression to postfix expression[3] Get k-gram which has the smallest position set
contained in operand string.[4] Perform the set operation among the position sets
extracted in Step 3 and then get the final candidate data units for regex string matching.
1717
Step 1Step 1 Eliminate useless symbols Replace the useless symbols with operators or
specific operands.
<w & POS=& >& <w &> (<w &> & <w &>) <w & >problem< & w>
Step 2Step 2 the query processor transforms the infix expressions
to postfix expressions A&(B|C) becomes ABC|&
Example 2 :V + problem
1818
Query processing algorithm
determining candidate data units for regex matching
Input : a regular expression query, rOutput : candidate data units[1] Rewrite regular expression r so that it only uses $,
(, ), | and &, where $ means entire data untis and& means AND operator. The details are as follow:
1. Based on the order in Table 1, perform the action of target symbol in r.
2. If there is no operator in any two operands, insert &[2] Transform the infix expression to postfix expression[3] Get k-gram which has the smallest position set
contained in operand string.[4] Perform the set operation among the position sets
extracted in Step 3 and then get the final candidate data units for regex string matching.
1919
Syntax-basedSyntax-based texttext retrievalretrieval systemsystem
Corley et al. (2001) syntactic structure search engine.
grammar rules.
Our system no grammar rules or parsers but linear syntax knowledge.
Tagged corpusFilter factorEvaluation Query Generator
2020
Tagged corpusTagged corpus
We decide to add part-of-speech information to the raw text.
The lemmatizing operation should be performed after POS tagging because some words have different lemma with different POS.
2121
An example after POS tagging and XML structure formatting
<s>
<w Lemma="do" POS="VDD"> Did </w>
<w Lemma="you" POS="PNP">you</w>
<w Lemma="know" POS="VVI">know</w>
<w Lemma="?" POS="PUN">?</w>
</s>
2222
Filter factorFilter factor
As for the length 1 and 2 words, the system build the isolated keyword index for them.
We want all words of length 3 kept in the index set.
The index construction starts on k=4
2323
Function
, where k0 means the smallest number in k.
α is a real number smaller than 1.
α can be 1/52 if no matter which letter of the English alphabet is added in front of or in back of the string with length k-1, each of them has the same M(x)
2424
Evaluation Evaluation The following shows ten
query samples.
2525
Table 2 shows the index sizes and posting sizes of complete index construction and presuf free index construction.
2626
2727
Query GeneratorQuery Generator
2828
Examples Examples
Example 3 Example 3 Single wordSingle word
wonderful
Example 4Example 4 Syntax formatSyntax format
verb keep followed by -ing form verb (keep + <anyword>(0,5) + Ving)
Example 5Example 5 Syntax formatSyntax format
adjective before the word rain (adj + <anyword>(0,3) + rain)
2929
Conclusion and future worksConclusion and future works
Regular expression search engine to reduce response time
syntax-based searches to locate tokens of specific types of target
no solution for making class-based queries faster