chapter 4 : query languages

38
Chapter 4 : Query Languages 學學 : 學學學 學學 :88522070 學學學學 : 學學學 學學學學 :89/10/26

Upload: enrico

Post on 19-Mar-2016

58 views

Category:

Documents


4 download

DESCRIPTION

Chapter 4 : Query Languages. 學生:曾寶樂 學號:88522070 課程老師:張嘉惠 報告日期:89/10/26. Outline. Keyword-Based Querying Patten Matching Structural Queries Query Protocols Trends and Research Issues. Keyword-Based Querying. A query is formulation of a user information need - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 4 : Query Languages

Chapter 4 : Query Languages

學生 : 曾寶樂學號 :88522070

課程老師 : 張嘉惠報告日期 :89/10/26

Page 2: Chapter 4 : Query Languages

Outline Keyword-Based Querying Patten Matching Structural Queries Query Protocols Trends and Research Issues

Page 3: Chapter 4 : Query Languages

Keyword-Based QueryingA query is formulation of a user

information needKeyword-based queries are popular 1.Single-Word Queries 2.Context Queries 3.Boolean Queries 4.Natural Language

Page 4: Chapter 4 : Query Languages

Single-Word Queries A query is formulated by a word A document is formulated by long

sequences of words A word is a sequence of letters

surrounded by separators What are letters and separators?

e.g,’on-line’ The division of the text into words is not arbitrary

Page 5: Chapter 4 : Query Languages

Context Queries definition - Search words in a given context,e.g,near other

words types -phrase >a sequence of single-word queries >e.g,enhance retrieval -proximity >a sequence of single words or phrases, and a maximum

allowed distance between them are specified >e.g,within distance(enhance,retrieval,4) will match ‘…

enhance the power of retrieval…’

Page 6: Chapter 4 : Query Languages

Boolean QueriesDefinition -A syntax composed of atoms that retrieve documents, and of Boolean operators which work on their operands-e.g,translation AND syntax OR syntactic

Page 7: Chapter 4 : Query Languages

Boolean Queries Operands -(e1 OR e2) select all documents which satisfy e1 or e2 -(e1 AND e2) select all documents which satisfy both e1 and e2 -(e1 BUT e2) select all documents which satisfy e1 but not e2 “ fuzzy boolean” -Retrieve documents appearing in some operands(The

AND may require it to appear in more operands than the OR)

Page 8: Chapter 4 : Query Languages

Natural Language generalization of “fuzzy Boolean” A query is an enumeration of

words and context queries All the documents matching a

portion of the user query are retrieved

Page 9: Chapter 4 : Query Languages

Pattern Matching A pattern is a set of syntactic features that must occur

in a text segment Types -words -prefixes e.q ‘comput’->’computer’ ,’computation’,’computing’,etc -suffixes e.q ‘ters’->’computers’,’testers’,’painters’,etc -substrings e.q ‘tal’->’coastal’,’talk’,’metallic’,etc -Ranges between ‘held’ and ‘hold’->’hoax’ and ‘hissing’

Page 10: Chapter 4 : Query Languages

Pattern Matching Allowing errors Retrieve all text words which all ‘similar’ to the

given word edit distance: the minimum number of character

insertions,deletions,and replacements needed to make two strings equal , e.q , ‘flower’ and ‘flo wer’

maximum allowed edit distance: query specifies the maximum number of

allowed errors for a word to match the pattern

Page 11: Chapter 4 : Query Languages

Pattern Matching Regular expressions union: if e1 and e2 are regular expressions ,

then(e1|e2) matches what e1 or e2 matches concatenation: if e1 and e2 are regular

expressions , the occurrences of (e1e2) are formed by the occurrences of e1 immediately followed by those of e2

repetition: if e is a regular expression , then (e*) matches a sequence of zero or more contiguous occurrence of e

‘pro(blem|tein)(s|є)(0|1|2)*’->’problem2’ and ‘proteins’

Page 12: Chapter 4 : Query Languages

Structural Queries Mixing contents and structure in

queries -contents:words,phrases,or patterns -structural

constraints:containment,proximity,or other restrictions on structural elements

Three main structures -fixed structure -hypertext structure -hierarchical structure

Page 13: Chapter 4 : Query Languages

Fixed StructureDocument:a fixed set of fieldsEX: a mail has a sender, a receiver, a date, a subject and a body fieldSearch for the mails sent to a given person with “football” in the Subject field

Page 14: Chapter 4 : Query Languages

HypertextA hypertext is a directed graph where nodes hold some text (text contents)the links represent connections between nodes or between positions inside nodes (structural connectivity)

Page 15: Chapter 4 : Query Languages

Hypertext : WebGlimpseWebGlimpse: combine browsing and searching on the Web

Page 16: Chapter 4 : Query Languages

Hierarchical StructureRecursive decomposition of the text

Page 17: Chapter 4 : Query Languages

Hierarchical Structure

Page 18: Chapter 4 : Query Languages

Hierarchical Structure

Page 19: Chapter 4 : Query Languages

Hierarchical Structure

Page 20: Chapter 4 : Query Languages

Hierarchical Structure PAT Expressions Overlapped Lists Lists of References Proximal Nodes Tree Matching

Page 21: Chapter 4 : Query Languages

PAT Expressions What is PAT tree? The areas of a region cannot nest

or overlap

Page 22: Chapter 4 : Query Languages

PAT Tree

Hsin-Hsi Chen 8-37

2

2

2

4 3

15

1

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

4

36

註:3和6要4個bits才能區辨

2

2

2

3

15

4

36

1

3

47

2

2

2

3

15

4

36

1

3

7 5

84

Search 00101

Page 23: Chapter 4 : Query Languages

Overlapped Lists The model allow for the areas of a

region to overlap,but not to nest It is not clear,whether overlapping

is good or not for capturing the structural properties

Page 24: Chapter 4 : Query Languages

Lists of References Overlap and nest are not allowed All elements must be of the same

type,e.g only sections,or only paragraphs.

A reference is a pointer to a region of the database.

Page 25: Chapter 4 : Query Languages

Proximal Nodes This model tries to find a good

compromise between expressiveness and efficiency.

It does not define a specific language, but a model in which it is shown that a number of useful operators can be included achieving good efficiency.

Page 26: Chapter 4 : Query Languages

Tree Matching The leaves of the query can be not

only structural elements but also text patterns, meaning that the ancestor of the leaf must contain that pattern.

Page 27: Chapter 4 : Query Languages

Query Protocols Z39.50 WAIS (Wide Area Information

Service)

Page 28: Chapter 4 : Query Languages

Z39.50 American National Standard Information

Retrieval Application Service Definition Can be implemented on any platform Query bibliographical information using

a standard interface between the client and the host database manager

Z39.50 protocol is part of WAIS

Page 29: Chapter 4 : Query Languages

Z39.50 Brief history Z39.50-1988(version 1) Z39.50-1992(version 2) Z39.50-1995(version 3) Version 4,development began in

Autumn 1995

Page 30: Chapter 4 : Query Languages

Using Z39.50 over the WWW

WWW Client WWW Z39.50

Z39.50 Client

Z39.50Server

RepositoryDigital library

Page 31: Chapter 4 : Query Languages

WAIS (Wide Area Information Service) Beginning in the 1990s Query databases through the

Internet

Page 32: Chapter 4 : Query Languages

Trends and Research Issues

Model Queries allowedBooleanVectorProbabilisticBBN

word,set operationswordswordswords

Relationship between types of queries and models

Page 33: Chapter 4 : Query Languages

Boolean Model 布林運算式雖具有精確的語意 , 但如何將一篇文章以布林運算式表達也是一個問題。 它是以二元比較 , 缺乏「相似性」或「程度上」的比較 , 也就是無法進行相似文章的查詢。

Page 34: Chapter 4 : Query Languages

Vector Model 優點: (1) 以 Term-weight 的方法改善了資料粹取的效率; (2) 它能允許相關文章的查詢; (3) 它能計算文章間相似程度,以找出最大相似度的文章。 缺點:這個模型假設了字串的獨立性,若關鍵字在每篇文章都出現 , 此關鍵字

的 weight 將會是 0 如此便忽略了字在各文章出現頻率不同所隱含的意義。

Page 35: Chapter 4 : Query Languages

Probabilistic Model Probabilistic Model 主要優點在於能夠計算相似度的機率值 , 但它有幾個缺點 : (1) 須要猜測一堆文章中相關及不相關的集合 ;(2) 未考慮到字串在文件中出現的頻率 ;(3) 對索引字串須假設相互獨立 。

Page 36: Chapter 4 : Query Languages

Bayesian Belief Network 是一個有向的非循環圖,其是由質和量兩個部份所組成,質的部分是由領域相關的變數及變數之間的交互關係所組成的有向圖,量的部分是這些領域相關變數的聯合機率分佈 在這有向圖中,每個節點代表一個隨機變數,每條連結線指出兩個變數之間的交互關係。簡言之,這個有向圖是這些變數之聯合機率分佈的分解表示法 。

Page 37: Chapter 4 : Query Languages

Bayesian Belief Network 懷孕 (P) 導致荷爾蒙 (H) 改變 (ie. 影響荷爾蒙的狀態 )掃描圖陰影 (S) 的改變、荷爾蒙的改變導致血液檢測

(B) 及尿液檢測 (U) 的結果改變。

Page 38: Chapter 4 : Query Languages

Trends and Research Issues

The types of queries covered and how they are structured