國立台灣大學 資訊工程研究所...

101
國立台灣大學 資訊工程研究所 博士論文 基於欄位填充機制的 XML 文件檢索方法 - (以蝴蝶與蛋白質的檢索為案例) XML Retrieval - A Slot Filling Approach 生:陳鍾誠 指導教授:項 中華民國九十一年七月

Upload: others

Post on 11-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

國立台灣大學 資訊工程研究所

博士論文

基於欄位填充機制的XML文件檢索方法

- (以蝴蝶與蛋白質的檢索為案例)

XML Retrieval - A Slot Filling Approach

研 究 生:陳鍾誠

指導教授:項 潔

中華民國九十一年七月

Page 2: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

誌謝

本論文是我從 1997年到 2002年之間的主要成果,總結了我在生活, 工作,與學校中所學習與獲

得的研究與經驗,這些成果若有一點點值得參考的地方,都歸功於那些曾經教導過我與幫助過

我的父母師長,還有同事朋友們。

首先要感謝的是我的指導教授,項潔教授,從日常的討論與正式的課堂中,使我得以學習到研

究的方法與正確的態度,老師的諄諄指導總是使我獲益良多,不論是在理論的建構上,研究的

直覺上與實際系統的觀察上,總是深刻入微,令人深深敬佩。

其次我要感謝高成炎老師,不但是我碩士的指導教授,也在博士班時給予我相當多的指導與援

助, 不論是在研究方向與生物資訊的領域,高老師都給予我相當多的指導。

其次我要感謝中央研究院資訊所的許聞廉教授,在我剛開始博士研究的前幾年,啟發了我對自

然語言研究的熱誠與信心,並且給我相當大的發揮空間與細心的指導。

還有幾位特別給予我協助的人,包含中央大學的洪炯宗教授,交通大學的楊進木教授與黃振剛

主任,他們在研究論文上給予我的協助與指導,都是這篇論文得以順利完成的原因。

接下來要感謝的是我的同學們,“林耀仁、杜協昌、黃光璿、謝育平、劉文俊、潘家煜、傅國長、陳宏

杰、陳必衷、余禎祥、黃子葵、賴勝華、洪智瑋、劉秉涵、陳瑞呈、徐代昕、陳詩沛、陳耀將”,感謝他們

在這段求學的日子裡的幫助與照顧,希望大家都能有美好的未來。

特別感謝謝育平同學,在博士論文的最後階段,提供了許多寶貴的意見,透過許多次的討論,

才得以使論文呈現目前的面貌,也使我獲益良多。

另外、還有歷任的研究助理們,“胡純毓、張慶瑞、許玉霜、鐘淑微、梁素瑜” 沒有他們的努力, 實驗

室的所有成果將無法累積,我們也無法擁有如此優良的研究環境。

最後、要感謝的是我的父親與母親,在我唸博士般的時候,一直堅定的支持我,使我能安然的度

過這一路上的風風雨雨,令我相當愧疚的是,在這幾年當中,我無法盡到照顧他們的責任,也很

感謝我的大哥,這幾年真的辛苦你了。

要感謝的人真是太多了,現在我真正了解到,陳之藩在 “謝天” 一文中所說的 : “要感謝的人真的

太多了,無法一一感謝,因此只好謝謝老天爺,以表達我們無限的感謝 ! ”,在未來的日子裡,

希望大家都能有健康,快樂, 美好的每一天。

Page 3: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

中文摘要

可擴充標記語言 XML自 1998年由 W3C 提出之後,已被廣泛用於文件交換與知識表達

上,由於 XML 文件具有語意標記與半結構化的特性,使得 XML 的檢索具有相當大的

發展潛力,為了充分利用 XML 文件的特性,本論文利用特殊設計的知識表達方法,發

展出了一套 XML 文件的檢索機制。

由於電腦不容易理解自然語言文件,因此造成了人與機器之間的語意落差,對於

XML檢索系統而言,語意落差可分為查詢端的語意落差與文件端的語意落差。查詢端的

語意落差主要是由於結構化查詢語言的不容易寫所造成的,而文件端的語意落差則是由

於電腦無法理解 XML文件所造成的。為了解決語意落差的問題,本論文提出以欄位樹

(Slot-Tree Ontology)為核心的知識表達方法,並利用此方法解決 XML文件檢索系統上的

語意落差問題。

欄位樹是一種物件式的知識表達法,特別適合用來檢索物件式的 XML文件,在本

論文中,首先我們設計出欄位樹以代表物件的背景知識,接著發展出欄位填充機制

(Slot-Filling Algorithm),將 XML 文件映射到欄位樹中,以抓取 XML文件的語意,然後

利用該欄位樹與填充機制,設計出一套 XML 文件的語意檢索方法,包含多欄位的檢索

介面,能充分利用語意標籤的檢索模型與摘要技術,以使系統能精確的檢索出 XML 文

件,並動態抽取出語意樹以便瀏覽。

由於建構欄位樹的工作不易,因此我們發展出一套資料採掘 (Data Mining)的演算法

(Slot-Mining Algorithm),以自動從 XML文件集合中抽取出欄位樹,該方法以統計的手

段分析語意標籤與詞彙之間的相關係數,以便找出特徵詞彙填入欄位中,自動建構出欄

位樹,使得欄位樹的建構工作變得比較容易。

我們用兩個實際的案例-台灣蝴蝶數位博物館與蛋白質資料庫(Protein Information

Resource),來測試該 XML文件檢索系統的表現,發現該系統能較正確的檢索 XML文

件,並且組織檢索結果以便瀏覽,,另外、自動建構欄位樹的程式也能有效填入特徵詞彙

於欄位中,但仍然需要人工修改以提高欄位樹的品質。

最後、我們總結了本論文在 XML 文件檢索上的貢獻,並與現有的一些方法進行定性

的比較,以說明本方法的優點與缺點,並提出未來可能的研究方向。

Page 4: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Abstract

Extensible Markup Language (XML) is widely used in data exchanging and knowledge

representation. A retrieval system that used to manage the content of XML documents is

strongly desired. In order to improve the efficiency of XML retrieval systems, we design a set

of methods based on a ontology called slot-trees, and use the slot-trees to help the XML

retrieval process.

One problem for us to build smart computer is that computer cannot understand natural

language as good as human. This is called the semantic gap between human and computer. For

XML retrieval systems, semantic gap lies on both the query side and document side. The

semantic gap on the query side is due to the difficulty for human to write structured query. The

semantic gap on the document side is due to the difficulty for computer to understand XML

documents. In order to reduce the semantic gap, we design a XML retrieval system based on a

notion of slot-tree ontology.

Slot-tree ontology is an object-based knowledge representation. In this thesis we develop

slot-tree ontology to represent the inner structure of an object. We then introduce a slot-filling

algorithm that maps XML documents into the slot-tree ontology in order to capture the

semantics. After that, we design a XML retrieval system based on the slot-tree ontology and

slot-filling algorithm. The system includes a slot-based query interface, a semantic retrieval

model for XML, and a program that extract summary for browsing.

Since the construction of slot-tree is not an easy job, we also develop a slot-mining

algorithm to construct the slot-tree automatically. Our slot-mining algorithm is a statistical

approach based on the correlation analysis between tags and words. The highly correlated

terms are filled into the slot-tree as values. This algorithm eases the construction process of the

slot-tree.

Two XML collections, one on butterflies and another on proteins, are used as test-bed of

our XML retrieval system. We found that our XML retrieval system is easy to use and performs

well in the retrieval effectiveness and the quality of browsing. Furthermore, the slot-mining

algorithm can fill important words into each slot. However, the mining results should be

modified manually in order to improve the quality of the slot-tree.

Finally, we summary our contributions on XML retrieval, and then compare our methods

to some other methods. A qualitative analysis is given in the last chapter. We also suggest

directions for further research.

Page 5: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

XML Retrieval - A Slot-Filling Approach

Ph.D. Dissertation

Chen, Chung Chen

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan

E-mail : [email protected]

Advisor : Jieh Hsiang

23 July 2002

Page 6: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Content

Part 1 : Tutorial of This Thesis1 Introduction 1

1.1 Motivation 1

1.2 Research problems 3

1.3 Research approaches 5

1.4 Outline of this thesis 7

2 Background – XML and Information Retrieval 8

2.1 XML 8

2.2 Information retrieval 9

2.3 XML querying and retrieval 12

2.4 Using ontology to help the XML retrieval process 16

2.5 Discussion 20

Part 2 : Slot-Tree Based Methods for XML Retrieval2 Slot-Tree Ontology and Slot-Filling Algorithm 21

2.6 Introduction 21

2.7 Slot-tree ontology 22

2.8 Slot-filling algorithm 26

2.9 Discussion 28

3 An Ontology Based Approach for XML Querying, Retrieval and Browsing 29

3.1 Introduction 29

3.2 XML documents 30

3.3 Indexing structure 32

3.4 Query language and query interface 33

3.5 Ranking strategy 34

3.6 Browsing XML documents 36

3.7 Discussion 37

4 The Construction of Slot-Tree Ontology 38

4.1 Introduction 38

4.2 Background 39

4.3 The process of building a slot-tree 39

4.4 Slot-mining algorithm 41

4.5 Discussion 44

Page 7: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Part 3 : Case Studies5 Case Study - A Digital Museum of Butterflies 46

5.1 Introduction 46

5.2 The representation of butterflies in XML 47

5.3 Slot-tree ontology for butterflies 48

5.4 Query interface 51

5.5 Slot-filling algorithm 52

5.6 XML retrieval 53

5.7 Slot-mining algorithm 53

5.8 Discussion 56

6 Case Study - Protein Information Resource 57

6.1 Introduction 57

6.2 The representation of proteins in XML 58

6.3 Slot-tree ontology for proteins 58

6.4 Query interface 59

6.5 Slot-filling algorithm 60

6.6 XML retrieval 61

6.7 Slot-mining algorithm 62

6.8 Discussion 64

Part 4 : Conclusions7 Conclusions and Contributions 65

7.1 Comparison 65

7.2 Contributions 69

7.3 Discussion and future work 70

Reference 71

Appendix 1 : A Museum of Butterflies in Taiwan 77

Appendix 2 : Protein Information Resource 85

Page 8: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Part 1 : Tutorial of This Thesis1 IntroductionThis thesis introduces an information retrieval (IR) method for XML. One big problem for

information retrieval is that computer cannot understand documents as good as people. The

problem is called the semantic gap problem. Our goal is building an information retrieval

system to reduce the semantic gap between human and computer on XML. Our approach is

using ontology to help the searching processes for XML, include querying, retrieval and

browsing. This thesis is opened with our motivation in section 1.1. Our research problems are

proposed in section 1.2. Our research approaches are described in section 1.3. An overview of

this thesis is outlined in section 1.4.

7.4 Motivation

Extensible Markup Language (XML) [XML98] is a standard to encode semi-structured

documents. XML is useful in data representation, data exchanging and data publishing on the

web. Many people believes that XML will be a widely spread standard in the future. For this

reason, XML has gained much attention in both the information community and in the field of

database research.

XML is a markup language with extensible tags. Everyone may define his own markup

language based on XML. In fact, hundreds of specifications based on XML have been

proposed from 1997 to 2002. These specifications are designed to fulfill the need of some

domains or some applications. For example, Protein Information Resource (PIR)

(http://pir.georgetown.edu/) is an XML collections designed to record the data about proteins.

UDDI [UDDI00] is an XML specifications designed to record the profile of business

companies.

XML is designed to be easy understood by human and computer. XML is encoded in text

format for human to read and understand easily. Tags in XML provide semantic background for

computer to “understand” the content correctly. XML can be used as a bridge between human

writing and computer understanding.

A smart computer program that understands XML documents is useful. However, building

a computer program to “understand” XML documents is still very difficult. In this thesis, we

propose methods for computer to “understand” XML documents.

The natural language processing (NLP) community has been focus on the processing and

understanding of natural language documents for a long time [Grosz86]. However,

understanding natural language documents is still very difficult for computer programs. No

Page 9: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

effective approach is powerful enough to solve the understanding problem. Building a smart

computer program to understand natural language texts is very difficult because of the

“semantic gap”. The semantic gap is described as following.

“Computer cannot understand natural language as good as human.”

The semantic gap causes some difficulties for information retrieval systems. For example,

an information retrieval system cannot understand our natural language queries, and retrieve

many documents that are not semantically related to our queries.

There are two semantic gaps for an information retrieval system, one for queries

understanding and another for documents understanding. These gaps are list as following.

Gap 1 : “Computer cannot understand queries as good as human.”

Gap 2 : “Computer cannot understand documents as good as human.”

Figure 1.1 : Semantic gaps of natural language

In order to reduce the semantic gap problem, researchers in NLP community have been

trying hard to resolve the following question.

“How to make computers understand natural language? ”

Page 10: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

However, natural language is too difficult for computer programs to understand now.

Although many people have been devoted to solve the problem for more than thirty years,

designing a computer program to understand natural language is still an open research

problem.

Computers do not understand natural language well. Why don’t we design a structured

language that is easy for computer to understand and easy for human to write. If we can design

such a language, then we have a common language between human and computer. People may

write documents in this language for computer to understand. Then we may build computer

programs to understand documents in this language.

XML is such a language that is easy for human to write. However, we have no method for

computer to understand XML documents easily. If we can design such a computer program, we

may reduce the semantic gap for XML, so that XML may plays as a bridge between human and

computer.

In this thesis, our goal is to reduce the semantic gap on XML. Our approach is to design

methods for computer to understand XML documents. Our research problem is described in the

next section.

7.5 Research problems

XML is a markup language with extensible tags. People have to understand tags before writing

XML documents. If there are too many tags for an XML writer to remember, he cannot write

XML documents easily. If a writer has to mark each word up in XML documents, he cannot

write it easily, too. On the other hand, if a writer mark documents up roughly, it is difficult for

computer to understand. The tradeoff between human writing and computer understanding is

called the “human-computer dilemma of XML”.

Our goal is to design an XML retrieval system to resolve the “human-computer dilemma

of XML”. For an XML retrieval system, there are two semantic gaps between human and

computer, one gap on query side and another gap on document side. Figure 1.2 shows these

two gaps.

Page 11: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 1.2 : Semantic gaps of XML

On the document side, an XML document may be easy for human to write but not so easy for

computer to understand. An XML document with many natural language texts is not so easy

for computer to understand. Example 1.1 shows an XML document that contains natural

language text in the “color” block and “size” block. It is not so easy for computers to

understand the XML document.

Example 1.1 : An XML document that is not easy for computer to understand

<butterfly name=”kodairai”>

<color>with black wing and white spots on it</color>

<size>middle size butterflies, from 50mm to 60mm</size>

</butterfly>

On the contrarily, an XML document may be easy for computer to understand but not so easy

for human to read and write. An XML document that marks each word up is not so easy for

human to read and write. Example 1.2 shows an XML document that is not easy for human to

read and write.

Example 1.2 An XML document that is not easy for human to read and write

<butterfly name=”kodairai”>

<color><wing>black<wing><texture>white spot</texture></color>

<size>

<classification>middle size</classification>

<from>50mm</from><to>60mm</to>

</size>

Page 12: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

</butterfly>

The same things happen on the query side, an XML query may be easy for human to write but

not so easy for computer to understand. An XML query with natural language is not so easy for

computer to understand. Example 1.3 shows an XML query that is not so easy for computer to

understand.

Example 1.3 An XML query that is not easy for computer to understand

<butterfly>in black color with white spots</butterfly>

On the contrarily, an XML query may be easy for computer to understand but not easy for

human to read and write. A structuralized XML query is not so easy for human to read and

write. Example 1.4 shows an XML query that is not so easy for human to read and write.

Example 1.4 An XML query that is not easy for human to read and write

For $b in //butterfly

Where ?b/color = “black” and ?b/texture=”white spots”

Return ?b

Two approaches may be used to reduce semantic gap between human and computer on

XML. The first approach is building computer programs to understand XML documents or

queries. The second approach is building tools for human to write XML documents or queries.

We adopt the first approach on the document side and adopt the second approach on the

query side. It means that we build a computer program to understand roughly tagged XML

documents, and we build a tool for human to write XML queries easily. The following section

shows our approach.

7.6 Research approaches

In this thesis, we build an XML retrieval system to reduce the semantic gap between human

and computer on XML. An ontology called slot-tree is used to help the retrieval process. A user

may use the query interface to write queries easily. The slot-tree ontology also helps the

computer to understand XML documents easily. Figure 1.3 shows a scenario of our XML

retrieval system.

Page 13: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 1.3 : A scenario of our XML retrieval system.

On the document side, we build a computer program to understand XML documents. The

“understanding” process is based on an ontology called slot-tree. Slot-tree is a frame like

representation that embedded with XPATH [XPATH99] expression. In order to make computer

understand XML documents, we designed a slot-filling algorithm to map XML documents into

the slot-tree.

On the query side, we build a query interface for human to write queries easily. The

interface is built by transform the ontology into a web page. User may use the interface to write

structural queries just by choosing or typing values into slots to build a structural query.

In our approach, the slot-tree ontology is a key component for both documents

understanding and queries building. The slot-tree ontology mediates queries and documents in

the retrieval process to reduce the semantic gaps both on query side and document side.

However, it is not an easy job to build the slot-tree ontology. The ontology constructor

needs tools to build slot-tree ontology. The problem of construct slot-tree automatically based

on a set of XML documents is called the slot-mining problem. It is described as following.

“How to mine the slot-tree ontology from a collection of XML documents ?”

Page 14: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

In order to handle the slot-tree mining problem, we developed a statistical method to build

the slot-tree automatically. The algorithm is called slot-mining algorithm that based on

correlation analysis between tags and terms in XML documents.

7.7 Outline of this thesis

This thesis is divided into four parts, including “tutorial part”, “methods part”, “case study

part” and “conclusion part”.

Part 1 sets the stage for all the others. Chapter 1 outlines the research problems and

approaches. Chapter 2 reviews the background literatures for our research - “Designing an

XML retrieval system to reduce the semantic gap problem”.

Part 2 is a detail description of our methods. Our methods are based on a knowledge

representation structure called slot-tree. The slot-tree is used in catching the semantics of XML

documents. It helps our XML retrieval system to understand XML documents.

Chapter 3 shows the syntax and semantics of slot-tree ontology, and shows a method that

uses the slot-tree to catch the semantics of XML documents called slot-filling algorithm.

Chapter 4 outlined an XML information retrieval system that based on slot-tree. The slot-tree

ontology and slot-filling algorithm are used to reduce the semantic gap of XML retrieval.

Chapter 5 shows the process of constructing slot-tree ontology. The steps of constructing a

slot-tree are outlined. After that, a method that constructs slot-tree automatically is proposed.

The method is a statistical program that called slot-mining algorithm. The slot-mining

algorithm mines slot-trees from XML documents based on the correlation analysis between

tags and terms. It helps peoples to construct the slot-tree ontology for a given XML collection.

Part 3 is test-beds of the slot-tree based approach. The slot-tree based approach is

examined in this part. Two cases are used to test the slot-tree based approaches. Chapter 6

shows the first case that is an XML collection about butterflies. The collection is a set of XML

documents in Chinese about butterflies in Taiwan. Chapter 7 shows the second case that called

Protein Information Resource (PIR). PIR is a large set of XML documents that released by

George Town University. The experiment on these two cases is used to analyze the strength and

weakness of the slot-tree based approach.

Part 4 is the conclusion part. Chapter 8 analyzes the strength of slot-tree based approach.

We compare the slot-tree based methods to some other XML retrieval methods, and point out

our contribution, conclusions and future works.

Page 15: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

8 Background – XML and Information RetrievalIn chapter 1, we have introduced our motivation, goals and research approaches. Briefly

speaking, we would like to build an XML retrieval system that reducing the semantic gap

between human and computer on XML. In this chapter, we will survey the related researches in

order to provide background knowledge for our research. Since our approach is using slot-tree

ontology to help the XML retrieval process, we will survey the topics of XML, information

retrieval and ontology in this chapter.

In section 2.1, we focus on the XML topics to survey the related specifications and

technologies. In section 2.2, we survey the information retrieval technologies. After that, we

will survey the current status and state of art in XML retrieval in section 2.3. Finally, we will

outline the relationship between ontology and XML retrieval in section 2.4.

8.1 XML

We have to understand XML in order to build an XML retrieval system that reduces the semantic gap.

In this section, we will survey the XML related specifications and technologies, especially literature

about knowledge representation and information retrieval.

XML is proposed by world-wide-web consortium (W3C) (http://www.w3c.org) in 1998. It’s a tree

structured markup language with extensible tags. The following example is an XML document of

phonebook.

Example 2.1 An XML document

<?xml version= “1.0”?>

<!DOCTYPE phonebook SYSTEM "phonebook.dtd">

<phonebooks xmlns= “http://www.ntu.edu.tw/phonebook”>

<people id= “001”>

<name>Johnson Chen</name>

<tel>02-34134345</tel>

</people>

<people id= “002”>

<name>Fanny Chen</name>

<tel>02-33451294</tel>

</people>

</phonebooks>

Page 16: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

In example 2.1, the head part <?xml version= “1.0”?> indicate that this document is an XML document.

The second line is the document type definition (DTD) part of this XML document. DTD is used to

validate the syntax of XML documents. The DTD part is optional and can be removed to ignore the

syntax validation process.

The third line, with a “phonebooks” tag, is the root node of this XML document. One XML

document has one and only one root node. In this line, the xmlns= “http://www.ntu.edu.tw/phonebook”

is the default name space of this XML document. Name space [XMLNS99] in XML is used to

distinguish tags with the same names form each other. So that people can define their own tags and

using others tags without have to worry about using the same tag name in different meaning.

A node in XML contains tag, attribute and text. “phonebooks”, In the example above, “people”

and “name” and “tel” are tags, “xmlns” and “id” are attributes, “http://www.ntu.edu.tw/phonebook” and

“Johnson Chen” and “02-34134345” are text parts.

XPath [XPATH99] is a specification that used to locate nodes in XML documents. If we would

like to locate all the “people” nodes, we may use the XPath expression “//people” to locate nodes of

people. The “//” operator means matching every descendent nodes. If we would like to locate the

“people” node with id = “001”, then we may use the XPath expression “//people[@id= ‘001’]” to locate

the node. The “@” symbol means the “id” is an attribute name. XPath is used in the slot-tree ontology

that is going to be discussed in chapter 3. We embed XPath into the slot-tree to locate nodes in XML,

and using the XPath to map XML documents into slot-tree ontology.

Many XML related specifications are proposed since 1997. XML has been a wide spreading

specification and used in many domains and applications, such as in “data exchanging”, “data

presentation”, “data querying”, and “knowledge representation”. For data exchanging, UDDI and

ebXML are used to mediate the data exchange process between business enterprises. For data

presentation, XSLT can be used to transform XML into HTML for presenting on the web. For data

querying, XQL, XML-QL and X-Query are used to query data in XML documents. For knowledge

representation, RDF/RDFS, DAML/DAMLS, XML topic map are proposed to represent knowledge in

XML format. We will survey specifications about data querying in section 2.3 that discussing the XML

query and retrieval topics, and survey specifications about knowledge representation in section 2.4 that

discussing the ontology topic.

8.2 Information retrieval

In order to build an XML retrieval system that reduce the semantic gap, we have to understand the

information retrieval technologies, and how to use natural language understanding technologies to

reduce the semantic gap of XML.

Page 17: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

The evolution of IR technique is close related to the target document structure. Each time, a new

document structure proposed, a new IR technique developed. In 1970~1980, Vector Space Model is

developed to retrieve text documents. In 1990~1999, Random Walk Model developed to retrieve HTML

documents. Today, XML document are wide spreading. Many researchers are trying to develop new

retrieval models for XML.

Text Retrieval

Text Retrieval Technology is almost as old as the Computer Technology. There are many models for

text retrieval. The most well known is Vector Space Model (VSM) [Salton75]. In this model, each

document is represented by a k-dimensional vector of terms. A plain text is expressed as following.

d = (dt1, dt2, …, dtk), where dti is the weight of term ti that show up in the document of d

In the expression above, where k equals the number of index terms in the collection. The order of

words in the text sequence is discarded.

A query is represented by a k-dimensional vector of terms, too. The query (q) may be represented

as the following vector.

q = (qt1, qt2, …, qtk), where qti is the weight of term ti that show up in the query of q

Cosine coefficient is a popular measure for the similarity between a document and a query. The

definition of cosine similarity is the cosine of the angle between the document vectors d and the query

vectors q.

Similarity(d, q) =

∑∑

==

==•k

iti

k

iti

ti

n

iti

qd

qd

qd

qd

1

2

1

2

1

*

)*(

||*||

One question is how to set the weight dti and qti in the vector space model. The “tfidf” is a simple and

common used weighting function. The “tfidf” weighting is defined as the product of term frequency (tf)

and inverse document frequency (idf)

Term frequency (tf) : tf(t,d) : the number of occurrences of term t in document d

Document frequency (df) : df t : the number of documents, containing term tj .

Page 18: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Inverse document frequency (idf) : the inverse number of documents in which the term occurs.

idf(t) = log(N/dft), where N is the number of documents.

For a given document d, dti= tfidf(ti, d) = tf(ti, d) * idf(t)

For a given document q, qti= tfidf(ti, q) = tf(ti, q) * idf(t)

The SMART system experiments lead by Salton [Salton88] shows that “tfidf” term weighting function

is the best in his 287 distinct combinations of term-weighting assignments. The “tfidf” weighting

function has been proved to be a good measure for the vector space model.

HTML Retrieval

The main issue of HTML-retrieval is to measure the importance of a document. A HTML retrieval

system retrieves documents that match the query, and then sort by importance. On the web, there are too

many documents to retrieve. The importance measure helps user to decide what he should read.

Documents on the web are different from the text collection because of the hyperlink structure.

The measure of HTML importance is based on the hyperlink analysis technique. Historically, hyperlink

analysis is developed based on the citation analysis technique. A simple strategy to measure the

importance of a web page is by counting the number of hyperlink that reference to it. A web page

referenced by many other pages is important.

In 1998, a random walk model used to weight the importance of web pages proposed was proposed

[Brin98][Page98]. The random walk model was then used in the Google search engine. In the random

walk model, a page is important if it is cited by many important pages. Formally speaking, each web

page in the random walk model has a weight measure w(d). An iterative process is used to recalculate

the w(d) in each iteration.

∑∈

←Epqq

qwpw),(:

)()(

Conceptually, the random walk model simulates the process of a person click web pages randomly.

The random walker chooses a web page randomly as a start page. After that, he randomly clicks a web

page in the page and repeats the click process on each clicked page. In the random walk model, a

important page will be visited with high probability.

Kleinberg proposed a Hub-Authority model to weight the impact of a web page [Kleinberg98]. Web

pages are divided into two classes in this model, hub-page and authority-page. The hub-authority model

Page 19: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

is an iterative process. For a hub-page (h), it is important if the page point to many important authority-

pages. For an authority page (a), it is important if the page is cited by many important hub-pages.

Formally speaking, there are two weight on each page (d) in the hub-authority model, the hub

weighting measure h(d) and the authority weighting measure a(d). An iterative process is used to

recalculate the h(d) and a(d) in each iteration. Figure 2.1 shows the concept of hub-authority model.

Figure 2.1 The hub-authority model

A set of web page (D) contains many hyperlinks (E). For each page d in D, h(d) is the hub weight of

d, and a(d) is the authority weight of d. At first, we may set both h(d) and a(d) as 1/|D|, where |D| is the

number of documents in D. After that, an iteration is used to recalculate h(d) and a(d) based on the

following recurrence equations.

∑∈

←Epqq

qhpa),(:

)()(

∑∈

←Eqpq

qaph),(:

)()(

Hub-authority model is used to weight the importance of a web page, and decide whether a page is

a hub or authority. Besides weighting the importance, hub-authority model provides a mechanism to

classify the type of a web page.

Both hub-authority model and random walk model used the iterative approach to decide the

importance of a web page. The convergence analysis based on eigen-value in linear algebra is used to

analyze the behavior of recurrence equations used in these models. The paper of Kleinberg

[Kleinberg98] and Page et. al. [Page98] have further discussions for the theory of these models.

8.3 XML Querying and Retrieval

Page 20: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

In order to manage XML documents, the database community and IR community have recently

focus on the research of storing, indexing, querying, and retrieving XML documents. For

storing, the database management systems are extended to support the function of storing XML

documents. One way is extending relational database system to store XML documents, another

way to store XML documents in object-oriented database (OODB) system. For indexing,

Patricia-trie and inverted-file are used to index XML documents. For querying, several XML

query languages are proposed to retrieve XML nodes. For searching, several systems are

designed to search XML documents. In this section, we will focus on the survey of XML query

languages and XML retrieving systems.

XML Query Language

Designing query languages for XML is a hot research topic for XML. XML query languages are much

more complex than text-retrieval and HTML-retrieval. XML query languages are more flexible than

database query languages. There are many XML query languages proposed in these years, such as Loral

[Loral97] , XML-QL [XML-QL98], XML-GL[XML-GL99], and X-Query [XQuery01].

Querying an XML collection is like to query a database. We usually query tables by “SQL”

language in a relational database. The following example shows a query to retrieve name and birthday

of United-State presidents.

SELECT name, birthday FROM people WHERE nation=”US” and job=”president”

An XML query language has to retrieve nodes in the tree of XML nodes. The following example

shows an X-Query example that retrieve name and birthday of United-State presidents.

For $p in //people

Let $n=?p/name, $b=?p/birthday

Where ?p/job = “president” and ?p/nation=”US”

Return ?n, ?b

XML-GL is a graphical notation used to retrieve XML documents. Figure 2.2 shows an example of

retrieve orders that ship books with title “Introduction to XML” to Los Angles.

Page 21: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 2.2 An example of XML-GL

XML retrieval systems

There are several XML retrieval system proposed in these years. We will have a survey of

these systems in this section.

Lore was one pioneer research project for XML retrieval in Stanford-University. In this project,

an object-oriented database was used to store XML documents. The XML query language

“Loral” was developed. Besides that, a query interface “DataGuider” was developed to query

XML documents. Figure 2.3 is a screen catch of the DataGuider system.

Page 22: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 2.3 The query interface of DataGuider system

XYZfind is a commercial system that split the querying process into four steps. The following

figures show the retrieval steps of the XYZfind retrieval system.

Step 1 : User type in a query to start the

category searching process.

Step 2 : The XYZfind system found

several related categories. User have to

click the target application.

Step 3 : User use the query interface to

build a query.

Page 23: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Step 4 : The XYZfind system retrieves

XML documents and shows on the

browser.

Figure 2.4 Retrieval steps of the XYZfind system

8.4 Using Ontology to Help the XML Retrieval Process

In order to reduce the semantic gap, we have to survey the technologies that used to make

computer understand natural language text. The design of XML does not eliminate the usage of

natural language text in the content of XML documents. Natural language texts are frequently

embedded in XML documents. The natural language understanding technologies that used to

reduce the semantic gap is still needed in the understanding process of XML documents. In this

section, we will focus on how to use natural language understanding technologies that based on

ontology representation to understand XML documents.

Natural language processing community has been trying to resolve the semantic gap

problem for a long time. Natural language understanding is a field that focuses on building

computer programs to “understand” natural language text [Grosz86] [Allen94]. However, the

word “understanding” used here is a misleading word. Computers do not really understand

natural language text as human. Calculation and symbolic reasoning is what computers can do.

Computers “understand” natural language text by mapping text into internal representation.

The internal representation guides the computer to do symbolic reasoning and act as it know

the meaning of natural language text.

Alan Turing designed the Turing-Test [Turing50] to test whether a computer understand

natural language text or not. For information retrieval, we adopt a similar definition as Turing-

Test. If a computer program that retrieve we want and discard what we do not want, and

organize the retrieval result into what we like to browse, then we say the computer program

understand documents and our queries. A computer can do what we like it to do is a smart

computer. A retrieval system that retrieves only what we want and organize the result into what

we like is a smart retrieval system.

Page 24: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

A data-structure called ontology that represents the concept in human mind is used in the

process of understanding. Generally speaking, understanding is the process of mapping natural

language text into ontology. After the mapping, computer can do actions based on the mapping.

This is the style of computer “understanding”.

Ontology may be represented in different structures. The research topic that focuses on the

structure of ontology is called knowledge representation [Brachman85a]. Roughly speaking, there are

two approach to represent knowledge and ontology, logic-based approach and object-based approach.

We will introduce and compare these two approaches. It is a basis of our slot-tree ontology that is going

to be discussed in chapter 3.

The logic-based approach encodes knowledge into logic statements for reasoning,

including propositional-logic, first-order-logic, probabilistic-logic etc. Prolog is the most well

known programming language based on logic.

Logic-based approaches encode knowledge into logic statements. Based on logic

statement, a reasoning process is used to inference unforeseen true statements from these

predefined logic statements.

First-order logic is a powerful theory to represent knowledge and reasoning conclusions.

First-order logic is a monotonic logic system that contains predicates and quantifiers in logic

expressions. In first order logic, we use logical statement to represent the ontology. The

following example shows the logic statements that describe the inheritance relationship

between butterfly, insect and animal.

is(butterfly, insect)

is(insect, animal)

x, y, z is(x,y) ∧ is (y,z) is(x,z)

The power of first order logic lies on the ability of monotonic reasoning. The “monotonic

reasoning” means any conclusions made will never being erased in the future. The 100% certainty of

facts, rules and conclusions should be assured in the first logic reasoning process. The following

example shows a reasoning process for the example above. The reasoning process inferred “butterfly is

a kind of animal”.

x, y, z is(x,y) ∧ is (y,z) is(x,z) (bind x to butterfly, y to insect, z to animal)

-----------------------------------------------------------------------------------------------

is(butterfly,insect) ∧ is (insect,animal) is(butterfly,animal)

Page 25: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

-----------------------------------------------------------------------------------------------

conclusion : is(butterfly, animal)

A difficulty is that many uncertainty situations are encountered in the natural language

understanding process. The 100% certainty of first order logic cannot always being assured.

Probabilistic logic and fuzzy logic are developed to handle the uncertainty. However, the monotonic

property is lost in the uncertain reasoning process.

After reviewing the logic-based approach, we will introduce object-based approach. Object based

approach contains a set of representation methods, including frame, semantic network and script.

Generally speaking, frames are used to represent the internal structure of object, semantic networks are

used to represent the relation between objects, and scripts are used to describe an active scenario

involving many objects.

Frame is proposed by Minsky in 1975 [Minsky75] in the seminal paper "A framework for

representing knowledge". Frame is a method of representation that organizes knowledge into

chunks. However, Minsky did not formalize the frame concept into mathematics model.

Minsky explicitly argued in favor of staying flexible and nonformal. After that, some AI

systems are built based on the frame representation, such as the KL-ONE system

[Brachman85b] and the KRL language [Bobrow77].

Generally speaking, a frame is a structure that describes the internal structure of an object.

Frames are composed out of slots (attributes) for which fillers (scalar values, references to

other frames or procedures) have to be specified or computed. A slot can be expressed as a

tuple in the form of (object, slot, filler). It is easy to transform these tuples into a logic

predicate in the from of slot(object, filler).

One frame that inherits from another frame is called a sub-frame. The inherit property may

be expressed as the “is” relation between frames in the form of is(object, object). The inherit

property organize frames into hierarchy. The concept of frame that organizes statements into

object-based structures is easy for human to read and write. It was then adopted by object-

oriented programming language for people to write program easily. The following example

shows a frame for “koairai” that is a species of butterfly.

<object name= “kodairai”>

<is>butterfly</is>

<texture>eyespots</texture>

</object>

Page 26: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Semantic networks concentrate on categories of objects and the relations between them

[Quillian66] [Wood75]. Drawing graphs to represent the relationship between objects is the

basic idea of semantic network. In these graphs, a link may be represented as a tuple in the

form of (object, relation, object). It is easy to transform these tuples into a logic predicate in the

from of relation(object, object).

Scripts are used to describe a scenario involving many objects [Schank77]. Steps in the

scenario are described as lattices. One step may be triggered when its preceding steps are

finished. For example, the following script shows the process of make a cup of coffee.

1. Put an empty cup on table. put_on(cup, table)

2. Put coffee powder into the cup put_into(coffee powder, cup)

3. Filling hot water into the cup. fill(hot water, cup)

4. Mixing the powder and the water by a spoon. mix_by_spoon(powder, water)

5. Process finished.

In fact, we may translate object-based representations into logic rules. The difference between

logic-based representation and object-based representation lies on the organization principle.

Logic-based representation encodes knowledge into logic expressions, and the object-based

representation organizes these expressions into frames, semantic networks and scripts.

Reasoning is not a standardized part in object-based systems [Ifikes85]. The information stored in

frames has often been treated as the “database” of the knowledge system, whereas the control of

reasoning has been left to other parts of the system. The most popular and effective reasoning

mechanism for frame is the production rules [Stefik83] [Kehler84]. Production rules are rules in the

form of pattern/action. It is a subset of predicate calculus with an added prescriptive component

indicating how the information in the rules is to be used during reasoning. Whenever a pattern is

matched, the production system will trigger the corresponding frame, and the action is performed to do

something that helps the “understand” process. After the pattern/action process, some values are filled

into frames as the conclusion. The reasoning process in object-based system that map natural language

text into slot-tree ontology is what we called “the slot-filling process”.

Both logic-based representation and object-based representation may be used to represent the

ontology and reasoning based on the ontology. Reasoning is helpful but not a necessary part for

computers to understand natural language. However, computers need a process to map natural language

text into ontology in order to understand it.

Page 27: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

The mapping process for XML documents is easier than the mapping process for natural language

documents, because tags provide semantic contexts that make the process of mapping easily. In chapter

3, we will propose a slot-filling algorithm to map XML documents into slot-tree ontology in order to

reduce the semantic gap between human and computer on XML.

8.5 Discussion

In this chapter, we review the research background of XML, information retrieval and

ontology. However, the technology of XML retrieval now is not good enough and needs further

research. In fact, researchers in information retrieval community are trying hard to develop

methods for XML retrieval recently.

In the workshop of ACM SIGIR 2000 on XML and information retrieval, Carmel et al.

[Carmel00] discuss about several unsolved problems for XML retrieval in the workshop

summary. We list these problems as following.

1. Using XML query language is likely to improve precision. However, XML query

languages are not easy for people. How to make it easier to use for people?

2. A heterogeneous XML collection contains document structures are coming from different

sources, and the tag names and document structures may be different and idiosyncratic.

How to retrieve heterogeneous XML documents?

3. XML is specified using Unicode. The tag names coming from different sources may be

given in different languages. Since a word can have more that one translation and even no

translation, how to find or make the appropriate translation is an interesting issue for

multilingual information retrieval. How retrieve do multilingual XML documents?

4. Browsing XML retrieval results should be better than browsing text document. How to

organize the retrieval results for browsing? Is it the entire document, a part of the XML

tree, or perhaps a graph?

In this thesis, we will try to resolve these problems by develop an XML retrieval system. The

system is mainly designed to reduce the semantic gap between human and computer. In this

system, we develop programs for computer to understand XML documents easily, for human to

write query easily and browse query results easily. These methods are based on an ontology

representation called slot-tree. We will describe these methods in the next part. In chapter 3, we

will show how to represent slot-tree and map XML documents into slot-tree. In chapter 4, we

will show how to use the slot-tree ontology to help the XML retrieval process. In chapter 5, we

will design a method to build slot-tree automatically.

Page 28: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Part 2 : Slot-Tree Based Methods for XML Retrieval2 Slot-Tree Ontology and Slot-Filling AlgorithmIn part 1, we have introduces our motivation, goals and research approaches in chapter 1, and

review the related researches for XML, information retrieval and ontology in chapter 2. In part

2, we will show our method to reduce the semantic gap of XML retrieval. In order to reduce

the semantic gap, an ontology called slot-tree, is used to help the XML retrieval process in our

system. In this part, we focus one the usage of slot-tree ontology in our XML retrieval system.

Part 2 contains three chapters. In chapter 3, we will describe the syntax, semantics and

usage of slot-tree. In chapter 4, we will use the slot-tree to reduce the semantic gap in the XML

retrieval process. In chapter 5, we will show how to construct the slot-tree ontology, and design

a mining algorithm to build the slot-tree ontology automatically.

This chapter contains four sections. In section 3.1, we outline the structure of slot-tree

ontology and its usage in the process of understanding XML documents. In chapter 3.2, we

describe the syntax and semantics of slot-tree ontology. In chapter 3.3, we design the slot-

filling algorithm to map XML documents into slot-tree ontology that is the core of

understanding process. Finally, we have a discussion about slot-tree ontology and slot-filling

algorithm in section 3.4.

8.6 Introduction

In this chapter, we design an object-based representation called slot-tree ontology, and then use the slot-

tree to “understand” XML documents. As we have said in section 2.4, the word “understand” used here

means the process of mapping text in XML into the slot-tree. This enables a computer to trigger the

corresponding procedure to do what user like it to do, such as answering questions or retrieving

documents that user want. We will outline the slot-tree ontology and the slot-filling algorithm that maps

XML documents in this section, and describe the detail of slot-tree in section 3.2 and slot-filling

algorithm in 3.3.

Slot-tree representation is object-based approach to represent the internal structure of objects like

frame. We have surveyed object-based approach for knowledge representation, including frame,

semantic network and script in section 2.4. Generally speaking, frame is used to represent the internal

structures of objects, semantic network is used to represent relations between objects, and script is used

to represent scenarios that involve many objects. The object-based approach is conceptually consistent

to our notion about world, because the world is a composed by many objects in our sense. The

difference between slot-tree and frame is that a slot in slot-tree contains a set of paths to locate nodes in

XML documents. A path in a slot is in XPath format that was described in section 2.1. For example,

“//butterfly//color” is used to locate “color” nodes in the block of “butterfly”.

Page 29: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

In our XML retrieval system, a slot-tree is encoded in XML format like the following

example.

Example 3.1 A simple slot-tree in XML format

<s slot= “butterfly” path= “//butterfly”>

<s slot= “color” path= “//butterfly//adult//color”>

<v value= “brown”/>

<v value= “white”/>

</s>

</s>

Based on the slot-tree ontology, we design a slot-filling algorithm that is used to map

XML documents into slot-tree ontology in the process of understanding. In the slot-filling

algorithm, a path in a slot is used to catch a block in XML like a hand, and a matching process

is used to map the content of the block into the slot. After the matching process, words that

matched any values in a slot are filled into the slot. The filled slot-tree after the matching

process is then used as a semantics structure of the XML document. We will show the detail of

slot-tree ontology in section 3.2 and the detail of slot-filling algorithm in section 3.3.

8.7 Slot-Tree Ontology

In this section, we propose an ontology representation called slot-tree. Slot-tree is an object-based

representation that describes the internal structure of an object like frame. We have described the frame

representation in section 2.4. We will describe the syntax, semantics and examples for slot-tree in this

section.

Definition 3.1 : A slot-tree is a tree (T) that each node in the slot-tree contains a tuple (s, P s, Vs), where

s is the name of slot, Ps is a set of paths, and Vs is a set of values. The name of a slot is a label that

uniquely represents the slot. A path (p) in Ps is a string in XPath format that used to locate nodes in

XML documents. A value (v) in Vs is a term that contains a set of semantically identical words or

patterns.

Figure 3.1 shows the structure of a slot-tree, the {p} in each node represent a set of paths and the

{v} in each node represent a set of values. For a slot-tree that represent the internal structure of an

object, a slot in the tree may used to represent a property of the object, such as the “color”, “shape”,

“texture”, “size”, etc. A value in the slot is a possible value of the property. For example, “black” is a

possible value in the “color” slot.

Page 30: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 3.1. The structure of a slot-tree

A slot-tree can be encoded as an XML document that each slot is encoded as a node in tag “s”.

The attribute “slot” in the node is the label of the slot. The attribute path contains a set of path in XPath

format that encode the {p} part for each slot. The node in tag “v” is a value that encodes the {v} part for

each slot. Example 3.2 shows a slot-tree for butterflies in XML format and figure 3.2 shows the graph

representation of the example.

Example 3.2. A slot-tree for butterflies in XML format

<s slot= “butterfly” path= “//butterfly”>

<s slot= “name” path=”//butterfly//name”/>

<s slot= “adult” path= “//butterfly//adult””>

<s slot= “color” path= “//butterfly//adult//color”>

<v value= “black”/>

<v value= “brown”/>

<v value= “black&white”/>

</s>

<s slot= “texture” path= “//butterfly//adult//color”>

<v value= “lines”/>

<v value= “spots”/>

</s>

</s>

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

s {p} {v}

Page 31: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 3.2 : The graph representation of slot-tree

Formally, the syntax of slot-tree is defined as grammars in figure 3.3. A slot (S) contains a label

(NAME), a set of path (P*) and a set of values (V*). The slot may also contain a set of sub-slot (S*). A

value (V) contains a label (NAME), a set of key (KEY*) and a set of matching rules (R*).

S <s slot= “NAME” path= “P*”> V* S* </s>

V <v value= “NAME” keys= “KEY*” match= “R*”/>

NAME Alphabetical String

KEY Alphabetical String

Where P is a path in XPath format, R is a rule.

Figure 3.3 : The grammar of slot-tree

The symbol “P” used in figure 3.3 is in a path in the format of XML path language (XPath). XPath

is a specification that proposed by Web Consortium (W3C) used to locate nodes in XML documents.

The symbol “/” is used to match children nodes, the symbol “//” is used to match nodes inside the

current node. A tag name with a prefix “@” symbol means an attribute. Example 3.3 shows several

example of XPath.

Example 3.3 : Examples of XML path language (XPath)

a. /butterfly/adult/color

b. //insect//color

c. //insect[@type=‘butterfly’]//color

The path of example 3.3.a is used to locate “color” nodes that are children of an “adult” node, and

the “adult” node is a child of the “butterfly” node. The path of example 3.3.b is used to locate any

Page 32: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

“color” nodes that are in the block of an “insect” node. The path of example 3.3.c is used to locate any

“color” nodes that are in the block of an “insect” node with values ‘butterfly’ in the attribute “type”. If

you would like to learn more about XPath, please see the XPath specification in the following web page

- http://www.w3.org/TR/xpath.

A rule in the slot-tree is used to match a string in XML. The syntax of a rule (R) is further

defined as grammar in figure 3.4. A rule may contains “&” operator, “|” operator and “-“

operators. A symbol “E” is an expression that is part of a rule. Each expression contains only a

literal “L” or a pattern in the form of “L..L”.

R (R & R)

R (R | R)

R E

R -E

E L {..L}

Figure 3.4 : The grammar of rules in slot tree

The “&” operator equals to a logical “and”. A “R1 & R2” rule satisfied if and only if both

R1 and R2 are satisfied. The “|” operator equals to a logical “or”. A “R1 | R2” rule satisfied if

and only if R1 or R2 is satisfied. A “..” symbol in the syntax of “E” means a far connect. A “L1

.. L2” rules satisfied if a L1 string is followed by an L2 string in one sentence. The following

example shows a several rules as following.

Example 3.4 : Matching rules in slot-tree

a. R = “white & black”

b. R = “lines & -spots”

c. R = “black .. head”

The rule of example 3.4.a is used to match a sentence like “a butterfly that is mixed of black and white

color”, or “a butterfly with white wing and black head”. The rule of example 3.4.b is used to match a

sentence such as “a butterfly with brown lines on wings”, but cannot match the sentence “a butterfly

with brown lines and white spots on wings”. The rule of example 3.4.c is used to match a sentence such

as “a butterfly with black color on head”, but cannot match the sentence “a butterfly with has green

head and black wings”.

Page 33: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

8.8 Slot-Filling Algorithm

A slot in a slot-tree is a container that may contain several fillers. The filler can be a value of a

sub-slot. A slot-filling algorithm is a method to map fillers into slots. In this chapter, we

describe how to map an XML document into slot-tree ontology.

Example 3.5 : An XML document for a butterfly

- <butterfly about=“Athyma_fortuna_kodairai.jpg”>

<adult>

<texture>There are some eye spots in each wing</texture>

<color>Brown background color, Eye spots in white color</color>

<size>Middle size, 50-60mm</size>

</adult>

<geography>

<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>

<global>Central China Area</global>

</geography>

</butterfly>

Example 3.6. A slot-tree for butterflies

<s slot= “butterfly” path= “//butterfly”>

<s slot= “name” path=”//butterfly//name” type= “copy”/>

<s slot= “adult” path= “//butterfly//adult””>

<s slot= “color” path= “//butterfly//adult//color”>

<v value= “black”/>

<v value= “brown”/>

<v value= “black&white”/>

</s>

<s slot= “texture” path= “//butterfly//adult//color”>

<v value= “lines”/>

<v value= “spots”/>

</s>

</s>

One simple way to fill values into the corresponding slot is by copy. A copy-slot is a slot with the

attribute (type=“copy”) in it. The copy-slot is used to extract a value from a specified field. In the slot-

filling process of example 3.3, the value “Athyma_fortuna_kodairai” is filled into the “name” slot in

example 3.4 just by copy.

Page 34: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Another way to fill values into slots is by keyword matching. A value is filled into a slot if the

value matched a sentence in the target XML document. The following example shows the process of

matching the “spotted” value in “texture” slot to the “color” nodes in XML document.

Example 3.7 An example of filling a value into slot by keyword matching

Texture block :

<color> Brown background color, Eye spots in white color </color>

Texture Slot :

<s slot= “texture” path= “//butterfly//texture”>

<v value= “single color” keys= “single, mono, uniform”/>

<v value= “spotted” keys= “spot”/>

<v value= “lines” keys=”line”/>

</s>

Matching result <s slot= “texture” values = “spotted”/>

A slot-filling algorithm is designed to fill values into slots in a slot-tree. In order to

“understand” an XML document, we use the slot-filling algorithm to fill an XML document

into the slot-tree. The output of our slot-filling algorithm is a filled slot-tree, where each node

in the tree is filled by values. For a given XML document d, d s is part of the document that

covered by slot s. The output of the slot-filling algorithm is a set of slot-value (s,v) pairs.

Slot-Filling(d, T) = { (s,v) | v∈V, t is a term in d, w(v, ds) > ε }

The following figure shows the pseudo code of slot-filling algorithm.

Algorithm Slot-Filling(d, T)

SV = {}

for each s in T

ds = {c | (s, p) ∈M(T), (p, c) ∈d }

for each v in s

if w(v, ds) >ε then put (s,v : w(v, ds)) into SV

end for

end for

return SV

Figure 3.5 : The pseudo code of slot-filling algorithm

Page 35: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

The time complexity of slot-filling algorithm is ∑s |ds|*|Vs|, where |ds| is the size of ds, and |Vs|

is the number of values in slot s.

8.9 Discussion

In this chapter, we have described the slot-tree ontology in section 3.2 and slot-filling

algorithm in section 3.3. The slot-filling algorithm is used to map XML documents into slot-

tree ontology in the understanding process. In chapter 4, we will use the slot-tree and slot-

filling algorithm to develop an ontology-based XML retrieval method, and using the method to

reduce the semantic gap between human and computer.

Page 36: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

9 An Ontology-Based Approach for XML Querying, Retrieval and Browsing

In the previous chapter, we have showed the slot-tree ontology and its usage. A mapping

between slot-tree and XML documents is built in the process of slot-filling algorithm. The

mapping process helps our XML retrieval system in reducing the semantic gap between human

and computer. In this chapter, we will outline the relationship between our XML retrieval

system and slot-tree ontology, and show the power of slot-tree.

In section 4.1, we will describe the process of our XML retrieval system, and outline

important components in our system. We will describe how to represent an XML documents

for retrieval in section 4.2, and describe the index structure in section 4.3. After that, the query

interface is described in section 4.4 and ranking strategies is described in section 4.5. And then

we show how to organize retrieval results for browsing in section 4.6. Finally, we have a

discussion about our XML retrieval system in section 4.7.

9.1 Introduction

Two technologies are needed in the process of searching for documents, retrieving and browsing.

Retrieving is the process of retrieves documents in a collection. After that, the retrieved documents

should be organized for browsing. Browsing is the process of read and traverse on the collection of

documents. We usually use retrieving and browsing techniques alternatively in a searching process. A

model integrated retrieving and browsing may used to improve the quality of searching.

Our research focuses on using ontology to improve the XML retrieval and browsing process. We

will focus on the following questions in this chapter.

1. How to encode XML documents for retrieval?

2. How to use slot-tree ontology to improve the efficiency of querying?

3. How to use slot-tree ontology to improve the efficiency of retrieval?

4. How to use slot-tree ontology to improve the efficiency of browsing?

Figure 4.1 shows a scenario of our approach to retrieve XML documents. First, a user build a

query by click or type on slots in the query interface, and then submit the query to the XML retrieval

system. The retrieval system retrieves XML documents, and then summarizes them for user to browse.

Page 37: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 4.1 : A scenario of our XML retrieval system

The ontology in figure 4.1 is the slot-tree ontology that described in chapter 3. It is the core of our XML

retrieval system. The slot-tree ontology is used to build query interface, retrieve documents and

summarize retrieved documents for browsing. The XML queries, XML documents and query interface

are important objects in our system. The retrieval and extraction are important processes in our system.

We will introduce these objects and processes in this chapter.

9.2 XML Documents

An XML document is encoded as a tree-structure text. Figure 4.2 shows an XML document that

describes a butterfly.

- <butterfly about=“Athyma_fortuna_kodairai.jpg”>

<adult>

<texture>There are some eye spots in each

wing</texture>

<color>Brown background color, Eye spots in white color</color>

<size>Middle size, 50-60mm</size>

</adult>

<geography>

<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>

<global>Central China Area</global>

</geography>

Page 38: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

</butterfly>

Figure 4.2 : An XML document of butterfly

For conceptual simplicity, the XML example above is expressed as a sequence of (path,

value) pairs that describe the object.

(butterfly, )

(butterfly@about, Athyma_fortuna_kodairai.jpg)

(butterfly\adult, )

(butterfly\adult\texture, There are some eye spots in each wing)

(butterfly\adult\color, Brown background color, Eye spots in white color)

(butterfly\adult\size, Middle size, 50-60mm)

(butterfly\adult, )

(butterfly\geography, )

(butterfly\geography\taiwan, North-Taiwan, 1000-2000meters mountain area)

(butterfly\geography\global, Central China Area)

(butterfly\geography, )

(butterfly, )

Figure 4.3 : The (path, value) expression of an XML document

The (path, value) expression can be thought as an object concept model. A “path” specified a

property of an object. A “value” specified a value for the property. The object concept model above is a

binary relation that may be expressed as path(object, value). A path represents a logical predicate with

two arguments. An object in this model is expressed as a set of (path, value) pairs.

Storing Structure

The (path, value) representation does not reflect the tree structure of an XML document. In order to

represent the tree structure, we use a pair of index to represent begin and end of each block. In other

word, we extend each (path, value) pair with a (begin, end) pair to represent the begin node and end

node of each block. The butterfly example above is expressed as the following structure.

1, 12 (butterfly, )

2, 2 (butterfly@about, Athyma_fortuna_kodairai.jpg)

3, 7 (butterfly\adult, )

4, 4 (butterfly\adult\texture, There are some eye spots in each wing)

5, 5 (butterfly\ adult \color, Brown background color, Eye spots in white color)

6, 6 (butterfly\ adult \size, Middle size, 50-60mm)

7, 7 (butterfly\ adult, )

Page 39: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

8, 11 (butterfly\geography, )

9, 9 (butterfly\geography\taiwan, North-Taiwan, 1000-2000 meters mountain area)

10, 10 (butterfly\geography\global, Central China Area)

11, 11 (butterfly\geography, )

12, 12 (butterfly, )

Figure 4.4 : The storing structure of an XML document

In the example above, each node is lead by a (begin, end) pair. The begin index of a node is

always identical to the ID of the node. A block with (begin, end) means it cover all nodes

between begin node and end node. For example, the first block “1,12 (butterfly,)” covers nodes

from 1 to 12, the third block “3,7 (butterfly\adult)” covers nodes from 3 to 7. In this way, the tree

structure of XML is expressed as the cover/covered relations between nodes.

The begin-end pair structure totally reflects the hierarchical structure of XML documents. In

our XML storage system, we store the (begin, end) pairs in a table instead of storing as a tree.

9.3 Indexing structure

Based on the PVSM, we index (p,t) pairs instead of (t) for an XML retrieval system. There are several

data-structures for full-text indexing, such as inverted-file, signature-file and Patricia-trie. We use

inverted-file as the index structure of our XML retrieval system for simplicity.

The following example is a simple XML document. We will show how to index the following

XML document, for both text field and number field.

Example 4.4 An XML document for butterfly

<butterfly about=“kodairai”>

<adult>

<color>brown</color>

<texture>spot</texture>

<size>50-60mm</size>

</adult>

</butterfly>

Indexing Text: The following table shows our inverted-file structure. The inverted-file is stored in a

relational database now. The following figure shows an inverted-file for the example above.

#path, #term #object list

Page 40: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

… …

#\butterfly\adult\color, #brown …,#kodairai, #…..

… …

#\butterfly\adult\texture, #spot …,#kodairai,…

… …

Figure 4.5 An example of text index in inverse file format

Indexing Number : Traditional full text indexing technology doesn’t index number. In our system,

number indexing is important for the browsing process. We may sort the search results in some

specified order based on number index. In the indexing process, we extract number from XML

documents and put into a number table as following.

#object, #path Number

… …

#kodairai, #\butterfly\adult\size 50

#kodairai, #\butterfly\adult\size 60

… …

Figure 4.6 An example of number index

9.4 Query Language and Query Interface

XML may used to encode metadata instead of data. Metadata is a kind of data that used to describe

data. We may use metadata to describe objects such as audio, video, people, etc. Based on metadata, we

may index image, video and audio in text format, so that we may query object by number and text field

in our XML retrieval system.

In our system, we design a program to transform slot-tree into HTML based query interface. A

template in Extensible Stylesheet Transformations (XSLT) is used to do the transformation.

In our query-interface, a value can be expressed as a string, a range of number, or an

object. A user may specify the value for a slot just by click a value or an icon in the slot. Our

retrieval system is not only used to retrieve text-based documents, but also used to retrieve

image or video. The following figure shows a query interface for butterflies.

Page 41: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 4.7 : The Query Interface for Butterflies

A user may select a slot just by one click, and select a value in the slot or type keywords into the

slot. He may also specify a field for sorting. A query will be built and submit to the XML retrieval

system when he press the submit button.

A query in our system is a filled slot-tree. The following example shows a query “find all

butterflies with broken wing and brown color”.

<s slot= “butterfly” path= “//butterfly”>

<s slot=“color” path=“//butterfly//adult//color” keys=“brown”/>

<s slot=“shape” path=“//butterfly//adult//shape” keys=“broken”/>

</s>

9.5 Ranking Strategy

The ranking strategy for XML-retrieval is much more like database than text-retrieval. We may rank the

retrieval result by any field in XML documents. For example, we may sort the retrieval result by the

Page 42: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

size of butterflies. We may also sort the retrieval result by the similarity between document and query

or by the importance of documents. In this section, we will show the ranking strategies that used to sort

the retrieval results.

Ranking by Field

In order to organize the retrieval result for user to browse, a user may specify the ranking

strategy. A user may specify any field to sort the result for browsing just like in a database. A

field can be sorted as numbers by scale or sorted as strings by alphabetical order, in either

increasing order or decreasing order. The variety of ranking strategies provides users a way to

organize the retrieval result into a list for browsing.

Ranking by Importance

In section 2.2, we have introduced how to measure the importance of a web page based on hyperlink.

Hyperlinks in XML may used to decide the importance of an XML document, too. In our XML retrieval

system, ranking by importance is used as a default ranking strategy. A simple way to measure the

importance of an XML document is by counting references to an XML document. We use the strategy

in our system for simplicity. In the future, we will try to accommodate random-walk model and hub-

authority model to measure the importance of XML documents in our XML retrieval system.

Ranking by Similarity

For text retrieval, a ranking strategy based on vector space model (VSM) and TFIDF weighting

function performs well. A brief survey for VSM and TFIDF was described in section 2.3. However, an

XML object is not only a sequence of words like a text, but also contains a lot of tags. For XML, we

extend VSM with a path to each term that is called the Path Vector Space Model (PVSM). An XML

document (d) could be expressed as the following vector v(d).

v(d) = (dp1,t1… d p1,tk …dpn,t1… dpn,tk) dpi,ti is the weight of (pi, ti) pair in document object d

When several paths have similar meaning, we may cluster them into a slot for retrieval. The model after

paths clustering is called the Slot Vector Space Model (SVSM).

v(d) = (ds1,t1… d s1,tk …dsn,t1… dsn,tk) dpi,ti is the weight of (pi, ti) pair in document object d

We may use the cosine-coefficient to measure the similarity between queries and documents in SVSM

just like in VSM.

Similarity(d, q) = ||*|| qd

qd •

Page 43: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

However, we do not know what kind of weighting function is good to measure the value dsi,tj. Is TFIDF

good enough in the SVSM, or we need another measure. In our system, we express the dsi,tj as the

product of wsi,tj and tfsi,tj . Where tfsi,tj is the term frequency of the term tj in slot si , and wsi,tj ais the

weighting coefficient.

A difficulty for retrieval system today is too many documents are retrieved. When there are to many

retrieval results for browsing, the ranking strategy is used to present what users want to them. A user

may like to see large butterflies, important butterflies or butterflies that are similar to a query. The

variety of ranking strategy in XML provides ways for users to retrieve only what they like to browse.

9.6 Browsing XML documents

For an information retrieval system, the retrieved documents should be summarized and

organized into readable format for people to browse. In our XML retrieval system, slot-filling

algorithm is used to map the retrieved documents into filled slot-trees for browsing. The filled

slot-tree is a summary of documents that is easy to browse and is well organized. In this

section, we will show an example of slot-filling algorithm that fills XML documents into slot-

tree. Before that, we have to show an XML document and a slot-tree used in the algorithm.

The following example shows a simple slot-tree for butterfly.

<s slot= “butterfly” path= “//butterfly”>

<s slot= “name” path=”//butterfly//name”/>

<s slot= “adult” path= “//butterfly//adult””>

<s slot= “color” path= “//butterfly//adult//color”>

<v value= “black”/>

<v value= “brown”/>

<v value= “black&white”/></s>

<s slot= “texture” path= “//butterfly//adult//color”>

<v value= “lines”/>

<v value= “spots”/></s>

</s>

We may use the slot-filling algorithm to extract values from the following XML document.

- <butterfly about=“Athyma_fortuna_kodairai.jpg”>

<adult>

<texture>There are some eye spots in each wing</texture>

<color>Brown background color, Eye spots in white color</color>

Page 44: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

</adult>

<geography>

<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>

<global>Central China Area</global>

</geography>

</butterfly>

The slot-filling algorithm will fill values into slot-tree. The following example shows the result of

filling.

<s slot= “butterfly” values= “Athyma_fortuna_kodairai”>

<s slot= “adult”>

<s slot= “texture” values= “spot”/>

<s slot= “color” values = “brown”/></s>

<s slot= “geography”>

<s slot= “Taiwan” values= “North”/>

<s slot= “Global” values= “China”/></s>

</s>

The result of slot-filling algorithm is a filled slot-tree. For human, it is easier to browses filled slot-trees

than browse the source documents. The filled slot-tree is a summary of the XML document and is well

organized.

9.7 Discussion

In this chapter, we design an XML retrieval system to reduce the semantic gap between human

and computer. The slot-tree ontology and the slot-filling algorithm are used in our XML

retrieval system to understand XML documents. Based on the slot-tree, we design a query

interface to reduce the semantic gap in query side. The interface helps people to write XML

queries easily. Based on the slot-filling algorithm, we design the slot vector space model

(SVSM) retrieve XML documents. The SVSM model helps computer to understand XML

documents. Besides that, the slot-filling algorithm also help computer to extract summary from

XML documents for browsing. Our goal of reducing the semantic gap between human and

computer is almost achieved by using slot-tree as a core representation.

We will study two cases of our XML retrieval systems in chapter 6 and chapter 7. In

chapter 6, we use the domain of butterflies as an example. In chapter 7, we use the domain of

proteins as an example. We will show the slot-tree, query interface, retrieved results and

summary for butterflies in chapter 6. And we will show the slot-tree, query interface, retrieved

results and summary for proteins in chapter 7.

Page 45: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

10The Construction of Slot-Tree OntologyWe have introduced the slot-tree ontology in chapter 3, and then showed an XML retrieval

system based on slot-tree ontology in chapter 4. However, building slot-tree ontology is a not

an easy job. In order to reduce the effort to build the slot-tree ontology, we have developed the

slot-mining algorithm. The slot-mining algorithm is a statistical approach to mine slot-tree

from XML documents. The algorithm is used to learn the slot-tree from a collection of XML

documents.

An overview of mining approaches is described in section 5.1. Section 5.2 provides

background for the text-mining technology. Section 5.3 shows how to construct slot-tree for a

given XML collection. Section 5.4 describes a method to mine slot-tree from XML documents

called slot-mining algorithm. Finally, we have a discussion for the building of slot-tree in

section 5.5.

10.1 Introduction

The goal of text mining is to find important patterns from text collection and organize these patterns

into ontology. In this thesis, we use the ontology to help the XML retrieval and browsing. The mining

technology may used to help us in the construction process of slot-tree ontology. In this section, we will

focus on the text-mining problem for XML.

Slot-tree is an ontology representation method. Our mining approach is to build a XML-mining

program to induce values for each slot. In this section, we assume that each value is represented by a

term (or a word) for simplicity. Based on this assumption, we developed a statistical program to mine

values for each slot.

The semi-structured property of XML makes the mining program work. For a given XML

collection, the distribution of a term is highly depends on the tags. For example, the following terms

show up more frequent in the <color> block than in the other blocks.

<color> “black”, “white”, “yellow”, “blue”, “green” </color>

The problem of mining the important values for each slot is called the Slot-Mining Problem. We will

propose a mining-algorithm that is based on a simple observation – the distribution of terms depends on

the tag. A term shows up more frequently in a tag is likely to be a key value for the corresponding slot.

10.2 Background

Page 46: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

The goal of text mining is to discover some regularity in text-data. A text-mining program induces rules

from text or learn grammar form corpus, these rules are used in the process of natural language

understanding and information extraction.

For natural language processing, inside-outside algorithm is a popular tool to learn probabilistic

context-free grammar (PCFG) from tree-bank corpus. However, tree-bank corpus is not easy to build.

Building a tree-bank by human is a time consuming job. Some other text-learning methods are

developed to learn from text corpus. For example, link grammar is a simple head-driven grammar that

developed to parse natural language sentence. A learning algorithm is developed to learn the link-

grammar from text-corpus. Besides that, transducer is a learning algorithm to induce finite-state

automata from a given text-corpus. Learning transducer is easier than learning a context-free grammar.

For information extraction, a wrapper is an algorithm to learn a simple grammar from structured text,

such as web page. A wrapper will induce some rule to wrapping the document. For example, a simple

wrapper may learn the prefix and postfix of each field from a collection of program generated web

page. We may extract fields from web page based on these prefix and postfix. A transducer may also

used to learn the extraction rule from a collection of web page, too.

However, these methods are used to learn the grammar of input text, not used to learn ontology from a

given document collection. In this chapter, we will propose a learning algorithm that mine slot-tree

ontology from a given XML collection in section 5.4. The algorithm is called the slot-mining algorithm.

This algorithm is a tool to help the domain-knowledge designer to design the slot-tree

ontology. Before we show the slot-mining algorithm, we have to show the process for human

to build a slot-tree in section 5.3, in order to observe what is need in designing such an

algorithm.

10.3 The process of building a slot-tree

In order to show the ontology design process, we will trace the designing step of a simple slot-tree for

butterfly. There are six steps to design a slot-tree.

1. Browse XML data.

2. Identify object boundary.

3. List all tags in this domain.

4. Identify slots for this domain.

5. Mapping each slot to tags (or xpath).

6. Identify values for each slot.

Page 47: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Browsing XML data : The first step to design a slot-tree is to browse data in order to understand data.

What is the structure of the XML collection? Can we identify the object boundary in XML documents?

What’s the meaning of each tag? Does each tag correspond to a slot? What are candidate values for a

slot? We have to answer these questions before construct a slot-tree.

Identifying object boundaries : An object-block is an XML block that correspond to a object. We have

to identify the boundary of object-block to find out what objects the collection contains. For example,

in our butterfly collection, a <butterfly></butterfly> block is the boundary of a butterfly object.

Listing all tags in this collection : an XML tag usually has strong semantic meaning. For example, the

<color> tag represents the color of a butterfly. We may list all tags to understand the semantics for each

tag. For the simple butterfly collection, we list all tags as following.

Butterfly, adult, texture, color, size, geography, Taiwan, global

Identifying slots for this collection : We are lucky to find out that these tag are not ambiguous.

The semantics of tags are clear and definite. We may build a slot for each tag.

Mapping slots to tags (or xpath) : For the simple butterfly collection, we can map each tag to one slot

directly. The following example shows the schema of slot-tree.

<s slot=“butterfly” >

<s slot=“adult”>

<s slot= “texture” />

<s slot= “color” />

<s slot= “size” />

</s>

<s slot= “geography”>

<s slot= “Taiwan” />

<s slot= “Global” />

</s>

</s>

Identifying values for each slot : In order to identify values for each slot, we have to read the data for

each slot. For example, if we read the data in <color> tag, we may found that the “black”, “white”,

“brown”, “orange”, “yellow”, “green”, “blue”, “purple”, “gray” are key values for this slot. We may fill

them into the values list of the color slot. After we fill values for each slot. We finish the slot-tree

building process. The following XML document shows a slot-tree for the simple butterfly collection.

Page 48: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<s slot=“butterfly”>

<s slot=“adult”>

<s slot= “color”>

<v value= “black” /><v value= “white” /><v value= “brown”/>

<v value= “yellow” /><v value= “orange” /><v value= “green”/>

<v value= “blue” /><v value= “purple” /><v value= “gray”/>

</s>

<s slot= “texture”>

<v value= “single color” keys= “single, mono, uniform”/>

<v value= “spotted” keys= “spot”/>

<v value= “lines” keys=”line”/>

</s>

<s slot= “size” >

<v value= “small” /><v value= “middle” /><v value= “large” />

</s>

</s>

<s slot= “geography”>

<s slot= “Taiwan”>

<v value= “north”/><v value = “center”/><v value = “south”/><v value = “east”/>

</s>

<s slot= “Global”>

<v value= “Enrope”/><v value = “China”/><v value = “India”/>

<v value = “America”/><v value = “Australia”/>

</s>

</s>

</s>

In the slot-tree example above, a <v> tag represent a value in a slot. The simplest value is a

keyword. We may also specify a set of keywords or rules for a value, such as the “single color” value in

the “texture” slot.

The last step “Identifying values for each slot” is the most human laboring step in the whole slot-

tree building process. In order to construct slot-tree automatically, we develop the slot-mining algorithm

to mine slot-tree from XML documents in the next section.

10.4 Slot-mining algorithm

Page 49: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

A slot-mining algorithm mines slot-tree from XML documents. The first step is to extract paths

in XML documents to build a schema. The second step is using statistical correlation analysis

to find out what terms is important for these paths. After that, a slot-tree is built that each slot

corresponds to a path in XML documents. The following figure shows a concept model of the

slot-mining algorithm.

Figure 5.1 The process of slot-mining algorithm

Before we describe the algorithm, we have to define some mathematics notation for it.

Definition : Slot-Vector

A slot-vector is a vector of (slot, term) pairs for a given collection of XML blocks (B).

v(B) = (Bs1,t1, …, B s1,tk ,…,Bsn,t1,…,Bsn,tk)

B si,tj is the weight of (tj) shows up in blocks for slot(sj) of B

|B| is the abbreviation for ∑s,t Bs,t

|Bt| is the abbreviation for ∑s Bs,t

|Bs| is the abbreviation for ∑ t Bs,t

Definition : Slot-Vector Space Model (SVSM)

The model of represent XML document by Slot-Vector is called Slot-Vector Space Model.

Example

1. A slot-vector for a given collection (D) is represented as the following formula.

Page 50: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

v(D) = (Ds1,t1, …, D s1,tk ,…,Dsn,t1,…,Dsn,tk)

2. A slot-vector for a specified slot (s) of collection (D) is represented as the following formula.

v(Ds) = (Ds,t1, …, D s,tk)

3. A slot-vector for a given document (d) is represented as the following formula.

v(d) = (ds1,t1, …, d s1,tk ,…,dsn,t1,…,dsn,tk)

4. A slot-vector for a specified slot (s) of document (d) is represented as the following formula.

v(ds) = (ds,t1, …, d s,tk)

Slot-Mining Problem

Given an XML documents collection (D) and a set of slots (S), find the key values for each slot : v(s).

Slot-Mining Algorithm

The slot vector for D is v(D) = (Ds1,t1, …, D s1,tk ,…,Dsn,t1,…,Dsn,tk)

Let |Dt| = ∑ Dsi,t

The slot vector for Ds is v(Ds) = (Ds,t1, …, D s,tk)

v>r(s) = { w | Ds,t /|Ds| > r * |Dt|/|D| }

v>r(s) is called the r-key-set for slot (s)

In our XML-mining system, we set the parameter (r = 2.0) to extract the key values for each slot.

The following figure shows the pseudo code of slot-mining algorithm.

Algorithm Slot-Mining (D)

P = {p | p is a path in D}

for each (p,t) in D

|Dp,t | = |Dp,t|+1

|Dp| = |Dp|+1

Page 51: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

|Dt| = |Dt|+1

|D| = |D|+1

end for

for each (p,t) in PT

p(t | p) = |Dp,t | / |Dp|

p(t) = |Dt | / |D|

if p(t|p)/p(t) > r then put (p,t) into SV

end for

return SV

Figure 5.2 The Pseudo Code of Slot-Mining Algorithm

The slot-mining algorithm mines values from XML collection D. The mining values should be

modified and organized into slot-tree for improving the quality. Let’s have a look at a mining

example for slot “color”.

Example :

<color> head, brown, yellow, body, white, wing, gray, blue, black, background, line, spot </color>

In the mining result above, “brown, yellow, white, gray, blue, black ” are what we want, but “head,

body, wing, background, line, spot” are noise words. Until now, we cannot distinguish these two

groups by statistical method. We have to find out a way to distinguish them. One possible solution is

to combine a dictionary like “WordNet” to distinguish these two groups. We will try this solution in

the future.

10.5 Discussion

In order to help people constructing slot-tree ontology, we developed a slot-mining algorithm

to mine slot-tree from XML documents. The slot-mining algorithm is used as an authoring tool

to construct the slot-tree ontology.

The slot-mining algorithm mines slot-trees from a collection of XML documents. Our

approach is based on statistical correlation analysis between tags and terms. The correlation

analysis decides what terms are important for a given tag, and fills terms into the slot of this

tag.

Some modification is needed for the automatically constructed slot-tree in order to

improve the quality. At first, we have to merge paths with the same meaning into a slot in order

to simplify the structure of slot-tree. Second, we have to delete some incorrect mined-values

and merge values with the same meaning in order to improve the quality of each slot.

Page 52: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

The slot-mining algorithm is used to construct the ontology for butterflies in section 6.7

and used to construct the ontology for proteins in section 7.7. We will show full version the

mined slot-tree in these sections.

Page 53: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Part 3 : Case Studies11 Case Study - A Digital Museum of ButterfliesIn part 2, we have described our methods, including slot-tree ontology, slot-filling algorithm,

slot vector space model and slot-mining algorithm. These methods are used to build a semantic

retrieval system for XML. In this part, we will use two XML collections to test our methods,

including a collection for butterflies and a collection for proteins.

In chapter 6, we will test our methods on the collection of “A Museum of Butterflies in

Taiwan (MBT) ”. In chapter 7, we will test our methods on the collection of “Protein

Information Resource (PIR)”. Both collections are encoded in XML format.

In this chapter, an overview of MBT is given in section 6.1. A source XML document of

MBT is showed in section 6.2. A slot-tree for MBT is described in section 6.3. A query

interface based on the slot-tree is described in section 6.4. The slot-filling process for MBT is

described in section 6.5. The retrieval process for MBT is discussed in section 6.6. The mining

process to build slot-tree for MBT is discussed in section 6.7. A discussion of our approach on

MBT is given in section 6.8.

11.1 Introduction

The Digital Museum of Butterfly is a collection of butterfly in Taiwan. Each document in this collection

describes a species of butterfly in Taiwan. The following table is a profile for this collection.

Table 6.1 : A Museum of Butterflies in Taiwan

Collection A Museum of Butterflies in Taiwan (台灣蝴蝶數位博物館)

Working Group NMNS : National Museum of Natural Science (國立自然科學博物館), Taiwan

URL : http://www.nmns.edu.tw/

NCNU : National Chi-Nan University (暨南大學), Taiwan

URL : http://dlm.ncnu.edu.tw/butterfly/index.htm

NTU : National Taiwan University (台灣大學), Taiwan

URL : http://turing.csie.ntu.edu.tw/ncnudlm/

Size 356 species, 356 XML documents.

Language Tag in English, Content in Chinese

Digital Museum for Butterfly in Taiwan contains XML documents for 356 species of butterfly in

Taiwan. Roughly specking, tags may be classified into groups as following.

Page 54: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Table 6.2 : XML tags for butterflies in Taiwan

Group Fields

Classification name, family, cfamily (Chinese family), genus, species, subspecies

Host Host plant, Honey plant

Geography Taiwan, global

Egg Color, shape, feature, characteristic, days of growth, enemy

Larva Color, shape, feature, characteristic, days of growth, enemy

Pupa Color, shape, feature, characteristic, days of growth, enemy

Adult Color, shape, texture, characteristic, life period, enemy

11.2 The Representation of Butterflies in XML

The following figure shows an XML document for the butterfly “kodairai”.

- <butterfly>

<cname>拉拉山三線蝶</cname>

- <classification>

<family>Nymphalidae</family>

<cfamily>蛺蝶科</cfamily>

<genus>Athyma</genus>

<species>fortuna</species>

<sub_species>kodairai</sub_species></classification>

<hostplant>忍冬科 (Caprifoliaceae) 的松田氏紅子仔 (Viburnum luzonicum var. matsudai)。</hostplant>

<honeyplant>成蝶喜吸食腐熟水果汁液或樹幹流出汁液。</honeyplant>

- <geographic><taiwan>分布於台灣中北部地區,海拔 1000-2000 公尺間山區均有分布。</taiwan>

<global>中國大陸中部有原名亞種分布。</global></geographic>

- <life_stage>

- <egg>

<feature>底部扁平之高饅頭形,表面有明顯六角形格狀花紋,於六角形頂點處,各著生一

細長刺毛…

<color>淡綠。</color> <size>直徑約為 1.1-1.3mm。</size>

<predator>各類卵寄生蜂、蜱等節肢動物。</predator>

<days_of_growth>卵期約為 5-6 天左右。</days_of_growth></egg>

- <larva>

<feature>終齡幼蟲體呈長圓筒狀,頭部密生硬棘,各體節背方及體側皆長有具星狀刺之突

起…

<color>終齡幼蟲頭部褐色,表面密生棘狀突起。體呈翠綠色,各體節背方及體側突起基部

為藍色,星狀刺為黃綠色。</color>

Page 55: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<size>終齡幼蟲體長約為 33-41 mm。</size>

<predator>寄生蜂、寄生蠅、小繭蜂、椿象、蜥蜴及鳥類等。</predator>

<days_of_growth>冬季以二齡幼蟲越冬,幼蟲期長達半年以上。</days_of_growth>

<defense>初齡幼蟲停棲於寄主葉脈,攝食葉脈兩側葉肉,二齡幼蟲會將寄主植物葉片咬

成小塊並吐絲將其此碎片及糞便黏於葉脈造一蟲巢,越冬幼蟲即躲藏於蟲巢當中,由

於幼蟲褐色之體色與蟲巢上乾枯之小葉片或糞便色澤相近,或許可混淆天敵耳目。<

/defense></larva>

- <pupa>

<feature>蛹體為垂蛹,中胸背方隆起,腹節末端有一柄狀懸絲器。頭部前端有一對大型明

顯之彎曲角狀突出物,腹節背方均有小型鋸齒狀脊起。</feature>

<color>蛹體底色呈黃褐色,中、後胸背方有銀色斑塊,體側氣門黑褐色。</color>

<size>蛹體長度約為 22-27mm。</size>

<predator>蛹寄生蜂、胡蜂、姬蜂及各種真菌等。</predator>

<days_of_growth>蛹期約為 15-20 天,視溫度而定。</days_of_growth>

<defense>老熟幼蟲化蛹於隱蔽之植物叢間,藉以躲避天敵。</defense></pupa>

- <adult>

<feature>成蟲前翅外觀大致呈現三角形,翅形稍微橫長。後翅卵圓形,外觀接近三角形。雌

蝶翅型較為寬圓。</feature>

<color>雄蝶前、後翅表底色為黑色,前翅中室內有一枚長形白斑,各翅室中橫線部位有一

大型白色橢圓斑,前翅端有兩枚小型白斑。後翅有兩條明顯白色橫帶紋,前後翅緣皆

有不明顯小白紋。雌蟲翅表色澤花紋與雄蟲相似。</color>

<size>本種為中型蝶種,展翅約為 50-60mm。</size>

<characteristic>前翅中室內有一枚長形白斑。</characteristic>

<habitate>台灣中部以北山區均有分布。</habitate>

<predator>蜘蛛、螳螂、青蛙、蜻蜓、鳥類及蜥蜴等捕食性天敵。</predator>

<days_of_growth>前翅中室內有一枚長形白斑。</days_of_growth>

<defense>成蟲飛行快速,外觀與其他多種三線蝶類似,為莫氏擬態的一種。</defense>

<season>夏季較易見到成蟲活動。</season>

<behavior>成蝶喜吸食腐熟水果汁液或樹幹發酵流出之樹液,成蟲活動於開闊林道,常見

成蟲於開闊山徑兩旁樹上佔據地盤驅趕附近飛過蝴蝶,亦可見其活動於溪邊開闊處,

吸食腐果或潮濕地面水分。</behavior>

</adult>

</life_stage>

</butterfly>

Figure 6.1 : An XML document for butterfly (Full List)

11.3 Slot-Tree Ontology for Butterflies

Page 56: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Our ontology is represented as a slot-tree in XML format. The slot-tree we designed for butterfly is

consistent to the target collection, both of them are in the following schema.

<butterfly>

<classification/>

<Geography/>

<life-period>

<Egg/>

<Larva/>

<Pupa/>

<Adult/>

</life-period>

</butterfly>

Each object in the “life period” (egg, larva, pupa, adult) has a sub schema to describe the object. The

schema looks like the following tree.

<object>

<Color/>

<shape/>

<feature/>

<size>

</object>

The consistency between slot-tree and document ease our design process. Besides that, the

consistency also eliminates ambiguity for our retrieval and browsing process. On the contrary, a lousy

design of XML document structure will makes our domain-knowledge design process difficult, and

makes our domain-knowledge hard to help the retrieval process and browsing process.

A fragment of the slot-tree for butterfly is showed in the following figure. For a full list of slot-tree,

please see appendix 1.

- <butterfly>

- <family slot="種類" path="//butterfly//cfamily//">

<v value="弄蝶" keys="Hesperiidae" /><v value="小灰蝶" keys="ycaenidae" /> ….</family>

- <adult slot="蝴蝶成蟲" keys="Adult" path="//butterfly//adult//">

- <shape slot="蝴蝶的形狀" keys="Adult:Shape" path="//butterfly//adult//shape//">

<v value="類似燕尾 " image="swallowtail.gif"/> <v value="翅緣波浪狀 " …/>…<

Page 57: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

/shape>

- <color slot="蝴蝶的顏色" keys="Adult:Color" path="//butterfly//adult//color//">

<v value="黑色 " keys="Black" />… <v value="黑白相間 " keys="Black_White"/>…

</color>

- <texture slot="蝴蝶的特徵" keys="Adult:Texture" path="//butterfly//adult//texture//">

<v value="沒有花紋" image="mono.gif" /><v value="少數斑點" image="spot.gif" /> …

</texture>

</adult>

- <pupa slot="蝴蝶的蛹" keys="Pupa" path="//butterfly//pupa//">

- <s slot="蛹的形狀" path="//butterfly//pupa//"><v value="突起" keys="Skin_Stick" /> …</s>

- <s slot="蛹的顏色" keys="Pupa:Color" path="//butterfly//pupa//color//">

<v value="翠綠色" keys="Green"/> <v value="褐色" keys="Wood" /> …</s>

- <s slot="蛹的特徵" keys="Pupa:Feature" path="//butterfly//pupa//feature//">

<v value="帶蛹 " keys="Laying_Pupa"/><v value="垂蛹 " keys="Hanging_Pupa"/>

</s></pupa>

- <egg slot="蝴蝶的卵" keys="Egg" path="//butterfly//egg//">

- <s slot="卵的形狀" keys="Egg:Shape" path="//butterfly//egg//feature//">

<v value="圓球形" keys="Ball" image="egg_ball.jpg" />

<v value="半球形" keys="饅頭形+Half_Ball" image="egg_half_ball.jpg" /> …</s>

- <s slot="卵的顏色" keys="Egg:Color" path="//butterfly//egg//color//">

<v value="乳白" keys="Milk_White" /> …</s>

- <s slot="卵的特徵" keys="Egg:Texture" path="//butterfly//egg//feature//">

<v value=" 表 面 光 滑 " keys="Smooth”/>…<v value=" 格 狀 花 紋 "

keys="Square_Texture"/> …</s>

</egg>

- <larva slot="蝴蝶的幼蟲" keys="Larva+毛毛蟲" path="//butterfly//larva//">

- <s slot="幼蟲的形狀" keys="Larva:shape" path="//butterfly//larva//feature//">

<v value=" 紡 棰 形 " keys="Like_Shuttle" /><v value=" 鳥 糞 狀 "

keys="Like_Bird's_Shit" /> …</s>

- <s slot="幼蟲的顏色" keys="Larva:Color" path="//butterfly//larva//color//">

<v value="綠色" keys="Green" /><v value="褐色" keys="Brown" /> …</s>

- <s slot="幼蟲的特徵" keys="Larva:Texture" path="//butterfly/life_stage/larva/characteristic">

<v value="短毛" keys="Short_Hair" /><v value="長毛" keys="Long_Hair" /> … </s>

</larva>

- <s slot="台灣分布" keys="Taiwan" path="//butterfly//geographic//taiwan//">

<v value="台灣北部" keys="North_Taiwan+北" /> …</s>

- <s slot="全球分布" path="//butterfly//geographic//global//">

Page 58: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="東南亞" keys="South_Asia " /><v value="中國大陸" keys="China" /> …. </s>

- <s slot="體型大小" keys="Size" path="//butterfly//adult//size//">

<v value="小型" keys="Small_Size+小" /><v value="中型" keys="Middle_Size+中" /></s>

- <s slot="棲息地" keys="棲息地=Habitate" path="//butterfly//adult//habitate//">

<v value="平地" keys="Ground" />…<v value="高海拔山區" keys="High_Mountain” /> …

</s>

- <s slot="宿主植物" keys="Hostplant+寄主植物" path="//butterfly//hostplant//">

<v value="豆科" keys="Leguminosae" /><v value="大戟科" keys="Euphorbiaceae" /> …</s>

- <s slot="飲食習慣" keys="Eat Food" path="//butterfly//adult//behavior//;//butterfly//honeyplant//">

<v value="食花蜜" keys="Nectar" /><v value="食腐汁" keys="Juice " />…</s>

</butterfly>

Figure 6.2 : A slot-tree for butterflies

11.4 Query Interface

The query interface is built automatically by transform the slot-tree into a web page. We use

XSLT to transform slot-tree into HTML. The following figure shows a query interface for

butterflies.

A query-interface is automatically generated from slot-tree by XSLT template. The XSLT

template transforms the slot-tree into a HTML document. Then we show it as a web page on

the browser. The following figure shows the interface for butterfly domain.

Page 59: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 6.3 A Query Interface for Butterflies

The query interface above generates the following query.

<query sort_by= “蝴蝶大小” path= “/butterfly/adult/size$meter” order= “-“>

<s slot= “蝴蝶花紋” path= “//butterfly//adult//texture” value=”水平色帶”/>

<s slot= “台灣分布” path= “//butterfly//geographic//Taiwan” value=”恆春半島”/>

</query>

After the interface submits the query to our XML retrieval system, the retrieval results will be shows

up. The query above specified the query expression and the ranking strategy. The ranking strategy is by

the size of adult butterfly in decreasing order. Based on the query, the XML retrieval system will

retrieve the butterfly object and ranking by size of butterfly. We will show the query results in the

following section.

11.5 Slot-Filling Algorithm

We have to parse XML objects before the fill documents into slot-tree. For example, the following

XML document is a butterfly called “maraho”.

<butterfly>

Page 60: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<cname>寬尾鳳蝶</cname>

<geographic>

<taiwan>本種分布於海拔較高山區,台灣中北部海拔 1000-1500 公尺山區才可見… </taiwan>

</geographic>

<egg><feature>外觀呈圓球形</feature></egg>

<adult><color>成蟲翅表為黑色</color><adult>

<footnote>本種經行政院農業委員會公告為一級瀕臨滅絕保育……</footnote>

</butterfly>

The example above will be parsed into a sequence of (path, value) pair as following.

(butterfly\ cname , 寬尾鳳蝶)

(butterfly\ geographic\ taiwan, 本種分布於海拔較高山區,台灣中北部海拔 1000-1500 公尺山區才可見…)

(butterfly\ egg \ feature, 外觀呈圓球形)

(butterfly\ adult \ color, 成蟲翅表為黑色)

Then we may fill them into corresponding slot as following.

(butterfly\ egg \ feature, 外觀呈圓球形)

<slot name="卵的形狀" path="butterfly//egg//feature">

<value name="圓球形"/>

<value name="半球形"/>

….

卵的形狀 : 圓球形

11.6 XML Retrieval

After the user submits the query to the XML retrieval system, the XML retrieval system

retrieves the query results. Then an XML extraction algorithm extracts values for each slot.

After that, a sorting function sorts the result by the size of butterflies. The following figure

shows the query results.

Page 61: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 6.4 A Query Result for Butterflies

11.7 Slot-Mining Algorithm

Chinese Word Learning

A problem for Chinese language is the word boundary detection. For English, there is a space between

words in a sentence. But in Chinese, there are not spaces between words. This problem causes some

difficulty in our XML Text-Mining problem. One way to solve this problem is use a dictionary to find

out the words shows in a sentence. The deficiency of this approach is that no dictionary contains all

words. And there are many unknown words used in a special domain. We have to learn words

dynamically to conquer the problem. In our system, we adopt the keyword-learning algorithm proposed

by L.F.Chien [Chien97]. This keyword-learning algorithm is based on the following observation –

“Both the right hand side and left hand side of a word should be ‘free’”. The ‘free’ means a word can

connect to many neighbors statistically. For example, we may extract the word ‘三線蝶 ’ from the

following sentences based on the statistical freedom of this word.

…雄紅三線蝶身上有…

…江崎三線蝶分布於…

…台灣三線蝶是一種…

…埔里三線蝶屬於小…

Page 62: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

“三線蝶” left neighbor {紅,崎,灣,里} ,right neighbor {身,分,是,屬 }

“線蝶” left neighbor {三} ,right neighbor {身,分,是,屬}

“三線” left neighbor {紅,崎,灣,里} ,right neighbor {蝶}

For the string “三線蝶”, both left side and right side has four neighbors. But for “線蝶”, there are only

one left neighbor. For “三線”, there are only one right neighbor. A string with many neighbors in both

sides is very possible to be a “word”, so that “三線蝶” is putted into the learning-dictionary for the

following XML text-mining step.

Slot-Mining

After the word learning step, the slot mining algorithm describe in section 3.5 is used to extract

important word for each slot. The following table shows some results of of the Slot-Mining (part of

slot-tree).

Table 6.3 : A Result of Slot-Mining Algorithm for Butterflies

Slot Value List

\butterfly\classification\cfamily 鳳蝶科, 蛺蝶科, 蛇目蝶科, 粉蝶科, 斑蝶科, 弄蝶科, 小灰蝶科

\butterfly\classification\family Satyridae, Pieridae, Papilionidae, Papilio, Nymphalidae, Lycaenidae,

Hesperiidae, Danaidae

\butterfly\cname 黃蝶, 鳳蝶, 蛺蝶, 蔭蝶, 胡麻斑粉蝶, 胡麻, 粉蝶, 樺斑蝶, 斑蝶, 弄蝶,台灣

\butterfly\footnote 高冷蔬菜區, 非常, 開發, 開墾, 長達,近年來, 種經, 種族群, 破壞, 生活史, 生

存, 環境, 溫帶果園, 溫帶, 海拔山, 海拔, 植物, 棲息環境, 棲息,, 本種, 更使,

族群分布, 族群分, 族群, 拔山, 情形, 寄主植物, 寄主, 台灣中, 台灣, 分布, 再

加上農藥, 侷限同時, 侷限, 使用, 主植物

\butterfly\geographic\global 馬來半島,非洲,錫金 西部,蘇門達臘,蘇門答臘 蘇門,群島,美洲,緬甸北部,緬甸,

琉球群島,琉球,爪哇,熱帶,澳洲東部,澳洲,泰國,歐洲,東部,東亞,本種尚分布,

本種,朝鮮半島,朝鮮,有分,日本,新幾內亞,斯里蘭卡,廣泛分布,廣泛,幾內亞,巴

基斯坦,尼泊爾,尚分布,婆羅洲,地區皆,地區均,地區,喜馬拉,喀什米爾,印度,南

部,半島,區皆,區均,北部,利亞,分布,分佈,亞種分布,亞熱帶地區,亞熱帶,亞洲,

亞地區,中國大陸,中南半島,中亞,

\butterfly\honeyplant\ 馬櫻丹,馬利筋,馬利,金露花,野花,豐草,菊科野花,菊科, 菊科,花蜜,腐熟,繁星

花,繁星,紫花霍香薊,紫花,流出,汁液,水果汁,水果,樹液,樹幹,植物,果汁液,果

汁,成蟲,成蝶,小型,多種,咸豐草,咸豐,吸食花蜜,吸食腐,吸食,各種野花, 各種,

\butterfly\life_stage\adult\predator ,鳥類,青蛙,螳螂,蜻蜓,蜥蜴,蜘蛛,捕食性天敵,捕食,性天敵,天敵,

\butterfly\life_stage\egg\characteristic ,表面平滑,表面,縱脊,細微,精孔,突起,條縱脊,條細微縱脊,條細微縱脊,明顯縱

脊,明顯,數條,平滑,刻點,光澤,中央精孔,中央,

Page 63: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

\butterfly\life_stage\egg\feature\ ,高饅頭形,饅頭,頂點,頂部微凸,頂部,角形,表面,著生,菱形,花紋,縱脊,細長刺

毛,細長,細小突起,細小,精孔,突起,稍微,砲彈形,砲彈,瓶形,球形,明顯縱脊,明

顯,扁平,扁圓形,扁圓,微凹,微凸,底部扁平,底部,小突起,圓球形,圓球,圓球,圓

形精孔,圓形,各著生,半圓球形,半圓,刺毛,凹陷,六角形,佈滿,中央微凹,中央微

凸,中央,

\butterfly\life_stage\adult\color\ 黑褐色,黑褐,黑色細帶紋,黑色斑點,黑色斑紋,黑色性徵,黑色帶紋,黑色小斑,

黑色小圓斑,黑色外框,黑色圓斑,黑色,黑紋,黑眼紋,黑白,黑斑,黃色,鱗片,體型,

體呈,顯眼,面底色,靠近,青藍色,雌蟲,雌蝶色澤,雌蝶翅表,雌蝶外觀,雄蟲相似,

雄蟲相,雄蟲,雄蝶相似,雄蝶前,雄蝶,附近,長型白斑,金屬,部分,部位,角形,規

則,褐色帶紋,褐色帶,褐色,表無,表各,蟲翅,蝶翅,蝶前,藍色,花紋,色鱗,色細,色

紋,色澤,色斜,色斑點,色斑紋,色斑,色帶紋,色帶,色小,色寬,色外框,色外,色圓

斑,色圓,色區域,至亞,腹面,肛角部位,肛角,翅表色澤,翅表底色,翅表,翅腹面底

色,翅腹色澤,翅腹底色,翅腹,翅脈,翅緣,翅第,翅端,翅形,翅外緣部位,翅外,翅

基部位,翅基,翅中,縱貫,緣部,緣毛,線部位,細紋,細小,紫色,紋橫,紋分,紅色,端

角,突起,眼紋,眼狀,相間,相似,白點,白色鱗,白色細帶紋,白色斜帶紋,白色斑紋,

白色斑,白色帶紋,白色,白斑,狹長,狹長,狀細,狀紋,狀突起,狀突,無明顯差異,

灰黑色,灰黑,灰褐色,灰褐,灰白色,灰白,淺黃色,淺黃,深褐色,深褐,深色,淡紫

色,淡紫色,消失,波狀,橫線,橢圓,橙黃色,橙色外,橙色圓斑,橙色,橘黃色,枚白

斑,枚白,枚小白斑,枚小,暗藍色,明顯黑,明顯白色,明顯,斑點,斑紋,數枚白斑,

數枚白,數枚,散生,排列,成蟲,成蝶前,成蝶,性徵,後翅表底色,後翅表,後翅色澤,

後翅腹面,後翅肛角,後翅第,後翅外緣,後翅前緣,後翅中央,後翅,後緣,形黑,形

成,底色,帶金屬光澤,帶金屬,帶金,帶紋,帶狀,差異,差異,尾狀突起,小黑圓斑,

小黑,小部份,小白,小斑,小型白斑,小型,寬圓, 寬圓,室各,室亞外緣,大型,多數,

外觀,外緣部位,外緣,外橫線,外框,外圈,基部,型黑,型白,圓斑,圓形,呈黑,呈灰,

各翅室,及第,區域,前翅表底色,前翅端部,前翅端角,前翅端,前翅前緣,前翅中

室內,前翅中室,前翅中央,前翅中,前翅,前緣,分布,分佈,具性徵,其中,兩枚,光

澤,佈滿,亞外緣部位,亞外緣,中橫線附近,中橫線附近,中橫線,中室內,中室,中

央部位,中央部,中央,三角形

11.8 Discussion

In this chapter, we have studied our methods on the case of butterflies. We describe the

following methods.

1. Modeling XML documents of butterflies.

2. Constructing slot-tree ontology for butterflies.

3. Using slot-filling algorithm to map XML documents into slot-tree of butterflies.

Page 64: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

4. Using slot-tree ontology to build query interface for butterflies.

5. Using slot-tree ontology to help XML retrieval for butterflies.

6. Mining slot-tree ontology from XML documents of butterflies.

These methods reduce the semantic gap between human and computer in the domain of

butterflies. The query interface enable user to write queries easily. The slot-filling algorithm

makes computer understand XML documents easily. Finally, the mining algorithm makes us

construct slot-tree ontology easily.

Page 65: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

12Case Study - Protein Information ResourceIn the previous chapter, we have tested our methods on the collection of “a Museum of

Butterflies in Taiwan (MBT) ”. In this chapter, we will test our methods on the collection of

“Protein Information Resource (PIR)”. The PIR is a large collection for proteins that

maintained by George Town University.

In this chapter, an overview of PIR is given in section 7.1. Some XML data of PIR are

showed in section 7.2. A slot-tree for PIR is described in section 7.3. A query interface for

proteins is described in section 7.4. The slot-filling process for PIR is described in section 7.5.

The retrieval process for PIR is discussed in section 7.6. The mining process to build slot-tree

for PIR is discussed in section 7.7. A discussion of our approach on PIR is given in section 7.8.

12.1 Introduction

Protein Information Resource is a general collection of Protein and Gene record for life, including

human, animal, plant, virus bacteria, etc. Each document in this collection describes a gene or protein.

The following table is a profile of this collection.

Table 6.1 : The Protein Information Resource

Collection Protein Information Resource

Working Group National Biomedical Research Foundation in George Town University

URL : http://pir.georgetown.edu/

Size The PIR-PSD, Release 72.03, May 17, 2002, Contains 283174 Entries

Language English

Protein Information Resource contains 283174 entries of protein and gene. Roughly specking, tags may

be classified into groups as following.

Table 6.2 : XML tags for Protein Information Resource

Group Fields

Identification ID, name,

Characteristic organism, function, classification, feature

Gene sequence, length, type

Reference keyword, reference (author, citation), access information

Date create_date

Page 66: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

12.2 The Representation of Proteins in XML

The following figure shows an XML document for the protein entry “S35333”

- <ProteinEntry id="S35333">

<created_date>03-Feb-1994</created_date>

- <protein><name>steroid receptor protein svp44</name></protein>

- <organism><source>zebra fish</source>…</organism>

- <reference>…<author>Fjose, A.</author> …..<citation>EMBO J.</citation>

<volume>12</volume><year>1993</year><pages>1403-1414</pages>

<title>Functional conservation of vertebrate seven-up related genes in neurogenesis and eye

development…

- <xrefs><xref><db>MUID</db><uid>93223680</uid></xref></xrefs>

- <accinfo label="FJO">…<mol-type>mRNA</mol-type> <seq-spec>1-411</seq-spec> -

<xrefs><xref><db>EMBL</db><uid>X70299</uid></xref>…

</reference>

- <classification><superfamily>unassigned erbA-related proteins</superfamily> …

- <keywords>DNA binding, steroid hormone receptor, zinc finger</keywords>

- <feature label="ERBA">

<feature-type>domain</feature-type>

<description>erbA transforming protein homology</description>

<seq-spec>74-320</seq-spec>

</feature>…

- <summary><length>411</length><type>complete</type></summary>

<sequence>MAMVVSVWRDPQEDVAGGPPSGPNPAAQPAREQQQAASAAPHTPQTPSQPGPPSTPGTAGDK…

</ProteinEntry>

Figure 7.1 : An example of XML document in Protein Information Resource

12.3 Slot-Tree Ontology for Proteins

Our ontology is represented as a slot-tree in XML format. The slot-tree we designed for

protein is not so consistent to the PIR collection. For example, the keyword field contains any

keyword that is important for a protein. But in our ontology representation, we use several slot

to represent a protein, including “protein structure”, “molecular function”, “biological

process” and “cellular component”.

Our ontology used in this section is based on the suggestion of Gene Ontology. Gene

Ontology Consortium proposed an ontology system with three dimensions, including

“molecular function”, “biological process” and “cellular component”. Besides that, we add the

“protein structure”, “protein size”, “molecular type”, and some other information in our slot-

Page 67: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

tree. A fragment of the slot-tree for butterfly is showed in the following figure. For a full list

of the slot-tree for protein, please see appendix 2.

- <frame>

- <s slot="mol-type" path="/ProteinEntry/reference/accinfo/mol-type">

<v value="protein" /><v value="DNA" /><v value="RNA" /></s>

- <structure slot="mol-shape" path="//">

<v value="Alpha Helix"/><v value="Beta Sheet"/>…

- <source_genus slot="organism" path="//">

<v value="Animal" /><v value="Plants" /><v value="Bacteria"/>…

- <body_component slot="body_component" path="//" >

<v value="Heart" /><v value="Lung" /><v value="Liver" />…

- <cell_component slot="cell_component" path="//">

<v value="Nucleus" /><v value="Cytoplasm" />…<v value="Golgi_Bodies"/>…

- <body_function slot="body_function" path="//">

<v value="Digestion" /> <v value="Respiration" /> <v value="Motion" /> …

- <cell_function slot="cell_function" path="//">

<v value="Structural" /><v value="Metabolism" /><v value="Communication" /> …

- <material slot="material" path="protein/target">

<v value="Acid" /><v value="Base" />…<v value="Enzyme" />…

- <s slot="ref-db" path="//db">

<v value="SGD" />…<v value="GDB" /> …<v value="FlyBase" />

</frame>

Figure 7.2 : A Slot-Tree for Proteins

12.4 Query Interface

The query interface is built automatically by transform the slot-tree into a web page. We use

XSLT to transform slot-tree into HTML. The following figure shows a query interface for

butterflies.

A query-interface is automatically generated from slot-tree by XSLT template. The XSLT

template transforms the slot-tree into a HTML document. Then we show it as a web page on

the browser. The following figure shows the interface for proteins domain.

Page 68: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 7.3 A Query Interface for Proteins

12.5 Slot-Filling Algorithm

We have to parse XML objects before the extraction. For example, the following XML document is a

protein.

<ProteinEntry>

<protein><name> steroid receptor protein svp44</name>

<protein>

<organism><name> zebra fish </name></organism>

<keyword>DNA binding, steroid hormone receptor, zinc finger</keyword>

<summary><length>411</length>…</summary>

</butterfly>

The example above will be parsed into a sequence of (path, value) pair as following.

(ProteinEntry\protein\name, steroid receptor protein svp44)

Page 69: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

(ProteinEntry\organism\name,zebra fish)

(ProteinEntry\keyword,DNA binding, steroid hormone receptor, zinc finger)

(ProteinEntry\summary\length,411)

Then we may fill them into corresponding slot as following.

(ProteinEntry\organism\name,zebra fish),

+ <slot name="Organism" path="ProteinEntry/Organism">

<value name="Human"/>

<value name="Plant"/>

<value name="Fish"/>

….

Organism: Fish

The following figure shows the extraction result for the example above.

<slot protein="S35333">

<slot name="Organism" values="Fish">

<slot name="Molecular Function" values="binding">

</slot>

12.6 XML Retrieval

After the user submits the query to the XML retrieval system, the XML retrieval system

retrieves the query results. Then an XML extraction algorithm extracts values for each slot.

After that, a sorting function sorts the result by the size of butterflies. The following figure

shows the query results.

Page 70: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 7.4 : A Query Result for Protein Information Resource

12.7 Slot-Mining Algorithm

The slot mining algorithm describe in section 3.5 is used to extract important word for each slot. The

following table shows some results of of the Slot-Mining (part of slot-tree).

Table 7.3 : A Result of Slot-Mining Algorithm for Proteins

Slot Value List

/ProteinEntry/classification/superfamily virus,unassigned,ubiquinone,tyrosine,type,tuberculosis,trypsin,translation,transforming,

transferase,transfer, transcription,transcript,topoisomerase,thioredoxin,tRNA,

synthase,sulfatase,ste,sea,rich,ribosomal,response,repressor,repeat,regulator,region,

reductase,receptor,rat,ras,ran,proteins,protein,probable,polyprotein,phosphate,phage,

permease,peptide,peptidase,oxidase,ornithine,nucleotide,nor,non,mol,min,mer,

membrane,man,long,line,ligase,lactaldehyde,kinesin,kinase,isomerase,inhibitor,inhibit,

immunoglobulin,hypothetical,hydrolyzing,hydrogenase,hydrogen,homology,

homolog,homeobox,glucose,globin,gene,gamma,form,factor,esterase,ester,erbA,

epimerase,enzyme,elegans,edu,domain,dehydrogenase,cytochrome,control,conserved,

Page 71: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

coli,chr,cholinesterase,choline,chain,cell,cassette,carrier,binding,bind,beta,barley,

bacterium,antigen,anti,ant,alpha,alcohol,alanine,acid,RNA,NADH,NAD,Mycobacterium,

MTH,III,Escherichia,DNA,Caenorhabditis,Bacillus,ATPase,ATP,ADP

/ProteinEntry/comment ste,protein,phosphorylation,phosphorylated,phosphorylase,phospho,phosphate,non,

molecule,mol,interacts,inhibit,enzyme,covalent,cell,allosterically,allosteric,allo,This,Thi

/ProteinEntry/complex tet,phosphorylase,phospho,mer,homotetramer

/ProteinEntry/feature TMM,SIG,RRH,MAT,KIN,IMM,HOX,FOX,ERBA,ACP,ABC

/ProteinEntry/header/created_date Sep,Oct,Nov,May,Mar,Jun,Jul,Jan,Feb,Dec,Aug,Apr

/ProteinEntry/header/seq-rev_date Sep,Oct,Nov,May,Mar,Jun,Jul,Jan,Feb,Dec,Aug,Apr

/ProteinEntry/genetics/xrefs/xref/db SGD,OMIM,MIPS,MIP,GDB

/ProteinEntry/genetics/start-codon GTG

/ProteinEntry/genetics/map-position qter,pter,circular,chromosome,chr,REV

/ProteinEntry/genetics/gene/db SPDB,SGD,SCOEDB,GDB,CESP,ATSP

/ProteinEntry/function/description sulfate,ran,protein,phospho,phosphate,hydrogenase,hydrogen,glucose,formate,

form,catalyzes,alpha

/ProteinEntry/feature/status predicted,experimental,exp,atypical

/ProteinEntry/feature/feature-type site,region,product,modified,inhibitory,inhibitor,inhibit,domain,disulfide,bonds,binding,

bind,active

/ProteinEntry/keywords/keyword zinc,transmembrane,transferase,transfer,transcription,transcript,tet,ste,ribosome,

regulation,reductase,receptor,rat,ras,ran,pyridoxal,proteinase,protein,polyprotein,

photo,phosphoprotein,phospho,phosphate,oxygen,oxidoreductase,nucleus,nucleotide,

muscle,mol,mitochondrion,min,metalloprotein,metal,mer,membrane,magnesium,lyase,

loop,kinase,isomerase,iron,immunoglobulin,hydrolase,homotetramer,homeobox,

heterotetramer,heme,glycoprotein,finger,erythrocyte,end,edu,duplication,date,complex,

chromoprotein,chr,chloroplast,cell,carrier,carboxyl,carboxy,carbon,blood,biosynthesis,

binding,bind,aminoacyl,amino,amidated,allo,acid,acetylated,NAD,DNA,ATP

/ProteinEntry/feature/description zinc,trypsin,transmembrane,transforming,ste,signal,sequence,seq,response,repeat,

regulator,reductase,rat,ras,ran,pyridoxal,pter,proteinase,protein,potential,phosphorylase,

phospho,phosphate,peptide,oxidase,nucleotide,muscle,motif,molybdopterin,mol,

min,mer,membrane,mature,man,magnesium,low,loop,ligands,ligand,kinase,iron,

inhibitor,inhibit,immunoglobulin,hydrogenase,hydrogen,homology,homolog,homeobox,

heme,glycoprotein,fragment,form,finger,ferroxidase,factor,erbA,end,edu,domain,

dehydrogenase,date,cytochrome,covalent,chr,chain,cassette,carrier,carboxyl,carboxy,

carbohydrate,binding,bind,beta,axial,amino,amidated,alpha,allo,alcohol,acetylated,Thr,

Ser,Lys,Ile,His,Glu,GTP,Cys,Bowman,Birk,Asp,Asn,Arg,ATP,ADP

Page 72: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

However, the schema of PIR is not consistent to our ontology that described in section 5.2. The

inconsistency causes mapping problem between slots ant paths. Ontology designer have to spend a lot

of time to adjust the automatic generated slot-tree.

12.8 Discussion

In this chapter, we have studied our methods on the case of proteins. We describe the following

methods.

1. Modeling XML documents of proteins.

2. Constructing slot-tree ontology for proteins.

3. Using slot-filling algorithm to map XML documents into slot-tree of proteins.

4. Using slot-tree ontology to build query interface for proteins.

5. Using slot-tree ontology to help XML retrieval for proteins.

6. Mining slot-tree ontology from XML documents of proteins.

These methods reduce the semantic gap between human and computer in the domain of

butterflies. The query interface enable user to write queries easily. The slot-filling algorithm

makes computer understand XML documents easily. Finally, the mining algorithm makes us

construct slot-tree ontology easily.

Page 73: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Part 4 : Conclusions13Conclusions and ContributionsIn this thesis, our goal is designing an XML retrieval system to reduce the semantic gap between human

and computer. We use the slot-tree to help the XML retrieval system to achieve the goal. We have

proved that the slot-tree ontology may used to reduce the semantic gap of XML retrieval.

In this thesis, slot-trees are used to generate a query interface for user to write queries easily. The

query interface reduces the semantic gap in the query side. On the other hand, a slot-filling algorithm is

designed for computer to understand XML documents easily. The slot-filling algorithm reduces the

semantic gap on the document side.

In order to ease the process of building a slot-tree, we propose a slot-mining algorithm to mine

slot-tree from XML documents. The slot-tree has to be modified by domain expert for quality

improvement.

In this chapter, we will compare our approach to other approaches in section 8.1. Our

contributions are described in section 8.2. Finally, we have conclusions in section 8.3.

13.1 Comparison

We will try to compare our approach to other approaches based on four measures. Each

measures corresponding to a question listed below.

1. Can people write queries easily?

2. Can people write documents easily?

3. Can machine understand queries easily?

4. Can machine understand documents easily?

a. A comparison of knowledge representation approaches

At first, we compare to four knowledge representation approaches that trying to resolve the

semantic gap problem, including natural language approach, database approach, logic based

approach and XML based approach. Figure 8.1 show the comparison of these approaches.

Page 74: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 8.1 : A comparison of knowledge representation approaches

Natural language approach (NL) : Both documents and queries are written in natural

language. A typical text retrieval system adopts natural language approach. Natural language is

easy for user to read and write. However, natural language is not easy for computer to

understand.

Database approach (DB) : Documents are encoded as a set of tables in a relational database.

Database system is not so easy for user to read and write data. A designer has to design user-

interface to help user read and write data. Database query languages like SQL are not so easy

for end user to write. However, data in database are easy to understand for computer.

Logic based approach (Logic) : Both logic queries and data are very easy for computer to

understand. However, people cannot write logic rules and queries easily. Besides, not all

documents can be represented logic rules.

XML based approach (XML) : XML queries are easy for computer to understand. However,

XML queries are not easy for human to write, and computer cannot understand XML

documents easily for the time being. In this thesis, we use the slot-tree ontology to help

computer to understand XML documents. We also use the slot-tree ontology to build the query

interface. The interface helps human to write XML queries easily. The slot-tree based methods

moves the XML based approach to the easy side in the figure below. The slot-tree ontology

reduces the gap between human and computer on XML.

Page 75: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

b. A comparison of XML-based representations

Next, we compare three XML-based representation, including XML, RDF and DAML. XML

has been described in this thesis for several times. Now we will introduce RDF and DAML

before comparison.

RDF is a recommendation of the W3C Semantic-Web project. It is an object-based

representation that encodes objects in XML format. Each object in RDF is called a resource

and has a unique URI. The following example shows a RDF document.

<rdf:RDF>

<rdf:Description about="Athyma_fortuna_kodairai">

<rdf:type resource="http://description.org/schema/butterfly"/>

<color>with brown wing and black head</color>

<texture>has white spots on wings</texture>

</rdf:Description>

</rdf:RDF>

We have to use the tag defined in RDF specification to describe object and the inheritance

relation. People have to understand RDF tags before write RDF documents. However, RDF is

simple and easy to use.

DAML is a representation that encodes logic rules into frame based XML documents.

DAML extend tags in RDF to accommodate frequently used logic predicate, such as

“disjointWith”, “cardinality”, “intersectionOf”, etc. The following example shows a example

of DAML.

<rdfs:Class rdf:ID="Athyma_fortuna_kodairai">

<rdfs:subClassOf rdf:resource="#butterfly"/>

<daml:disjointWith rdf:resource="#Moth"/>

</rdfs:Class>

Writing DAML document is not an easy job. People have to understand many DAML tags

and express the content into logic predicates. The following figure shows the comparison

between XML, RDF and DAML.

Page 76: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 8.2 : A comparison of XML-based representations

c. A comparison of XML retrieval systems

Finally, we compare several XML retrieval systems, including XML-GL, XYZfind, Lore, and

our slot-tree system. We have described the XML-GL, XYZfind and Lore system in section

2.3. Briefly speaking, XML-GL is a graphical XML query language, XYZfind is a two level

XML search system and Lore is an XML retrieval system based on object-oriented database.

Figure 8.3 shows the comparison of these approaches. Our slot-tree based system is

labeled as “slot” in the figure. XML-GL is labeled as “X-GL” in the figure. XYZfind is labeled

as “XYZ” in the figure. Lore is labeled as “Lore” in the figure.

We found that slot-tree approach perform well in all questions. The slot-tree ontology

makes people write queries easily. Our approach does not ask people to write XML document

in specified tags, so that people can write documents easily. The XML queries are always easy

to understand for computer. The slot-filling algorithm makes computer understand XML

documents easily.

Page 77: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Figure 8.3 : A comparison of XML retrieval systems

13.2 Contributions

Based on the analysis in section 9.1, we may describe our contribution briefly as following.

“The slot-tree approach reduces the semantic gap between human and computer on XML”

The contribution is further described as the following parts.

1. “The slot-tree based query interface makes human to write XML queries easily.”

2. “The slot-filling algorithm makes computer understand XML documents easily.”

3. “A retrieval system that based on slot-tree is built to reduce the semantic gap on XML.”

4. “The slot-mining algorithm makes people construct slot-tree ontology easily.”

However, we proposed the slot-tree based XML retrieval method only focus on a specific

domain. We have to construct slot-tree for each domain before release the XML retrieval

system. The method is good in retrieve object-based XML documents such as butterflies and

proteins. However, we are not sure the method can be used to retrieve XML collection that is

not object-based. Besides, we have to extend the method to build an XML retrieval system for

more than one domain.

Page 78: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

13.3 Discussion and Future Work

In this thesis, we use the slot-tree ontology and slot-filling algorithm to reduce the semantic

gap of XML. The slot-tree is used to generate a query interface to reduce the semantic gap on

query side. The slot-filling algorithm is used to map XML document into slot-tree ontology in

order to reduce the semantic gap on document side. Our XML retrieval system works well on

objects-based XML collections, such as the collection for butterflies in chapter 6 and the

collection for proteins in chapter 7.

However, not all XML documents are used to describe objects. Some XML documents are

used to encode categories, scripts and other structures. How to integrate these structures into an

XML retrieval system is a good question for our future research.

Another question is the integration of XML collections in several domains. For example,

how to integrate XML documents that describe gene, protein and biological species into one

XML retrieval system is a good case to study. The integration of several domains needs a

further research.

Finally, a scalable XML retrieval system should be useful on a web with many XML

documents. The XML retrieval system should be used to retrieve a large collection of XML

documents in a variety of domains. We will try to build such an XML retrieval system in the

future.

Page 79: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Reference

[Aguilera00] Aguilera, V. and Cluet, S. and Veltri, P. and Vodislav, D. and Wattez,F. (2000) “Querying

XML Documents in Xyleme” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/xyleme/XylemeQuery/XylemeQuery.html

[Albano00] Albano, A. and Colazzo, D. and Ghelli, G. and Manghi, P. and Sartiani, C. (2000) “A Type

System for Querying XML Documents” in ACM SIGIR 2000 Workshop On XML and Information

Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Sartiani/athens.html

[Allen94] Allen, J.F. “Natural Language Understanding,” Benjamin Cummings, 1987, Second Edition,

1994.

[Alshawi92] Hiyan Alshawi, editor. The Core Language Engine. MIT Press, Cambridge, Massachusetts,

1992.

[Baeza00] Baeza-Yates, R. and Navarro, G. (2000) “XQL and Proximal Nodes,” in ACM SIGIR 2000

Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/RBaetza/att1.htm

[Bobrow77] Bobrow, D. G. and Winograd, T. (1977). “An overview of KRL, a knowledge

representation language.” Cognitive Science, 1(1), 3--46.

[Bollacker98] Bollacker, K.D. and Lawrence, S. and Giles, C.L. (1998) “CiteSeer: An Autonomous

Web Agent for Automatic Retrieval and Identification of Interesting Publications”, 2nd International

ACM Conference on Autonomous Agents, pp. 116-123, ACM Press, May, 1998.

[Brachman85a] Brachman, R. and Levesque, H. (1985). “Readings in Knowledge Representation”,

Stanford: Morgan Kaufmann

[Brachman85b] Brachman, F.J., and Schmolze, J.G. (1985) “An overview of the KL-ONE knowledge

representation system.” Cognitive Sci. 9.2 (Apr. 1985) 171-216.

[Brin98] Brin, S. and Page,L.(1998) "The Anatomy of a Large-Scale Hypertextual Web Search Engine"

in Proceedings of World-Wide Web '98 (WWW7), April 1998.

Page 80: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

[Carmel00] Carmel, D. and Maarek, Y. and Soffer, A. (2000) “Workshop Summary of XML and

Information Retrieval: a SIGIR 2000 Workshop” IBM Research Lab in Haifa.

http://www.haifa.il.ibm.com/sigir00-xml/WorkshopSummary.html

[Chen00] Chen, B.C. (2000) ‘Content-Based Image Retrieval of Butterflies”, Master Thesis. NTU,

Taiwan, June, 2000.

[Chien97] Chien, L.F. (1997) "PAT-Tree Based Keyword Extraction for Chinese Information Retrieval"

ACM SIGIR 1997.

[Cooper01] Cooper, B.F. and Sample, N. and Franklin,M.J. and Hjaltason,G.R. and Shadmon, M.

(2001) “A Fast Index for Semistructured Data” Proc. of 27th Intl. Conf. on Very Large Data Bases,

August 2001. http://www.rightorder.com/technology/XML.pdf

[DC99] “Dublin Core Metadata Element Set, Version 1.1: Reference Description” –

http://dublincore.org/documents/dces/

[DeJong82] DeJong; G.. (1982) “An Overview of the FRUMP System.” In Strategies for Natural

Language Processing, W.G.Lehnert & M.H.Ringle (Eds), Lawrence Erlbaum Associates, 1982, 149-

176.

[Dyer83] Dyer, M.G. (1983) "In-Depth Understanding - A computer model of integrated processing for

Narrative Comprehension, " MIT press, 1983.

[Egnor00] Egnor,D. and Lord,R. (2000) “XYZfind: Searching in Context with XML” in ACM SIGIR

2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Egnor/index.html

[Fuhr00] Fuhr, N. (2000) “XIRQL An Extension of XQL for Information Retrieval” in ACM SIGIR

2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/KaiGross/sigir00.html

[Goldman97] Goldman, R. and Widom, J. (1997) “DataGuides: Enabling query formulation and

optimization in semistructured databases.” In Proc. Intl. Conf. on Very Large Data Bases, 1997.

[Green63] Green, B.F., Wolf, A.K., Chomsky, C., and Laughery, K. (1963). “Baseball : An automatic

question answerer.” In Feigenbaum and Feldman (Eds.), Computer and Thought. McGraw-Hill, New

York, 207-233.

Page 81: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

[Grosz86] Grosz, B.J., Sparck-Jones, K., and Webber, B.L., eds. (1986) "Readings in Natural Language

Processing", Morgan Kaufmann Publishers, Los Altos, CA, 1986

[Han01] Han, J. and Kamber, M. (2001) “Data Mining - Concepts and Techniques”, Morgan Kaufmann

Publisher. 2001.

[Hayashi00] Hayashi, Y. and Tomita, J. and Kikui,G. (2000) “Searching Text-rich XML Documents

with Relevance Ranking” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Hayashi/hayashi.html

[Heb00] Heb, M. and Monch, C. and Drobnik, O. (2000) "Quest - Querying Specialized Collections on

the Web", J. Borbinha and T.Baker (Eds.) : ECDL 2000, LNCS 1923, pp. 117-126, 2000.

[Hobbs96] Hobbs, J. and Appelt, D. and Bear, J. and Israel, D. and Kameyama, M. and Stickel, M. and

Tyson, M. (1996) “FASTUS: A Cascaded Finite-State Transducer for Extracting Information from

Natural-Language Text.” in Finite State Devices for Natural Language Processing, MIT Press, 1996

[Hsu98] Hsu, C.N. and Dung, M.T. (1998) “Generating finite-state transducers for semistructured data

extraction from the web,” Information Systems, 23(8):521-538, Special Issue on Semistructured

Data, 1998.

[Ide00] Ide, N. (2000) “Searching Annotated Language Resources in XML: A Statement of the

Problem” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Ide/SIGIR-XML.html

[Ifikes85] Ifikes, R. and Kehler, J. (1985) “The role of frame-based representation in reasoning.”

Communications of the ACM, Volume 28 Number 9, September 1985.

[Kehler84] T.P. Kehler and G.D. Clemenson. KEE: The Knowledge Engineering Environment for

Industry. Systems And Software, 3(1):212-224, January 1984.

[Kleinberg98] Kleinberg, J.M. (1998) "Authoritative Sources in a Hyperlinked Environment" in

Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 668-677, January 1998.

http://www.cs.cornell.edu/home/kleinber/auth.ps

[Kushmerick00] Kushmerick, N. (2000) “Wrapper induction: Efficiency and expressiveness” Artificial

Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems).

Page 82: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

[Lewin99] Lewin et al.1999 I. Lewin, R. Becket, J. Boye, D. Carter, M. Rayner, and M. Wiren.

Language processing for spoken dialogue systems: is shallow parsing enough? In Accessing

Information in Spoken Audio: Proceedings of ESCA ETRW Workship, Cambridge, 19 & 20th April

1999, pages 37--42, 1999.

[Loral97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. “The Lorel Query Language for

Semistructured Data.” International Journal on Digital Libraries, 1(1):68-88, April 1997.

[Luk00] Luk,R. and Chan,A. and Dillon,T. and Leong, H.V. (2000) “A Survey of Search Engines for

XML Documents” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Luk/XMLSUR.htm

[McHugh97] McHugh, J. and Widom, J. and Wiener, J. and Abiteboul, S. and Quass, D. (1997) “The

Lorel Query Language for Semistructured Data, ” - International Journal on Digital Libraries,

1(1):68-88, 1997.

[Minsky75] Minsky, M. (1975). “A framework for representing knowledge.” Available in Readings in

Knowledge Representation, Brachman, R.J. & Levesque, H.J., Eds. (1985), Morgan Kaufman.

[Muslea99] Muslea, I. (1999) “Extraction Patterns for Information Tasks : A Survey, ” In AAAI-99

Workshop on Machine Learning for Information Extraction, 1999.

[OIL00] “An informal description of Standard OIL and Instance OIL 28 November 2000”

http://www.ontoknowledge.org/oil/downl/oil-whitepaper.pdf

[Page98] Page, L. and Brin, S. and Motwani, R. and Winograd, T. “The PageRank citation ranking:

Bringing order to the Web.” Unpublished manuscript, online at http://google.stanford.edu/~backrub/

pageranksub.ps, 1998.

[Quillian66] Quillian, R. "Semantic memory," Cambridge, Mass. : Bolt, Beranek and Newman, 1966.

[RDF99] Resource Description Framework (RDF) Model and Syntax Specification W3C

Recommendation 22 February 1999 http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/

[RDFS00] Resource Description Framework (RDF) Schema Specification 1.0 W3C Candidate

Recommendation 27 March 2000 http://www.w3.org/TR/2000/CR-rdf-schema-20000327/

Page 83: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

[Salton88] Salton, G. and Buckley, C. “Term-Weighting Approaches in Automatic Text Retrieval,”

Information Processing and Management, 24(5), 513-23, 1988.

[Schank74] Schank, R.C. and Reiger III, C.J.(1974) "Inference and the Computer Understanding of

Natural language," Artificial Intelligence 5(4), 1974, 373-412.

[Schank77] Schank, R.C. and Abelson, R. (1977). “Scripts, Plans, Goals, and Understanding.”

Hillsdale, NJ: Earlbaum Assoc.

[Schank80] Schank, R.C. and Kolodner, J.L. and DeJong, G. (1980) “Conceptual Information

Retrieval.” SIGIR 1980: 94-116.

[Schmidt00] Schmidt, A. et al. (2000) “Efficient Relational Storage and Retrieval of XML Documents”,

In proceedings of International Workshop on the Web and Databases (In conjunction with ACM

SIGMOD), pages 47-52, Dallas, TX, USA, May 2000.

http://citeseer.nj.nec.com/schmidt00efficient.html

[Schlieder00] Schlieder, T. and Meuss, H. (2000) “Result ranking for structured queries against XML

documents.” In DELOS Workshop on Information Seeking, Searching and Querying in Digital

Libraries, Zurich, Switzerland, December 2000.

[Schlieder01] Schlieder, T. (2001) “Similarity search in XML data using cost-based query

transformations.” In Proceedings of the Fourth International Workshop on the Web and Databases

(WebDB'01), Santa Barbara, USA, May 2001.

[Schlieder00] Schlieder, T. and Naumann ,F. (2000) “Approximate Tree Embedding for Querying XML

Data” in ACM SIGIR 2000 Workshop On XML and Information Retrieval.

http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Approximate.htm

[Stefik79] Stefik, M.J. (1979) “An examination of a frame-structured representation system.” In

Proceedings of the 6th International Joint Conference on Artificial Intelligence (Tokyo, Japan, Aug.).

Kaufmann, Los Altos, CaIif., 1979, pp. 845-852.

[Stefik83] Stefik, M., Bobrow, D. G., Mittal, S., and Conway, L. Knowledge Programming in Loops:

Report on an Experimental Course. AI Magazine, 4:3, pp. 3-13, Fall 1983. (Reprinted in Readings

From the AI Magazine, Volumes 1-5, 1980-1985, pp. 493-503, 1988.)

Page 84: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

[Tu99] Tu, H. C. (1999) “Interactive Web IR: Focalization Model, Effectiveness Measures, and

Experiments”, Doctoral Dissertation, NTU, Taiwan, June, 1999.

[Turing50] Turing, A. M. “Computing machinery and intelligence. Mind”, 59:433-460, 1950.

[UDDI00] “UDDI Technical White Paper” September 6, 2000

http://www.uddi.org/pubs/Iru_UDDI_Technical_White_Paper.PDF

[van Zwol2002] van Zwol, R. (2002). “Modelling and searching web-based document collections.”

PhD thesis, Centre for Telematics and Information Technology (CTIT), Enschede, the Netherlands.

ISBN: 90-365-1721-4; ISSN: 1381-3616 No. 02-40 (CTIT Ph.D. thesis series).

[Weizenbaum66] Weizenbaum, J. 1966. “ELIZA,” Communication of ACM 9:36-45.

[Widom99] Widom, J. (1999) “Data Management for XML - Research Directions”, IEEE Data

Engineering Bulletin, Special Issue on XML, 22(3):44-52, September 1999.

http://www-db.stanford.edu/~widom/xml-whitepaper.htm

[Wood75] Woods, William A. “What's in a Link : Foundations for Semantic Networks” Available in

Readings in Knowledge Representation, Brachman, R.J. & Levesque, H.J., Eds. (1985), Morgan

Kaufman.

[XML98] “Extensible Markup Language (XML) 1.0” W3C Recommendation 10-February-1998

http://www.w3.org/TR/1998/REC-xml-19980210

[XML-QL98] “XML-QL: A Query Language for XML,” Submission to the World Wide Web

Consortium 19-August-1998 http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/

[XML-GL99] Stefano Ceri, Sara Comai, Ernesto Damiani , Piero Fraternali, Stefano Paraboschi,

Letizia Tanca “XML-GL: a Graphical Language for Querying and Restructuring XML Documents,”

in The Eighth International World Wide Web Conference (WWW8), Toronto Convention Centre,

Toronto, Canada May 11-14, 1999.

[XMLNS99] “Namespaces in XML” World Wide Web Consortium 14-January-1999

http://www.w3.org/TR/REC-xml-names/

[XPATH99] “XML Path Language (XPath) Version 1.0”, W3C Recommendation 16 November 1999,

http://www.w3.org/TR/xpath

Page 85: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

[XQuery01] “XQuery 1.0 and XPath 2.0 Data Model” W3C Working Draft 20 December 2001

http://www.w3.org/TR/2001/WD-query-datamodel-20011220/

[XTM01] “XML Topic Maps (XTM) 1.0” http://www.topicmaps.org/xtm/1.0/

Page 86: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Appendix 1 : A Museum of Butterflies in Taiwan

a. XML example

- <butterfly>

<cname>拉拉山三線蝶</cname>

<nickname />

- <present_SN_record>

<present_SN>Athyma_fortuna_kodairai</present_SN>

<present_SN_author>Sonan</present_SN_author>

<present_SN_year>1938</present_SN_year>

</present_SN_record>

- <classification>

<family>Nymphalidae</family>

<cfamily>蛺蝶科</cfamily>

<genus>Athyma</genus>

<species>fortuna</species>

<sub_species>kodairai</sub_species>

</classification>

<hostplant>忍冬科 (Caprifoliaceae) 的松田氏紅子仔 (Viburnum luzonicum var. matsudai)。</hostplant>

<honeyplant>成蝶喜吸食腐熟水果汁液或樹幹流出汁液。</honeyplant>

- <geographic>

<taiwan>分布於台灣中北部地區,海拔 1000-2000 公尺間山區均有分布。</taiwan>

<global>中國大陸中部有原名亞種分布。</global>

</geographic>

- <life_stage>

- <egg>

<feature>底部扁平之高饅頭形,表面有明顯六角形格狀花紋,於六角形頂點處,各著生

一細長刺毛。</feature>

<color>淡綠。</color>

<size>直徑約為 1.1-1.3mm。</size>

<characteristic />

<habitate />

<predator>各類卵寄生蜂、蜱等節肢動物。</predator>

<days_of_growth>卵期約為 5-6 天左右。</days_of_growth>

</egg>

- <larva>

<feature>終齡幼蟲體呈長圓筒狀,頭部密生硬棘,各體節背方及體側皆長有具星狀刺之

突起。</feature>

Page 87: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<color>終齡幼蟲頭部褐色,表面密生棘狀突起。體呈翠綠色,各體節背方及體側突起基部

為藍色,星狀刺為黃綠色。</color>

<size>終齡幼蟲體長約為 33-41 mm。</size>

<characteristic />

<habitate />

<predator>寄生蜂、寄生蠅、小繭蜂、椿象、蜥蜴及鳥類等。</predator>

<days_of_growth>冬季以二齡幼蟲越冬,幼蟲期長達半年以上。</days_of_growth>

<defense>初齡幼蟲停棲於寄主葉脈,攝食葉脈兩側葉肉,二齡幼蟲會將寄主植物葉片咬

成小塊並吐絲將其此碎片及糞便黏於葉脈造一蟲巢,越冬幼蟲即躲藏於蟲巢當中,

由於幼蟲褐色之體色與蟲巢上乾枯之小葉片或糞便色澤相近,或許可混淆天敵耳目。

</defense>

</larva>

- <pupa>

<feature>蛹體為垂蛹,中胸背方隆起,腹節末端有一柄狀懸絲器。頭部前端有一對大型明

顯之彎曲角狀突出物,腹節背方均有小型鋸齒狀脊起。</feature>

<color>蛹體底色呈黃褐色,中、後胸背方有銀色斑塊,體側氣門黑褐色。</color>

<size>蛹體長度約為 22-27mm。</size>

<characteristic />

<habitate />

<predator>蛹寄生蜂、胡蜂、姬蜂及各種真菌等。</predator>

<days_of_growth>蛹期約為 15-20 天,視溫度而定。</days_of_growth>

<defense>老熟幼蟲化蛹於隱蔽之植物叢間,藉以躲避天敵。</defense>

</pupa>

- <adult>

<feature>成蟲前翅外觀大致呈現三角形,翅形稍微橫長。後翅卵圓形,外觀接近三角形。

雌蝶翅型較為寬圓。</feature>

<color>雄蝶前、後翅表底色為黑色,前翅中室內有一枚長形白斑,各翅室中橫線部位有一

大型白色橢圓斑,前翅端有兩枚小型白斑。後翅有兩條明顯白色橫帶紋,前後翅緣皆

有不明顯小白紋。雌蟲翅表色澤花紋與雄蟲相似。</color>

<size>本種為中型蝶種,展翅約為 50-60mm。</size>

<characteristic>前翅中室內有一枚長形白斑。</characteristic>

<habitate>台灣中部以北山區均有分布。</habitate>

<predator>蜘蛛、螳螂、青蛙、蜻蜓、鳥類及蜥蜴等捕食性天敵。</predator>

<days_of_growth>前翅中室內有一枚長形白斑。</days_of_growth>

<defense>成蟲飛行快速,外觀與其他多種三線蝶類似,為莫氏擬態的一種。</defense>

<season>夏季較易見到成蟲活動。</season>

<behavior>成蝶喜吸食腐熟水果汁液或樹幹發酵流出之樹液,成蟲活動於開闊林道,常見

成蟲於開闊山徑兩旁樹上佔據地盤驅趕附近飛過蝴蝶,亦可見其活動於溪邊開闊處,

Page 88: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

吸食腐果或潮濕地面水分。</behavior>

</adult>

</life_stage>

<update>2000/11/7</update>

<footnote />

</butterfly>

b. Domain Knowledge

- <frame language="big5" database="xir" showPath="//butterfly//cname//">

- <butterfly>

- <family slot="種類" path="//butterfly//cfamily//" menu="yes">

<v value="弄蝶" keys="弄蝶=Hesperiidae" />

<v value="小灰蝶" keys="小灰蝶=Lycaenidae" />

<v value="斑蝶" keys="斑蝶=Danaidae" />

<v value="粉蝶" keys="粉蝶=Pieridae" />

<v value="鳳蝶" keys="鳳蝶=Papilionidae" />

<v value="蛇目蝶" keys="蛇目蝶=Satyridae" />

<v value="蛺蝶" keys="蛺蝶=Nymphalidae" />

<v value="小灰蛺蝶" keys="小灰蛺蝶=Riodinidae" />

<v value="長鬚蝶" keys="長鬚蝶=Libytheidae" />

</family>

- <adult slot="蝴蝶成蟲" keys="Adult" path="//butterfly//adult//">

- <shape slot="蝴蝶的形狀" keys="Adult:Shape" path="//butterfly//adult//shape//" menu="yes">

<v value="類似燕尾" keys="Swallowtail+突出" image="swallowtail.gif"/>

<v value="細小尾突" keys="little_tail" image="little_tail.gif" />

<v value="翅緣破裂" keys="broken+破裂" image="broken.gif" />

<v value="翅緣波浪狀" keys="鋸齒狀+wave" image="wave.gif" />

<v value="似蛾狀" keys="Moth+蛾" image="moth.gif" />

<v value="似枯葉狀" keys="Leaf+枯葉" image="leaf.gif" />

</shape>

- <color slot="蝴蝶的顏色" keys="Adult:Color" path="//butterfly//adult//color//" menu="yes">

<v value="大致黑色" keys="Black" />

<v value="大致深棕色" keys="Dark_Wood" />

<v value="大致淺棕色" keys="Light_Wood" />

<v value="大致橘紅色" keys="Orange_Red" />

<v value="大致橘黃色" keys="Orange_Yellow" />

<v value="大致黃色" keys="Yellow" />

<v value="大致綠色" keys="Green" />

Page 89: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="大致藍色" keys="Blue" />

<v value="大致紫色" keys="Purple" />

<v value="大致灰色" keys="Gray" />

<v value="大致白色" keys="White" />

<v value="黑白相間" keys="Black_White" />

<v value="黑黃相間" keys="Black_Yellow" />

<v value="黑橘相間" keys="Black_Orange" />

<v value="黑藍相間" keys="Black_Blue" />

<v value="黑紅相間" keys="Black_Red+" />

<v value="棕白相間" keys="Wood_White" />

<v value="超過三種顏色" keys="many" />

</color>

- <texture slot="蝴蝶的特徵" keys="Adult:Texture" path="//butterfly//adult//texture//"

menu="yes">

<v value="沒有花紋" keys="Mono+無..花紋" image="mono.gif" />

<v value="垂直色帶" keys="v_Band" image="v_band.gif" />

<v value="水平色帶" keys="h_Band" image="h_band.gif" />

<v value="一條細線" keys="1_Line" image="1_line.gif" />

<v value="多條細線" keys="lines" image="lines.gif" />

<v value="翅脈明顯" keys="Vein+翅脈" image="vein.gif" />

<v value="格子斑紋" keys="Grid+格狀" image="grid.gif" />

<v value="眼睛狀點" keys="Eyes+圓斑+眼" image="eyes.gif" />

<v value="少數斑點" keys="Spot" image="spot.gif" />

<v value="一些斑點" keys="Some_Spots" image="some_spots.gif" />

<v value="滿佈著斑點" keys="Spots" image="spots.gif" />

<v value="複雜木紋" keys="Complex_Wood" image="complex_wood_t.gif" />

<v value="翅緣有花紋" keys="Edge" image="edge.gif" />

<v value="有零星小點" keys="Stars" image="stars.gif" />

<v value="前翅前半異色" keys="Fore_Half" image="fore_half.gif" />

</texture>

</adult>

- <pupa slot="蝴蝶的蛹" keys="Pupa" path="//butterfly//pupa//">

- <s slot="蛹的形狀" path="//butterfly//pupa//" menu="yes">

<v value="突起" keys="Skin_Stick" />

<v value="環紋" keys="Ring_Texture" />

<v value="粗糙" keys="Rough_Skin" />

<v value="光滑" keys="Smooth_Skin" />

<v value="橢圓形" keys="Ellipse_shape" />

Page 90: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

</s>

- <s slot="蛹的顏色" keys="Pupa:Color" path="//butterfly//pupa//color//" menu="yes">

<v value="翠綠色" keys="Green=翠綠色" />

<v value="黃綠色" keys="Light_Green=淡綠色" />

<v value="褐色" keys="Wood" />

<v value="灰色" keys="Gray" />

<v value="白色" keys="White" />

<v value="金黃色" keys="Gold" />

</s>

- <s slot="蛹的特徵" keys="Pupa:Feature" path="//butterfly//pupa//feature//" menu="yes">

<v value="帶蛹" keys="Laying_Pupa" image="pupa_bag.jpg" />

<v value="垂蛹" keys="Hanging_Pupa" image="pupa_hang.jpg" />

</s>

</pupa>

- <egg slot="蝴蝶的卵" keys="Egg" path="//butterfly//egg//">

- <s slot="底部"> <v value="扁平" /> </s>

- <s slot="表面">

<s slot="縱脊" />

<s slot="突出物" />

</s>

- <s slot="卵的形狀" keys="Egg:Shape" path="//butterfly//egg//feature//" menu="yes">

<v value="圓球形" keys="Ball" image="egg_ball.jpg" />

<v value="半球形" keys="饅頭形+Half_Ball" image="egg_half_ball.jpg" />

<v value="扁平盤狀" keys="Plate" image="egg_plate.jpg" />

<v value="梭子形" keys="酒瓶形+瓶形+Shuttle" image="egg_shuttle.jpg" />

<v value="砲彈形" keys="Bullet" image="egg_bullet.jpg" />

</s>

- <s slot="卵的顏色" keys="Egg:Color" path="//butterfly//egg//color//" menu="yes">

<v value="乳白" keys="Milk_White" />

<v value="淡黃" keys="Light_Yellow" />

<v value="棕褐色" keys="Wood+棕+褐" />

<v value="淡綠" keys="Light_Green" />

<v value="橙黃" keys="Yellow" />

<v value="光澤" keys="Shining" />

</s>

- <s slot="卵的特徵" keys="Egg:Texture" path="//butterfly//egg//feature//" menu="yes">

<v value="表面光滑" keys="Smooth+光滑" />

<v value="六角形花紋" keys="Haxagon Texture+六角形" />

Page 91: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="有縱脊" keys="Ridge+縱脊" />

<v value="菱形花紋" keys="Rhombus_Texture+菱形" />

<v value="格狀花紋" keys="Square_Texture" />

</s>

</egg>

- <larva slot="蝴蝶的幼蟲" keys="Larva+毛毛蟲" path="//butterfly//larva//">

<s slot="軀體" keys="蟲體" />

<s slot="頭部" />

<s slot="體節" />

<s slot="表面" />

- <s slot="體側">

<s slot="氣門" />

</s>

<s slot="肛板" />

<s slot="體長" />

- <s slot="幼蟲的形狀" keys="Larva:shape" path="//butterfly//larva//feature//" menu="yes">

<v value="細長" keys="Thin" />

<v value="扁平" keys="Like_Plate" />

<v value="紡棰形" keys="Like_Shuttle" />

<v value="鳥糞狀" keys="Like_Bird's_Shit" />

</s>

- <s slot="幼蟲的顏色" keys="Larva:Color" path="//butterfly//larva//color//" menu="yes">

<v value="翠綠色" keys="Green+綠色" />

<v value="黃綠色" keys="Yellow_Green" />

<v value="淡黃色" keys="Light_Yellow" />

<v value="灰色" keys="Gray" />

<v value="白色" keys="White" />

<v value="黑色" keys="Black" />

<v value="褐色" keys="Brown" />

</s>

- <s slot="幼蟲的特徵" keys="Larva:Texture" path="//butterfly/life_stage/larva/characteristic"

menu="yes">

<v value="短毛" keys="Short_Hair" />

<v value="長毛" keys="Long_Hair" />

<v value="細毛" keys="Thin_Hair" />

<v value="肉突" keys="Skin_Stick+突起" />

<v value="橫紋" keys="Line_Texture" />

<v value="圈紋" keys="Ring_Textrue+圈狀眼紋+環紋" />

Page 92: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

</s>

</larva>

- <s slot="台灣分布" keys="Taiwan" path="//butterfly//geographic//taiwan//" menu="yes">

<v value="台灣全島" keys="Whole_Taiwan" />

<v value="台灣北部" keys="North_Taiwan+北" />

<v value="台灣東部" keys="East_Taiwan+東" />

<v value="台灣南部" keys="South_Taiwan+南" />

<v value="恆春半島" keys="HunChan" />

<v value="綠島" keys="GreenIsland" />

<v value="蘭嶼" keys="LanYu" />

</s>

- <s slot="全球分布" path="//butterfly//geographic//global//" menu="yes">

<v value="東亞" keys="East_Asia+朝鮮半島+韓國+日本" />

<v value="東南亞" keys="South_Asia+中南半島+印尼+泰國+馬來+緬甸+菲律賓+婆羅洲" />

<v value="中國大陸" keys="China" />

<v value="喜馬拉亞地區" keys="Himalayas+喜馬拉亞" />

<v value="中亞地區" keys="Middle_Asia+中亞" />

<v value="西伯利亞" keys="Siberia" />

<v value="新幾內亞" keys="New_Guinea" />

<v value="澳洲" keys="Australia" />

<v value="歐洲" keys="Europe+歐" />

<v value="美洲" keys="America+北美+中美+南美" />

<v value="非洲" keys="Africa" />

</s>

- <s slot="體型大小" keys="Size" path="//butterfly//adult//size//" menu="yes">

<v value="小型" keys="Small_Size+小" />

<v value="中型" keys="Middle_Size+中" />

<v value="大型" keys="Large_Size+大" />

</s>

- <s slot="棲息地" keys="棲息地=Habitate" path="//butterfly//adult//habitate//" menu="yes">

<v value="平地" keys="平地=Level_Ground" />

<v value="低海拔山區" keys="Low_Mountain+低海拔" />

<v value="中海拔山區" keys="Middle_Mountain+中海拔" />

<v value="高海拔山區" keys="High_Mountain+高海拔" />

</s>

- <s slot="宿主植物" keys="Hostplant+寄主植物" path="//butterfly//hostplant//" menu="yes">

<v value="豆科" keys="Leguminosae" />

<v value="大戟科" keys="Euphorbiaceae" />

Page 93: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="白花菜科" keys="Capparidaceae" />

<v value="蘇鐵科" keys="Cycadaceae" />

<v value="蕁麻科" keys="Urticaceae" />

<v value="禾本科" keys="Gramineae" />

<v value="殼斗科" keys="Fagaceae" />

<v value="芸香科" keys="Rutaceae" />

<v value="榆科" keys="Ulmaceae" />

<v value="樟科" keys="Lauraceae" />

<v value="木犀科" keys="Oleaceae" />

<v value="桑寄生科" keys="Loranthaceae" />

<v value="肉食性" keys="Carnivore" />

</s>

- <s slot="飲食習慣" keys="Eat Food" path="//butterfly//adult//behavior//;//butterfly//honeyplant//"

menu="yes">

<v value="食花蜜" keys="Nectar+蜜" />

<v value="食腐汁" keys="Juice+腐+果汁+汁液+液" />

</s>

- <s slot="飛行速度" keys="Fly Speed" path="//butterfly//adult//behavior//;//butterfly//adult//defense//" menu="yes">

<v value="飛行迅速" keys="速+快" />

<v value="飛行緩慢" keys="緩+慢" /> </s>

- <s slot="禦敵方式" path="//butterfly//defense//" menu="yes">

<v value="有毒" keys="毒" />

<v value="擬態+保護色" keys="擬態+欺騙+環境融合+混淆" />

<v value="有臭味" keys="臭" />

</s>

- <s slot="現存數量" path="//butterfly//footnote//" menu="yes">

<v value="已絕種" keys="已滅絕" />

<v value="瀕臨絕種" keys="瀕臨滅絕" />

<v value="罕見稀少" keys="稀少+罕見" />

<v value="普通常見" keys="常見" />

</s>

<s slot="棲息地高度" path="//butterfly//geographic//taiwan//text()$meter" sortable="yes" />

<s slot="蝴蝶的大小" path="//butterfly//life_stage//adult//size//text()$meter" sortable="yes" />

<s slot="蝴蝶的壽命" path="//butterfly//life_stage//adult//days_of_growth//text()$day"

sortable="yes" />

</butterfly>

</frame>

Page 94: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

Appendix 2 : Protein Information Resourcea. XML example

- <ProteinEntry id="S35333">

- <header>

<uid>S35333</uid>

<accession>S35333</accession>

<created_date>03-Feb-1994</created_date>

<seq-rev_date>03-Feb-1994</seq-rev_date>

<txt-rev_date>24-Sep-1999</txt-rev_date>

</header>

- <protein><name>steroid receptor protein svp44</name></protein>

- <organism>

<source>zebra fish</source>

<common>zebra fish</common>

<formal>Brachydanio rerio</formal>

</organism>

- <reference>

- <refinfo refid="S35333">

- <authors>

<author>Fjose, A.</author>

<author>Nornes, S.</author>

<author>Weber, U.</author>

<author>Mlodzik, M.</author>

</authors>

<citation>EMBO J.</citation>

<volume>12</volume>

<year>1993</year>

<pages>1403-1414</pages>

<title>Functional conservation of vertebrate seven-up related genes in neurogenesis and eye

development.</title>

- <xrefs><xref><db>MUID</db><uid>93223680</uid></xref></xrefs>

</refinfo>

- <accinfo label="FJO">

<accession>S35333</accession>

<mol-type>mRNA</mol-type>

<seq-spec>1-411</seq-spec>

- <xrefs>

- <xref><db>EMBL</db><uid>X70299</uid></xref>

Page 95: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

- <xref><db>NID</db><uid>g296418</uid></xref>

- <xref><db>PIDN</db><uid>CAA49780.1</uid></xref>

- <xref><db>PID</db><uid>g296419</uid></xref>

</xrefs>

</accinfo>

</reference>

- <genetics><gene><uid>svp44</uid></gene></genetics>

- <classification>

<superfamily>unassigned erbA-related proteins</superfamily>

<superfamily>erbA transforming protein homology</superfamily>

</classification>

- <keywords>

<keyword>DNA binding</keyword>

<keyword>steroid hormone receptor</keyword>

<keyword>zinc finger</keyword>

</keywords>

- <feature label="ERBA">

<feature-type>domain</feature-type>

<description>erbA transforming protein homology</description>

<seq-spec>74-320</seq-spec>

</feature>

- <feature>

<feature-type>region</feature-type>

<description>zinc finger</description>

<seq-spec>76-96</seq-spec>

</feature>

- <feature>

<feature-type>region</feature-type>

<description>zinc finger</description>

<seq-spec>112-136</seq-spec>

</feature>

- <summary><length>411</length><type>complete</type></summary>

<sequence>MAMVVSVWRDPQEDVAGGPPSGPNPAAQPAREQQQAASAAPHTPQTPSQPGPPSTP

GTAGDKGSQNSGQSQQHIECVVCGDKSSGKHYGQFTCEGCKSFFKRSVRRNLTYTCRANRNCPI

DQHHRNQCQYCRLKKCLKVGMRREAVQRGRMPPTQPNPGQYALTNGDPLNGHCYLSGYISLLL

RAEPYPTSRYGSQCMQPNNIMGIENICELAARLLFSAVEWARNIPFFPDLQITDQVSLLRLTWSEL

FVLNAAQCSMPLHVAPLLAAAGLHASPMSADRVVAFMDHIRIFQEQVEKLKALHVDSAEYSCIK

Page 96: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

AIVLFTSDACGLSDAAHIESLQEKSQCALEEYVRSQYPNQPSRFGKLLLRLPSLRTVSSSVIEQLFF

VRLVGKTPIETLIRDMLLSGSSFNWPYMSIQ</sequence>

</ProteinEntry>

b. Domain Knowledge

- <frame>

- <s slot="分子種類=mol-type" path="/ProteinEntry/reference/accinfo/mol-type" menu="yes">

<v value="protein" />

<v value="DNA" />

<v value="RNA" />

<v value="mRNA" />

<v value="genomic RNA" />

</s>

- <structure slot="分子形狀=mol-shape" path="//ProtenEntry " menu="yes">

<v value="螺旋=Alpha" keys="螺旋=Helix" image="motif/Alpha.gif" />

<v value="平板=Beta" keys="平板=Sheet" image="motif/Beta.gif" />

<v value="Alpha+Beta" />

<v value="Parallel-Beta" />

<v value="AntiParallel-Beta" />

</structure>

- <source_genus slot="分子來源=organism" path="//ProtenEntry//organism" menu="yes">

<v value="動物=Animal" />

<v value="植物=Plants" />

- <v value="細菌=Bacteria"><v value="大腸桿菌=E_coli"/></v>

- <v value="病毒=Virus"><v value="噬菌體=Bacteriophage" /></v>

<v value="昆蟲=Insects" />

<v value="酵母=Yeast" />

<v value="人=Human" />

<v value="牛=Cow" keys="牛=Ox" />

<v value="雞=Chicken" />

<v value="豬=Pig" />

<v value="兔=Rabbit" />

<v value="鼠=Mouse" keys="rat=鼠" />

<v value="魚=Fish" keys="Whale,Dolphen" />

<v value="鳥類=Bird" />

<v value="昆蟲=Insect" />

<v value="真菌=Fungi" />

<v value="線蟲=Nematodes" />

Page 97: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

</source_genus>

- <body_component slot="身體部位=body_component" path="//ProtenEntry " menu="yes">

<v value="心臟=Heart" />

<v value="肺臟=Lung" />

<v value="肝臟=Liver" />

<v value="腎臟=Kidney" />

<v value="胰臟=Pancreas" />

<v value="脾臟=Spleen" />

<v value="腸道=Intestine" />

<v value="大腦=Nucleus" />

<v value="皮膚=Cytoplasm" keys="皮膚=skin" />

<v value="肌肉=Membrane" keys="肌肉=Myosin" />

<v value="毛髮=Hair" />

<v value="神經=Nerve_System" />

<v value="血液=Blood" />

<v value="骨骼=Bone" />

<v value="副甲狀腺=Parathyroid" />

<v value="荷爾蒙=pheromone" />

<v value="羽毛=Feather" />

<v value="植物的根=Root" />

<v value="植物的莖=Stem+Trunk" />

<v value="植物的葉=Leaf" />

</body_component>

- <cell_component slot="細胞部位=cell_component" path="//ProtenEntry " menu="yes">

<v value="細胞核=Nucleus" />

<v value="細胞質=Cytoplasm" />

<v value="細胞膜=Membrane" />

<v value="細胞壁=Cell_Wall" />

<v value="內質網=Endoplasmic_reticulum" />

<v value="高基氏體=Golgi_Bodies" />

<v value="溶小體=Lysosomes" />

<v value="粒腺體=Mitrochondria" />

<v value="運輸系統=Transport" />

<v value="植物的質體=Plastids" />

</cell_component>

- <body_function slot="身體功能=body_function" path="//ProtenEntry " menu="yes">

<v value="消化=Digestion" />

<v value="呼吸=Respiration" />

Page 98: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="運動=Motion" />

<v value="學習=Memory" />

<v value="感覺=Perception" />

<v value="幼兒=Larval" />

<v value="成長=Adult" />

<v value="懷孕=Pregnancy" />

<v value="交配=Mating" />

</body_function>

- <cell_function slot="細胞功能=cell_function" path="//ProtenEntry " menu="yes">

<v value="骨架=Structural" />

<v value="成長=Growth" />

<v value="吞噬=Phagocytosis" />

<v value="訊息=Communication" />

<v value="轉錄=Transcription" />

<v value="代謝=Metabolism" />

<v value="平衡=Ion_homeostasis" />

<v value="分解=Catabolism" />

<v value="調節=Regulation" />

<v value="催化=Enzyme" />

<v value="免疫=Immune" />

<v value="色素=Cytochrome" />

<v value="結合=Binding" />

<v value="水解=Hydrolase" />

<v value="循環=Circulation" />

<v value="毒素=Toxin" />

</cell_function>

- <material slot="相關元素=material" path="//ProtenEntry " menu="yes">

<v value="DNA" />

<v value="RNA" />

<v value="酸=Acid" />

<v value="鹼=Base" />

<v value="鹽=Salt" />

<v value="醣=Carbohydrate" />

<v value="脢=Enzyme" />

<v value="核酸=Nucleotides" />

<v value="脂肪=Lipid" />

<v value="維生素=vitamin" />

<v value="離子=Anion/Cation" />

Page 99: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="碳=Carbon" />

<v value="磷=Phosphatase" />

<v value="能量=ATP" />

- <v value="金屬">

<v value="鈉" />

<v value="鉀=Potassium" />

<v value="鈣=calcium" />

<v value="鐵=iron" keys="ferric" />

<v value="銅=copper" />

<v value="鋁=aluminum" />

<v value="鎂=phosphatase" />

<v value="重金屬=heavy_metal" />

</v>

</material>

- <property slot="特性=property" path="//ProtenEntry " menu="yes">

<v value="親水性=Hydrophobic" />

<v value="斥水性=Hydropholic" />

<v value="帶正電=Positive_Charged" />

<v value="帶負電=Negative_Charged" />

</property>

- <s slot="蛋白質大小=size" path="//ProteinEntry//size" menu="yes" sortable="yes">

<v value="10-20R" />

<v value="20-50R" />

<v value="50-100R" />

<v value="100-500R" />

<v value="500-1000R" />

<v value="1000R-*" />

</s>

- <s slot="全部/片斷=whole/part" path="//ProteinEntry/summary/type" menu="yes">

<v value="fragment" />

<v value="complete" />

<v value="fragments" />

</s>

- <s slot="database" path="//db" menu="yes">

</s>

- <s slot="記錄資料庫=record-db" path="/ProteinEntry/reference/accinfo/xrefs/xref/db" menu="yes">

<!-- gene db -->

Page 100: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="SGD" />

<v value="OMIM" />

<v value="MIPS" />

<v value="MIP" />

<v value="GDB" />

<v value="FlyBase" />

<!-- ref db -->

<v value="XFSC" />

<v value="UWGP" />

<v value="TIGR" />

<v value="SPDB" />

<v value="SCOEDB" />

<v value="PIDN" />

<v value="PID" />

<v value="PASP" />

<v value="NMASP" />

<v value="NMA" />

<v value="NID" />

<v value="MIPS" />

<v value="MIP" />

<v value="GSPDB" />

<v value="EMBL" />

<v value="DDBJ" />

<v value="CJSP" />

<v value="CESP" />

<v value="ATSP" />

<!-- ref db -->

<v value="PMID" />

<v value="MUID" />

</s>

- <source_genus slot="出版日期" path="//date" menu="yes" sortable="yes">

<v value="2002" />

<v value="2001" />

<v value="2000" />

<v value="1999" />

<v value="1998" />

<v value="1997" />

<v value="1996" />

Page 101: 國立台灣大學 資訊工程研究所 博士論文ccckmit.wdfiles.com/local--files/re:paper/PhdThesis.pdf · algorithm that maps XML documents into the slot-tree ontology in order

<v value="1990-1995" />

<v value="1980-1990" />

<v value="1970-1980" />

<v value="before 1970" />

</source_genus>

</frame>