1 資訊檢索與擷取 information retrieval and extraction 陳信希 hsin-hsi chen 台大資訊系

34
1 資資資資資資資 Information Retrieval and Extraction 資資資 Hsin-Hsi Chen 資資資資資

Upload: oliver-mccormick

Post on 12-Jan-2016

306 views

Category:

Documents


18 download

TRANSCRIPT

Page 1: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

1

資訊檢索與擷取Information Retrieval and Extraction

陳信希Hsin-Hsi Chen

台大資訊系

Page 2: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

2

Information Retrieval• generic information retrieval system

select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user

• functions– document search

the selection of documents from an existing collection of documents

– document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles

Page 3: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

3

Detection Need• Definition

a set of criteria specified by the user which describes the kind of information desired.– queries in document search task– profiles in routing task

• forms– keywords– keywords with Boolean operators– free text– example documents– ...

Page 4: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

4

Example

<head> Tipster Topic Description<num> Number: 033<dom> Domain: Science and Technology<title> Topic: Companies Capable of Producing Document

Management<des> Description:Document must identify a company who has the capability toproduce document management system by obtaining a turnkey-system or by obtaining and integrating the basic components.<narr> Narrative:To be relevant, the document must identify a turnkey documentmanagement system or components which could be integratedto form a document management system and the name of eitherthe company developing the system or the company using thesystem. These components are: a computer, image scanner oroptical character recognition system, and an information retrievalor text management system.

Page 5: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

5

Example (Continued)

<con> Concepts:1. document management, document processing, office automationelectronic imaging2. image scanner, optical character recognition (OCR)3. text management, text retrieval, text database4. optical disk<fac> Factors:<def> DefinitionsDocument Management-The creation, storage and retrieval of documents containing, text, images, and graphics.Image Scanner-A device that converts a printed image into a videoimage, without recognizing the actual content of the text or pictures.Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because oftheir high storage capacity.

Page 6: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

6

search vs. routing

• The search process matches a single Detection Need against the stored corpus to return a subset of documents.

• Routing matches a single document against a group of Profiles to determine which users are interested in the document.

• Profiles stand long-term expressions of user needs.• Search queries are ad hoc in nature.• A generic detection architecture can be used for both the

search and routing.

Page 7: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

7

Search• retrieval of desired documents from an existing corpus• Retrospective search is frequently interactive.• Methods

– indexing the corpus by keyword, stem and/or phrase– apply statistical and/or learning techniques to better

understand the content of the corpus– analyze free text Detection Needs to compare with

the indexed corpus or a single document– ...

Page 8: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

8

Document Detection: Search

Page 9: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

9

Document Detection: Search(Continued)

• Document Corpus– the content of the corpus may have significant

the performance in some applications

• Preprocessing of Document Corpus– stemming– a list of stop words– phrases, multi-term items– ...

Page 10: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

10

Document Detection: Search(Continued)

• Building Index from Stems– key place for optimizing run-time performance

– cost to build the index for a large corpus

• Document Index– a list of terms, stems, phrases, etc.

– frequency of terms in the document and corpus

– frequency of the co-occurrence of terms within the corpus

– index may be as large as the original document corpus

Page 11: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

11

Document Detection: Search(Continued)

• Detection Need– the user’s criteria for a relevant document

• Convert Detection Need to System Specific Query– first transformed into a detection query, and then a

retrieval query.– detection query: specific to the retrieval engine, but

independent of the corpus– retrieval query: specific to the retrieval engine, and to

the corpus

Page 12: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

12

Document Detection: Search(Continued)

• Compare Query with Index

• Resultant Rank Ordered List of Documents– Return the top ‘N’ documents – Rank the list of relevant documents from the

most relevant to the query to the least relevant

Page 13: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

13

Routing

Page 14: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

14

Routing (Continued)

• Profile of Multiple Detection Needs– A Profile is a group of individual Detection

Needs that describes a user’s areas of interest.– All Profiles will be compared to each incoming

document (via the Profile index).– If a document matches a Profile the user is

notified about the existence of a relevant document.

Page 15: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

15

Routing (Continued)

• Convert Detection Need to System Specific Query

• Building Index from Queries– similar to build the corpus index for searching– the quantify of source data (Profiles) is usually

much less than a document corpus– Profiles may have more specific, structured

data in the form of SGML tagged fields

Page 16: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

16

Routing (Continued)

• Routing Profile Index– The index will be system specific and will make use

of all the preprocessing techniques employed by a particular detection system.

• Document to be routed– A stream of incoming documents is handled one at

a time to determine where each should be directed.– Routing implementation may handle multiple

document streams and multiple Profiles.

Page 17: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

17

Routing (Continued)

• Preprocessing of Document– A document is preprocessed in the same manner that

a query would be set-up in a search

– The document and query roles are reversed compared with the search process

• Compare Document with Index– Identify which Profiles are relevant to the document

– Given a document, which of the indexed profiles match it?

Page 18: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

18

Routing (Continued)

• Resultant List of Profiles– The list of Profiles identify which user should

receive the document

Page 19: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

19

Summary

• Generate a representation of the meaning or content of each object based on its description.

• Generate a representation of the meaning of the information need.

• Compare these two representations to select those objects that are most likely to match the information need.

Page 20: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

20

Documents Queries

DocumentRepresentation

QueryRepresentation

Comparison

Basic Architecture of an Information Retrieval System

Page 21: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

21

Research Issues

• Given a set of description for objects in the collection and a description of an information need, we must consider

• Issue 1– What makes a good document representation?– What are retrievable units and how are they

organized?– How can a representation be generated from a

description of the document?

Page 22: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

22

Research Issues (Continued)

• Issue 2How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user?

• Issue 3How can we compare representations to judge likelihood that a document matches an information need?

Page 23: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

23

Research Issues (Continued)

• Issue 4How can we evaluate the effectiveness of the retrieval process?

Page 24: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

24

Information Extraction

• Generic Information Extraction SystemAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Page 25: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

25

Information Extraction (Continued)

• What are the transducers or modules?

• What are their input and output?

• What structure is added?

• What information is lost?

• What is the form of the rules?

• How are the rules applied?

• How are the rules acquired?

Page 26: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

26

Example: Parser

• transducer: parser• input: the sequence of words or lexical items• output: a parse tree• information added: predicate-argument and

modification relations• information lost: no• rule form: unification grammars• application method: chart parser• acquisition method: manually

Page 27: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

27

Modules

• Text Zonerturn a text into a set of text segments

• Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes

• Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

• Preparsertake a sequence of lexical items and try to identify various reliably determinable, small-scale structures

Page 28: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

28

Modules (Continued)

• Parserinput a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

• Fragment Combinerturn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence

• Semantic Interpretergenerate a semantic structure or logical form from a parse tree or from parse tree fragments

Page 29: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

29

Modules (Continued)

• Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates

• Coreference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text

• Template Generatorderive the templates from the semantic structures

Page 30: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

30

Topics

1. Introduction to Information Retrieval and Extraction2. Conventional Text-Retrieval Systems (Salton, Chapter 8) - Database Management and Information Retrieval - Text Retrieval Using Inverted Indexing Methods - Extensions of the Inverted Index Operations - Typical File Organization - Text-Scanning Systems3. Automatic Indexing (Salton, Chapter 9) - Indexing Environment - Indexing Aims - Single-Term Indexing Theories - Term Relationships in Indexing - Term-Phrase Formulation - Thesaurus-Group Generation

Page 31: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

31

4. Advanced Information-Retrieval Models (Salton, Chapter 10) - The Vector Space Model - Automatic Document Classification - Probabilistic Retrieval Model - Extended Boolean Retrieval Model5. File Structures (Frakes & Baeza-Yates, Chapters 3-5) - Inverted Files - Signature Files - PAT trees6. Term and Query Operations (Frakes & Baeza-Yates, Chapters 7-9,10) - Lexical Analysis and Stoplists - Stemming Algorithms - Thesaurus Construction - Relevance Feedback7. Evaluation Metrices (Jones & Willett, Chapter 4) - The Pragmatics of Information Retrieval Experimentation, Revisited - The TREC Conferences

Topics (Continued)

Page 32: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

32

8. IR on the World Wide Web (Cheong, Chapter 4) - Spiders for Indexing the Web - Web Indexing Spiders - WebCrawler: Finding What People Want - Lycos: Hunting WWW Information - Harvest: Gathering and Brokering Information - WebAnts: Hunting in Packs - Issues of Web Indexing - Spiders of the Future 9. Cross-Language Information Retrieval (Hsin-Hsi Chen)10. Information Extraction (Jerry R. Hobbs) - What information extraction is - What is involved in building information extraction systems, and some how to? - What kinds of resources and tools are needed, and how to access them

Topics (Continued)

Page 33: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

33

Information Sources• Books

– Salton, G. (1989) Automatic Text Processing. The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.

– Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.

– Cheong, F. (1996) Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders, 1996.

– Karen Sparck Jones and Peter Willett (1997) Readings in Information Retrieval, CA: Morgan Kaufmann Publishers.

Page 34: 1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

34

Information Sources

• Conference Proceedings– ACM SIGIR Annual International Conference on Research

and Development in Information Retrieval (1978-)

• Journals– ACM Transactions on Information Systems

– Information Processing and Management (formerly Information Storage and Retrieval)

– Journal of the American Society for Information Science (formerly American Documentation)

– Journal of Documentation