information retrieval and extraction 資訊檢索與擷取
DESCRIPTION
Information Retrieval and Extraction 資訊檢索與擷取. Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central University, Taiwan. Information Retrieval. - PowerPoint PPT PresentationTRANSCRIPT
1
Information Retrieval and Extraction資訊檢索與擷取
Chia-Hui Chang, Assistant Professor
Dept. of Computer Science & Information EngineeringNational Central University, Taiwan
2
Information Retrieval
generic information retrieval system select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user
functions» document search
the selection of documents from an existing collection of documents
» document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles
3
Detection Need
Definitiona set of criteria specified by the user which describes the kind of information desired.» queries in document search task» profiles in routing task
forms» keywords» keywords with Boolean operators» free text» example documents» ...
4
Example
<head> Tipster Topic Description<num> Number: 033<dom> Domain: Science and Technology<title> Topic: Companies Capable of Producing Document Management<des> Description:
Document must identify a company who has the capability to produce document management system by obtaining a turnkey- system or by obtaining and integrating
the basic components.<narr> Narrative:
To be relevant, the document must identify a turnkey document management system or components which could be integrated to form a document management system and the name of either the company developing the system or the company using thesystem. These components are: a computer, image scanner or optical character recognition system, and an information retrieval or text management system.
5
Example (Continued)
<con> Concepts:1. document management, document processing, office automation electronic imaging2. image scanner, optical character recognition (OCR)3. text management, text retrieval, text database4. optical disk
<fac> Factors:<def> Definitions
Document Management-The creation, storage and retrieval of documents containing, text, images, and graphics. Image Scanner-A device that converts a printed image into a video image, without recognizing the actual content of the text or pictures.Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because of their high storage capacity.
6
search vs. routing
The search process matches a single Detection Need against the stored corpus to return a subset of documents.
Routing matches a single document against a group of Profiles to determine which users are interested in the document.
Profiles stand long-term expressions of user needs. Search queries are ad hoc in nature. A generic detection architecture can be used for both the search
and routing.
7
Search
retrieval of desired documents from an existing corpus Retrospective search is frequently interactive. Methods
» indexing the corpus by keyword, stem and/or phrase» apply statistical and/or learning techniques to better
understand the content of the corpus» analyze free text Detection Needs to compare with the
indexed corpus or a single document» ...
8
Document Detection: Search
9
Document Detection: Search(Continued)
Document Corpus» the content of the corpus may have significant the
performance in some applications
Preprocessing of Document Corpus» stemming» a list of stop words» phrases, multi-term items» ...
10
Document Detection: Search(Continued)
Building Index from Stems» key place for optimizing run-time performance» cost to build the index for a large corpus
Document Index» a list of terms, stems, phrases, etc.» frequency of terms in the document and corpus» frequency of the co-occurrence of terms within the corpus» index may be as large as the original document corpus
11
Document Detection: Search(Continued)
Detection Need» the user’s criteria for a relevant document
Convert Detection Need to System Specific Query» first transformed into a detection query, and then a retrieval
query.» detection query: specific to the retrieval engine, but
independent of the corpus» retrieval query: specific to the retrieval engine, and to the
corpus
12
Document Detection: Search(Continued)
Compare Query with Index Resultant Rank Ordered List of Documents
» Return the top ‘N’ documents » Rank the list of relevant documents from the most relevant
to the query to the least relevant
13
Routing
14
Routing (Continued)
Profile of Multiple Detection Needs» A Profile is a group of individual Detection Needs that
describes a user’s areas of interest.» All Profiles will be compared to each incoming document (via
the Profile index).» If a document matches a Profile the user is notified about the
existence of a relevant document.
15
Routing (Continued)
Convert Detection Need to System Specific Query Building Index from Queries
» similar to build the corpus index for searching» the quantify of source data (Profiles) is usually much less
than a document corpus» Profiles may have more specific, structured data in the form
of SGML tagged fields
16
Routing (Continued)
Routing Profile Index» The index will be system specific and will make use of all the
preprocessing techniques employed by a particular detection system.
Document to be routed» A stream of incoming documents is handled one at a time to
determine where each should be directed.» Routing implementation may handle multiple document
streams and multiple Profiles.
17
Routing (Continued)
Preprocessing of Document» A document is preprocessed in the same manner that a
query would be set-up in a search» The document and query roles are reversed compared with
the search process
Compare Document with Index» Identify which Profiles are relevant to the document» Given a document, which of the indexed profiles match it?
18
Routing (Continued)
Resultant List of Profiles» The list of Profiles identify which user should receive the
document
19
Summary
Generate a representation of the meaning or content of each object based on its description.
Generate a representation of the meaning of the information need.
Compare these two representations to select those objects that are most likely to match the information need.
20
Documents Queries
DocumentRepresentation
QueryRepresentation
Comparison
Basic Architecture of an Information Retrieval System
21
Research Issues
Given a set of description for objects in the collection and a
description of an information need, we must consider Issue 1
» What makes a good document representation?» How can a representation be generated from a description
of the document?» What are retrievable units and how are they organized?
22
Research Issues (Continued)
Issue 2How can we represent the information need and how can we acquire this representation?» from a description of the information need or
» through interaction with the user?
Issue 3How can we compare representations to judge likelihood that a document matches an information need?
Issue 4How can we evaluate the effectiveness of the retrieval process?
23
Information Extraction
Generic Information Extraction SystemAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.
24
Information Extraction (Continued)
What are the transducers or modules? What are their input and output? What structure is added? What information is lost? What is the form of the rules? How are the rules applied? How are the rules acquired?
25
Example: Parser
Transducer: parser Input: the sequence of words or lexical items Output: a parse tree Information added: predicate-argument and
modification relations Information lost: no Rule form: unification grammars Application method: chart parser Acquisition method: manually
26
Modules
Text Zonerturn a text into a set of text segments
Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes
Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones
Preparsertake a sequence of lexical items and try to identify various reliably determinable, small-scale structures
27
Modules (Continued)
Parserinput a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete
Fragment Combinerturn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence
Semantic Interpretergenerate a semantic structure or logical form from a parse tree or from parse tree fragments
28
Modules (Continued)
Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates
Coreference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text
Template Generatorderive the templates from the semantic structures