2002.10.03 - slide 1is 202 – fall 2002 prof. ray larson & prof. marc davis uc berkeley sims...
Post on 21-Dec-2015
215 views
TRANSCRIPT
2002.10.03 - SLIDE 1IS 202 – FALL 2002
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/
SIMS 202:
Information Organization
and Retrieval
Lecture 11: Thesaurus Design
2002.10.03 - SLIDE 2IS 202 – FALL 2002
Lecture Overview
• Review– Name Authority Control– Types of Controlled Vocabularies
• Thesaurus Design and Development– Developing Controlled Vocabularies– Thesaurus Design– Steps In Thesaurus Development– Indexing
2002.10.03 - SLIDE 3IS 202 – FALL 2002
Lecture Overview
• Review– Name Authority Control– Types Of Controlled Vocabularies
• Thesaurus Design And Development– Developing Controlled Vocabularies– Thesaurus Design– Steps In Thesaurus Development– Indexing
2002.10.03 - SLIDE 4IS 202 – FALL 2002
Types of Indexing Languages
• Uncontrolled keyword indexing
• Indexing languages– Controlled, but not structured
• Thesauri– Controlled and structured
• Classification systems– Controlled, structured, and coded
• Faceted classification systems
2002.10.03 - SLIDE 5IS 202 – FALL 2002
Uses of Controlled Vocabularies
• Library subject headings, classification and authority files
• Commercial journal indexing services and databases
• Yahoo, and other web classification schemes
• Online and manual systems within organizations– SunSolve– MacArthur
2002.10.03 - SLIDE 6IS 202 – FALL 2002
Indexing Languages
• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents
• An indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms
2002.10.03 - SLIDE 7IS 202 – FALL 2002
Classification Systems
• A classification system is an indexing language often based on a broad ordering of topical areas
• Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics
• Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms
2002.10.03 - SLIDE 8IS 202 – FALL 2002
Automatic Indexing and Classification
• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words
• More complex automatic indexing systems attempt to select controlled vocabulary terms based on terms in the document
• Automatic classification attempts to automatically group similar documents using either– A fully automatic clustering method– An established classification scheme and set of
documents already indexed by that scheme
2002.10.03 - SLIDE 9IS 202 – FALL 2002
Clustering
Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent
DocDoc
DocDoc
DocDoc
DocDoc
1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)
Rocchio’s method – (Yes the same Rocchio as Relevance Feedback)
2002.10.03 - SLIDE 10IS 202 – FALL 2002
Automatic Class Assignment
DocDoc
DocDoc
DocDoc
Doc
SearchEngine
1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category
Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme
2002.10.03 - SLIDE 11IS 202 – FALL 2002
Lecture Overview
• Review– Name Authority Control– Types Of Controlled Vocabularies
• Thesaurus Design And Development– Developing Controlled Vocabularies– Thesaurus Design– Steps In Thesaurus Development– Indexing
2002.10.03 - SLIDE 12IS 202 – FALL 2002
Developing Controlled Vocabularies
• Origins and uses of controlled vocabularies for information retrieval
• Types of indexing languages, thesauri and classification systems
• Process of design and development of thesauri
2002.10.03 - SLIDE 13IS 202 – FALL 2002
Origins
• Very early history of content representation– Sumerian tokens and “envelopes”– Alexandria - pinakes– Indices
2002.10.03 - SLIDE 14IS 202 – FALL 2002
Origins
• Biblical Indexes and Concordances– Hugo de St. Caro – 1247 A.D. : 500 Monks -- KWOC– Book indexes (Nuremburg Chronicle)
• Library Catalogs• Journal Indexes• “Information Explosion” following WWII
– Cranfield Studies of indexing languages and information retrieval
– Development of bibliographic databases • Index Medicus -- production and Medlars searching
2002.10.03 - SLIDE 15IS 202 – FALL 2002
Origins
• Communication theory revisited
• Problems with transmission of meaning
Noise
Source DecodingEncoding Destination
Message Message
Channel
StorageSourceDecoding
(Retrieval/Reading)Encoding
(writing/indexing)Destination
Message Message
2002.10.03 - SLIDE 16IS 202 – FALL 2002
Structure of an IR System
SearchLine
Interest profiles& Queries
Documents & data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
2002.10.03 - SLIDE 17IS 202 – FALL 2002
What is a Controlled Vocabulary?
• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)
• Similarly, there are too many ways of expressing or explaining the topic of a document
• Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of “lead-in vocabulary” and an limited and selective “Indexing Language” sometimes with special coding or structures
2002.10.03 - SLIDE 18IS 202 – FALL 2002
Lecture Overview
• Review– Name Authority Control– Types Of Controlled Vocabularies
• Thesaurus Design And Development– Developing Controlled Vocabularies– Thesaurus Design– Steps In Thesaurus Development– Indexing
2002.10.03 - SLIDE 19IS 202 – FALL 2002
Thesauri
• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms
2002.10.03 - SLIDE 20IS 202 – FALL 2002
Thesauri (cont.)
• National and International Standards for Thesauri– ANSI/NISO z39.19-1994 — American National
Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri
– ANSI/NISO Draft Standard Z39.4-199x — American National Standard Guidelines for Indexes in Information Retrieval
– ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri
– ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri
2002.10.03 - SLIDE 21IS 202 – FALL 2002
Thesauri (cont.)
• Examples– The ERIC Thesaurus of Descriptors– The Medical Subject Headings (MESH) of the
National Library of Medicine– The Art and Architecture Thesaurus
2002.10.03 - SLIDE 22IS 202 – FALL 2002
Why Develop a Thesaurus?
• To provide a conceptual structure or “space” for a body of information– To make it possible to adequately describe
the topical contents of informational objects at an appropriate level of generality or specificity
– To provide enhanced search capabilities and to improve the effectiveness of searching (i.e., to retrieve most of the relevant material without too much irrelevant material)
2002.10.03 - SLIDE 23IS 202 – FALL 2002
Why Develop a Thesaurus?
• To provide vocabulary (or terminological) control– When there are several possible terms
designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with
2002.10.03 - SLIDE 24IS 202 – FALL 2002
Lecture Overview
• Review– Name Authority Control– Types Of Controlled Vocabularies
• Thesaurus Design And Development– Developing Controlled Vocabularies– Thesaurus Design– Steps In Thesaurus Development– Indexing
2002.10.03 - SLIDE 25IS 202 – FALL 2002
Preliminary Considerations
• What is used now?– Continue using an existing thesaurus?– Ad hoc modification of existing thesaurus?– Develop a new well-structured thesaurus?
• What is the scope and complexity of the subject field?
• What kind of retrieval objects or data will be dealt with?
• How exhaustive and specific is the desired description of objects?
2002.10.03 - SLIDE 26IS 202 – FALL 2002
Preliminary Considerations
• The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus– It is better to plan for a larger and more
comprehensive system than a smaller system that rapidly will become inadequate as the database grows
• Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists
2002.10.03 - SLIDE 27IS 202 – FALL 2002
Development of a Thesaurus
• Term selection
• Merging and development of concept classes
• Definition of broad subject fields and subfields
• Development of classificatory structure
• Review, testing, application, revision
2002.10.03 - SLIDE 28IS 202 – FALL 2002
Flow of Work in Thesaurus Construction
Select Sources
Assign codes
Select Terms
Record Selected Terms
Sort Terms
Merge identical Terms
Define Broad SubjectFields
Merge Terms in SameConcept class
Sort Terms into BroadSubject Fields
Define Subfields withinone Subject Field
Work out detailed structureof the Subject Field
Select Preferred Terms
All Subfields of BroadSubject finished?
All BroadSubjects finished?
Improve Class Structure
Yes
Yes
No
No
Print Classified Indexand review
Discuss with Experts andUsers
Select descriptors andchecklist items
Produce Full Thesaurusand Check references
Assign Notation
Review and Test
Many Modifications?
Based on Soergel, pp 327-333
Yes
No
Revise asneeded
2002.10.03 - SLIDE 29IS 202 – FALL 2002
1. Term Selection
• Select sources for the collection of terms– Prearranged Sources– Open-ended Sources
• Assign codes to each source
• Selection of terms– For part of pre-arranged and for all open-
ended sources
• Enter terms into database with all information
2002.10.03 - SLIDE 30IS 202 – FALL 2002
1.1 Kinds of Sources
• Prearranged Sources– Existing descriptor lists, classification schemes
thesauri• This includes universal schemes like DDC or LCSH
– Nomenclatures of single disciplines– Treatises on the terminology of a field– Encyclopedias, lexica, dictionaries and glossaries– Tables of contents of textbooks and handbooks– Indexes of journals or abstracting journals– Indexes of other publications in the field
2002.10.03 - SLIDE 31IS 202 – FALL 2002
1.1 Kinds of Sources
• Open-ended sources– Lists of search requests or interest profiles– Description of projects/activities to be served by the
information retrieval system– Discussion with specialists in the field– Sample of documents in the field
• Ask users why and how these documents relate to the field• Have documents indexed by experts in the field
– Lists of titles of documents in the field– Abstracts and reviews of documents– Your own knowledge
2002.10.03 - SLIDE 32IS 202 – FALL 2002
Selection of Sources
• Prearranged sources require less effort in gathering the material, and may already indicate some relationships between terms and concepts and relationships among terms
• Open-ended sources can reflect current terminology and may provide more complete coverage
• Choose a set of sources that are current, as complete as possible, and considered authoritative
2002.10.03 - SLIDE 33IS 202 – FALL 2002
Selection of Sources
• Each selected source is assigned an ID for tracking its use in the development of the thesaurus– Useful when making decisions about which
terms to prefer– Useful for backtracking when questions arise
(where did this come from?)
2002.10.03 - SLIDE 34IS 202 – FALL 2002
Selection of Terms
• Terms can be transferred directly from prearranged sources to the recording medium (cards or database)– Have to decide which terms and references to
include, or to take the whole source
2002.10.03 - SLIDE 35IS 202 – FALL 2002
Selection of Terms
• In open-ended sources you read through the source and pick out terms (i.e. words and phrases) that might be useful in retrieval or as references to other terms
• Alternatively, use keyword and phrase extraction software to create lists of terms and select from those
• Transfer selected terms to the recording medium (cards or database)
2002.10.03 - SLIDE 37IS 202 – FALL 2002
2. Merging and Development of Concept Classes
• Sort Term DB into alphabetical order
• First Round– Merge information for identical terms, possibly
pulling info from additional sources
• Second Round– Merge synonyms or terms in the same
concept class
2002.10.03 - SLIDE 38IS 202 – FALL 2002
3. Definition of Broad Subject Fields and Subfields
• Define broad subject fields and sort terms into these broad fields
• Define subfields within each broad field and sort terms into these subfields
• Work out the detailed structure– Select preferred terms– Merge information for terms in the same concept
class• Repeat these steps
– For each subfield within a broad field– And for each broad field– Until all terms have been consolidated and preferred
terms selected
2002.10.03 - SLIDE 39IS 202 – FALL 2002
4. Development of Classificatory Structure
• Produce preliminary version of classified index and update the working database
• Improve classificatory structure
• Reality check– Produce and distribute a version of the
classified index– Distribute to users/experts
2002.10.03 - SLIDE 41IS 202 – FALL 2002
Review
• Discuss classified index with users/experts– Select descriptors and checklist descriptors
• Assign notational symbols
• Produce main thesaurus and indexes
2002.10.03 - SLIDE 42IS 202 – FALL 2002
Review (cont.)
• Check cross references and insert where needed
• Produce test version
• Test by indexing
• Modify as needed
• Produce production version
2002.10.03 - SLIDE 43IS 202 – FALL 2002
Testing a Thesaurus
• Assign descriptors to a sample set of NEW documents (use enough to get an idea of any gaps in the thesaurus)
• Test retrieval using sample questions and seeing how effectively the thesaurus maps to the appropriate descriptor
2002.10.03 - SLIDE 44IS 202 – FALL 2002
Lecture Overview
• Review– Name Authority Control– Types Of Controlled Vocabularies
• Thesaurus Design And Development– Developing Controlled Vocabularies– Thesaurus Design– Steps In Thesaurus Development– Indexing
2002.10.03 - SLIDE 45IS 202 – FALL 2002
The Indexing Process
• Concept identification
• Term selection (via thesaurus)
• Term assignment
2002.10.03 - SLIDE 46IS 202 – FALL 2002
Application: The Indexing Process (Manual)
IsTerm
suitable
NOSelect Alternativeterm to represent
Concept
WouldConcept be
better representedby one of
these terms
Is There
Another Concept
Consider Preferred
Term
Select Preferred
Term
Establish TermDenoting Concept
Examine Documentand Identify Significant Concepts
Consider First
Concept
PreferredTerm?
StartNO
NO
NO
NO
NO
YES YES YES
YES
YESYES
DoesThesaurus
contain termfor
Concept
Consider anyassociated terms inThesaurus (NT,BT)
Admit New TermInto Thesaurus
Can Conceptbe expressed
combining terms?
Consider Each ofThese Terms
Assign Termsto
Document
Prefer Alternative
Term(s)
End
Adapted from ISO 5963, p.5
2002.10.03 - SLIDE 47IS 202 – FALL 2002
Thesaurus Revision and Updates
• There will always be new concepts, products, or expressions that need to be added to the thesaurus – Set a regular schedule of reviews and
revisions– Collect complaints, problems, etc. and fold
into revision of the thesaurus
2002.10.03 - SLIDE 48IS 202 – FALL 2002
References
• Soegel, D. Indexing Languages and Thesauri: Construction and Maintenance. Los Angeles: Melville Publishing Co., 1974
• Foskett, A.C. The Subject Approach to Information. London: Clive Bingley, 1982.
• Standards:– ANSI/NISO z39.19-1994 — American National Standard
Guidelines for the Construction, Format and Management of Monolingual Thesauri
– ANSI/NISO Draft Standard Z39.4-199x — American National Standard Guidelines for Indexes in Information Retrieval
– ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri
– ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri