indexing knowledge
DESCRIPTION
Indexing Knowledge. Daniel Vasicek 2014 March 27. Introduction. Basic topic is : All Human Knowledge Who Cares? Simple Examples. Basic Ideas. Concepts instead of key words Thesauri instead of key words Recognize Emerging concepts Classification - PowerPoint PPT PresentationTRANSCRIPT
Indexing Knowledge
Daniel Vasicek2014 March 27
Introduction
• Basic topic is : All Human Knowledge• Who Cares?• Simple Examples
Basic Ideas
• Concepts instead of key words– Thesauri instead of key words– Recognize Emerging concepts– Classification
• Facilitate communication between environments (Data translation)
• Meta data for publications (xml, sql, txt) – Indexing information
Topics to Cover
• Programming language constructs needed. What functionality do we need?
• What people pay Access Innovations to do?• Typical programming problems that I
encounter.
Input Data • Formats
– XML tagged meta data for publications– SQL data base– RAW text– Pictures of text
• Quantities– AIP
• 304,910 authors as xml files• 807,005 xml files containing title, abstract, +meta data
– Nicem (National Information Center for Educational Media)• 503,534 xml files describing available educational media • 26,144 xml files describing suppliers of educational media
Programming Languages Used
• Visual Basic (1990s)• C++ • Java (currently)
Who Cares?
• AIP – American Institute of Physics (17 journals + conference proceedings)
• IEEE- Institute of Electronic and Electrical Engineers (journals, standards, patents, …)
• SPIE- International Society for Optics and Photonics
• ACM – Association of Computing Machinery• Wolters-Klewer• Pub-Med
More Clients
• Parliament of Victoria (5000 articles per day)• JSTOR (~10 million documents, some journals back
to 1665)• PLOS (quick path to electronic publication)• Dupont• DOW• Council of Europe• Triumph Learning• ASCE, SAGE, SafetyLit, OSA, NICEM, NPR …
Useful Tools
• Controlled Vocabulary – an organizational tool for capturing concepts
• Proximity – a tool for capturing context• Hash Table (Content Addressable Array)– Convenience – Uniqueness– Fast access
• Regular Expressions
What’s a taxonomy?• Knowledge organization system• Words– Controlled vocabulary for a subject area
• Descriptive labels • Hierarchy– Simple hierarchical view of a thesaurus
• Storage and retrieval aid
Thesaurus Elements
• Hierarchy – Broader and Narrower concepts– Multiply connected “treelike” structure
• Nodes in the thesaurus structure contain descriptions of concepts and links to broader, narrower, related, and similar concepts
• Subject specific?
Structure of Controlled Vocabularies
Flat List Synonym Ring Taxonomy Thesaurus Ontology
INCREASING MEANING and CONTROL
Ambiguity AmbiguitySynonym
AmbiguitySynonymHierarchyRelationships
SynonymHierarchyAdditional Types of Relationships
Hierarchy
After ANSI/NISO Z39.19 -2005, Figure 5
Synonym
Narrower Term
Science of Life
Broader Term
Science
Biology
Thesaurus Node (Term)
Thesaurus Implementation• Terms (Concepts, Preferred Terms)• Broader Terms• Narrower Terms• Related Terms• Other Concepts
– Synonyms– History – Responsibility– Backup
• Rules to help identify the concept in text• Methods for maintaining the thesaurus
Thesaurus Text Representation<TermInfo><T>Biology</T><BT>Science</BT><UF>Science of Life</UF></TermInfo>
<TermInfo><T>Science</T><NT>Biology</NT></TermInfo>
<TermInfo><T>Science of Life</T></TermInfo>
Thesaurus Problems
• Missing Terms - pointer links to a term that is not present
• Broken loops – Narrower term without matching broader term– Broader term without matching narrower term– Related term without a matching return
relationship
Proximity of Words
• Adjacent– Before– After
• Same sentence• Same Paragraph• Within 50 words• Phrases (n-Grams)
Content Addressable Array
T[“Science”]=1;T[“Biology”]=1;T[“Science of Life”]=1;BT[“Biology”] = “Science”;NT[“Science”] = “Biology”;UF[“Science of Life”]=“Biology”;
Regular Expressions
• /^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$/ – Email addresses?
• / [A-Z][a-z]* / – Capitalized words
• /[A-Z][a-zA-Z0-9,\”\- ]*\. /– Sentence ?
• Paragraph?
Structure of Controlled Vocabularies
Flat List Synonym Ring Taxonomy Thesaurus Ontology
INCREASING MEANING and CONTROL
Ambiguity AmbiguitySynonym
AmbiguitySynonymHierarchyRelationships
SynonymHierarchyAdditional Types of Relationships
Hierarchy
After ANSI/NISO Z39.19 -2005, Figure 5