indexing knowledge

21

Upload: nowles

Post on 22-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Indexing Knowledge. Daniel Vasicek 2014 March 27. Introduction. Basic topic is : All Human Knowledge Who Cares? Simple Examples. Basic Ideas. Concepts instead of key words Thesauri instead of key words Recognize Emerging concepts Classification - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Indexing Knowledge
Page 2: Indexing Knowledge

Indexing Knowledge

Daniel Vasicek2014 March 27

Page 3: Indexing Knowledge

Introduction

• Basic topic is : All Human Knowledge• Who Cares?• Simple Examples

Page 4: Indexing Knowledge

Basic Ideas

• Concepts instead of key words– Thesauri instead of key words– Recognize Emerging concepts– Classification

• Facilitate communication between environments (Data translation)

• Meta data for publications (xml, sql, txt) – Indexing information

Page 5: Indexing Knowledge

Topics to Cover

• Programming language constructs needed. What functionality do we need?

• What people pay Access Innovations to do?• Typical programming problems that I

encounter.

Page 6: Indexing Knowledge

Input Data • Formats

– XML tagged meta data for publications– SQL data base– RAW text– Pictures of text

• Quantities– AIP

• 304,910 authors as xml files• 807,005 xml files containing title, abstract, +meta data

– Nicem (National Information Center for Educational Media)• 503,534 xml files describing available educational media • 26,144 xml files describing suppliers of educational media

Page 7: Indexing Knowledge

Programming Languages Used

• Visual Basic (1990s)• C++ • Java (currently)

Page 8: Indexing Knowledge

Who Cares?

• AIP – American Institute of Physics (17 journals + conference proceedings)

• IEEE- Institute of Electronic and Electrical Engineers (journals, standards, patents, …)

• SPIE- International Society for Optics and Photonics

• ACM – Association of Computing Machinery• Wolters-Klewer• Pub-Med

Page 9: Indexing Knowledge

More Clients

• Parliament of Victoria (5000 articles per day)• JSTOR (~10 million documents, some journals back

to 1665)• PLOS (quick path to electronic publication)• Dupont• DOW• Council of Europe• Triumph Learning• ASCE, SAGE, SafetyLit, OSA, NICEM, NPR …

Page 10: Indexing Knowledge

Useful Tools

• Controlled Vocabulary – an organizational tool for capturing concepts

• Proximity – a tool for capturing context• Hash Table (Content Addressable Array)– Convenience – Uniqueness– Fast access

• Regular Expressions

Page 11: Indexing Knowledge

What’s a taxonomy?• Knowledge organization system• Words– Controlled vocabulary for a subject area

• Descriptive labels • Hierarchy– Simple hierarchical view of a thesaurus

• Storage and retrieval aid

Page 12: Indexing Knowledge

Thesaurus Elements

• Hierarchy – Broader and Narrower concepts– Multiply connected “treelike” structure

• Nodes in the thesaurus structure contain descriptions of concepts and links to broader, narrower, related, and similar concepts

• Subject specific?

Page 13: Indexing Knowledge

Structure of Controlled Vocabularies

Flat List Synonym Ring Taxonomy Thesaurus Ontology

INCREASING MEANING and CONTROL

Ambiguity AmbiguitySynonym

AmbiguitySynonymHierarchyRelationships

SynonymHierarchyAdditional Types of Relationships

Hierarchy

After ANSI/NISO Z39.19 -2005, Figure 5

Page 14: Indexing Knowledge

Synonym

Narrower Term

Science of Life

Broader Term

Science

Biology

Thesaurus Node (Term)

Page 15: Indexing Knowledge

Thesaurus Implementation• Terms (Concepts, Preferred Terms)• Broader Terms• Narrower Terms• Related Terms• Other Concepts

– Synonyms– History – Responsibility– Backup

• Rules to help identify the concept in text• Methods for maintaining the thesaurus

Page 16: Indexing Knowledge

Thesaurus Text Representation<TermInfo><T>Biology</T><BT>Science</BT><UF>Science of Life</UF></TermInfo>

<TermInfo><T>Science</T><NT>Biology</NT></TermInfo>

<TermInfo><T>Science of Life</T></TermInfo>

Page 17: Indexing Knowledge

Thesaurus Problems

• Missing Terms - pointer links to a term that is not present

• Broken loops – Narrower term without matching broader term– Broader term without matching narrower term– Related term without a matching return

relationship

Page 18: Indexing Knowledge

Proximity of Words

• Adjacent– Before– After

• Same sentence• Same Paragraph• Within 50 words• Phrases (n-Grams)

Page 19: Indexing Knowledge

Content Addressable Array

T[“Science”]=1;T[“Biology”]=1;T[“Science of Life”]=1;BT[“Biology”] = “Science”;NT[“Science”] = “Biology”;UF[“Science of Life”]=“Biology”;

Page 20: Indexing Knowledge

Regular Expressions

• /^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$/ – Email addresses?

• / [A-Z][a-z]* / – Capitalized words

• /[A-Z][a-zA-Z0-9,\”\- ]*\. /– Sentence ?

• Paragraph?

Page 21: Indexing Knowledge

Structure of Controlled Vocabularies

Flat List Synonym Ring Taxonomy Thesaurus Ontology

INCREASING MEANING and CONTROL

Ambiguity AmbiguitySynonym

AmbiguitySynonymHierarchyRelationships

SynonymHierarchyAdditional Types of Relationships

Hierarchy

After ANSI/NISO Z39.19 -2005, Figure 5