thesaurus-based index term extraction olena medelyan digital library laboratory
TRANSCRIPT
![Page 1: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/1.jpg)
Thesaurus-Based Index Term Extraction
Olena Medelyan
Digital Library Laboratory
![Page 2: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/2.jpg)
• Describe the topics in a document
• Index terms: controlled vocabulary (e.g. predatory birds, damage, aquaculture)
• Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture)
• Purposes: – Organize library’s holding– Provide thematic access to documents– Represent documents as brief summary– Aid navigation in search results
• Manual assignment: expensive, time-consuming
Index Terms vs. Keyphrases
Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities
![Page 3: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/3.jpg)
Extraction vs. Assignment
• Select significant n-grams or NPs according to their characteristics
• Classify documents according to their content words into classes (lables = keyphrases)
- Restriction to syntax- Bad quality phrases- No consistency
+ Easy and fast implementation+ Not much training required
- Need large corpora- Long compuational time- Not practical
+ Word coocurrence+ High accuracy
![Page 4: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/4.jpg)
KEA++
• Combines extraction with controlled vocabulary
• Considers semantic relations
• Controlled vocabulary = thesaurus
• Experiment: – agricultural documents (www.fao.org/docrep)– Agrovoc thesaurus (www.fao.org/agrovoc)
![Page 5: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/5.jpg)
How does it Work?
1. Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation predat bird
2. Each document = set of candidate phrases
3. Training (document + manually assinged phrases)a. Compute the features
b. Compute the model
4. Testing (new documents, no phrases)a. Compute the features
b. Compute probabilities according to the model
5. Classification model: Naïve Bayes
![Page 6: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/6.jpg)
Features
• TF×IDF – phrases that are specific for a given document are significant
• First Position – phrases that are in the beginning (or the end) of the document are significant
• Phrase Length – phrases with certain number of words are significant (2!)
• Node Degree – phrases that are related to the most other phrases in the document are significant
![Page 7: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/7.jpg)
Example
fisheries
fish culture
aquaculture
fish ponds
aquaculturetechniques
birdcontroll
predatorybirds
noxiousbirds
scares
pestconroll
controllmethods
monitoring
methods
equipmentprotectivestructures
electricalinstallation
fencing
Indexers:1 2 3 4 5 6
Agrovoc relation:
KEA++:
damage
noise
north america
techniques
fisheryproduction
predation
predators
birds
ropes
fishingoperations
![Page 8: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/8.jpg)
Evaluation I
• Standard Evaluation:– Number of exact matches in the test set– Precision, Recall, F-measure
• Problem: – Semantic similarity is not considered– Comparison only to one indexer, although
indexing is subjective
![Page 9: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/9.jpg)
Evaluation II
• Inter-indexer consistency, e.g. Rolling’s measure:
Indexers vs. other vs. KEA vs.KEA++indexers
1 42 7 29 2 39 8 28 3 37 9 26 4 37 6 31 5 37 6 25 6 36 4 20 avg 38 7 27
Rolling‘s IIC = 2C
A+B
C – number of phrases in commonA – number of phrases in the first setB – number of phrases in the second set
-11%
![Page 10: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/10.jpg)
“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”
.
Results
Indexer KEA++Exact aquaculture aquaculture
damage damagefencing fencingscares scaresnoise* noise*
Similar bird control birdspredatory birds predatorsfish culture fishing operationsfishery production
No match noxious birdscontrol methodsropes
*Selected by only one indexer
![Page 11: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory](https://reader035.vdocuments.pub/reader035/viewer/2022072005/56649ce25503460f949ad380/html5/thumbnails/11.jpg)
Problems & Future Work
• Trivial problems (e.g. stemming errors)• Document chunking
– What are important and disturbing parts of the document?
• Topic coverage– exploring thesaurus’ structure– Lexical chains
• Term occurrence– Including other NLP resources (e.g. WordNet)
• Multi-linguality, other domains