a taxonomy of similarity mechanisms for case-based reasoning
Post on 09-Feb-2016
44 Views
Preview:
DESCRIPTION
TRANSCRIPT
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
A Taxonomy of Similarity Mechanisms for Case-Based Reasoning
Pa´ draig Cunningham
TKDE, Vol.21, 2009, pp. 1532–1543.
Presenter : Wei-Shen Tai
2009/11/17
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
2
Outline
Introduction Representation Similarity measures
Direct similarity mechanisms Transformation-based measures Information-theoretic measures Emergent measures
Implications for CBR research Conclusion Comments
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
3
Motivation
Similarity is central to CBR More recently, a number of novel mechanisms have
emerged that introduce interesting alternative perspectives on similarity.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
4
Objective
Novel SM mechanisms review Present a taxonomy of similarity mechanisms that places
these new techniques in the context of established CBR techniques.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
5
Feature value representation In terms of case attributes or instance. Enhancement
Discover word associations in a text corpus and then use these associations to add terms to the representation. Bill Gates - > software, CEO, mircrosoft
Allow texts to be represented by more features.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
6
Structural representations Hierarchical structure
Features value themselves reference nonatomic objects. Network structure
Typically a semantic network The Semantic Web describes the relationships between things (like
tire is a part of car and John Lennon was a member of the Beatles) and the properties of things (like size, weight, age, and price)
Flow structure Share many of the characteristics of hierarchical and
network representations. For example, work or job.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
7
String and sequence representations
The most straightforward representation for free text. (non-structure data) It supports similarity assessment is the bag-of-words
strategy from information retrieval.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
8
Direct similarity mechanisms
Similarity and distance metrics k-NN
Set-theoretic measures Jaccard index, Dice similarity
Kullback-Leibler Divergence and the χ2 Statistic Compare two images described as histograms.
Symbolic attributes in taxonomies Case representation is organized by feature values
into a taxonomy of is-a relationships.
root
tea
Green tea Black tea
carbonated
PepsiCola
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
9
Transformation-based measures I
Edit Distance the number of editing to transform one string.
From cat to rat is 1, from cats to cat is 1.
Alignment Measures for Biological Sequences A variety of sequence alignment in biology (DNA).
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
10
Transformation-based measures II
Earth mover distance A transformation-based distance for image data.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
11
Transformation-based measures III
Similarity for networks and graphs Structure mapping engine (SME)
Identify the appropriate mapping between the two domains.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
12
Information-theoretic measures
It works directly on the raw case representation Compression-based similarity for text
Two very similar documents, the compressed size of both them will not be much greater than one.
Information-based similarity for biological sequences Specialized algorithms are required to compress them
Similarity in a taxonomy Distinguish the weight of is-a relationship between features.
A taxonomy can be quantified as the negative log likelihood. Similarity is the common parent node with the highest value.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
13
Emergent measures I
Random forests An ensemble of decision trees.
For each ensemble member (n > N), build a decision tree for them with less selected features (m >> M).
Track the frequency with which cases are located at the same leaf node.
Two features get more shared leaf frequency means they are more similar as well.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
14
Emergent measures II
Cluster kernels A semi-supervized learning, where only some of the
available data are labeled. Class labels do not change in regions of high density. Cluster kernels allow the unlabelled data to influence similarity.
where K(xi, xj)orig is a basic neighborhood kernel and K(xi, xj)bag is a kernel derived from repeated clustering of all the data.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
15
Emergent measures III
Web-based kernel Text snippet similarity by documents returned in
Web search.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
16
Implications for CBR research
Vocabulary knowledge container In some circumstances (e.g., information-theoretic
measures) the role of the similarity knowledge container is increased.
Speeding up technique New methodologies are typically computationally
intensive, the importance of strategies for speeding up case-retrieval is increased.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
17
Conclusions
Similarity measurement taxonomy Organize the broad range of strategies for similarity
assessment in CBR into a coherent taxonomy.
Improve effectiveness of CBR Alternative metrics simply offer better accuracy
because it embodies specific knowledge about the data.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
18
Comments Advantage
This paper introduces and discusses those alternative metrics of similarity assessment for CBR.
Drawback .
Application Similarity measurement.
top related