model of web clustering engine enrichment with a taxonomy, ontologies and user information carlos...
Post on 22-Dec-2015
214 views
TRANSCRIPT
![Page 1: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/1.jpg)
Model of Web Clustering Engine
Enrichment with a Taxonomy,
Ontologies and User InformationCarlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected] Advisor: Elizabeth León Ph.D. [email protected]
Visiting scholar of Modern Heuristic Research Group LISI-MIDAS: Universidad Nacional de Colombia Sede BogotáGTI : Universidad del CaucaIdaho Falls, October 5, 2011
![Page 2: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/2.jpg)
Agenda
Preliminaries
Latent Semantic Indexing
Web Clustering Engines
Proposed Model
![Page 3: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/3.jpg)
Preliminaries
UserRetrievalProcess
Documents
Results
Query
FeedbackVisualization and browsing
Information Retrieval System
Indexes
IndexingProcess
ExtendedQuery
Auto complete
![Page 4: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/4.jpg)
Preliminaries
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval
Classic Models
boolean vector space probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Information Retrieval Models
![Page 5: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/5.jpg)
PreliminariesClassic Models – Basic Concepts
Each document is represented by a set of representative keywords or index terms
An index term is a document word useful for remembering the document main themes
Usually, index terms are nouns because nouns have meaning by themselves
However, some search engines assume that all words are index terms (full text representation)
Not all terms are equally useful for representing the document contents, e.g. less frequent terms allow identifying a narrower set of documents
The importance of the index terms is represented by weights associated to them
![Page 6: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/6.jpg)
PreliminariesIndexing Process
Documentrecognition of structure
Structure
Tokenization
Filters
Stop words rem.
Noun groups rem.
Stemming
Vocabulary rest.
Key words
Full text representation
![Page 7: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/7.jpg)
PreliminariesIndexing Process - Sample
WASHINGTON The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again
washington the house of representatives on tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again
washington house representatives tuesday passed bill puts government stable financial footing weeks resolve battle spending flare
washington hous repres tuesdai pass bill put govern stabl financi foot week resolv battl spend flare
WASHINGTON - The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again.
Original
Tokens
Filters
Stop
Stem
![Page 8: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/8.jpg)
PreliminariesIndexing Process - Sample
TRENTON New Jersey New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race in a move that sets up a battle between Mitt Romney and Rick Perry
trenton new jersey new jersey governor chris christie dashed hopes on tuesday he might make a late leap into the 2012 republican presidential race in a move that sets up a battle between mitt romney and rick perry
trenton jersey jersey governor chris christie dashed hopes tuesday make late leap 2012 republican presidential race move sets battle mitt romney rick perry
trenton jersei jersei governor chri christi dash hope tuesdai make late leap 2012 republican presidenti race move set battl mitt romnei rick perri
Original
Tokens
Filters
Stop
Stem
TRENTON, New Jersey - New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race, in a move that sets up a battle between Mitt Romney and Rick Perry.
![Page 9: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/9.jpg)
Preliminaries
Term-Document Matrix (TDM)
ji
jiji n
N
f
fw
1log
)max(,
,
t1 t2 … tj … tF
d1 1 3 4 2
d2 2
…
di 0 fi,j
…
dN 0
Observed Frequency
4
max(fi)
2 nj
TF-IDF or Term-Document Matrix
Stored in an Inverted Index
![Page 10: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/10.jpg)
Preliminaries
M
i
qi
M
i
di
qi
M
idi
WW
WWqdSim
1
,2
1
,2
,1
,
,
Cosine Similarity
![Page 11: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/11.jpg)
Preliminaries
t1
t2
t3 d1
d2
d3d4
d5
d6d7
q
t1 t2 t3 max freql,j t1 t2 t3 |dj| sim(dj,q) ranking
d1 1 0 1 1 d1 0,3364722 0 0,8472979 0,9116618 0,85224481 3
d2 1 0 0 1 d2 0,3364722 0 0 0,33647224 0,31454287 6
d3 0 1 1 1 d3 0 0,5596158 0,8472979 1,01542282 0,94924327 2
d4 1 0 0 1 d4 0,3364722 0 0 0,33647224 0,31454287 7
d5 1 1 1 1 d5 0,3364722 0,5596158 0,8472979 1,06971822 1 1
d6 1 1 0 1 d6 0,3364722 0,5596158 0 0,65298039 0,61042281 4
d7 0 1 0 1 d7 0 0,5596158 0 0,55961579 0,52314318 5ni 5 4 3
idfi 0,3364722 0,5596158 0,8472979
N 7 max freql,q |q|
q 1 1 1 1 q 0,5047084 0,8394237 1,2709468 1,60457732
Sample 1: Vector Space Model
![Page 12: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/12.jpg)
Preliminaries
t1
t2
t3 d1
d2
d3d4
d5
d6d7
q
t1 t2 t3 max freql,j t1 t2 t3 |dj| sim(dj,q) ranking
d1 1 0 1 1 d1 0,3364722 0 0,8472979 0,9116618 0,88229947 3
d2 1 0 0 1 d2 0,3364722 0 0 0,33647224 0,19256666 6
d3 0 1 1 1 d3 0 0,5596158 0,8472979 1,01542282 0,97544391 2
d4 1 0 0 1 d4 0,3364722 0 0 0,33647224 0,19256666 7
d5 1 1 1 1 d5 0,3364722 0,5596158 0,8472979 1,06971822 0,98650404 1
d6 1 1 0 1 d6 0,3364722 0,5596158 0 0,65298039 0,48349989 4
d7 0 1 0 1 d7 0 0,5596158 0 0,55961579 0,44838373 5ni 5 4 3
idfi 0,3364722 0,5596158 0,8472979
N 7 max freql,q |q|
q 1 2 3 3 q 0,2803935 0,6528851 1,2709468 1,45608558
Sample 2: Vector Space Model
![Page 13: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/13.jpg)
PreliminariesVector Space Model
Advantages:• Simple model based on
linear algebra• Term weights• Allows computing a
continuous degree of similarity between queries and documents
• Allows ranking documents according to their possible relevance
• Allows partial matching
Limitations:• Long documents are poorly represented
because they have poor similarity values (a small scalar product and a large dimensionality)
• Word substrings might result in a "false positive match"
• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
• The order in which the terms appear in the document is lost in the vector space representation.
• Assumes terms are statistically independent
![Page 14: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/14.jpg)
Latent Semantic Indexing
It is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text
SVD:◦Also, it can be used to reduce noise in
the data (SVD moves data to a reduced dimension)
![Page 15: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/15.jpg)
Latent Semantic Indexing
Let A denote an m × n matrix of real-valued data and rank r, where without loss of generality m ≥ n, and therefore r ≤ n.
Where:◦ The columns of U are called the left singular and form an
orthonormal basis for original columns U is the eigenvectors of DDT (orthogonal)
◦ The rows of VT contain the elements of the right singular vectors and form an orthonormal basis for original rows V is the eigenvectors of DTD (orthogonal)
◦ Ʃ is square root of eigenvalues of U and V put in the diagonal (so it’s a sorted diagonal matrix) Ʃi,i > Ʃ j,j where i<j y Ʃi,i=0 where i>=r … r ≤ n
nxnT
nxnmxnmxn VUA **
SVD
![Page 16: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/16.jpg)
Latent Semantic Indexing
3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3
0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32
34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35
0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11
88x
810xA 810xU
88xTV
Docs
Terms
mxn
nxn
nxn
mxn
![Page 17: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/17.jpg)
Latent Semantic Indexing
Using SVD to reduce noise◦Take r instead of n in matrix Ʃ◦What value of r? e.g. 90% of
Frobenius norm
In this case r=5, where r < n (n=8)
8555510810 ** xT
xxx VUA
![Page 18: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/18.jpg)
Latent Semantic Indexing
3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3
0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32
34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35
0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11
55x
810xA 510xU
85xTV
Docs
Terms
mxr
rxr
rxn
mxn
![Page 19: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/19.jpg)
Latent Semantic Indexing
Sum ← 0
For i ← 0 to n do
Sum ← Sum + Ʃ(i,i)
End for
Percentage← Sum* 0.9 // 90% of Frobenius Norm
r ← 0
Temp ← 0
For i ← 0 to n do
Temp ← temp + S(i, i)
r ← r + 1
IF temp ≥ Percentage then
break
end if
End for
Return r Value of r?
![Page 20: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/20.jpg)
Latent Semantic Indexing
Retrieved documents in latent space◦Documents in the latent space:
◦Terms in latent space:
1**' rxrnxrmxnmxr VAD
1**' rxrmxrTmxnnxr UAT
![Page 21: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/21.jpg)
Latent Semantic Indexing
Query in the latent space:
Cosine similarity
111 **' rxrnxrxnxr Vqq
r
i
qi
r
i
di
qi
r
idi
WW
WW
qdSim
1
,2
1
,2
,1
,
,
![Page 22: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/22.jpg)
Web Clustering Engines
![Page 23: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/23.jpg)
Web Clustering Engines
The search aspects where WCE can be most useful in complementing the output of plain search engines are:◦Fast subtopic retrieval: documents can be
accessed in logarithmic rather than linear time◦Topic exploration.: Clusters provides a high-
level view of the whole query topic including terms for query reformulation (particularly useful for informational searches in unknown or dynamic domains)
◦Alleviating information overlook: Users may review hundreds of potentially relevant results without the need to download and scroll to subsequent pages
![Page 24: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/24.jpg)
Web Clustering Engines
WDC pose new requirements and challenges to clustering technology:
◦Meaningful labels◦Computational efficiency (response
time)◦Short input data description (snippets)◦Unknown number of clusters◦Work with noise data◦Overlapping clusters
![Page 25: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/25.jpg)
Search results acquisitions
Preprocesing
Cluster construction and labeling
Visualization
Query
Snippets
Features
Clusters
General Model
![Page 26: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/26.jpg)
Search results acquisitions
Preprocesing
Cluster construction and labeling
Visualization
Query
Snippets
Features
Clusters
Proposed Model
Query Expansion
Concepts instead of Terms
Evolutionary approach: Online and Offline
Feedback
Taxonomy, Ontologies and User Information
![Page 27: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/27.jpg)
Query Expansion Process
1. A registered user requests a query (based on keywords in a common graphics interface like Google). He/she receives help on-line (auto complete) based on his/her user profile
General Taxonomy of Knowledge
User Profile
SpecificOntology
Query by keywords
1. Pre-processing and semantic relationship
0 … *
Auto completeDropdown List
2. Related Concepts with user profile
3. External service
Inverted Index of Concepts
User
Query by keywords
1
Query Expansion Process
ExtendedQuery
Concepts, relations (is-a, is-part-of) and instances
![Page 28: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/28.jpg)
Query Expansion Process (B)
1. GTK and Specific ontologies are multilingual (collaborative edition process)
2. User profile has:• Nodes from GTK used for the user• A relation with the Inverted Index of concepts
(ontologies), to support rating process:• Manage concepts that have been previously
evaluated for an ontology specific (good/bad)
General Taxonomy of Knowledge
User
Query by keywords
1
Query Expansion Process
ExtendedQuery
![Page 29: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/29.jpg)
Term-Document Matrix - Observed Frequency - TDM-OF Building Process
Extended query: Original keyword+ other concepts + selected nodes from GTK (ontologies)
In parallel, each web search results is processed:
1. Pre-processing • Tokenization• Filters (Special characters and lower case)• Stop words removal• Define the language• Stemming (English/ Spanish)
2. For each document, accumulate the observed frequency of each term
3. Mark the document as processed
Inde
pend
ent T
hrea
ds
Term-Document Matrix (Observed
Frequency)
2
GoogleAPI
Yahoo!API
BingAPI
TDM-OF Building Process
![Page 30: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/30.jpg)
Concept-Document Matrix - Observed Frequency - CDM-OF Building Process
CDM-OFBuilding Process
Concept-Document Matrix (Observed
Frequency)
In parallel, for each document marked as processed:
1. Join terms belonging to the same concept in the selected specific ontologies (from extended query)
2. Accumulate the observed frequency for terms who joined in the same concept
3. End this process when all web search results are processed - thread synchronization -
SpecificOntology
Thread Synchronization
3
![Page 31: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/31.jpg)
Concept-Document Matrix (CDM) Building Process
Concept-Document Matrix (CDM)
4
CDM-OFBuilding Process 1. Calculate weigh (TF-IDF) of concepts in documents
jji
jiji n
N
freq
freqw
1log
)max( ,
,,
![Page 32: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/32.jpg)
Clustering Process
Three own algorithms1. A Hybridization of the Global-Best Harmony Search,
with the K-means algorithm2. A Memetic Algorithm with Niching Techniques
(restricted competition replacement and restrictive mating)
3. A Memetic Algorithm (Roulette wheel, K-means, and Replace the worst)
All Algorithms:4. Define the number of clusters automatically (BIC)5. Can use a standard Term-Document Matrix (TDM),
Frequent Term-Document Matrix (FTDM), Concept-Document Matrix (CDM) or Frequent Concept-Document Matrix (FTDM)
6. Test with data sets based on Reuters-21578 and DMOZ
7. Test by users
5
ClusteringProcess
Clustered Documents
![Page 33: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/33.jpg)
Labeling Process
Statistically Representative Terms:1. Initialize algorithm parameters2. Building of the "Others” label and cluster3. Candidate label induction4. Eliminate repeated terms5. Visual improving of labels
Frequent Phrases:6. Conversion of the representation7. Document concatenation8. Complete phrase discovery9. Final selection10.Building of the "Others” label and cluster11. Cluster label induction
Overlapping clusters6
LabelingProcess
Clustered Documents and Labeled
![Page 34: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/34.jpg)
Visualization and Rating Process
On experimentation → for each cluster, the user answered whether or not: • (Q1) the cluster label is in general representative of
the cluster (much, little, or nothing)• (Q2) the cluster is useful, moderately useful or
useless. Then, for each document in each cluster, the user answered whether or not: • (Q3) the document matches with the cluster (very
well matching, moderately matching, or not-matching)• (Q4) the document relevance (location) in the cluster
was adequate (adequate, moderately suitable, or inadequate).
Visualization and RatingProcess
Clustered Documents and Labeled
UserProfile
![Page 35: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/35.jpg)
Visualization and Rating Process
On production → the user can answer if each document is useful (relevant) or not
Visualization and RatingProcess
Clustered Documents and Labeled
UserProfile
General Taxonomy of KnowledgeUser
Profile
SpecificOntology
0 … *
Inverted Index of Concepts
![Page 36: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/36.jpg)
Proposed model
![Page 37: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/37.jpg)
Collaborative Editing Process of Ontologies
WordNet
General Taxonomy of Knowledge
UserProfile Specific
Ontology
0 … *
3. Supported by general ontologies
Inverted Index of Concepts
Editor
1. Select node (ontology associated)
2. Edit the ontologyConcepts, synonyms in different languages, relations, instances
4. Supported by concepts used for user
Can be automatically
5. Update Index automatically when save
![Page 38: Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.coccobos@unicauca.edu.co](https://reader037.vdocuments.pub/reader037/viewer/2022103123/56649d815503460f94a65967/html5/thumbnails/38.jpg)
Model of Web Clustering Engine
Enrichment with a Taxonomy,
Ontologies and User Information
Carlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected]
Questions?