density link-based methods for clustering web pages
DESCRIPTION
Density link-based methods for clustering web pages. Morteza Haghir Chehreghani , Hassan Abolhassani , Mostafa Haghir Chehreghani DSS, 2009 Presented by Jun-Yi Wu 2010/09/08. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
1
Density link-based methods for clustering web pages
Morteza Haghir Chehreghani, Hassan Abolhassani,Mostafa Haghir ChehreghaniDSS, 2009
Presented by Jun-Yi Wu2010/09/08
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
2
Outlines
· Motivation· Objectives· Methodology· Experiments· Conclusions· Comments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
3
Motivation
· Web Information is very useful for supporting decision making, but the information explosion on the web makes it hard to obtain required knowledge.
· Effective web clustering facilitates relevant document retrieval that itself facilitates decision making.
· High quality clustering, assists users to access relevant information much conveniently.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
4
Objectives
· Using both content and link information on top of density based algorithms.
· Density based methods have the advantages of creating clusters in various shapes and removing the noisy data.
· Proposing a method using web hyperlink structure to find the dense units and also improve the joining process for creating hierarchical clusters.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· In this paper, Proposing two methods:─ New density-based method─ Density link-based method
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Density-based Method: DBSCAN
· DBSCAN was the first density based algorithm, in which to create a new cluster or expand an existing one.
· A neighborhood distance with radius Eps must contain at least a minimum number of points denoted by MinPts.
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.New density-based method
· Clustering web data using only textual contents of documents.
· Extending the basic algorithm to use hyperlinks between the web documents.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.New density-based method
· The method has some limitations including:─ A constant value for mutation to a higher level is not
appropriate. A smaller value maybe appropriate for smaller clusters, but larger ones must take larger values.
─ It is developed for web data clustering, but it doesn't use hyperlink structure of the web.
─ Setting accurate values for parameters of the proposed method maybe difficult.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Link-based algorithm
· Hyperlink structure brings some interesting ideas:─ in combination with text content can help to construct
hierarchical clusters with the link-based clusters as the base clusters
─ link structure can be a good suggestion to find dense units
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Link-based algorithm
· First step - Finding dense units─ A subhyperlink structure is an LD_Unit if for each core
node N inside the unit there is a subset of N's neighbors that: it has at most MaxN members sum of the similarities between N and the nodes of this subset
is at least W.
· Second step – Joining dense units─ Node a is said to be external node of b if a and b do not
exist in the same LD_Unit.
10 W=2.5 and MaxN=4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Comparison of density based algorithms from different aspects.
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Use of the density based method for clustering web pages.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Examination of link-based method for clustering web pages
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
14
Conclusions
· The proposed method has the preference of low complexity(O(n*log n)) and the resultant clusters have high quality.
· Revealing that link-based method has some preferences over the density based method.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
15
Comments
· Advantages─ Low Complexity─ High quality
· Applications─ Data Clustering