density link-based methods for clustering web pages

15
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology 1 Density link-based methods for clustering web pages Morteza Haghir Chehreghani, Hassan Abolhassani, Mostafa Haghir Chehreghani DSS, 2009 Presented by Jun-Yi Wu 2010/09/08

Upload: renata

Post on 17-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Density link-based methods for clustering web pages. Morteza Haghir Chehreghani , Hassan Abolhassani , Mostafa Haghir Chehreghani DSS, 2009 Presented by Jun-Yi Wu 2010/09/08. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

1

Density link-based methods for clustering web pages

Morteza Haghir Chehreghani, Hassan Abolhassani,Mostafa Haghir ChehreghaniDSS, 2009

Presented by Jun-Yi Wu2010/09/08

Page 2: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outlines

· Motivation· Objectives· Methodology· Experiments· Conclusions· Comments

Page 3: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

3

Motivation

· Web Information is very useful for supporting decision making, but the information explosion on the web makes it hard to obtain required knowledge.

· Effective web clustering facilitates relevant document retrieval that itself facilitates decision making.

· High quality clustering, assists users to access relevant information much conveniently.

Page 4: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Objectives

· Using both content and link information on top of density based algorithms.

· Density based methods have the advantages of creating clusters in various shapes and removing the noisy data.

· Proposing a method using web hyperlink structure to find the dense units and also improve the joining process for creating hierarchical clusters.

Page 5: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· In this paper, Proposing two methods:─ New density-based method─ Density link-based method

5

Page 6: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Density-based Method: DBSCAN

· DBSCAN was the first density based algorithm, in which to create a new cluster or expand an existing one.

· A neighborhood distance with radius Eps must contain at least a minimum number of points denoted by MinPts.

6

Page 7: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.New density-based method

· Clustering web data using only textual contents of documents.

· Extending the basic algorithm to use hyperlinks between the web documents.

7

Page 8: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.New density-based method

· The method has some limitations including:─ A constant value for mutation to a higher level is not

appropriate. A smaller value maybe appropriate for smaller clusters, but larger ones must take larger values.

─ It is developed for web data clustering, but it doesn't use hyperlink structure of the web.

─ Setting accurate values for parameters of the proposed method maybe difficult.

8

Page 9: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Link-based algorithm

· Hyperlink structure brings some interesting ideas:─ in combination with text content can help to construct

hierarchical clusters with the link-based clusters as the base clusters

─ link structure can be a good suggestion to find dense units

9

Page 10: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Link-based algorithm

· First step - Finding dense units─ A subhyperlink structure is an LD_Unit if for each core

node N inside the unit there is a subset of N's neighbors that: it has at most MaxN members sum of the similarities between N and the nodes of this subset

is at least W.

· Second step – Joining dense units─ Node a is said to be external node of b if a and b do not

exist in the same LD_Unit.

10 W=2.5 and MaxN=4

Page 11: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Comparison of density based algorithms from different aspects.

11

Page 12: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Use of the density based method for clustering web pages.

12

Page 13: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Examination of link-based method for clustering web pages

13

Page 14: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

14

Conclusions

· The proposed method has the preference of low complexity(O(n*log n)) and the resultant clusters have high quality.

· Revealing that link-based method has some preferences over the density based method.

Page 15: Density link-based methods for clustering web pages

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

15

Comments

· Advantages─ Low Complexity─ High quality

· Applications─ Data Clustering