mining web data: techniques for understanding the user

105
Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 1 Mining web data: Techniques for understanding the user behavior in the Web Juan D. Velásquez Silva Juan D. Velásquez Silva PhD Information Engineering, PhD Information Engineering, University of Tokyo, Japan University of Tokyo, Japan [email protected] [email protected] http://wi.dii.uchile.cl http://wi.dii.uchile.cl Department of Industrial Engineering Department of Industrial Engineering University of Chile University of Chile

Upload: tommy96

Post on 11-May-2015

6.708 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 1

Mining web data: Techniques forunderstanding the user behavior

in the Web

Juan D. Velásquez SilvaJuan D. Velásquez SilvaPhD Information Engineering, PhD Information Engineering, University of Tokyo, JapanUniversity of Tokyo, [email protected]@dii.uchile.clhttp://wi.dii.uchile.clhttp://wi.dii.uchile.clDepartment of Industrial EngineeringDepartment of Industrial EngineeringUniversity of ChileUniversity of Chile

Page 2: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 2

Outline

Motivation. Web data. Web mining. Applications. Summary.

Page 3: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 3

1.- Motivation

Page 4: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 4

The new sales window

The markets move towards business virtualization. Many companies have begun to use the Web as a

new channel of massive spread and worldwidecover.

How can we improve our web site? By understanding what describes the user behavior

[Brusilovsky96].

Page 5: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 5

OK, but how can we do it?

Analyzing the user browsing behavior. Analyzing the web page preferences. Analyzing the user profile.

In fact “Extracting knowledgefrom web data”

… applying Web mining techniques.

Page 6: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 6

2.- Data originated in the Web:Web Data

Page 7: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 7

Web server and web browser

Page 8: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 8

Web logs# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent

1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)

7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)

18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)

19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)

20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

Page 9: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 9

Web site contents

You can put anything you want on aWeb page, from family to businessinfo….

Hyperlink structure

Page 10: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 10

Nature of web data Content. The web page content, i.e., pictures, free

text, sounds, etc. Structure. Data that show the internal web page

structure. In general, they are HTML or XML tags,some of them contain information about hyperlinkconnections with other web pages.

Usage. Data that describe the visitor preferenceswhile browsing in a web site. It is possible to findthem inside web log files.

User profile. A collection of information about auser, combining personal information (e.g. name, ageetc.), usage information (e.g. page visited) andinterests.

Page 11: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 11

Understanding the visitor behaviorin a web site

Visitor browsing behavior: Web logs. Visitor preferences: Web pages Problems:

Web logs contain a lot of irrelevant data. A Web site is a huge collection of

heterogeneous, unlabelled, distributed, timevariant, semi-structured and high dimensionaldata.

Page 12: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 12

The dream

“Transform the visitors intocustomers and retain the existing

ones” Some solutions:

Continuous improvement of the web sitestructure and content.

Personalization of the relationship between theuser and the web site.

Understanding the user behavior in the website.

Page 13: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 13

Data cleaning and preprocessing

Web logs. Applying the sessionizationprocess. To identify the real visitor session.

Web pages. Vector space model. Transforming the web pages into feature vectors.

Web hyperlinks structure. Identifying social networks

Page 14: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 14

Web logs# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent

1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)

7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)

18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)

19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)

20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

Page 15: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 15

WUM – Heuristics

The session identification heuristics Timeout: if the time between pages requests exceeds a certain

limit, it is assumed that the user is starting a new session IP/Agent: Each different agent type for an IP address

represents a different sessions Referring page: If the referring page file for a request is not

part of an open session, it is assumed that the request is comingfrom a different session.

Same IP-Agent/different sessions (Closest): Assigns the requestto the session that is closest to the referring page at the time ofthe request.

Same IP-Agent/different sessions (Recent): In the case wheremultiple sessions are same distance from a page request, assignsthe request to the session with the most recent referrer accessin terms of time

Page 16: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 16

Sessionization- ExampleBased on examples in tutorial PKDD-2002 by Berendt et. al.

Page 17: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 17

Sessionization- Example (2)Sort the users (IP+Agent)

Based on examples in tutorial PKDD-2002 by Berendt et. al.

Page 18: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 18

Sessionization- Example (3)

Sessionize using heuristics (h1 with 30 min)

The h1 heuristic (timeout=30 min) will result in the two sessions

Based on examples in tutorial PKDD-2002 by Berendt et. al.

Page 19: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 19

Sessionization- Example (4)

Sessionize using heuristics (with href)

By using the reffer-based heuristics, we have only a single session

Based on examples in tutorial PKDD-2002 by Berendt et. al.

Page 20: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 20

Sessionization- Example (5)

Path completion

Based on examples in tutorial PKDD-2002 by Berendt et. al.

Page 21: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 21

Web text content

From different web page content, specialattention receive the free text.

For the moment, a searching is performed byusing key words.

It is necessary to represent the textinformation in a feature vector, before toapply a mining process.

The representation must consider that thewords in the web page don’t have the sameimportance.

Page 22: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 22

Web text content: filters

There are words that don’t have information(article, prepositions, conjunctions, etc)

It is necessary a cleaning process: HTML tags. Stop words (i.e. pronouns, prepositions,

conjunctions, etc.) Word stemming. Process of suffix removal, to

generate word

Page 23: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 23

Processing the web site:Vector space model [Salton75]

The model associates a weight to each word in the page,based on its frequency in the whole web site.

Let and Q the amount of pages, a simple estimation ofthe relevance of a word is

The inverse document frequency can beused like a weight.

A variation of the last expression is known as TF*IDF

Qn

w iji =,

)log(in

QIDF =

)log(**i

ij nQ

fIDFTF =

in

Page 24: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 24

Web page: vectorial representation

Its vectorialrepresentation would be amatrix of RxQ.

Q is the number of pagesin the web site and R isthe number of differentwords in P.

Word 1 2 … Q

1 advise 1 0 … 1

2 business 0 1 … 0

. … . . … .

. … . . … .

. … . . … .

. … . . … .

. … . . … .

R zambia 1 0 … 0

Page 25: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 25

Vector space model: a graphic example

Page 26: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 26

Comparing web pages

)log(*)(in

Q

ijij fmM ==

),...,( 1 Riii mmp ! ),...,( 1 Rjjj mmp !

! !

!==

= =

=

R

k

R

k

kjki

R

k

kjki

mm

mm

ji ppdp

1 1

22

1

)()(

cos),( "ip

jp

!

Page 27: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 27

Vector Model, Summarized The best term-weighting schemes tf-idf

weights:wij = f(i,j) * log(Q/ni)

For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *

log(N / ni)

This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known

ranking alternatives

Page 28: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 28

Advantages: term-weighting improves quality of the answer

set partial matching allows retrieval of docs that

approximate the query conditions cosine ranking formula sorts documents

according to degree of similarity to the query Disadvantages:

assumes independence of index terms; not clearif this is a good or bad assumption

Pros & Cons of Vector Model

Page 29: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 29

Web hyperlinks structure

Why a web page point to another one? Link analysis: use link structure to determine

credibility. If a web page is pointed by other ones,

maybe it is because the page containsrelevant information.

We can understand the formation of a webcommunity.

We can improve our web site.

Page 30: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 30

Processing the hyperlinks structure

Page 31: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 31

Processing the hyperlinks structure (2) Identifying which pages contain more relevant

information that others [Kleinberg99]:Authorities. A natural information repository

for the community.Hub. These concentrate links to authorities web

pages, for instance, “my favorite sites”.

!

!

"

"

=

=

pqthatsuchq

qp

pqthatsuchq

qp

xy

yx

Page 32: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 32

4.- Web mining

Page 33: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 33

Data mining techniques

Association rules Clustering Classification. Prediction. Probabilistic models. Regression.

Page 34: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 34

What happen with web data?

“The web is a huge collectionof heterogeneous, unlabelled,

distributed, time variant,semi-structured and high

dimensional data”

S.K. Pal 2002

Page 35: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 35

Web Mining Taxonomy [Jooshi00,Lu03]

Web Mining

Web ContentMining

Web UsageMining

Web StructureMining

Page 36: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 36

Web Structure Mining

It deals with the mining of the web hyperlinkstructure (inter document structure).

A website is represented by a graph of itslinks, within the site or between sites.

Facts like the popularity of a web page can bestudied, for instance, if a page is referred bya lot of other pages in the web.

The web link structure allows to develop anotion of hyperlinked communities.

It can be used by search engines, like googleor altavista, in order to get the set of pagesmore cited for a particular subject.

Page 37: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 37

WSM (2) Finding authoritative Web pages

Retrieving pages that are not only relevant, but also ofhigh quality, or authoritative on the topic

Hyperlinks can infer the notion of authority The Web consists not only of pages, but also of

hyperlinks pointing from one page to another These hyperlinks contain an enormous amount of latent

human annotation A hyperlink pointing to another Web page, this can be

considered as the author's endorsement of the otherpage

Page 38: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 38

Google’s PageRank (Brin/Page 98) Mine structure of web graph independently of the

query! Each web page is a node, each hyperlink is a directed edge

Assumes a random walk (surf) through the web: Start at a random page

At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1- d

The PageRank of a page p is the fraction of stepsthe surfer spends at p in the limit

Page 39: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 39

PageRank Algorithm The assumption is the importance of a page is given for the

importance of the pages that pointed it.

!"#$

++%=

pqPpq q

k

qk

pN

xd

ndx

/,

)(

)1( 1)1(

Importance of page p

Number of out-links from page q

Number of pages in the web graph

Importance of page qProbability that the surfer follows someout-links of q when visit that page

Page 40: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 40

PageRank Algorithm (2)

By using a matrix representation, the PageRank equationis rewrite as:

)()1( )1( kkdMXDdX +!=

+

)1

()1

,...,1(),,...,( )()()(

1

j

ij

k

p

k

p

k

NmMand

nnDxxX

n====

where

With amount of out-links from page j such asjN ij!

Page 41: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 41

PageRank Algorithm (3)

Initialize

Set d, usually d=0.85.

Calculate

If stop and return .

Else and go to point 3.

!<"+ |||| )()1( kk

XX)1( +k

X

)1

,...,1()0,...0()0(

nnDandX ==

)()1( )1( kkdMXDdX +!=

+

)1()( +=

kkXX

Page 42: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 42

PageRank Algorithm (4)

!!!!!

"

#

$$$$$

%

&

=

''''

(

)

****

+

,

=

25.0

25.0

25.0

25.0

005.01

0000

1000

015.00

DandM

Page 43: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 43

PageRank Algorithm (5)

!!!!!

"

#

$$$$$

%

&

=

!!!!!

"

#

$$$$$

%

&

=

!!!!!

"

#

$$$$$

%

&

=

126.0

0375.0

0375.0

0853.0

,

126.0

0375.0

0375.0

0853.0

,

0375.0

0375.0

0375.0

0375.0

)3()2()1(XXX

0

0 0

00375.0 0375.0

0375.0

0375.00853.0

126.0

0375.0

0375.0

0375.0

0375.0

0853.0

126.0

Page 44: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 44

PageRank Algorithm (6) About the disadvantages:

If a page is pointed by another one, it mean the pagereceive a vote for the PageRank calculus.

If a page is pointed by a lots of pages, it mean thatthe page is important.

“Only the good pages are pointed by others one”, but: Reciprocal links. If the page A link page B, then page

B will link page A. Link Requirements. Some web pages give electronic

gifts, like a program, document etc., if another pagepoint it.

Near persons community. For instance friends andrelatives that from their pages point another friend orrelative, only for the human relationship between them.

Page 45: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 45

WSM summary

HIST and PageRank use back-links as a meansof adjusting the “worthiness” or “importance”of a page

Both use iterative process over matrix/vectorvalues to reach a convergence point

HITS is query-dependent (thus too expensiveto compute in general)

PageRank is query-independent and consideredmore stable

Just one piece of the overall picture…

Page 46: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 46

Web Content Mining

The goal is to find useful information from theweb content.

In this sense, WCM is similar to InformationRetrieval (IR).

However, web content is not only free text, otherobjects like pictures, sound and movies belong alsoto the content.

There are two main areas in WCM : the mining of document contents (web page content

mining) the improvement of content search in tools like

search engines (search result mining).

Page 47: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 47

Issues in Web Content Mining

Developing intelligent tools for IRFinding keywords and key phrasesDiscovering grammatical rules and collocationsHypertext classification/categorizationExtracting key phrases from text documentsLearning extraction models/rulesHierarchical clustering Predicting (words) relationship

Page 48: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 48

WEBSOM

It is a means for organizing miscellaneous textdocuments into meaningful maps forexploration and search.

It is based on SOM (Self-Organizing Map)that automatically organizes documents into atwo-dimensional grid so that relateddocuments appear close to each other

http://www.cis.hut.fi/websom

Page 49: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 49

WEBSOMAll words ofdocument aremapped into the wordcategory map

Histogram of “hits” onit is formed

Self-organizing map.Largest experiments have used:• word-category map

315 neurons with 270 inputs each

• Document-map 104040 neurons with 315inputs each

Self-organizing semantic map.15x21 neurons

Interrelated words that have similar contextsappear close to each other on the map

• Training done with 1124134 documents

Page 50: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 50

WEBSOM

Page 51: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 51

Sample search

A new document or anydocument’s descriptioncan be used for findingrelated documents.

The circle on the mapdenote the location of themost representativedocument for thequestion.

Page 52: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 52

How to use the Maps

As filter that notifies theuser of interestingdocuments.

As a searching engine. A new index method.

Page 53: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 53

Extraction of key-textcomponents from web pages

The key-text components are parts of an entiredocument, for instance a paragraph, phrase andinclusive a word, that contain significant informationabout a particular topic, from the web site user point ofview.

A web site keyword is “a word or possibly a set ofwords that make a web page more attractive for aneventual user during his visit to the website''[Velasquez05b].

The assumption is that there exists a correlationbetween the time that the user spent in a page andhis/her interest in its content [Velasquez04b] .

Usually, the keywords in a web site have been relatedwith the ``most frequently used words''.

In [Buyukkokten01] a method to extract keywordsfrom a huge set of web pages is introduced.

Page 54: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 54

WCM summary

The vector space model is a recurrent method torepresent a document as a feature vector.

Because the set of words used in the construction ofthe web site could be a lot, it is necessary to apply astop word cleaning and stemming process.

A web page content is different to a commondocument, in fact, a web page contain semi-structuredtext, i.e., with tags that give additional informationabout the text component.

Also a page could contain pictures, sounds, movies, etc. Sometime the page text content is a short text or

inclusive a set of unconnected words.

Page 55: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 55

Web Usage Mining (WUM)

Also known as Web log mining mining techniques to discover interesting

usage patterns from the secondary dataderived from the interactions of the userswhile surfing the web

Page 56: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 56

WUM: Considerations [Pierrakos03]

WUM applies traditional data mining methods inorder to analyze usage data.

The sessionization process is necessary to correctthe problems detected in the data.

The goal is to discover patterns in usage dataapplying different kinds of data mining techniques.

Applications of WUM can be grouped in two maincategories: User modeling in adaptive interfaces, known as

personalization. User navigation patterns, in order to improve the

web site structure.

Page 57: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 57

WUM: Applications Target potential customers for electronic commerce Enhance the quality and delivery of Internet

information services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching)

Page 58: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 58

Statistics

Several tools use the conventional statistics foranalyzing the user behavior in a web site(http://www.accrue.com/, http://www.netgenesis.com,http://www.webtrends.com).

Each one of them use graphics interfaces, likehistograms, with statistics associate.

For instance amount of clicks per page during thelast month.

By applying conventional statistics on web logs, one canperform different kinds of analysis [Boving04] .

Page 59: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 59

Statistics (2)

These basic statistics are used to improve the website structure and content in many institutions, overall the commercial ones [Dellmann04] .

Their application is simple, there are many webtraffic analysis tools and it easy to understand thetools's reports.

Although the statistic analysis could seem a verysimple tool to mining the web logs, it resolve a lot ofproblems relate with the performance of the website.

Page 60: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 60

Examples:

60% of clients who accessed /products/, also accessed/products/software/webminer.htm.

30% of clients who accessed /special-offer.html,placed an online order in /products/software/.

(Example from IBM official Olympics Site) {Badminton, Diving} ===> {Table Tennis} (α = 69.7%, σ

= 0.35%)

Page 61: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 61

Clustering

Groups together a set of items having similarcharacteristics

User Clusters Discover groups of users exhibiting similar browsing

patterns Page recommendation

User’s partial session is classified into a single cluster The links contained in this cluster are recommended

Page 62: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 62

Clustering

Grouping users with common browsingbehavior and pages with similar content[Srivastava00] .

An important thing in any clusteringtechnique, is the similarity measure or methodused to compare the input vectors for creatingthe clusters.

In the WUM case, the similarities areconstructed basis on the usage data in the website, i.e., the set of pages used or visitedduring the user session.

Page 63: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 63

WUM: Clustering example Clusters of user browsing behavior

Page 64: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 64

Association Rule Generation Discovers the correlations between pages that are most often

referenced together in a single server session Provide the information

What are the set of pages frequently accessed together byWeb users?

What page will be fetched next? What are paths frequently accessed by Web users?

Association rule A B [ Support = 60%, Confidence = 80% ]Example “50% of visitors who accessed URLs /diplomas.html and

BI/info.html also visited webmining.html”

Page 65: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 65

Association rules

In WUM, the association rules are focusmainly in the discovery of relations betweenpages visited by the users [Mobasher01] .

For instance, an association rule for a MBAprogram is

mba/seminar.html mba/speakers.html From a different point of view, in

[Schwarzkopf01] a Bayesian network is usedfor defining taxonomic relations betweentopics shown in a web site.

Page 66: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 66

5.- Applications

Page 67: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 67

Improving the relationship between theweb site and the user

Recommendations to modify the web sitestructure and content.

Web personalization (online navigationrecommendations).

Intelligent web site.

Page 68: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 68

Web personalization [Lu03,Mombasher00]

It is the process where the web server andthe related applications, dynamically,customize the content for a particular user,based on information about his/her behaviorin a website.

This is different to another related conceptcalled “customization” where the visitorinteracts with the web server using aninterface to create her/his own web site, e.g.,“MyBanking Page”.

Page 69: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 69

Intelligent Web site [Coenen00,Perkowitz98, Velasquez04b]

It represents the main concepts behind the“new portal generation”.

They are systems that “based on the userbehavior, allow to implement changes inthe current web site structure andcontent”.

Page 70: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 70

What we should do? [Mombasher01]

Modeling the visitor behavior. A model for the visitor browsing behavior

must consider the visited pages, the timespent and the pages sequence.

A model for the visitor preferences mustconsider the content of the pages and thetime spent in each one of them in a session.

Page 71: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 71

Visitor browsing behavior

¿ ?

Page 72: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 72

Visitor browsing behavior (2)

Three variables are considered: the web page path,its content and the time spent when it is visited bya visitor.

The visitor behavior vector is defined and asimilarity measure between visitor session isintroduced.

Where is a component that represents thepage content, its path and percentage of timespent in the page visited.

The vector maintains the page visit order.

)],)...(,[( 11 nntptpv =

r

),(iitp

thi

Page 73: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 73

Vector space model: A variation proposed

)log(*)1()(in

Qiijij

TR

swfmM +== sw: special words array

TR: Total special words

),...,( 1 Riii mmp ! ),...,( 1 Rjjj mmp !

! !

!==

= =

=

R

k

R

k

kjki

R

k

kjki

mm

mm

ji ppdp

1 1

22

1

)()(

cos),( "ip

jp

!

Page 74: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 74

An example of variation proposed

Page 75: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 75

Comparing the sequences [Runkler03,Velasquez04a]

The sequence of anavigation can berepresented by a graph.Each page is identified byan identification number.

!"#

$

=%==

=

=

&&&=

&&&&=

+)()(1

)()(02),(

}7,6,3,1{)(

}8,5,6,2,1{)(

}73,63,31{

}85,52,62,21{

21

21

)||()(||

)||()(||

21

2

1

2

1

21

21

GEGEif

GEGEifGGdG

GE

GE

G

G

GEGE

GEGE 'I

We need to know how similar ordifferent are both sequencerepresentations!!

Notion of similarity

Page 76: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 76

Comparing sequences: An example

4),(

3),(

"1367"

"12856"

"1367"

"12648"

43

21

4

3

2

1

=

=

=

=

=

=

SSL

SSL

S

S

S

S

] ]

3,021

||)(||||)(||

),(21),(

||})(||||,)(max{||,0

0),(

453

21

2121

21

21

21

=!=

+!=

"#$ %

=

+

GEGE

SSLGGdG

elseGEGE

SSSSL

Page 77: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 77

Comparing browsing behavior

Then the similarity measure is:

where is an indicator of visitor interest

is the page distance

and dG is a “graph distance”, i.e., how similar are the pathsbetween two sessions and dP is a “page distance” betweenthe content of the visited pages.

!=

=L

k

c

k

c

kt

t

t

t

L

hh ppdpppdGsmk

k

k

k

1

,,1 ),(),min(),(),( "#"# #

"

"

#

"#rr

),min( !

"

"

!

k

k

k

k

t

t

t

t

),( ,,

c

k

c

k ppdp !"

Page 78: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 78

What is she/he looking for?

Free text. Movies Pictures Sounds

Page 79: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 79

Visitor preferences

The main focus is to understand the textpreferences.

Web site keywords: the word or the set ofwords that make the web page more attractivefor the visitor.

From the visitor behavior vector, we want toselect the most important pages, assumingthat the degree of importance is correlatedwith the percentage of time spent on eachpage.

Page 80: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 80

Important pages vector [Velasquez05b]

We can suppose that the most important content for thevisitors is included in the web pages that concentrate thegreater amount of time spent in their visit.

Then, in order to know the most important web pagecontent, we can select a component set from v . Letbe the quantity selected.

being the component in v that represents thepage most important and the time spent on it.

This information is obtained after sorting the vcomponents by time and select the first ones.

)],(),...,,[()( 11 !!! "#"#$ =v

!

),(kk!"

thk

!

Page 81: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 81

Similarity measure using text

),(*},min{1

))(),((1

!"#

"

!

!

"

## $$%

%

%

%

#!&"& kk

k k

k

k

k dpst '=

=

This equation combines the content of the mostimportant visited pages with the time spent oneach one of them by a multiplication.

In this way we can distinguish between two pageswith similar content, but different spent times.

Page 82: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 82

Identifying web site keywords

A clustering algorithm can be applied to find groups ofsimilar visitor sessions.

Based on the clusters centroids, the most importantwords for each cluster will be identified.

Using the vector space model and a method todetermine the most important keywords and theirimportance in each cluster is proposed.

A measure (geometric mean) to calculate theimportance of each word relative to each cluster, isdefined as:

!"

ippmikw

#$=][ : Cluster with vectors! !

Ri ,...,1:

Page 83: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 83

Applying clustering algorithms[Hinnserbury03, Theodoridis99, Velasquez03

For example, Self Organizing Feature Maps. Schematically, a SOFM is presented as a two-dimensional array in

whose positions the neurons are located. Each neuron is constituted by an n-dimensional vector, whose

components are the synaptic weights. The notion of neighborhood among the neurons provides diverse

topologies. In this case a toroidal topology was used, which means that the

neurons closest to the ones of the superior edge, are located in theinferior and lateral edges.

Page 84: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 84

Working with real web sites

Page 85: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 85

Bank web site

It belongs to a Chilean virtual bank, i.e., a bankthat doesn't have physical branches and whereall the transactions are made using electronicmeans, like e-mails, portals, etc.

Written in Spanish. 217 static web pages. Approximately eight million raw web logs

registers corresponding to the period Januaryto March, 2003.

Page 86: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 86

Visitor Behavior Vectors

Only 16% of the visitors visit 10 or more pagesand 18% less than 4.

The average of visited pages is 6. With these data, we fixed 6 as the maximum

number of components in a visitor behaviorvector

Finally, applying the above described filters,approximately 200,000 vectors wereidentified.

Page 87: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 87

Visitor Behavior Vectors (2)

Page 88: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 88

Cluster analysis

The cluster identification and validation isgenerally a subjective tasks.

It is necessary the support of a businessexpert.

In this case bank business expert. Eight clusters were identified but only four

of the were accepted by the business expert. The criterion is “What cluster make sense for

the business expert”

Page 89: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 89

Results

Page 90: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 90

Important page clustering

Page 91: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 91

Cluster analysis

The cluster must content only pages relatedby main topic.

The business expert define the main topic perpage.

From twelve identified clusters, only eightwere accepted.

Page 92: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 92

ResultsCluster Pages Visited

1 (6,8,190)

2 (100,128.30)

3 (86,150,97)

4 (101,105,1)

5 (3,9,147)

6 (100,126,58)

7 (70,186,137)

8 (157,169,180)

Page 93: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 93

Cluster analysis Review which words in each page cluster are

the same. Using the vector page definition, it is possible

to get the key words and their relativeimportance in each cluster.

Example

)log(*)1()(in

Qiijij

TR

swfmM +==

319086

}190,8,6{

iiiimmmkw =

=!

!"

ippmikw

#$=][

Page 94: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 94

Cluster analysis

Page 95: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 95

Some identified keywords

#

1 Crédito Credit

2 Hipotecario House credit

3 Tarjeta Card

4 Promoción Promotion

5 Concurso Concourse

6 Puntos Points

7 Descuento Discount

8 Cuenta Account

Keywords

Page 96: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 96

Offline recommendations

Structure. Add links intra clusters, e.g., link from page 150

to page 137. Add link inter cluster, e.g. link from page 105 to

126. Eliminate link, e.g., from page 150 to page 186

Content. To use the web site keywords as linksor contents in the page.

Page 97: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 97

Online recommendations

A current user session is classified into someclusters found.

The online navigation recommendation iscreated as a set of links to pages belonging tothe current web site.

The user can select some links or not. Default case: No recommendation. It is too risky to apply the recommendations in

the real web site. However it is possible tomake a simulation.

Page 98: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 98

A methodology to test onlinenavigation recommendations

qCl)],(),...,,(),...,,[( 11 HHmm tptptp=!

qm Clp !+1 0,/ ,11,1 >!!" +++ jCllRl qjmmjm

###

)(1 !+mR

Recommendations for ? 1+mp },...,{ 1 npp=

},...,,...,{)( ,1,10,11

!!!! jmkmmm lllR ++++ =1+mp

!

xml

,1+

Page 99: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 99

Online recommendations (2)

Page 100: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 100

Testing the online navigation effectiveness

12

3

fourth page

fifth page

sixth page

0

10

20

30

40

50

60

70

80

90

100

Pe

rce

nta

ge

of

ac

ce

pta

nc

e

Number of page links

per suggestion

Page 101: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 101

Intelligent web site [Velasquez04b]

Page 102: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 102

Knowledge Base

Parametric rules

Historic patternrepository

Page 103: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 103

5.- Summary

Page 104: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 104

Summary

Web data are a real source to analyze theuser behavior in the web.

An important step is cleaning and pre-processing of the web data.

The application of web mining techniquesallows to find unknown patterns.

These patterns must be validated/rejected byan expert in the business under investigation.

We can personalize a web site.

Page 105: Mining web data: Techniques for understanding the user

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 105

Muchas gracias por su atenciónThank you very much for your attention

ご清聴ありがとうございました