06 webmining

15
1 Web Mining Web Mining 2 What is Web Mining? What is Web Mining? Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) (another definition: mining of data related to the World Wide Web) Motivation / Opportunity - The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining 3 The Web The Web Over 1 billion HTML pages, 15 terabytes Wealth of information Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes,  yellow & white pages, maps, markets, ......... Diverse media types: text, images, audio, video Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 Highly Dynamic 1 million new pages each day Average page changes in a few weeks Graph structure with links between pages Average page has 7-10 links in-l inks and out-l inks foll ow power- law distr ibutio n Hundreds of millions of queries per day 4 Abundance and authority crisis Abundance and authority crisis Liberal and informal culture of content generation and dissemination. Redundancy and non-standard form and content. Millions of qualifying pages for most broad queries Example: java or kayaking No authoritative information about the reliability of a site Little support for adapting to the background of specific users.

Upload: ashish-pushkar

Post on 09-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 1/15

1

Web MiningWeb Mining

2

What is Web Mining?What is Web Mining?

Web mining is the use of data mining techniquesto automatically discover and extractinformation from Web documents/services(Etzioni, 1996, CACM 39(11))

(another definition: mining of data related to the World Wide Web)

Motivation / Opportunity - The WWW is huge, widely distributed, globalinformation service centre and, therefore, constitutes a rich source fordata mining

3

The WebThe WebOver 1 billion HTML pages, 15 terabytesWealth of information

Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes, yellow & white pages, maps, markets, .........Diverse media types: text, images, audio, videoHeterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3

Highly Dynamic1 million new pages each dayAverage page changes in a few weeks

Graph structure with links between pagesAverage page has 7-10 linksin-links and out-links follow power-law distribution

Hundreds of millions of queries per day

4

Abundance and authority crisisAbundance and authority crisis

Liberal and informal culture of content generation anddissemination.

Redundancy and non-standard form and content.Millions of qualifying pages for most broad queries

Example: java or kayaking

No authoritative information about the reliability of asite

Little support for adapting to the background ofspecific users.

Page 2: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 2/15

5

How do you suggest we couldHow do you suggest we couldestimate the size of theestimate the size of the

web?web?

6

One Interesting ApproachOne Interesting Approach

The number of web servers was estimated by sampling

and testing random IP address numbers and determiningthe fraction of such tests that successfully located aweb server

The estimate of the average number of pages perserver was obtained by crawling a sample of the serversidentified in the first experiment

Lawrence, S. and Giles, C. L. (1999). Accessibility of information on theweb. Nature , 400(6740): 107–109.

7

The WebThe WebThe Web is a huge collection of documents except for

Hyper-link information

Access and usage information

Lots of data on user access patternsWeb logs contain sequence of URLs accessed by users

Challenge: Develop new Web mining algorithms andadapt traditional data mining algorithms to

Exploit hyper-links and access patterns

8

Applications of web miningApplications of web miningE-commerce (Infrastructure)

Generate user profiles -> improving customization and provide users withpages, advertisements of interest

Targeted advertising ->Ads are a major source of revenue for Web portals(e.g., Yahoo, Lycos) and E-commerce sites. Internet advertising is probably the“hottest” web mining application today

Fraud ->Maintain a signature for each user based on buying patterns on theWeb (e.g., amount spent, categories of items bought). If buying pattern changessignificantly, then signal fraud

Network ManagementPerformance management ->Annual bandwidth demand is increasing ten-foldon average, annual bandwidth supply is rising only by a factor of three. Result isfrequent congestion. During a major event (World cup), an overwhelming numberof user requests can result in millions of redundant copies of data flowing backand forth across the worldFault management ->analyze alarm and traffic data to carry out root causeanalysis of faults

Page 3: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 3/15

9

Applications of web miningApplications of web mining

Information retrieval (Search) on the WebAutomated generation of topic hierarchies

Web knowledge bases

10

Why is Web Information Retrieval Important?Why is Web Information Retrieval Important?

According to most predictions, the majority of human informationwill be available on the Web in ten years

Effective information retrieval can aid in

Research: Find all papers about web mining

Health/Medicene: What could be reason for symptoms of “yelloweyes”, high fever and frequent vomitting

Travel: Find information on the tropical island of St. Lucia

Business: Find companies that manufacture digital signal processors

Entertainment: Find all movies starring Marilyn Monroe during the years 1960 and 1970

Arts: Find all short stories written by Jhumpa Lahiri

11

Why is Web Information Retrieval Difficult?Why is Web Information Retrieval Difficult?

The Abundance Problem (99% of information of no interest to99% of people)

Hundreds of irrelevant documents returned in response to a searchquery

Limited Coverage of the Web (Internet sources hidden behindsearch interfaces)

Largest crawlers cover less than 18% of Web pages

The Web is extremely dynamicLots of pages added, removed and changed every day

Very high dimensionality (thousands of dimensions)

Limited query interface based on keyword-oriented search

Limited customization to individual users

12

http://www.searchengineshowdown.com/stats/size.shtml

Search Engine Relative SizeSearch Engine Relative Size

Page 4: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 4/15

13

Search Engine Web Coverage OverlapSearch Engine Web Coverage Overlap

From http://www.searchengineshowdown.com/stats/overlap.shtml

Coverage – about 40% in 1999

4 searches weredefined thatreturned 141 webpages.

14

Web Mining

WebStructureMining

WebContentMining

Web UsageMining

Web Mining TaxonomyWeb Mining Taxonomy

15

Web Mining TaxonomyWeb Mining TaxonomyWeb content mining: focuses on techniques forassisting a user in finding documents that meet a

certain criterion (text mining)

Web structure mining: aims at developing techniques totake advantage of the collective judgement of web pagequality which is available in the form of hyperlinks

Web usage mining: focuses on techniques to study the

user behaviour when navigating the web(also known as Web log mining and clickstream analysis)

16

Web Content MiningWeb Content Mining

Examines the content of web pages as well as results of websearching.

Page 5: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 5/15

17

Web Content MinngWeb Content Minng

Can be thought of as extending the work performed by

basic search engines.

Searche engines have crawlers to search the web andgather information, indexing techniques to store theinformation, and query processing support to provideinformation to the users.

18

Database ApproachesDatabase ApproachesOne approach is to build a local knowledge base - model data on theweb and integrate them in a way that enables specifically designed querylanguages to query the data

Store locally abstract characterizations of web pages. A querylanguage enables to query the local repository at several levelsof abstraction. As a result of the query the system may have torequest pages from the web if more detail is needed

Zaiane, O. R. and Han, J. (2000). WebML: Querying the world-wide web for resources and knowledge.In Proc. Workshop on Web Information and Data Management , pages 9–12

Build a computer understandable knowledge base whose contentsmirrors that of the web and which is created by providing

training examples that characterized the wanted documentclassesCraven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1998).Learning to extract symbolic knowledge from the world wideweb. In Proc. National Conference on Artificial Intelligence , pages 509–516

19

AgentAgent -- Based ApproachBased ApproachAgents to search for relevant information using domaincharacteristics and user profiles

A system for extracting a relation from the web, for example, alist of all the books referenced on the web. The system is given aset of training examples which are used to search the web forsimilar documents. Another application of this tool could be tobuild a relation with the name and address of restaurantsreferenced on the web

Brin, S. (1998). Extracting patterns and relations from the world wide web. In Int.Workshop on Web and Databases , pages 172–183.

Personalized Web Agents ->Web agents learn userpreferences and discover Web information sources based on thesepreferences, and those of other individuals with similar interests

SiteHelper is an local agent that keeps tracks of pages viewed bya given user in previous visits and gives him advice on new pagesof interest in the next visit

Ngu, D. S.W. and Wu, X. (1997). SiteHelper: A localized agent that helps incrementalexploration of the world wide web. In Proc. WWW Conference , pages 691–700.

20

Web Structure MiningWeb Structure Mining

Exploiting Hyperlink Structure

Page 6: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 6/15

21

First generation of search enginesFirst generation of search engines

Early days: keyword based searches

Keywords: “web mining”Retrieves documents with “web” and mining”

Later on: cope withsynonymy problem

polysemy problem

stop words

Common characteristic: Only information on the pagesis used

22

Modern search enginesModern search engines

Link structure is very important

Adding a link: deliberate actHarder to fool systems using in-links

Link is a “quality mark”

Modern search engines use link structure as importantsource of information

23

Central Question:

Which useful information can beWhich useful information can bederivedderived

from the link structure of the web?from the link structure of the web?

24

Some answersSome answers

1. Structure of Internet

2. Google

3. HITS: Hubs and Authorities

Page 7: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 7/15

25

1. The Web Structure1. The Web StructureA study was conducted on a graph inferred from two largeAltavista crawls .

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,Tomkins, A., andWiener, J. (2000). Graph structure in the web. In Proc. WWW Conference .

The study confirmed the hypothesis that the number of in-linksand out-links to a page approximately follows a Zipf distribution (aparticular case of a power-law)

If the web is treated as an undirected graph 90% of the pagesform a single connected component

If the web is treated as a directed graph four distinct componentsare identified, the four with similar size

26

General TopologyGeneral Topology

SCCIN OUT44mil 44mil56mil

TendrilsTendrils44mil

Disconnectedcomponents

Tubes

SCC: set of pages that can be reached by one anotherIN : pages that have a path to SCC but not from itOUT : pages that can be reached by SCC but not reach itTENDRILS: pages that cannot reach and be reached the SCC pages

27

Some statisticsSome statisticsOnly between 25% of the pages there is a connecting path

BUT

If there is a path:

Directed : average length <17

Undirected : average length <7 (!!!)

It’s a “small world” -> between two people only chain of length 6!

Small World GraphsHigh number of relatively small cliques

Small diameter

Internet (SCC) is a small world graph

28

2. Google2. GoogleSearch engine that uses link structure to calculate a qualityranking (PageRank) for each page

Intuition: PageRankcan be seen as the probability that a “ randomsurfer ” visits a page

Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.In Proc. WWW Conference , pages 107–117

Keywords w entered by user

Select pages containing w and pages which have in-links withcaption w

Anchor textProvide more accurate descriptions of Web pagesAnchors exist for un-indexable documents (e.g., images)

Font sizes of words in text : Words in larger or bolder font are assignedhigher weights

Rank pages according to importance

Page 8: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 8/15

29

PageRankPageRank

Link i→ j :i considers j important.the more important i, the more important j becomes.if i has many out-links: links are less important.

Initially: all importances p i = 1. Iteratively, p i is refined.

Page RankPage Rank :: A page is important if many importantA page is important if many importantpages link to it.pages link to it.

Let OutDegree i = # out-links of page i

Adjust p j:

This is the weighted sum of the importance of the pages referring to PjParameter p is probability that the surfer gets bored and starts on a new random page(1-p ) is the probability that the random surfer follows a link on current page

∑→

−+= j i i OutDegree

i PageRank p p j PageRank )(

)()()( 1

30

3. HITS3. HITS (Hyperlink(Hyperlink -- Induced Topic Search)Induced Topic Search)

HITS uses hyperlink structure to identify authoritativeWeb sources for broad-topic information discovery

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment.Journal of the ACM , 46(5):604–632.

Premise: Sufficiently broad topics contain communitiesconsisting of two types of hyperlinked pages:

Authorities : highly-referenced pages on a topic

Hubs: pages that “point” to authoritiesA good authority is pointed to by many good hubs; a good hubpoints to many good authorities

31

Hubs and AuthoritiesHubs and Authorities

A on the left is an authority

A on the right is a hub32

HITSHITSSteps for Discovering Hubs and Authorities on a specific topic

Collect seed set of pages S (returned by search engine)

Expand seed set to contain pages that point to or are pointed to bypages in seed set (removes links inside a site)

Iteratively update hub weight h(p) and authority weight a(p) foreach page:

After a fixed number of iterations, pages with highesthub/authority weights form core of community

Extensions proposed in CleverAssign links different weights based on relevance of link anchor text

∑∑→→

==q p p q

q a p h q h p a )()()()(

Page 9: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 9/15

33

Applications of HITSApplications of HITS

Search engine querying

Finding web communities

Finding related pages

Populating categories in web directories.

Citation analysis

34

Web Usage MiningWeb Usage Mining

analyzing user web navigation

35

Web Usage MiningWeb Usage Mining

Pages contain information

Links are “roads”

How do people navigate over the Internet?⇒ Web usage mining (Clickstream Analysis)

Information on navigation paths is available in log files.

Logs can be examined from either a client or a serverprespective.

36

Website Usage AnalysisWebsite Usage AnalysisWhy analyze Website usage?

Knowledge about how visitors use Website couldProvide guidelines to web site reorganization; Help prevent disorientation

Help designers place important information where the visitors look for it

Pre-fetching and caching web pages

Provide adaptive Website (Personalization)

Questions which could be answeredWhat are the differences in usage and access patterns among users?

What user behaviors change over time?

How usage patterns change with quality of service (slow/fast)?What is the distribution of network traffic over time?

Page 10: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 10/15

37

Website Usage AnalysisWebsite Usage Analysis

38

Data SourcesData Sources

39

Data SourcesData SourcesServer level collection : the server stores data regarding requestsperformed by the client, thus data regard generally just onesource;

Client level collection : it is the client itself which sends to arepository information regarding the user's behaviour (can beimplemented by using a remote agent (such as Javascripts or Javaapplets) or by modifying the source code of an existing browser(such as Mosaic or Mozilla) to enhance its data collectioncapabilities. );

Proxy level collection : information is stored at the proxy side,thus Web data regards several Websites, but only users whoseWeb clients pass through the proxy.

40

An Example of a Web Server LogAn Example of a Web Server Log

Page 11: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 11/15

41

AnalogAnalog –– Web Log File AnalyserWeb Log File Analyser

Gives basic statistics such as

number of hitsaverage hits per time periodwhat are the popular pages in your sitewho is visiting your sitewhat keywords are users searching for to get to youwhat is being downloaded

http://www.analog.cx/

42

Web Usage Mining ProcessWeb Usage Mining Process

WebServer

LogData

Preparation CleanData

DataMining

SiteData Usage

Patterns

43

Data PreparationData PreparationData cleaning

By checking the suffix of the URL name, for example, all log entrieswith filename suffixes such as, gif, jpeg, etc

User identificationIf a page is requested that is not directly linked to the previous pages,multiple users are assumed to exist on the same machine

Other heuristics involve using a combination of IP address, machinename, browser agent, and temporal information to identify users

Transaction identificationAll of the page references made by a user during a single visit to a site

Size of a transaction can range from a single page reference to all ofthe page references

44

SessionizingSessionizingMain Questions:

how to identify unique users

how to identify/define a user transaction

Problems:user ids are often suppressed due to security concerns

individual IP addresses are sometimes hidden behind proxy servers

client-side & proxy caching makes server log data less reliable

Standard Solutions/Practices:user registration – practical ????

client-side cookies – not fool proof

cache busting - increases network traffic

Page 12: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 12/15

45

SessionizingSessionizingTime oriented

By total duration of session

not more than 30 minutes

By page stay times (good for short sessions)not more than 10 minutes per page

Navigation oriented (good for short sessions and when timestampsunreliable)

Referrer is previous page in session, or

Referrer is undefined but request within 10 secs, or

Link from previous to current page in web site

The task of identifying the sequence of requests from a user is nottrivial - see Berendt et.al., Measuring the Accuracy of Sessionizers for WebUsage Analysis SIAM-DM01

46

Web Usage MiningWeb Usage Mining

Commonly used approaches

Preprocessing data and adapting existing data miningtechniques

For example associatin rules: does not take into account theorder of the page requests

Developing novel data mining models

47

An Example of Preprocessing Data andAn Example of Preprocessing Data andAdapting Existing Data Mining TechniquesAdapting Existing Data Mining Techniques

Chen, M.-S., Park, J. S., and Yu, P. S. (1998). Efficient data mining for traversalpatterns. IEEE Transactions on Knowledge and Data Engineering , 10(2): 209–221.

The log data is converted into a tree,from which is inferred a set ofmaximal forward references. Themaximal forward references arethen processed by existingassociation rules techniques. Twoalgorithms are given to mine for therules, which in this context consistof large itemsets with the additional

restriction that references must beconsecutive in a transaction.

48

Mining Navigation PatternsMining Navigation Patterns

Each session induces a user trail through the site

A trail is a sequence of web pages followed by a user during a

session, ordered by time of access.A pattern in this context is a frequent trail .

Co-occurrence of web pages is important, e.g. shopping-basket andcheckout.

Use a Markov chain model to model the user navigation records,inferred from log data. Hypertext Probabilistic Grammar .

Page 13: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 13/15

49

LogFiles

Navigation

Sessions

HypertextWeightedGrammar

Data MiningAlgorithms

BFSIFEFG

User

NavigationPatterns

Hypertext Probabilistic Grammar ModelHypertext Probabilistic Grammar Model

HypertextProbabilisti

cGrammar

Ngram

Dynamic model

To indentify paths with higher probability50

Hypertext Weighted GrammarHypertext Weighted GrammarA1→ A2→ A3→ A4

A1→ A5→ A3→ A4 → A1A5→ A2→ A4→ A6

A5→ A2→ A3A5→ A2→ A3→ A6

A4→ A1→ A5→ A3

Parameter, α , is used when converting the weightedgrammar to the corresponding probabilistic grammar.α =1 – Initial probability proportional to num. page visitsα =0 - Initial probability proportional to num. sessions starting on

page

51

NgramNgram modelmodel

We make use of the Ngram concept in order to improvethe model accuracy in representing user sessions.

The Ngram model assumes that only the previous n-1visited pages have a direct effect on the probability ofthe next page chosen.

A state corresponds to a navigation trail with n-1 pages

Chi-square test is used to assess the order of the model

(in most cases N=3 is enough)

Experiments have shown that the number of states ismanageable

52

NgramNgram modelmodelA1→ A2→ A3→ A4A1→ A5→ A3→ A4 → A1A5→ A2→ A4→ A6A5→ A2→ A3

A5→ A2→ A3→ A6A4→ A1→ A5→ A3

Page 14: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 14/15

53

Ongoing WorkOngoing WorkCloning states in order to increase the model accuracy

54

Applications of the HPG ModelApplications of the HPG ModelProvide guidelines for the optimisation of a web site structure.

Work as a model of the user’s preferences in the creation ofadaptive web sites.Improve search engine’s technologies by enhancing the random surfconcept.

Web personal assistant.

Visualisation tool

Use model to learn access patterns and predict future accesses.Pre-fetch predicted pages to reduce latency.

Also cache results of popular search engine queries.

55

Future WorkFuture Work

Conduct a set of experiments to evaluate theusefulness of the model to the end user.

Individual UserWeb site owner

Incorporate the categories that users are navigatingthrough so we may better understand their activities.

E.g. what type of book is the user interested in; this may beused for recommendation.

Devise methods to compare the precision of differentorder models.

56

ReferencesReferences

Data Mining: Introductory and Advanced Topics ,Margaret Dunham (Prentice Hall, 2002)

Mining the Web - Discovering Knowledge fromHypertext Data , Soumen Chakrabarti, Morgan-Kaufmann Publishers

Page 15: 06 WebMining

8/8/2019 06 WebMining

http://slidepdf.com/reader/full/06-webmining 15/15

57

Thank you !!!Thank you !!!