web mining

Web Mining

Sanjay Kumar MadriaDepartment of Computer Science

University of Missouri-Rolla, MO [email protected]

Web Mining

(Etzioni, 1996) Web mining – 웹문서 혹은 서비스로부터 자동적으로 정보를

발견 , 추출하기 위한 data mining 기법 .

(Kosala and Blockeel, July 2000) “ Web mining refers to the overall process of

discovering potentially useful and previously unknown information or knowledge from the Web data.”

Web mining 의 연구분야 – 아래와 같은 다양한 연구분야들을 통합한 연구분야 Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)

Web Mining : Subtasks

Resource Finding Task of retrieving intended web-documents

Information Selection & Pre-processing Automatic selection and pre-processing specific

information from retrieved web resources

Generalization Automatic Discovery of patterns in web sites

Analysis Validation and / or interpretation of mined patterns

Web Mining: Not IR or IE

Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible Web document classification, which is a Web Mining

task, could be part of an IR system (e.g. indexing for a search engine)

Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select the relevant documents IE systems for the general Web are not feasible Most focus on specific Web sites or content

WWW 은 다음과 같은 다양한 내용들을 포함하고 있는 거대한 정보의 원천이다 Information services:

news, advertisements, consumer information, financial management, education, government, e-commerce,

etc. Hyper-link information Access and usage information Web Site contents and Organization Growing and changing very rapidly: Broad diversity of user

communities 그럼에도 불구하고 웹에 있는 정보 중에서 실질적으로 웹이용자들에게

이용가치가 있는 것은 극히 일부분에 불과하다 . 어떻게 하면 특정 토픽에 대한 양질의 Web pages 를 찾을 수 있을까 ? 이를 위해서 web data 를 이용한 data mining 기법이 요구되는

것이다 .

Mining the World-Wide Web

Challenges on WWW Interactions

중요한 정보에 대한 탐색 Finding Relevant Information

이용 가능한 정보에서 지식 창출 Creating knowledge from Information available

정보의 개인화 , 즉 개개인에 need 에 적합한 정보를 제공 Personalization of the information

고객전체 또는 개별고객에 대한 학습모형 개발 Learning about customers / individual users

Web Mining can play an important Role!

Web Mining: more challenging

Searches for Web access patterns Web structures Regularity and dynamics of Web contents

Problems The “abundance” problem Limited coverage of the Web: hidden Web sources,

majority of data in DBMS Limited query interface based on keyword-oriented

search Limited customization to individual users Dynamic and semistructured

Web Mining Taxonomy

Web Mining

Web Content Mining

Web Usage Mining

Web Structure

Mining

Web Content Mining

Discovery of useful information from web contents / data / documents Web data contents:

text, image, audio, video, metadata and hyperlinks. Information Retrieval View ( Structured + Semi-

Structured) Assist / Improve information finding Filtering Information to users on user profiles

Database View Model Data on the web Integrate them for more sophisticated queries

Issues in Web Content Mining

Developing intelligent tools for IR Finding keywords and key phrases Discovering grammatical rules and collocations Hypertext classification/categorization Extracting key phrases from text documents Learning extraction models/rules Hierarchical clustering Predicting (words) relationship

Developing Web query systems WebOQL, XML-QL

Mining multimedia data Mining image from satellite (Fayyad, et al. 1996) Mining image to identify small volcanoes on Venus

(Smyth, et al 1996) .

Web Structure Mining

To discover the link structure of the hyperlinks at the inter-document level to generate structural summary about the Website and Web page.

Direction 1: based on the hyperlinks, categorizing the Web pages and generated information.

Direction 2: discovering the structure of Web document itself.

Direction 3: discovering the nature of the hierarchy or network of hyperlinks in the Website of a particular domain.


Finding authoritative Web pages 특정 주제에 관련된 의미가 있으면서 양질인

페이지의 검색 Hyperlinks can infer the notion of authority

Web 은 페이지와 페이지간을 연결해 주는 hyperlink 로 구성

이런 hyperlink 는 제작자의 주석을 통해서 나타나는 잠재적인 평가를 의미한다

A hyperlink pointing to another Web page, 이것은 웹페이지 제작자가 다른 페이지를 볼 수 있도록 승인하는 것을 의미


Web pages categorization (Chakrabarti, et al., 1998)

Discovering micro communities on the web Clever system (Chakrabarti, et al., 1999),

Google (Brin and Page, 1998)

Schema Discovery in Semistructured Environment

Web Usage Mining

Web usage mining also known as Web log mining mining techniques to discover interesting

usage patterns from the secondary data derived from the interactions of the users while surfing the web

Web Usage Mining Applications

Target potential customers for electronic commerce

Enhance the quality and delivery of Internet information services to the end user

Improve Web server system performance Identify potential prime advertisement

locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching)

Problems with Web Logs

Identifying users – Clients may have multiple streams

– Clients may access web from multiple hosts– Proxy servers: many clients/one address– Proxy servers: one client/many addresses

Data not in log– POST data (i.e., CGI request) not recorded– Cookie data stored elsewhere

Proxy server: WWW 서버에서 어떤 인터넷 주소의 정보검색에 대한 요구를 받으면 , 그 주소를 그 전에 읽어 저장한 장소에서 찾아 , 있으면 그 정보를 즉시 찾아 주고 , 없으면 그 주소지의 서버로부터 가지고 와서 저장장소에 복사한 후 요구자에게 알려 준다 . 이러한 역할을 하는 서버( 저장장소 ).

Cont… Missing data

Pages may be cached Referring page requires client cooperation When does a session end? Use of forward and backward pointers

Typically a 30 minute timeout is used Web content may be dynamic

May not be able to reconstruct what the user saw Use of spiders and automated agents –

automatic request web pages Like most data mining tasks, web log mining

requires preprocessing To identify users To match sessions to other data To fill in missing data Essentially, to reconstruct the click stream

Log Data - Simple Analysis

Statistical analysis of users Length of path Viewing time Number of page views

Statistical analysis of site Most common pages viewed Most common invalid URL

Web Log – Data Mining Applications

Association rules Find pages that are often viewed together

Clustering Cluster users based on browsing patterns Cluster pages based on content

Classification Relate user attributes to patterns

Web Logs

Web servers have the ability to log all requests

Web server log formats: Most use the Common Log Format (CLF) New, Extended Log Format allows configuration

of log file

Generate vast amounts of data

Remotehost: browser hostname or IP # Remote log name of user

(almost always "-" meaning "unknown")

Authuser: authenticated username Date: Date and time of the request "request”: exact request lines from

client Status: The HTTP status code returned Bytes: The content-length of response

Common Log Format

Server Logs

Fields

Client IP: 128.101.228.20 Authenticated User ID: - - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0" Status: 200 Bytes: - Referrer: “-” Agent: "Mozilla/4.61 [en] (WinNT; I)"

Web Usage Mining Commonly used approaches (Borges and

Levene, 1999) Maps the log data into relational tables before an

adapted data mining technique is performed.

Uses the log data directly by utilizing special pre-processing techniques.

Typical problems Distinguishing among unique users, server sessions,

episodes, etc. in the presence of caching and proxy servers (McCallum, et al., 2000; Srivastava, et al., 2000).

Request

Method: GET

– Other common methods are POST and HEAD URI: / – This is the file that is being accessed. When a

directory is specified, it is up to the Server to

decide what to return. Usually, it will be the file

named “index.html” or “home.html” Protocol: HTTP/1.0

Status

Status codes are defined by the HTTP

protocol. Common codes include:

– 200: OK

– 3xx: Some sort of Redirection

– 4xx: Some sort of Client Error

– 5xx: Some sort of Server Error

Web Mining

Web Structure

Mining

Web ContentMining

Search ResultMining

Web PageContent Mining

General AccessPattern

Tracking

CustomizedUsage

Tracking

Web UsageMining

Web Mining Taxonomy

Web Mining

Web StructureMining

Web ContentMining

Web Page Content MiningWeb Page Summarization WebOQL(Mendelzon et.al. 1998) …:Web Structuring query languages; Can identify information within given web pages •(Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Mining the World Wide Web

Web Mining


Web UsageMining



Web StructureMining

Web ContentMining

Web PageContent Mining Search Result Mining

Search Engine Result Summarization•Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets

Web Mining

Web ContentMining


Search ResultMining

Web UsageMining




Web Structure Mining Using Links•PageRank (Brin et al., 1998)•CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages.

Using Generalization•MLDB (1994)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.

Web Mining

Web StructureMining

Web ContentMining


Search ResultMining

Web UsageMining

General Access Pattern Tracking

•Web Log Mining (Zaïane, Xin and Han, 1998)Uses KDD techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.



Web Mining

Web UsageMining


Customized Usage Tracking

•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyzes access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.


Web StructureMining

Web ContentMining


Search ResultMining

Web Content Mining

Agent-based Approaches: Intelligent Search Agents Information Filtering/Categorization Personalized Web Agents

Database Approaches: Multilevel Databases Web Query Systems

Intelligent Search Agents

Locating documents and services on the Web: WebCrawler, Alta Vista

(http://www.altavista.com): scan millions of Web documents and create index of words (too many irrelevant, outdated responses)

MetaCrawler: mines robot-created indices

Retrieve product information from a variety of vendor sites using only general information about the product domain: ShopBot

Intelligent Search Agents (Cont’d)

Rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents: Harvest FAQ-Finder Information Manifold OCCAM Parasite

Learn models of various information sources and translates these into its own concept hierarchy: ILA (Internet Learning Agent)

Information Filtering/Categorization

Using various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. HyPursuit: uses semantic information embedded in link

structures and document content to create cluster hierarchies of hypertext documents, and structure an information space

BO (Bookmark Organizer): combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information

Personalized Web Agents

This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests (using collaborative filtering) WebWatcher PAINT Syskill&Webert GroupLens Firefly others

Multiple Layered Web Architecture

Generalized Descriptions

More Generalized Descriptions

Layer0

Layer1

Layern

...

Multilevel Databases

At the higher levels, meta data or generalizations are extracted from lower levels organized in structured collections, i.e. relational

or object-oriented database.

At the lowest level, semi-structured information are stored in various Web repositories, such as

hypertext documents

Multilevel Databases (Cont’d)

(Han, et. al.): use a multi-layered database where each layer

is obtained via generalization and transformation operations performed on the lower layers

(Kholsa, et. al.): propose the creation and maintenance of meta-

databases at each information providing domain and the use of a global schema for the meta-database

Multilevel Databases (Cont’d)

(King, et. al.): propose the incremental integration of a portion

of the schema from each information source, rather than relying on a global heterogeneous database schema

The ARANEUS system: extracts relevant information from hypertext

documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views

Multi-Layered Database (MLDB) A multiple layered database model

based on semi-structured data hypothesis queried by NetQL using a syntax similar to the relational

language SQL Layer-0:

An unstructured, massive, primitive, diverse global information-base.

Layer-1: A relatively structured, descriptor-like, massive,

distributed database by data analysis, transformation and generalization techniques.

Tools to be developed for descriptor extraction. Higher-layers:

Further generalization to form progressively smaller, better structured, and less remote databases for efficient browsing, retrieval, and information discovery.

Three major components in MLDB

S (a database schema): outlines the overall database structure of the global MLDB presents a route map for data and meta-data (i.e., schema)

browsing describes how the generalization is performed

H (a set of concept hierarchies): provides a set of concept hierarchies which assist the system

to generalize lower layer information to high layeres and map queries to appropriate concept layers for processing

D (a set of database relations): the whole global information base at the primitive

information level (i.e., layer-0) the generalized database relations at the nonprimitive layers

The General architecture of WebLogMiner(a Global MLDB)

Site 1

Site 2

Site 3

Generalized Data

Concept Hierarchies

Higher layers

Resource Discovery(MLDB)

Knowledge Discovery (WLM)Characteristic RulesDiscriminant RulesAssociation Rules

Techniques for Web usage mining

Construct multidimensional view on the Weblog database Perform multidimensional OLAP analysis to find the top N

users, top N accessed Web pages, most frequently accessed time periods, etc.

Perform data mining on Weblog records Find association patterns, sequential patterns, and trends

of Web accessing May need additional information,e.g., user browsing

sequences of the Web pages in the Web server buffer Conduct studies to

Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping

Web Usage Mining - Phases

Three distinctive phases: preprocessing, pattern discovery, and pattern analysis

Preprocessing - process to convert the raw data into the data abstraction necessary for the further applying the data mining algorithm

Resources: server-side, client-side, proxy servers, or database.

Raw data: Web usage logs, Web page descriptions, Web site topology, user registries, and questionnaire.

Conversion: Content converting, Structure converting, Usage converting

User: The principal using a client to interactively retrieve and render resources or resource manifestations.

Page view: Visual rendering of a Web page in a specific client environment at a specific point of time

Click stream: a sequential series of page view request

User session: a delimited set of user clicks (click stream) across one or more Web servers.

Server session (visit): a collection of user clicks to a single Web server during a user session.

Episode: a subset of related user clicks that occur within a user session.

Content Preprocessing - the process of converting text, image, scripts and other files into the forms that can be used by the usage mining.

Structure Preprocessing - The structure of a Website is formed by the hyperlinks between page views, the structure preprocessing can be done by parsing and reformatting the information.

Usage Preprocessing - the most difficult task in the usage mining processes, the data cleaning techniques to eliminate the impact of the irrelevant items to the analysis result.

Pattern Discovery Pattern Discovery is the key component of

the Web mining, which converges the algorithms

and techniques from data mining, machine learning, statistics and pattern recognition etc research categories.

Separate subsections: statistical analysis, association rules, clustering, classification, sequential pattern, dependency Modeling.

Statistical Analysis - the analysts may perform different kinds of descriptive statistical analyses based on different variables when analyzing the session file ; powerful tools in extracting knowledge about visitors to a Web site.

Association Rules - refers to sets of pages that are accessed together with a support value exceeding some specified threshold.

Clustering: a technique to group together users or data items (pages) with the similar characteristics. It can facilitate the development and

execution of future marketing strategies. Classification: the technique to map a data

item into one of several predefined classes, which help to establish a profile of users belonging to a particular class or category.

Pattern Analysis

Pattern Analysis - final stage of the Web usage mining.

To eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process.

Analysis methodologies and tools: query mechanism like SQL, OLAP, visualization

etc.

WUM – Pre-Processing Data Cleaning

Removes log entries that are not needed for the mining process

Data Integration Synchronize data from multiple server logs, metadata

User Identification Associates page references with different users

Session/Episode Identification Groups user’s page references into user sessions

Page View IdentificationPath Completion

Fills in page references missing due to browser and proxy caching

WUM – Issues in User Session Identification

A single IP address is used by many users

Different IP addresses in a single session

Missing cache hits in the server logs

different usersdifferent users Proxy Proxy serverserver Web serverWeb server

ISP serverISP server Web serverWeb serverSingle userSingle user

User and Session Identification Issues

Distinguish among different users to a site Reconstruct the activities of the users within

the site Proxy servers and anonymizers Rotating IP addresses connections through

ISPs Missing references due to caching Inability of servers to distinguish among

different visits

WUM – Solutions

Remote AgentA remote agent is implemented in Java Applet

It is loaded into the client only once when the first page is accessed

The subsequent requests are captured and send back to the server

Modified Browser The source code of the existing browser can be modified to gain user

specific data at the client side

Dynamic page rewritingWhen the user first submit the request, the server returns the requested page rewritten to include a session specific ID

Each subsequent request will supply this ID to the server

HeuristicsUse a set of assumptions to identify user sessions and find the missing cache hits in the server log

WUM – Heuristics

The session identification heuristicsTimeout: if the time between pages requests exceeds a certain limit, it is assumed that the user is starting a new sessionIP/Agent: Each different agent type for an IP address represents a different sessionsReferring page: If the referring page file for a request is not part of an open session, it is assumed that the request is coming from a different sessionSame IP-Agent/different sessions (Closest): Assigns the request to the session that is closest to the referring page at the time of the requestSame IP-Agent/different sessions (Recent): In the case where multiple sessions are same distance from a page request, assigns the request to the session with the most recent referrer access in terms of time

Cont.

The path completion heuristicsIf the referring page file of a session is not part of the previous page file of that session, the user must have accessed a cached pageThe “back” button method is used to refer a cached page Assigns a constant view time for each of the cached page file

WUM – Association Rule Generation

Discovers the correlations between pages that are most often referenced together in a single server session

Provide the informationWhat are the set of pages frequently accessed together by Web users?

What page will be fetched next?

What are paths frequently accessed by Web users?

Association rule

A B [ Support = 60%, Confidence = 80% ]

Example

“50% of visitors who accessed URLs /infor-f.html and labo/infos.html also visited situation.html”

Associations & Correlations

Page associations from usage data– User sessions– User transactions

Page associations from content data– similarity based on content analysis

Page associations based on structure– link connectivity between pages

==> Obtain frequent itemsets

Examples:

60% of clients who accessed /products/, also accessed /products/software/webminer.htm.

30% of clients who accessed /special-offer.html, placed an online order in /products/software/.

(Example from IBM official Olympics Site) {Badminton, Diving} ===> {Table Tennis}

(69.7%,.35%)

WUM – Clustering

Groups together a set of items having similar characteristics

User ClustersDiscover groups of users exhibiting similar browsing patternsPage recommendation

User’s partial session is classified into a single clusterThe links contained in this cluster are recommended

Cont..

Page clustersDiscover groups of pages having related content Usage based frequent pages Page recommendation

The links are presented based on how often URL references occur together across user sessions

Website Usage Analysis

Why developing a Website usage / utilization

analyzation tool?

Knowledge about how visitors use Website couldKnowledge about how visitors use Website could

- Prevent disorientation and help designers place

important information/functions exactly where the

visitors look for and in the way users need it

- Build up adaptive Website server

Clustering and Classification

clients who often access /products/software/webminer.html tend to be from

educational institutions.

clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.

75% of clients who download software from

/products/software/demos/ visit between 7:00 and 11:00 pm on weekends.


Discover user navigation patterns in using

Website

- Establish a aggregated log structure as a

preprocessor to reduce the search space before

the actual log mining phase

- Introduce a

model for Website usage pattern discovery by

extending the classical mining model, and

establish the processing framework of this model

Sequential Patterns & Clusters

30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit

60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days


Website client-server architecture facilitates

recording user behaviors in every steps by

- submit client-side

log files to server when users use clear functions or

exit window/modules

The special design for local and universal

back/forward/clear functions makes user’s

navigation pattern more clear for designer by

- analyzing local back/forward history and incorporate

it with universal back/forward history


What will be included in SUA

1. Identify and collect log data

2. Transfer the data to server-side and save them in a

structure desired for analysis

3. Prepare mined data by establishing a customized

aggregated log tree/frame

4. Use modifications of the typical data mining

methods, particularly an extension of a traditional

sequence discovery algorithm, to mine user

navigation patterns


Problem need to be considered:

- How to identify the log data when a user go through

uninteresting function/module

- What marks the end of a user session?

- Client connect Website through proxy servers

Differences in Website usage analysis with common Web usage mining

- Client-side log files available

- Log file’s format (Web log files follow Common Log Format specified as a part of HTTP protocol)

- Not necessary for log file cleaning/filtering (which usually performed in preprocess of Web log mining)

Web Usage Mining - Patterns Discovery Algorithms

(Chen et. al.) Design algorithms for Path Traversal Patterns, finding maximal forward references and large reference sequences.

Path Traversal Patterns

Procedure for mining traversal patterns: (Step 1) Determine maximal forward

references from the original log data (Algorithm MF)

(Step 2) Determine large reference sequences (i.e., Lk, k1) from the set of maximal forward references (Algorithm FS and SS)

(Step 3) Determine maximal reference sequences from large reference sequences

Focus on Step 1 and 2, and devise algorithms for the efficient determination of large reference sequences

Determine large reference sequeces

Algorithm FS: Utilizes the key ideas of algorithm DHP: employs hashing and pruning techniques DHP is very efficient for the generation of candidate

itemsets, in particular for the large two-itemsets, thus greatly improving the performance bottleneck of the whole process

Algorithm SS: employs hashing and pruning techniques to reduce both

CPU and I/O costs by properly utilizing the information in candidate

references in prior passes, is able to avoid database scans in some passes, thus further reducing the disk I/O cost

Patterns Analysis Tools

WebViz [pitkwa94] --- provides appropriate tools and techniques to understand, visualize, and interpret access patterns.

Proposes OLAP techniques such as data cubes for the purpose of simplifying the analysis of usage statistics from server access logs. [dyreua et al]

Patterns Discovery and Analysis Tools

The emerging tools for user pattern discovery use sophisticated techniques from AI, data mining, psychology, and information theory, to mine for knowledge from collected data: (Pirolli et. al.) use information foraging theory to

combine path traversal patterns, Web page typing, and site topology information to categorize pages for easier access by users.

(Cont’d)

WEBMINER : introduces a general architecture for Web usage

mining, automatically discovering association rules and sequential patterns from server access logs.

proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association rules and sequential patterns.

WebLogMiner Web log is filtered to generate a relational database Data mining on web log data cube and web log

database

WEBMINER

SQL-like Query A framework for Web mining, the

applications of data mining and knowledge discovery techniques, association rules and sequential patterns, to Web data: Association rules: using apriori algorithm

40% of clients who accessed the Web page with URL /company/products/product1.html, also accessed /company/products/product2.html

Sequential patterns: using modified apriori algorithm 60% of clients who placed an online order in

/company/products/product1.html, also placed an online order in /company/products/product4.html within 15 days

WebLogMiner

Database construction from server log file: data cleaning data transformation

Multi-dimensional web log data cube construction and manipulation

Data mining on web log data cube and web log database

Mining the World-Wide Web

Design of a Web Log Miner Web log is filtered to generate a relational database A data cube is generated form database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge

G)p,q( )q(reedegout

)q(R)1(n/)p(R

1Data Cleaning

2Data CubeCreation

3OLAP

4Data Mining

Web log Database Data Cube Sliced and dicedcube

Knowledge

Construction of Data Cubes(http://db.cs.sfu.ca/sections/publication/slides/slides.html)

sum

0-20K20-40K 60K- sum

Comp_Method

… ...

sum

Database

Amount

Province

Discipline

40-60KB.C.

PrairiesOntario

All AmountComp_Method, B.C.

Each dimension contains a hierarchy of values for one attributeA cube cell stores aggregate values, e.g., count, sum, max, etc.A “sum” cell stores dimension summation values.Sparse-cube technology and MOLAP/ROLAP integration.“Chunk”-based multi-way aggregation and single-pass computation.

WebLogMiner Architecture

Web log is filtered to generate a relational database

A data cube is generated from database OLAP is used to drill-down and roll-up in

the cube OLAM is used for mining interesting

knowledge

G)p,q( )q(reedegout

)q(R)1(n/)p(R

1Data Cleaning

2Data CubeCreation

3OLAP

4Data Mining

Web logDatabase

Data Cube Sliced and dicedcube

Knowledge


Page-Rank Method CLEVER Method Connectivity-Server Method

1. Page-Rank Method

Introduced by Brin and Page (1998) Mine hyperlink structure of web to produce ‘global’

importance ranking of every web page Used in Google Search Engine Web search result is returned in the rank order Treats link as like academic citation Assumption: Highly linked pages are more ‘important’ than

pages with a few links A page has a high rank if the sum of the ranks of its back-

links is high

Page Rank: Computation

Assume: R(u) : Rank of a web page u Fu : Set of pages which u points to

Bu : Set of pages that points to u

Nu : Number of links from u C : Normalization factor E(u) : Vector of web pages as source of rank

Page Rank Computation:

)()(

)( ucEN

vRcuR

uBv v

Page Rank: Implementation

Stanford WebBase project Complete crawling and indexing system of with current repository 24 million web pages (old data)

Store each URL as unique integer and each hyperlink as integer IDs

Remove dangling links by iterative procedures Make initial assignment of the ranks Propagate page ranks in iterative manner Upon convergence, add the dangling links back and

recompute the rankings

Page Rank: Results

Google utilizes a number of factors to rank the search results: proximity, anchor text, page rank

The benefits of Page Rank are the greatest for underspecified queries, example: ‘Stanford University’ query using Page Rank lists the university home page the first

Page Rank: Advantages

Global ranking of all web pages – regardless of their content, based solely on their location in web graph structure

Higher quality search results – central, important, and authoritative web pages are given preference

Help find representative pages to display for a cluster center

Other applications: traffic estimation, back-link predictor, user navigation, personalized page rank

Mining structure of web graph is very useful for various information retrieval

References

L. Page, S. Brin, "PageRank: Bringing order to the Web," Stanford Digital Libraries working paper 1997-0072.

Chakrabarti, Dom, Kumar, “Mining the link structure of the World Wide Web,” IEEE Computer, 32(8), August 1999

K. Bharat, A. Broder, “The Connectivity Server: Fast access to linkage information on the Web.” In Ashman and Thistlewaite [2], pages 469--477. Brisbane, Australia, 1998

B. Allan, “Finding Authorities and Hubs from Link Structures on the World Wide Web”, ACM, May 2001

Jeffrey Dean “Finding Related Pages in the World Wide Web” http://citeseer.nj.nec.com/dean99finding.html

A. Z. Border,… Graph structure in the web: experiments and models. Proc. 9th WWW Conf., 2000.

S. R. Kumar,… Trawling emerging cyber-communities automatically. Proc. 8th WWW Conf., 1999.

References Principles of Data Mining, Hand, Mannila, Smyth. MIT Press,

2001. Notes from Dr. M.V. Ramakrishna

http://goanna.cs.rmit.edu.au/~rama/cs442/info.html Notes from CS 395T: Large-Scale Data Mining, Inderjit Dhillon

http://www.cs.utexas.edu/users/inderjit/courses/dm2000.html Link Analysis in Web Information Retreival, Monika Henzinger.

Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000. research.microsoft.com/research/db/debull/A00sept/henzinge.ps

slides from Data Mining: Concepts and Techniques, Jan and Kamber, Morgan Kaufman, 2001.

1. J. Srivastava, R. Cooley, M. Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

, SIGKDD Explorations, Vol. 1, Issue 2, 2000. 2. B. Mobasher, R. Cooley and J. Srivastava,

Web Mining: Information and Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.

3. B. Mobasher, Namit Jain, Eui-Hong (Sam) Han, Jaideep Srivastava. Web Mining: Pattern Discovery from World Wide Web Transactions. Technical Report TR 96-060, University of Minnesota, Dept. of Computer Science, Minneapolis, 1996

4. R. Cooley, P. N. Tan., and J. Srivastava. (1999). WebSIFT: the Web site information filter system. In Proceedings of the 1999 KDD Workshop on Web Mining, San Diego, CA. Springer-Verlag, in press.

5. R. W. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web data. PhD Thesis, Dept of Computer Science, University of Minnesota, May 2000.

6. Cooley, R., Mobasher, B., and Srivastava, J. Web Mining: Information and pattern Discovery on the World Wide Web. IEEE Computer, pages 558-566, 1997.

7. Etzioni, O. The world wide web: Quagmire or gold mine. Communications of the ACM, 39(11):65-68, 1996.

8. Kosala, R. and Blockeel, H. Web Mining Research: A summary. SIGKDD Explorations, 2(1):1-15, 2000.

Fayyad, U., Djorgovski, S., and Weir, N. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, pages 471-493. AAAI Press, 1996.

Langley, P. User modeling in adaptive interfaces. In Proceedings of the Seventh International Conference on User Modeling, pages 357-370, 1999.

Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim, E.-P. Research issues in web data mining. In Proceedings of Data Warehousing and Knowledge Discovery, First International Conference, DaWaK ‘99, pages 303-312, 1999.

Masand, B. and Spiliopoulou, M. Webkdd-99: Work-shop on web usage analysis and user profiling. SIGKDD Explorations, 1(2), 2000.

Smyth, P., Fayyad, U.M., Burl, M.C., and Perona, P. Modeling subjective uncertainty in image annotation. In Advances in Knowledge Discovery and Data Mining, pages 517-539, 1996.

Spiliopoulou, M. Data mining for the web. In Principles of Data Mining and Knowledge Discovery, Second European Symposium, PKDD ‘99, pages 588-589, 1999.

Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. Web usage mining: Discovery and applications of usage patterns from web data. SIGMOD Explorations, 1(2), 2000.

Zaiane, O.R., Xin, M., and Han, J. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. IEEE, pages 19-29, 1998.

Page Ranking

o The PageRank Citation Ranking: Bringing Order to the Web (1998), Larry Page, Sergey Brin, R. Motwani, T. Winograd, Stanford Digital Library Technologies Project..

o Authoritative Sources in a Hyperlinked Environment (1998),

Jon. Kleinberg, Journal of the ACM o The Anatomy of a Large-Scale Hypertextual

Web Search Engine (1998) Sergey Brin, Lawrence Page, Computer Networks and ISDN Systems.

o Web Search Via Hub Synthesis (1998) Dimitris Achlioptas,

Amos Fiat, Anna R. Karlin, Frank McSherry. o What is this Page Known for? Computing Web Page Reputatio

ns (2000) Davood Rafiei, Alberto O Mendelzon.

o Link Analysis in Web Infromation Retrieval, Monika Henzinger. Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000.

Finding Authorities and Hubs From Link Structures on the World Wide Web, Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas, 2002.

Web Communities and Classification Enhanced hypertext categorization using hyperlinks (1998) Soumen

Chakrabarti, Byron Dom, and Piotr Indyk, Proceedings of SIGMOD-98, ACM International Conference on Management of Data.

Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (1998) S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and s. Rajagopalan, Proceedings of the 7th International World Wide Web Conference.

Inferring Web Communities from Link Topology (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext.

o Trawling the web for emerging cyber-communities (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks.

o Finding Related Pages in the World Wide Web

(1999) Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks.

o A System for Collaborative Web Resource Categori

zation and Ranking Maxim Lifantsev.

A Study of Approaches to Hypertext Categorization

(2002) Yiming Yang, Sean Slattery, Rayid Ghani, Journal of Intelligent Information Systems.

o Hypertext Categorization using Hyperlink Patterns

and Meta Data (2001) Rayid Ghani, Sean Slattery, Yiming Yang.

web mining

Documents