group: highflyers

59
Data Mining for Web Personalization Presented by the Highflyers group

Upload: tommy96

Post on 28-Jan-2015

105 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: group: highflyers

Data Mining for Web PersonalizationPresented by the Highflyers group

Page 2: group: highflyers

Who are the Highflyers?•Irfan Butt – Introduction and Traditional

approaches to Web Personalization•Joel Gascoigne – Data Collection,

Preprocessing and Modelling•James Silver – Pattern Discovery

Predictive Web User Modelling Part 1•Aaron John-Baptiste – Pattern

Discovery Predictive Web User Modelling Part 2

•Asad Qazi – Evaluating Personalized Models and Conclusion

Page 3: group: highflyers

Introduction

•Paper titled: Data Mining for Web Personalization

•Author: Bamshad Mobasher

Page 4: group: highflyers

Irfan Butt Introduction and Traditional approaches to

Web Personalization

Page 5: group: highflyers

Introduction to Web Personalization•Personalization

▫Delivery of content tailored to a particular user

•Web Personalization▫Delivery of dynamic content, such as text,

links tailored to a particular user or segments of user

Page 6: group: highflyers

Automatic Personalization Vs Customization•Similarity: Both refer to delivery of

content•Difference: Creation and updating of

user profile•Examples

▫Customization: My Yahoo, Dell Website▫Automatic Personalization: Amazon

Page 7: group: highflyers

Personalization in Traditional Approaches

•Two phases in the process of personalization1) Data Collection Phase 2) Learning Phase

•Classification based on learning from data1.Memory Based Learning (Lazy)▫Examples: User-based collaborative system,

Content-based filtering system2.Model Based Learning (Eager)▫Examples: Item-based System

Page 8: group: highflyers

Memory Based Learning VS Model Based Learning

•Memory Based Learning (Lazy)▫ Huge memory required▫ Scalability issue▫ Adaptable to changes

•Model Based Learning (Eager)▫ Limited memory required▫ Easily scalable▫ Learning phase offline▫ Not adaptable to changes

Page 9: group: highflyers

Traditional Approaches to Web Personalization•Rule Based Personalization Systems

▫Rules are used to recommend item▫Rules based on personal characteristics of

user▫Static profiles result in degradation of

system

Page 10: group: highflyers

Traditional Approaches to Web Personalization• Content-based Filtering Systems

▫User profile built on content descriptions of items

▫Profile based on previous rating of items

Page 11: group: highflyers

Traditional Approaches to Web Personalization•Collaborative Filtering Systems▫Single profile is built in the same way i.e.

content-based filtering Systems ▫Items from more than one profile is used to

recommend new item or content▫These profiles are K Nearest Neighbors

based on previous ratings of items of each profile

▫Poor results as the system grows

Page 12: group: highflyers

Data Mining Approach to Personalization•Data Mining (or Web Usage Mining)

▫The automatic discovery and analysis of patterns in click stream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites

•Data Mining Cycle:▫Data preparation and transformation phase.▫Pattern discovery phase▫Recommendation phase

Page 13: group: highflyers

Joel Gascoigne Data Collection, Preprocessing and

Modelling

Page 14: group: highflyers

Data Modelling and Representation•Assume the existence of a set of m users:

▫U = {u1, u2, …, um}

•Set of n items:▫I = {in, in, …, in}

Page 15: group: highflyers

Data Modelling and Representation•The profile for a user u є U is an n-

dimensional vector of ordered pairs:▫u(n) = {(i1, su(i1)), (i2, su(i2)), …, (in, su(in))}

•Typically, such profiles are collected over time and stored▫Can be represented as an n x m matrix, UP

Page 16: group: highflyers

Data Modelling and Representation•A Personalisation System, PS can be

viewed as a mapping of user profiles and items to obtain a rating of interest

•The mapping is not generally defined for the whole domain of user-item pairs▫System must predict interest scores

Page 17: group: highflyers

Data Modelling and Representation•This general framework can be used with

most approaches to personalisation

•In the data mining approach:▫A variety of machine learning techniques

are applied to UP to discover aggregate user models

▫These user models are used to make a prediction for the target user

Page 18: group: highflyers

Data Sources for Web Usage Mining•Main data sources used in web usage

mining are server log files▫Clickstream data

•Other data sources include the site files and meta-data

Page 19: group: highflyers

Data Sources for Web Usage Mining•This data needs to be abstracted

▫Pageview Representation of a collection of web objects

▫Session A sequence of pageviews by a single user

•All sessions belonging to a user can be aggregated to create the profile for that user

Page 20: group: highflyers

Data Sources for Web Usage Mining•Content data

▫Collection of objects and relationships conveyed to the user Text Images

▫Also, semantic or structual meta-data embedded within the site Domain ontology

Could use an ontology language such as RDF Or a database schema

Page 21: group: highflyers

Data Sources for Web Usage Mining•Also, operational databases for the site

may include additional information about user and items▫Geographic information▫User ratings

Page 22: group: highflyers

Primary Tasks in Data Preprocessing for Web Usage Mining

Page 23: group: highflyers

Data Preprocessing for Web Usage Mining•Goal:

▫Transform click-stream data into a set of user profiles

•This “sessionized” data can be used as the input for a variety of data mining algorithms or further abstracted

Page 24: group: highflyers

Data Preprocessing for Web Usage Mining•Tasks in usage data preprocessing:

▫Data Fusion▫Data Cleaning▫Pageview Identification▫Sessionization▫Episode Identification

Page 25: group: highflyers

Data Preprocessing for Web Usage Mining•Data Fusion:

▫Merging of log files from web and application servers

•Data Cleaning:▫Tasks such as:

Removing extraneous references to embedded objects

Removing references due to spider navigations

Page 26: group: highflyers

Data Preprocessing for Web Usage Mining•Pageview Identification:

▫Aggregation of collection of objects or pages, which should be considered a unit

▫This process is dependent on the linkage structure of the site

▫In the simplets case, each HTML file has a one-to-one correlation with a pageview

▫Must distinguish between users Authentication system or cookies

Page 27: group: highflyers

Data Preprocessing for Web Usage Mining•Sessionization:

▫Process of segmenting the user activity log of each user into sessions, each representing a single visit to the site

•Episode Identification:▫Episode is a subsequence of a session

comprised of related pageviews

Page 28: group: highflyers

Data Preprocessing for Web Usage Mining•These tasks ultimately result in a set of n

pageviews▫P = {p1, p2, …, pn}

•A set of v user transactions▫T = {t1, t2, …, tv}

•A user transaction captures the activity of a user during a particular session

Page 29: group: highflyers

Data Preprocessing for Web Usage Mining•Finally, one or more transactions or

sessions associated with a given user can be aggregated to form the final profile for that user▫If the profile is generated from a single

session, it represents short-term interests▫Aggregation of multiple sessions results in

profiles that capture long-term interests

Page 30: group: highflyers

Data Preprocessing for Web Usage Mining•The collection of these profiles comprises

the m x n matrix UP which can be used to perform various data mining tasks

•After basic clickstream preprocessing steps, data from other sources is integrated:▫Content, structure and user data

Page 31: group: highflyers

James SilverPattern Discovery Predictive Web User

Modelling Part 1

Page 32: group: highflyers

Model-Based Collaborative Techniques

•Two-stage recommendation process:▫(A) offline model-building (B) Real-time

scoring

(Explicit & Implicit user behavioural data used)

•Offline model-building algorithms:(1) Clustering, (2) Association Rule Discovery, (3) Sequential Pattern Discovery, (4) Latent Variable Models (part 2)

We also look at hybrid models (part 2)

Page 33: group: highflyers

(1) Clustering

•Clustering divides data into groups where:▫Inter-cluster similarities are minimised

▫Intra-cluster similarities are maximised

•Generalization to Web usage mining▫User-based vs. Item-based clustering▫Efficiency and scalability

improvements

Page 34: group: highflyers

(1) Clustering: User-based

•User profiles•Partitions Matrix UP

▫Clusters represent user segments based on common navigational behaviour

•Recommendations (target user u, target item i)▫Centroid vector vk computed for each

cluster Ck▫Neighbourhood: All user segments that

have a score for i and whose vk is most similar to u

Page 35: group: highflyers

(1) Clustering: Other

•Fuzzy Clustering▫ Desirable to group users into many

categories•Distance issues

▫Consider web-transactions as sequences

•Association Rule Hypergraph Partitioning (ARHP)

Page 36: group: highflyers

(2) Association Rule Discovery

Finding groups of pages or items that are commonly accessed or purchased together

•Originally for mining supermarket basket data

•Discovering Association Rules involves:1)Discovering frequent itemsets

Satisfying a minimum support threshold2)Discovering association rules

Satisfying a minimum confidence threshold

Page 37: group: highflyers

(2) Association Rules: Concepts

•Transactions set T• Itemsets I = {I1,I2,...,Ik} over T•Association rule r has the form X => Y

(sr, cr)▫sr = the support of X U Y

(i.e. probability that X and Y occur together in a transaction)

▫ cr = the confidence of the rule r(i.e. the conditional probability that Y occurs in a transaction, given that X has occurred in that transaction)

Page 38: group: highflyers

(2) Recommendations

• Matching rule antecedents with target user profiles▫ Sliding window solution▫ Naive approach▫ Frequent Itemset Graph

• Finding Candidate pages: ▫ Match current user session window with previously

discovered frequent itemsets• Recommendation Value

▫ Confidence of corresponding association rule

Page 39: group: highflyers

(2) Recommendations

Page 40: group: highflyers

(3) Sequential Models

•Now we consider the order when discovering frequently occurring itemsets.

• So: given the user transaction {i1,i2,i3}▫ Association rules (i1=>i2) and (i2=>i1) are fine▫ But sequential pattern (i2=>i1) not supported

•Two types of sequences: i1,i2 => i3▫ Contiguous (closed) sequence

{i1,i2,i3}▫ Open Sequence

{i1,i2,i4,i3}

•Frequent Navigational Paths

Page 41: group: highflyers

(3) Recommendations

•Trie-structure (aggregate tree)▫Each node is an item, root is the empty

sequence•Recommendation Generation

▫Found in O(s) by traversing the tree‘s’ = the length of the current user transaction deemed to be useful in recommending the next set of items

▫Sliding window w Maximum depth of tree therefore is |w|+1

▫Controlling the size of the tree

Page 42: group: highflyers

(3) Sequential Models: Contiguous•Contiguous sequence patterns are

particularly restrictive▫Valuable in page pre-fetching applications▫Rather than in general context of

recommendation generation

Page 43: group: highflyers

(3) Sequential Models: Markov

•Another approach for sequential modelling▫Based on Stochastic methods

•Modelling the navigational activity in the website as a Markov chain

Page 44: group: highflyers

(3) Sequential Models: Markov

•A Markov model is represented by the 3-tuple <A,S,T>▫A: set of possible actions (items)▫S: set of n states for which the model is

built (visitor’s navigation history)▫T=[pi,j]nxn: Transition Probability Matrix

pi,j: probability of a transition from state si to state sj

•Order : Number of prior events used in predicting each future event

Page 45: group: highflyers

(3) Markov for Web-mining

•Designed to predict the next user action based on the user’s previous surfing behaviour

•Also used to discover high-probability user navigational paths in a website▫User-prefered trails

•Various optimization methods•Apart from Markov: Mixture Models

Page 46: group: highflyers

Aaron John-BaptistePattern Discovery Predictive Web User

Modelling Part 2

Page 47: group: highflyers

(4) Latent Variable Models (LVMs)•Latent Variables are variables that

haven't been directly observed but have rather been inferred.▫E.g. Morale is not measured directly but

inferred•Have more recently become popular as a

modelling approach in web usage mining•Two commonly used LVMs

▫Finite Mixture Models (FMM)▫Factor Analysis (FA)

Page 48: group: highflyers

(4) FA and FMM

•Factor Analysis▫Aims to summarise and find relationships

within observed data (all data)▫Used in pattern recognition, collaborative

filtering and personalization based web usage mining

•Finite Mixture Models (FMM)▫Use a finite number of components to

model (a page view, or user rating)

Page 49: group: highflyers

(4) Drawbacks to pure usage based models•Pure usage based models have drawbacks

▫Process relies on user transactions or rating data

▫New items or pages are therefore never recommended (“new item problem”)

▫Also do not use knowledge from underlying domain and so cannot make more complex recommendations

Page 50: group: highflyers

(5) Hybrid models

•Uses a combination of user-based and content-based modelling.

•Three main types used in web mining▫Integrating content features▫Integrating semantic knowledge▫Using Linkage structure

Page 51: group: highflyers

(5) Integrating content features with usage-based models•Solves “new item problem”

▫Use content characteristics of pages with user-based data

▫Extract keywords from content to be used to discover patterns

▫Not just using user data means new pages with relevant content can be recommended

▫Users interests can be mapped to content, (concepts or topics)

Page 52: group: highflyers

(5) Integrating structured semantic knowledge with usage-based models•Content feature integration is useful when

pages are rich in text and keywords•However cannot capture more complex

relationships where items have underlying properties

•Idea is to take the underlying meanings of objects and add them to the user-based data. Recommendations can then be made to pages or items with similar semantic meanings

Page 53: group: highflyers

(5) Using Linkage structure for model learning and selection•Other semantic data can be used such as

relational databases and the hyperlink structure on a web page

•Mobasher proposes a hybrid recommendation system that switches between different algorithms based on the degree of connectivity in the site and user

•E.g. in a highly connected website, with short paths, non sequential models performed better

Page 54: group: highflyers

Asad QaziEvaluating Personalized Models and

Conclusion

Page 55: group: highflyers

Evaluating Personalization models

The Primary Goal of this section is to evaluate the accuracy and effectiveness of

web personalization models

Page 56: group: highflyers

Why Evaluate?

• More complex web-based applications and more complex user interaction requires the selection of more sophisticated models

• Need to further explore the impact of recommended model on user behaviour

• There are several different modelling approaches to web personalization

• Evaluating personalized models is an inherently challenging task firstly, because different models require different evaluation metrics, secondly, the required personalization actions may be quite different depending on the underlying domain, relevant data and intended application

• Finally, there is also a lack of consensus among researchers as to what factors affect quality of service in personalized systems and

of what elements contribute to user satisfaction

Page 57: group: highflyers

Common evaluation approaches• A number of metrics have been proposed in literature

for evaluating the robustness and predictive accuracy of a recommender system: this includes

• Mean Absolute Error (MAE)• Classification Metrics (Precision and Recall)• Receiver Operating Characteristic (ROC)• The use of business metrics to measure the customer

loyalty and satisfaction such as Recency Frequency Monetary (RFM)

• The use of other key dimensions along with metrics such as: Accuracy, Coverage, Utility, Explainability, Robustness, Scalability and User Satisfaction

Page 58: group: highflyers

Conclusions

• Web personalisation is viewed as an application of data mining which dynamically serves customized content (pages, products, recommendations, etc.) to users based on their profiles, preferences, or expected interests of data available to personalization systems, the modelling approaches employed and the current approaches to evaluating these systems

• We have also discussed the various sources of data available to personalization systems, the modelling approaches employed and the current approaches to evaluating these systems

• Recent user studies have found that a number of issues can affect the perceived usefulness of personalization systems including, trust in the system, transparency of the recommendation logic, ability for a user to refine the system generated profile and diversity of recommendations

• Most personalization systems tend to use a static profile of the user. However user interests are not static, changing with time and context. Few systems have attempted to handle the dynamics within the user profile.

Page 59: group: highflyers

Any Questions?