mining rich session context to improve web search guangyu zhu and gilad mishne in proceedings of...

Mining Rich Session Context Mining Rich Session Context to Improve Web Searchto Improve Web Search

Guangyu Zhu and Gilad MishneGuangyu Zhu and Gilad Mishnein Proceedings of 15th ACM SIGKDD International Conference on Knowledge in Proceedings of 15th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining (KDD 2009), Paris, France, 2009. Discovery and Data Mining (KDD 2009), Paris, France, 2009.

資料探勘期末資料探勘期末 PAPERPAPER 報告報告彭嘉宏資工碩一彭嘉宏資工碩一 6982101469821014

賴家瑞資工博一賴家瑞資工博一 89821004 89821004

Author• Guangyu Zhu ( 朱光宇 )

– Institute for Advanced Computer Studies– University of Maryland – College Park, MD 20742– [email protected]– http://www.ece.umd.edu/~zhugy/

• Gilad Mishne– Yahoo! Labs– 701 First Ave.– Sunnyvale, CA 94089– [email protected]– http://labs.yahoo.com/user/155

Outline

• Author• INTRODUCTION

– Web Search– PageRank– Web Session

• MINING WEB SESSIONS

• ClickRank

• Applications to web search

– Site ranking

– Page ranking

– Mining dynamic quicklinks• CONCLUSIONS

INTRODUCTIONINTRODUCTION

Web Search

PageRank

http://www.ndhu.edu.tw/

PageRank

http://www.nerdmodo.com/2009/07/what-is-a-pagerank/

PageRank of A = PageRank of B + PageRank of C + PageRank o D = .75

Further, if page B is linked to page D, then PageRank of A will be like:

PageRank of A = PageRank of B/2 + PageRank of C + PageRank o D = .625

PageRank problems1. First, user browsing behaviors are driven by intents, and they

significantly deviate from the random surfer model that PageRank is based on.

2. Second, static modeling of the link structure favors old pages, because a new page is less likely to be linked to within a short period of time, even if it has very good quality.

3. Third, link structures are prone to manipulation as adversarial links can be generated to artificially inflate ranking more quickly than quality links that typically originate in manual editing.

4. Last, as the web grows at an explosive speed1, computing page importance at the web scale by link analysis becomes very computationally expensive, even through various optimization schemes.

Web Session

• http://www.seomoz.org/blog/controlling-search-engine-access-with-cookies-session-ids

HTTP Protocol •Stateless•Connectionless

Session identification

• We define session as an active trail of user clicks presented by the URL referral structure

• A new session starts– After 30 minutes of inactivity– Occurrence of a URL without the referrer

URL

Session data collected

• We used aggregate, anonymous general user behavior data collected by Yahoo! Toolbar– 30 billion events over 6 month period in 2008– {cookie, timestamp, URL, referral URL, event

attributes}– No personal information in source data

Mining web sessions

Motivations

• To propose an efficient and scalable framework for mining general web user behavior data– Query/click logs are useful, but limited (< 5% of traffic)– All user actions count– The web and web user behaviors both constantly evolve

• Focus on sessions of general web browsing activities– A logical unit that is general across all categories– To learn the preferences, intents, and judgment of users from

rich contextual information

• To learn session context models to improve core web search ranking, and other web search experience

Session Clustering

• In this experiment, we mapped each URL to an event category based on five high-level intents:

– Search

– Email

– information/reference

– rich content (eg social networking and multimedia)

– Shopping

Session Clustering

Search sessions is only less than 5% of user on-line activities

A web session contains significantly richer activity context and diversity than a search session

Percentage of search sessions 4.85%

Average events per session 9.1

Average session duration (seconds) 420.3

Sessions per user per day 15.5

Histogram session representation

• We compute a distribution of activities over structured intents, given a list of URLs and their intent interpretations

Histogram representation of the session

Session duration

Total number of events in the session

Sessions are highly diverse

Use PCA to reduce dimensions

The first 6 eigenvalues are significant

UK SE MA IN RC SH Events

Duration

22 0 0 77 0 0 9 178316 50 0 33 0 0 6 920 0 0 0 100 0 27 1032

75 25 0 0 0 0 4 7333 6 0 60 0 0 15 156380 0 0 0 0 20 5 8875 0 0 0 0 25 4 3820 60 20 0 0 0 5 3780 0 0 0 0 100 2 22712 0 87 0 0 0 8 928

7 dimensional feature vector for each session

Session categorization

Table 2: Unsupervised clustering of session histograms reveals various Web user browsing patterns. Significant features associated with each cluster are highlighted in bold.

Session categorization

Cluster#Attribute Full Data 1 2 3 4 5 6 7 8 9 10 100% 29.8% 16.6% 14.3% 11.9% 11.0% 4.7% 4.6% 3.5% 2.1% 1.5%========================================================================================================================== Search 23.630 0.340 98.430 1.190 2.350 2.350 56.180 41.520 52.230 6.460 0.090Mail 16.810 0.070 0.660 97.250 0.390 0.400 1.290 51.790 0.710 9.790 0.080Information 12.260 0.040 0.270 0.390 1.030 96.500 24.580 2.650 0.500 5.970 0.020Rich content 34.320 99.420 0.370 0.650 0.450 0.360 0.640 0.950 45.250 60.510 99.540Shopping 12.850 0.080 0.240 0.410 95.670 0.290 16.920 2.600 0.860 16.840 0.060Total events 9.040 11.140 2.890 5.660 6.250 5.330 4.240 5.380 4.260 7.850 151.680Total time 420.300 532.490 261.370 303.850 235.780 298.910 228.400 455.580 218.010 439.780 4237.650

Cluster centroids

Intent-driven web browsing patterns emerge from session clusters

K-means clustering is sufficient to reveal meaningful intent patterns, such as long sessions of content browsing and query reformulation

Simple and effective

Browsing content

rich websites

Reformulating search queries

Reading email

Navigational queries

Addiction to content rich

websites

Collecting info during shopping

Informational queries

ClickRank

ClickRank Overview

• ClickRank is derived from contextual indicators of user preferences and judgment in general web sessions– Dwell time on the page– Click order in the session– Page load time– Frequency of occurrence in the session

• Compute a local ClickRank function for each visited page in a session by incorporating session context models, and then aggregate these values to obtain the global ClickRank

Local ClickRank

• Define the local ClickRank function as

– The weight function is computed from the rank of the page visit event in session

– The weight function is computed from temporal information associated with browsing of the page

– is the indicator function

ClickRank incorporates click order• Define the weight function for an event

in rank of a session with a total of events as

where – Motivated by experiments on implicit user preference

judgments in Joachims etc, SIGIR 2005– is a monotonically decreasing function w.r.t. the

rank of the event within a session– – and the mean and variance of the

local ClickRank function is finite

ClickRank incorporates temporal signals

• This paper define another weight function to incorporate more temporal information

where and are normalized dwell time on the page and page load time w.r.t. the entire session

• The indicator function above defines a filter that factors in the time range of interest

Global ClickRank

• Given a set of web sessions , the global ClickRank is computed from local ClickRank functions by an aggregation function

• Aggregation operators to compute global ClickRank are more general– Sum, average, and filter, e.g. by criterion like time

and demography– Filtering sessions is much flexible compared to

filtering links

Theoretical framework of ClickRank

• The local ClickRank function defines a random variable a associated with the web page , given an observed session

• and• Convergence Property: As

converges to by the strong law of large numbers

Relation to graph-based models

• ClickRank is based on an intentional surfer model• ClickRank is data driven

– ClickRank does not embed rigid assumptions on the traversing scheme over the web

– Better reflects users’ information need and adapts faster to constantly changing user behaviors

• Significantly more efficient and scalable compared to approaches based on explicit graph formulations– The ClickRank computational framework is well suited for

distributed computing– ClickRank can be computed incrementally– One pass over entire data and memory friendly

Applications to web search

Applications to web search

• Datasets– 3.3 billion web sessions extracted from Yahoo!

Toolbar data over 6 months in 2008

• Site ranking–

– Compute ClickRank of 16.3 million websites in 56 minutes

• Page ranking– Compute ClickRank of 3.1 billion web pages in 1

hour and 32 minutes

Site ranking

• ClickRank is more reliable and richer than results computed using only static link structure

* The BrowseRank results are cited from Liu etc, SIGIR’08, which used MSN Toolbar data

Page ranking

• The ClickRank feature brings 1.02%, 0.97%, 1.11%, and 1.331% web search improvements in DCG(1), DCG(5), DCG(10), and NDCG

• 1% gain over a production system is very significant• ClickRank affects 81.2% out of over 9, 000 queries and

covers 62.5% of documents

Query length

Number of

queries

Affected

queries

Improvements inSignificance

testp-valueDCG(1) DCG(5)

DCG(10)

NDCG

1 1484 1232 0.45% 0.71% 1.00% 0.38% 5.33 10-2

2 2992 2450 0.56% 0.99% 1.12% 1.07% 4.65 10-4

3 2153 1722 1.62% 1.08% 1.41% 2.18% 1.10 10-4

4+ 2412 1937 0.92% 0.86% 0.78% 1.43% 1.61 10-5

All 9041 7341 1.02% 0.97% 1.11% 1.33% 9.98 10-5

Mining dynamic quicklinks

• Many commercial search engines provide quick access links to popular destinations within the site

• These links are traditionally mined from search engine query logs– Query or search session logs are limited in scope and

coverage– Query logs favor old, navigational links


• We demonstrate ClickRank for discovering recent, dynamic content

• We adapt the time range in the temporal weight function w.r.t. the content refresh rate found by crawler

• Use the indicator function as a term that specifies recency of the content

Conclusion

• This paper expand the use of general user behavior data for web search ranking and other applications

• This paper introduce ClickRank, an efficient, scalable algorithm for estimating web page importance by incorporating rich contextual information

• ClickRank is shown to be a novel and effective query-independent ranking signal, especially on long queries

• Our results highlight the potential of data-driven user behavior modeling at the web scale

THANK YOU

資料探勘期末資料探勘期末 PAPERPAPER 報告報告彭嘉宏資工碩一彭嘉宏資工碩一 6982101469821014賴家瑞資工博一賴家瑞資工博一 8982100489821004

mining rich session context to improve web search guangyu zhu and gilad mishne in proceedings of...

Documents