mining rich session context to improve web search guangyu zhu and gilad mishne in proceedings of...
TRANSCRIPT
Mining Rich Session Context Mining Rich Session Context to Improve Web Searchto Improve Web Search
Guangyu Zhu and Gilad MishneGuangyu Zhu and Gilad Mishnein Proceedings of 15th ACM SIGKDD International Conference on Knowledge in Proceedings of 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD 2009), Paris, France, 2009. Discovery and Data Mining (KDD 2009), Paris, France, 2009.
資料探勘期末資料探勘期末 PAPERPAPER 報告報告彭嘉宏 資工碩一 彭嘉宏 資工碩一 6982101469821014
賴家瑞 資工博一 賴家瑞 資工博一 89821004 89821004
Author• Guangyu Zhu ( 朱光宇 )
– Institute for Advanced Computer Studies– University of Maryland – College Park, MD 20742– [email protected]– http://www.ece.umd.edu/~zhugy/
• Gilad Mishne– Yahoo! Labs– 701 First Ave.– Sunnyvale, CA 94089– [email protected]– http://labs.yahoo.com/user/155
Outline
• Author• INTRODUCTION
– Web Search– PageRank– Web Session
• MINING WEB SESSIONS
• ClickRank
• Applications to web search
– Site ranking
– Page ranking
– Mining dynamic quicklinks• CONCLUSIONS
INTRODUCTIONINTRODUCTION
Web Search
PageRank
http://www.ndhu.edu.tw/
PageRank
http://www.nerdmodo.com/2009/07/what-is-a-pagerank/
PageRank of A = PageRank of B + PageRank of C + PageRank o D = .75
Further, if page B is linked to page D, then PageRank of A will be like:
PageRank of A = PageRank of B/2 + PageRank of C + PageRank o D = .625
PageRank problems1. First, user browsing behaviors are driven by intents, and they
significantly deviate from the random surfer model that PageRank is based on.
2. Second, static modeling of the link structure favors old pages, because a new page is less likely to be linked to within a short period of time, even if it has very good quality.
3. Third, link structures are prone to manipulation as adversarial links can be generated to artificially inflate ranking more quickly than quality links that typically originate in manual editing.
4. Last, as the web grows at an explosive speed1, computing page importance at the web scale by link analysis becomes very computationally expensive, even through various optimization schemes.
Web Session
• http://www.seomoz.org/blog/controlling-search-engine-access-with-cookies-session-ids
HTTP Protocol •Stateless•Connectionless
Session identification
• We define session as an active trail of user clicks presented by the URL referral structure
• A new session starts– After 30 minutes of inactivity– Occurrence of a URL without the referrer
URL
Session data collected
• We used aggregate, anonymous general user behavior data collected by Yahoo! Toolbar– 30 billion events over 6 month period in 2008– {cookie, timestamp, URL, referral URL, event
attributes}– No personal information in source data
Mining web sessions
Motivations
• To propose an efficient and scalable framework for mining general web user behavior data– Query/click logs are useful, but limited (< 5% of traffic)– All user actions count– The web and web user behaviors both constantly evolve
• Focus on sessions of general web browsing activities– A logical unit that is general across all categories– To learn the preferences, intents, and judgment of users from
rich contextual information
• To learn session context models to improve core web search ranking, and other web search experience
Session Clustering
• In this experiment, we mapped each URL to an event category based on five high-level intents:
– Search
– information/reference
– rich content (eg social networking and multimedia)
– Shopping
Session Clustering
Search sessions is only less than 5% of user on-line activities
A web session contains significantly richer activity context and diversity than a search session
Percentage of search sessions 4.85%
Average events per session 9.1
Average session duration (seconds) 420.3
Sessions per user per day 15.5
Histogram session representation
• We compute a distribution of activities over structured intents, given a list of URLs and their intent interpretations
Histogram representation of the session
Session duration
Total number of events in the session
Sessions are highly diverse
Use PCA to reduce dimensions
The first 6 eigenvalues are significant
UK SE MA IN RC SH Events
Duration
22 0 0 77 0 0 9 178316 50 0 33 0 0 6 920 0 0 0 100 0 27 1032
75 25 0 0 0 0 4 7333 6 0 60 0 0 15 156380 0 0 0 0 20 5 8875 0 0 0 0 25 4 3820 60 20 0 0 0 5 3780 0 0 0 0 100 2 22712 0 87 0 0 0 8 928
7 dimensional feature vector for each session
Session categorization
Table 2: Unsupervised clustering of session histograms reveals various Web user browsing patterns. Significant features associated with each cluster are highlighted in bold.
Session categorization
Cluster#Attribute Full Data 1 2 3 4 5 6 7 8 9 10 100% 29.8% 16.6% 14.3% 11.9% 11.0% 4.7% 4.6% 3.5% 2.1% 1.5%========================================================================================================================== Search 23.630 0.340 98.430 1.190 2.350 2.350 56.180 41.520 52.230 6.460 0.090Mail 16.810 0.070 0.660 97.250 0.390 0.400 1.290 51.790 0.710 9.790 0.080Information 12.260 0.040 0.270 0.390 1.030 96.500 24.580 2.650 0.500 5.970 0.020Rich content 34.320 99.420 0.370 0.650 0.450 0.360 0.640 0.950 45.250 60.510 99.540Shopping 12.850 0.080 0.240 0.410 95.670 0.290 16.920 2.600 0.860 16.840 0.060Total events 9.040 11.140 2.890 5.660 6.250 5.330 4.240 5.380 4.260 7.850 151.680Total time 420.300 532.490 261.370 303.850 235.780 298.910 228.400 455.580 218.010 439.780 4237.650
Cluster centroids
Intent-driven web browsing patterns emerge from session clusters
K-means clustering is sufficient to reveal meaningful intent patterns, such as long sessions of content browsing and query reformulation
Simple and effective
Browsing content
rich websites
Reformulating search queries
Reading email
Navigational queries
Addiction to content rich
websites
Collecting info during shopping
Informational queries
ClickRank
ClickRank Overview
• ClickRank is derived from contextual indicators of user preferences and judgment in general web sessions– Dwell time on the page– Click order in the session– Page load time– Frequency of occurrence in the session
• Compute a local ClickRank function for each visited page in a session by incorporating session context models, and then aggregate these values to obtain the global ClickRank
Local ClickRank
• Define the local ClickRank function as
– The weight function is computed from the rank of the page visit event in session
– The weight function is computed from temporal information associated with browsing of the page
– is the indicator function
ClickRank incorporates click order• Define the weight function for an event
in rank of a session with a total of events as
where – Motivated by experiments on implicit user preference
judgments in Joachims etc, SIGIR 2005– is a monotonically decreasing function w.r.t. the
rank of the event within a session– – and the mean and variance of the
local ClickRank function is finite
ClickRank incorporates temporal signals
• This paper define another weight function to incorporate more temporal information
where and are normalized dwell time on the page and page load time w.r.t. the entire session
• The indicator function above defines a filter that factors in the time range of interest
Global ClickRank
• Given a set of web sessions , the global ClickRank is computed from local ClickRank functions by an aggregation function
• Aggregation operators to compute global ClickRank are more general– Sum, average, and filter, e.g. by criterion like time
and demography– Filtering sessions is much flexible compared to
filtering links
Theoretical framework of ClickRank
• The local ClickRank function defines a random variable a associated with the web page , given an observed session
• and• Convergence Property: As
converges to by the strong law of large numbers
Relation to graph-based models
• ClickRank is based on an intentional surfer model• ClickRank is data driven
– ClickRank does not embed rigid assumptions on the traversing scheme over the web
– Better reflects users’ information need and adapts faster to constantly changing user behaviors
• Significantly more efficient and scalable compared to approaches based on explicit graph formulations– The ClickRank computational framework is well suited for
distributed computing– ClickRank can be computed incrementally– One pass over entire data and memory friendly
Applications to web search
Applications to web search
• Datasets– 3.3 billion web sessions extracted from Yahoo!
Toolbar data over 6 months in 2008
• Site ranking–
– Compute ClickRank of 16.3 million websites in 56 minutes
• Page ranking– Compute ClickRank of 3.1 billion web pages in 1
hour and 32 minutes
Site ranking
• ClickRank is more reliable and richer than results computed using only static link structure
* The BrowseRank results are cited from Liu etc, SIGIR’08, which used MSN Toolbar data
Page ranking
• The ClickRank feature brings 1.02%, 0.97%, 1.11%, and 1.331% web search improvements in DCG(1), DCG(5), DCG(10), and NDCG
• 1% gain over a production system is very significant• ClickRank affects 81.2% out of over 9, 000 queries and
covers 62.5% of documents
Query length
Number of
queries
Affected
queries
Improvements inSignificance
testp-valueDCG(1) DCG(5)
DCG(10)
NDCG
1 1484 1232 0.45% 0.71% 1.00% 0.38% 5.33 10-2
2 2992 2450 0.56% 0.99% 1.12% 1.07% 4.65 10-4
3 2153 1722 1.62% 1.08% 1.41% 2.18% 1.10 10-4
4+ 2412 1937 0.92% 0.86% 0.78% 1.43% 1.61 10-5
All 9041 7341 1.02% 0.97% 1.11% 1.33% 9.98 10-5
Mining dynamic quicklinks
• Many commercial search engines provide quick access links to popular destinations within the site
• These links are traditionally mined from search engine query logs– Query or search session logs are limited in scope and
coverage– Query logs favor old, navigational links
Mining dynamic quicklinks
• We demonstrate ClickRank for discovering recent, dynamic content
• We adapt the time range in the temporal weight function w.r.t. the content refresh rate found by crawler
• Use the indicator function as a term that specifies recency of the content
Mining dynamic quicklinks
Mining dynamic quicklinks
Conclusion
• This paper expand the use of general user behavior data for web search ranking and other applications
• This paper introduce ClickRank, an efficient, scalable algorithm for estimating web page importance by incorporating rich contextual information
• ClickRank is shown to be a novel and effective query-independent ranking signal, especially on long queries
• Our results highlight the potential of data-driven user behavior modeling at the web scale
THANK YOU
資料探勘期末資料探勘期末 PAPERPAPER 報告報告彭嘉宏 資工碩一 彭嘉宏 資工碩一 6982101469821014賴家瑞 資工博一 賴家瑞 資工博一 8982100489821004