dm.lab in university of seoul data mining laboratory 2008 년 12 월 10 일 이방래 eigenrank: a...

DM.Lab in University of Seoul

Data Mining Laboratory

Data Mining Laboratory2008 년 12 월 10 일

이방래

EigenRank: A Ranking-Oriented EigenRank: A Ranking-Oriented Approach to Collaborative Approach to Collaborative

FilteringFilteringSIGIRSIGIR’’0808

Nathan N. Liu & Qiang YangNathan N. Liu & Qiang Yang

Data Mining Laboratory 2


Table of contents

IntroductionRelated WorksRating-oriented collaborative filteringRanking-oriented collaborative filteringExperimentsConclusion



Introduction

Recommender System Content-based filtering : item/user 와 관련된

contents 정보를 분석→ (user, item) feature 집합 생성 후 추천

collaborative filtering : content 정보없이 많은 사용자의 items 에 대한 rating 만 가지고 추천

collaborative filtering 의 응용 시나리오사용자의 잠재적인 흥미를 반영해서 예측한 rating 에

기반해서 item 을 한번에 하나씩 사용자에게 제공 Top-N 개의 item list 를 사용자에게 제공 (ranking 기반 :

최상위에 있는 item 이 가장 선호될 것으로 예측됨 )



Introduction

연구의 출발점 higher accuracy in rating prediction does not

necessarily lead to better ranking effectiveness Example

Recommender system 에서는 rating 보다 ranking 이 중요

Main contribution similarity measure item ranking algorithms : greedy algorithm,

random walk

rating1 과 rating2 가 true 값과의 absolute deviation 은 동일하지만rating2 는 순서가 잘못됨



Related works

Two approaches to collaborative filtering Neighborhood-based Approaches : 가장 유사한

사용자 집합의 rating 에 근거해서 target 사용자의 rating 을 예측User-based / item-based 사용자 유사도 지표 : Pearson Correlation Coefficient /

vector similarity

Model-based Approaches : user-item ratings 을 모델 훈련에 사용Clustering methods, aspect models, Bayesian

network 등


DM.Lab in University of SeoulRating-oriented collaborative filteringNotational framework

Users : U = {u1, u2, ..., um} Items : I = {i1, i2, ..., in} User’s ratings on items : m × n matrix R

Ui : item i 에 rating 한 user 집합 Iu : user u 가 rating 한 item 집합

ru,i : user u’s rating on item i ru,i = 0 if u has not rated i.


DM.Lab in University of SeoulRating-oriented collaborative filteringPearson Correlation Coefficient

Vector Similarity(cosine)User-user similarity

Item-item similarity (adjusted cosine similarity)

두 사용자 u, v 가 공통으로 rating 한 item 집합에 대해 표준화된 rating 에 기반해서 두 사용자의 유사도를 계산


DM.Lab in University of SeoulRating-oriented collaborative filteringRating Prediction

In user-based collaborative filtering(CF) : 가장 유사한 k 명의 사용자 집합 Nu 의 rating 값에 의해 예측

In item-based CF : 가장 유사한 k 개의 item 집합 Ni 의 rating 값의 의해 예측

User u 의 item i 에 대한 rating 예측값

(u 가 i 에 rating 안한 상태 ) 사용자 v 는 Nu 집합에 포함되고

item i 에 rating 한 사람임

User u 가 평균적으로 rating scale 이 큰지 작은지를 반영

Item j 는 유사한 item 집합 Ni 에 포함되고User 가 rating 한 item 집합에 포함됨

ui

Iu

Ni

j


DM.Lab in University of SeoulRanking-oriented collaborative filteringIn the ranking-oriented approach

the similarity between users is determined by their preferences (not ratings) over the items

Kendall Rank Correlation Coefficient 동일한 item 집합에 대한 2 명의 rankings 의

유사도 측정

두 명이 공통으로 rating 한 item 집합에서 모든 item 쌍에 관하여 계산

Ranking 이 불일치하면 (+)*(-)=- 가 되고 계수값이 감소함 Ranking 이 불일치하는 item 쌍의 개수와 음의 상관관계


DM.Lab in University of SeoulRanking-oriented collaborative filteringPreference Functions

본 연구의 목적은 item 에 대한 rating 예측이 아니라 ranking 예측

modeling a user’s preference function

:

( , ) 0 : i j ( , ) 0

( , ) :

( , ) 0 : anti-symmetric

( , ) ( , )

( , ) 0 ( , ) 0 : not imply ( , ) 0

I I R

i j i j

i j

i i

i j j i

transitivity

i j j k i k

가 보다 선호됨 / :선호도 차이없음

선호의강도

요구하지않음


DM.Lab in University of SeoulRanking-oriented collaborative filteringPreference Functions(cont.)

CF 에는 다양한 machine learning 기법이 적용안됨(feature 부족해서 )

Ranking-oriented CF 에서는 명확하게 rating 안된 item 쌍에 대한 preference 정보 획득이 주요과제임

유사 선호도 갖는 사용자 집합 Nu 이용한 선호함수Nu 사용자 중 ri > rj 사용자가 많으면

선호함수 >0 의 근거로 사용

사용자 u 의 neighbor 중 item I, j 에 모두 rating 한 집합


DM.Lab in University of SeoulRanking-oriented collaborative filteringPreference Functions(cont.)

모든 item 쌍 (i, j)∈I 에 대해 선호함수로 점수를 매김

Item 의 ranking 을 구하자ρ : ranking of item ρ(i) > ρ(j) if and only if i is ranked higher than jvalue function : ranking ρ 가 preference function Ψ 와

일치하는지 측정

Value function 이 최대가 되면 최적화된 ranking ρ* 가 구해짐

finding the optimal ranking is a NP-Complete problem based on reduction from the Cyclic-Ordering problem[M. R. Gary and D. S. Johnson, 1979]


DM.Lab in University of SeoulRanking-oriented collaborative filteringGreedy Order Algorithm

finding an approximately optimal ranking

Potential value : Ψ(i,j)>0 (i 보다 덜 선호되는 j) 가 많을수록 item i 의 potential 이 커짐

Item t : 현재 순간에 potential 가장 큰 item

남은 item 의 개수를 t 의 rank 로 함( 값이 높은 것이 rank 높게 )

Item t 는 집합에서 제외

제외한 item t 의 영향력 제거

Greedy order 알고리즘은 각 item 의 rank 를 직접 search 한 방식임


DM.Lab in University of SeoulRanking-oriented collaborative filteringRandom Walk Model for Item Ranking

Markov chain model 이용

States = itemtransitional probabilities depend on a user’s

preference function stationary distribution of this Markov chain can be

used to produce a ranking

Markov chain 모델이 많은 사용자의 불완전한 선호도 정보를 종합하는데 효과적

A Markov chain is a sequence of random variables X1, X2, X3, ... with the Markov property, namely that, given the present state, the future and past states are independent.( 미래와 과거상태는 서로 독립 )< 출처 : http://en.wikipedia.org/wiki/Markov_chain>


DM.Lab in University of SeoulRanking-oriented collaborative filteringRandom Walk based on User Preferences

문제상황 : 사용자 일부 i>j 선호 , 다른 일부 j>k 선호 but, very few rated all i, j k 목적 : (I, k) 관계를 추출

Multi-step random walks 이용하여 효과적으로 추출됨Markov chain 에 있는 directed graph model 의 연결성을

탐구 PageRank scheme 과 매우 유사

임의의 검색자는 현재 페이지에서 랜덤하게 hyperlink 를 선택Direct link pq 는 p 가 q 를 추천하는 것으로 고려특정 페이지에 있을 정지확률 (stationary probalility) 은 p 의

영향력을 반영함 덜 선호된 j 가 더 선호되는 i 로 링크됨

전이확률 p(i|j) 는 선호의 강도 |Ψ(I,j)| 에 의존함


DM.Lab in University of SeoulRanking-oriented collaborative filteringRandom Walk based on User

Preferences(cont.) Imagine

사용자가 item i 를 랜덤하게 선택선호도에 따라서 item j 로 이동 : 전이확률 p(j|i) 에

의존함 j 가 i 보다 선호되면 높은 값

프로세스 반복하면 사용자는 가장 선호하는 item 을 선택함이 때 정지분포를 item rank 에 사용

Non-negative value


DM.Lab in University of SeoulRanking-oriented collaborative filteringCompute the Item Rankings

P : transition( 전이 ) 매트릭스 pi,j = p(j|i)

Item 에 대한 초기 확률분포 π0 주어지면 πt 는 power iteration method 로 구함

(1)

(2) ( ) : t step walking

( )

t

tt t

t

p

pp i

p n

M후에 i 에있을 확률


DM.Lab in University of SeoulRanking-oriented collaborative filtering Compute the Item Rankings(cont.)

power iteration method : principal eigenvector 푸는 기법 Stationary prob. : 일반적으로 πt 는 transition matrix P 의 principal eigenvector 로

수렴해감 existence and uniqueness of the stationary distribution

is guaranteed if and only if the matrix P is irreducible In our model, the entries of P are all non-negative, which

could guarantee the existence and uniqueness of the stationary distribution

tt lim*


DM.Lab in University of SeoulRanking-oriented collaborative filtering Power Method.

The power iteration algorithm starts with a vector b0, which may be an approximation to the dominant eigenvector or a random vector. The method is described by the iterationSo, at every iteration, the vector bk is multiplied by the matrix A and normalized

Under the assumptions: A has an eigenvalue that is strictly greater in magnitude than its other eigenvalues •The starting vector b0 has a nonzero component in the direction of an eigenvector associated with the dominant eigenvalue.

then:•A subsequence of (bk) converges to an eigenvector associated with the dominant eigenvalue

< 출처 : http://en.wikipedia.org/wiki/Power_method>


DM.Lab in University of SeoulRanking-oriented collaborative filteringPersonalization Vector

To avoid the reducibility of the stochastic matrix, Brin

and Page[1998] proposed revised transition matrix

In our random walk model , 개인화 벡터 수정

Perturbation matrixe : 모든 값이 1 인 벡터V: 개인화 벡터

ε : 0-1 사이 스칼라 값 : 웹서퍼가 현재 페이지의 hyperlink 와 무관하게 개인화 벡터 v 에 따라서 얼마나 자주 다른 페이지로 이동 (teleport) 하는 가를 표현

(1)

(2)

( )

u

uu

u

p

pV

p n

1

Unrated item 은 모두 동일한 확률로 teleport 로 visit 됨



Experiments

2 개의 데이터집합 이용한 실험실험의 주요 포인트

제안한 기법이 (user-based/item-based) rating 기법보다 우수한가 ?

Greedy order 와 random walk model 중 어느 것이 우수한가 ?

Kendall Rank Correlation Coefficient 가 효과적인가 ?



Experiments

Data Sets

실험방식 임의로 10,600 user 선정 (40 movie 이상에

rating 한 사람 )10,000 movies 는 training 에 사용100 movies 는 parameter tuning 에 사용500 movies 는 testing 에 사용

Rating 많은 2000 개만 실험에 사용



Experiments

Evaluation MetricRating prediction accuracy 측정을 위해서는

Mean Absolute Error (MAE), Root Mean Square Error (RMSE) 가 주로 사용됨

본 연구는 ranking 에 관심있으므로 Normalized Discounted Cumulative Gain(NDCG) 지표로 측정

Ranked list 에서 top k 개에 대해서 평가

Q : test user 집합

R(u, p) : user u 가 ranked list 의 위치 p 에 매긴 rating

Zu : 최적 rating 시 NDCG=1 이 되도록 하는 표준화 factor

지표값은 0~1 사이를 가지며 값이 클수록 rankin 이 효과적임을 의미Highest ranked item 의 rating 에 민감

Test user 집합 Q 에서 K 번째 위치에 대한 지표값

위치 p 증가하면 지표값이 감소사용자들은 뒤에 있는 리스트를 잘 안보니까처음 몇 개의 item 이 훨씬 중요



Experiments

Impact of Neighborhood Size 실험 Training user : 5000 명으로 고정 Performance measure : NDCG(Q, 1) : 1ST 위치값

이용 Neighborhood size 변경하면서 greedy

order/random walk 실험

Neighborhood size

performance

결론 : Optimal size of neighborhood : 100 명

100 명 넘어가면 유사도 낮은 사용자가 포함되어서 성능이 감소함



Experiments

Impact of ε(teleport) 얼마나 자주 teleport 하느냐를 조정 0~0.9 사이에서 실험 NDCG 는 1st, 3rd, 5th 위치에 대해서 측정 ε 가 증가하면 NDCG 도 증가 ε 가 너무 커지면 NDCG 가 다시 감소

Random walk 모델이 대부분 teleport 기능만 수행모든 unrated item 에 동일확률이 적용되어 효과적으로

ranking 매기기가 어려워짐

결론 : 최적 epsilon = 0.6

epsilon

NDCG



Experiments

Comparisons with other Algorithms Rating based

item based model using Pearson Correlation Coefficient(IPCC)

item based model using Vector Similarity(IVS)user based model using Pearson Correlation

Coefficient(UPCC) user based model using Vector Similarity(UVS)

Ranking based(Random Walk model)random walk using Pearson Correlation Coefficient (RWPCC) random walk using Vector Similarity (RWVS)random walk -Kendall Rank Correlation Coefficient

(RWKRCC) Ranking based(Greedy Order model)

greedy order -Pearson Correlation Coefficient(GOPCC)greedy order -Vector Similarity (GOVS) greedy order - Kendall Rank Correlation Coefficient

(GOKRCC)Item based 에는 item 개수를 50 으로 ( 참고문헌 참고 ), 다른 기법은 100 개 사용



Experiments

실험결과 Random walk 가 일관되게 성능이 가장 우수 Greedy order 의 경우 Kendal… 계수가 성능이

우수 Rating based 기법은 대체로 훈련 사용자수가

증가할수록 성능이 향상됨

Baseline(rating)

Bet rating 대비 best ranking 의 성능향상



Conclusions

we propose a ranking-oriented framework for collaborative filtering (without predicting a user’s ratings)

methods for computing item rankings greedy order random walk model

our approach outperforms existing CF algorithms in terms of ranking effectiveness



Q&A

dm.lab in university of seoul data mining laboratory 2008 년 12 월 10 일 이방래 eigenrank: a...

Documents