contextual*linear*banditproblem*&* applicaons*

68
Contextual Linear Bandit Problem & Applica7ons 1 Feng Bi Joon Sik Kim Leiya Ma Pengchuan Zhang

Upload: others

Post on 23-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contextual*Linear*BanditProblem*&* Applicaons*

Contextual  Linear  Bandit  Problem  &  Applica7ons  

1  

Feng  Bi  Joon  Sik  Kim  Leiya  Ma  

Pengchuan  Zhang  

Page 2: Contextual*Linear*BanditProblem*&* Applicaons*

Contents  

•  Recap  of  last  lecture  – Mul7-­‐armed  Bandit  Problem  /  UCB1  Algorithm  

•  Applica7on  to  News  Ar7cle  Recommenda7on  •  Contextual  Linear  Bandit  Problem  •  LinUCB  Algorithm  •  Disjoint  Model  &  General  Model  •  Deriva7on  of  LinUCB  •  News  Ar7cle  Recommenda7on  Revisited  •  Conclusion  

2  

Page 3: Contextual*Linear*BanditProblem*&* Applicaons*

Last  Lecture  :  Mul7-­‐armed  Bandit  Problem  

•  K  ac7ons  (feature-­‐free)  •  Each  ac7on  has  an  average  reward  (unknown):  μk  

•  For  t=1,…,T  (unknown)  –  Choose  an  ac7on  at  from  {1,  …,  K}  ac7ons  –  Observe  a  random  reward  yt,  where  yt  is  bounded  [0,1]    –  E[yt]  =  μa,t    :  Expected  reward  of  ac7on  at  

•  Minimizing  Regret:        

 Q.  How  to  choose  an  ac7on  to  minimize  regret?  

3  

Page 4: Contextual*Linear*BanditProblem*&* Applicaons*

Last  Lecture:  UCB1  Algorithm  

4  

•  At  each  itera7on,  choose  the  ac7on  with  highest  Upper  Confidence  Bound  (UCB)  

•  Regret  Bound  :  with  high  probability,  sublinear  w.r.t  T  Exploita7on  Term   Explora7on  Term  

#Ac7ons  

Gap  between  best  &  2nd  best  Δ  =  μ1  –  μ2  

Time  Horizon  

Page 5: Contextual*Linear*BanditProblem*&* Applicaons*

Applica7on  of    Mul7-­‐armed  Bandit  Problem?  

5  

Page 6: Contextual*Linear*BanditProblem*&* Applicaons*

News  Ar7cle  Recommenda7on  

6  

•  Various  algorithms  for  personalized  recommenda7on  –  Collabora7ve  filtering,  Content-­‐based  filtering,  Hybrid  approaches  

•  But…  –  Web-­‐based  contents  undergo  frequent  changes  –  Some  users  have  no  previous  data  to  learn  from  (cold-­‐start  problem)  

•  Explora7on  vs.  Exploita7on  –  Need  to  gather  more  informa7on  about  users  with  more  trials  –  Op7mize  the  ar7cle  selec7on  with  past  user  experience  

è    Use  mul7-­‐armed  bandit  seong    

Page 7: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

7  

Users  u1  with  age  YOUNG    and  u2  with  age  OLD  

u1  

u2  

Page 8: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

8  

User  u1  with  age  YOUNG    

u1  

Page 9: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

9  

User  u1  with  age  YOUNG    

u1  

0.2  

0.4  

0.5  

0.8  

Page 10: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

10  

User  u1  with  age  YOUNG    

u1  

0.2  

0.4  

0.5  

0.8  

Page 11: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

11  

User  u1  with  age  YOUNG    

u1  

0.2  

0.4  

0.5  

0.8  

Page 12: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

12  

User  u2  with  age  OLD  

u2  

0.2  

0.4  

0.5  

0.8  

Page 13: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

13  

User  u2  with  age  OLD  

u2  

0.2  

0.4  

0.5  

0.8  

Page 14: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

14  

User  u2  with  age  OLD  

u2  

0.2  

0.4  

0.5  

0.8  

???  

Page 15: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

15  

User  u2  with  age  OLD  

u2  

0.2  

0.4  

0.5  

0.8  

Page 16: Contextual*Linear*BanditProblem*&* Applicaons*

Contextual-­‐Bandit  Problem  

16  

•  For  t=1,…,T  (unknown)  –  User  ut  ,  set  At  of  ac7ons  (a)  –  Feature  vector  (context)  xt,a  :  summarizes  both  user  ut  and  ac7on  a  –  Based  on  previous  results,  choose  at  from  At  

–  Receive  payoff    –  Improve  selec7on  strategy  with  new  observa7on  set  

   

•  Minimizing  Regret:          

Ac7on  with  maximum    expected  payoff  at  7me  t  

Page 17: Contextual*Linear*BanditProblem*&* Applicaons*

Difference?  

17  

•  Contextual  bandit  problem  becomes  K-­‐armed  bandit  problem  when  –  The  ac7on  set  At  is  unchanged  and  contains  K  ac7ons  for  all  t  –  The  user  ut    (or  the  context)  is  the  same  for  all  t  

•  Also  called  context-­‐free  bandit  problem    

Page 18: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Feature-­‐Free  Bandit  Seong  

18  

Users  u1  with  age  YOUNG    and  u2  with  age  OLD  

u1  

u2  

Page 19: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Contextual  Linear  Bandit  Seong  

19  

Users  u1  with  age  YOUNG    and  u2  with  age  OLD   𝜃1  

𝜃2  

𝜃3  

𝜃4  

Page 20: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Contextual  Linear  Bandit  Seong  

20  

Users  u1  with  age  YOUNG    and  u2  with  age  OLD   𝜃1  

𝜃2  

𝜃3  

𝜃4  

Linear  Payoff  =  xT  𝜽  

Page 21: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Contextual  Linear  Bandit  Seong  

21  

Users  u1  with  age  YOUNG    and  u2  with  age  OLD   [  0.1  ,  0.6  ]  

[  0.5,  0.1  ]  

[  0.6  ,  0.1  ]  

[  0.9  ,  0.2  ]  

Linear  Payoff  =  xT  𝜽  

Page 22: Contextual*Linear*BanditProblem*&* Applicaons*

Ar7cle  Recommenda7on  in    Contextual  Linear  Bandit  Seong  

22  

Users  u1  with  age  YOUNG    and  u2  with  age  OLD   [  0.1  ,  0.6  ]  

[  0.5,  0.1  ]  

[  0.6  ,  0.1  ]  

[  0.9  ,  0.2  ]  

Linear  Payoff  =  xT  𝜽  

Page 23: Contextual*Linear*BanditProblem*&* Applicaons*

Lecture  17:  The  Multi-­‐Armed  Bandit  Problem 1

• Assumption: the payoff model is linear– Most  Intuitive  thought:  Linear  Model– Advantage:  Confidence interval computedefficiently in closed form

• Tempting to apply UCB on general contextual bandit problems– asymptoticoptimality– strong regret bound

è Called LinUCB algorithm.

LinUCB Algorithm

Page 24: Contextual*Linear*BanditProblem*&* Applicaons*

Lecture  17:  The  Multi-­‐Armed  Bandit  Problem

rarm

Page 25: Contextual*Linear*BanditProblem*&* Applicaons*

Lecture  17:  The  Multi-­‐Armed  Bandit  Problem 3

• For convenienceexposition, first describe simpler form– Disjoint linearmodel

• Then consider the general case– hybridmodel

è LinUCB is a generic contextual bandit algorithmwhich applies toapplications other than personalizednews article recommendation.

TwoModels

Page 26: Contextual*Linear*BanditProblem*&* Applicaons*

• We assume the expected payoff of an arm 𝑎 is linear in its d-­‐dimentional feature 𝑥#,% with some unknown coefficientvector 𝜃%∗; namely for all t,

• The model is called disjointbecause the parameters are notshared among different arms.

Lecture  17:  The  Multi-­‐Armed  Bandit  Problem 4

Linear Disjoint Model

Page 27: Contextual*Linear*BanditProblem*&* Applicaons*

“Disjoint” in example  Article  Recommendation

5

Users  u1 with  age  YOUNG  and  u2 with  age  OLD

u1

u2

Page 28: Contextual*Linear*BanditProblem*&* Applicaons*

Algorithm

Lecture  17:  The  Multi-­‐Armed  Bandit  Problem 6

𝑐# 𝐱#,%* 𝐀%,-𝐱#,%

𝑐#

Page 29: Contextual*Linear*BanditProblem*&* Applicaons*

Lecture  17:  The  Multi-­‐Armed  Bandit  Problem

𝜃.*x0,1

𝑐# 𝐱#,%* 𝐀%,-𝐱#,%

Visualization  Representation

Page 30: Contextual*Linear*BanditProblem*&* Applicaons*

Feature-free bandit v.s. linear bandit

Feature-free bandit

E[rt,a|xt,a] = µ∗a.

µ∗a is not known a priori.

Confidence interval Ct,a

{µa :|µa − µ̄t,a|

1/√

nt,a≤√

2 log t}

at = arg maxa∈{1,...,K}

maxµa∈Ct,a

µa

= arg maxa∈{1,...,K}

µ̄t,a +

√2 log t

nt,a

Linear bandit

E[rt,a|xt,a] = xTt,aθ∗a.

θ∗a is not known a priori.

Confidence ellipsiod Ct,a

{θa : ‖θa − θ̂t,a‖At,a ≤ ct}

where ‖x‖A ≡√

xT Ax.

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

= arg maxa∈{1,...,K}

xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a

Page 31: Contextual*Linear*BanditProblem*&* Applicaons*

Feature-free bandit v.s. linear bandit

Feature-free bandit

E[rt,a|xt,a] = µ∗a.

µ∗a is not known a priori.

Confidence interval Ct,a

{µa :|µa − µ̄t,a|

1/√

nt,a≤√

2 log t}

at = arg maxa∈{1,...,K}

maxµa∈Ct,a

µa

= arg maxa∈{1,...,K}

µ̄t,a +

√2 log t

nt,a

Linear bandit

E[rt,a|xt,a] = xTt,aθ∗a.

θ∗a is not known a priori.

Confidence ellipsiod Ct,a

{θa : ‖θa − θ̂t,a‖At,a ≤ ct}

where ‖x‖A ≡√

xT Ax.

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

= arg maxa∈{1,...,K}

xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a

Page 32: Contextual*Linear*BanditProblem*&* Applicaons*

Confidence ellipsiod Ct,a = {θa : ‖θa − θ̂t,a‖At,a ≤ ct}

−2 0 2 4 6 80

1

2

3

4

5

6

7

8

9

Confidence ellipsoid: Ct,a

estimation of coefficients: θt,a

Feature vector: xt,a

Confidence Interval of xt,a

t,a

xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a = max

θaxT

t,aθa

s.t. (θa − θ̂t,a)T At,a(θa − θ̂t,a) ≤ ct

Page 33: Contextual*Linear*BanditProblem*&* Applicaons*

Feature-free bandit = Linear bandit with xt,a ≡ 1, θa = µa

Feature-free bandit

E[rt,a|xt,a] = 1Tµ∗a.

µ∗a is not known a priori.

Confidence interval Ct,a

{µa : ‖µa−µ̄t,a‖nt,a ≤√

2 log t}

where ‖µ‖n ≡√µT nµ.

at = arg maxa∈{1,...,K}

maxµa∈Ct,a

1Tµa

= arg maxa∈{1,...,K}

1T µ̄t,a +

√2 log t

nt,a

Linear bandit

E[rt,a|xt,a] = xTt,aθ∗a.

θ∗a is not known a priori.

Confidence ellipsiod Ct,a

{θa : ‖θa − θ̂t,a‖At,a ≤ ct}

where ‖x‖A ≡√

xT Ax.

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

= arg maxa∈{1,...,K}

xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a

Page 34: Contextual*Linear*BanditProblem*&* Applicaons*

A Bayesian approach to derive θ̂t,a and A−1t,a

Gaussian prior p0(θa) ∼ N(0, λId).

nt,a noisy measurements: yt,a ∼ N(Dt,aθa, Int,a )....

yt,a(i)...

=...

xTi,a...

θa +

...ηi,a...

Posterior distribution pt,a(θa) ∼ N(θ̂t,a, A−1

t,a ).

θ̂t,a = (DTt,aDt,a +

Id)−1 DTt,ayt,a︸ ︷︷ ︸bt,a

,

At,a = DTt,aDt,a +

Id .

The reward xTt,aθa ∼ N(xT

t,aθ̂t,a, xTt,aA−1

t,a xt,a). The upper confidence

bound (UCB) is xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a .

Page 35: Contextual*Linear*BanditProblem*&* Applicaons*

A Bayesian approach to derive θ̂t,a and A−1t,a

Gaussian prior p0(θa) ∼ N(0, λId).

nt,a noisy measurements: yt,a ∼ N(Dt,aθa, Int,a )....

yt,a(i)...

=...

xTi,a...

θa +

...ηi,a...

Posterior distribution pt,a(θa) ∼ N(θ̂t,a, A−1t,a ).

θ̂t,a = (DTt,aDt,a +

Id)−1 DTt,ayt,a︸ ︷︷ ︸bt,a

,

At,a = DTt,aDt,a +

Id .

The reward xTt,aθa ∼ N(xT

t,aθ̂t,a, xTt,aA−1

t,a xt,a). The upper confidence

bound (UCB) is xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a .

Page 36: Contextual*Linear*BanditProblem*&* Applicaons*

A Bayesian approach to derive θ̂t,a and A−1t,a

Gaussian prior p0(θa) ∼ N(0, λId).

nt,a noisy measurements: yt,a ∼ N(Dt,aθa, Int,a )....

yt,a(i)...

=...

xTi,a...

θa +

...ηi,a...

Posterior distribution pt,a(θa) ∼ N(θ̂t,a, A−1

t,a ).

θ̂t,a = (DTt,aDt,a +

Id)−1 DTt,ayt,a︸ ︷︷ ︸bt,a

,

At,a = DTt,aDt,a +

Id .

The reward xTt,aθa ∼ N(xT

t,aθ̂t,a, xTt,aA−1

t,a xt,a). The upper confidence

bound (UCB) is xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a .

Page 37: Contextual*Linear*BanditProblem*&* Applicaons*

A Bayesian approach to derive θ̂t,a and A−1t,a

Gaussian prior p0(θa) ∼ N(0, λId).

nt,a noisy measurements: yt,a ∼ N(Dt,aθa, Int,a )....

yt,a(i)...

=...

xTi,a...

θa +

...ηi,a...

Posterior distribution pt,a(θa) ∼ N(θ̂t,a, A−1

t,a ).

θ̂t,a = (DTt,aDt,a +

Id)−1 DTt,ayt,a︸ ︷︷ ︸bt,a

,

At,a = DTt,aDt,a +

Id .

The reward xTt,aθa ∼ N(xT

t,aθ̂t,a, xTt,aA−1

t,a xt,a). The upper confidence

bound (UCB) is xTt,aθ̂t,a + ct

√xT

t,aA−1t,a xt,a .

Page 38: Contextual*Linear*BanditProblem*&* Applicaons*

A general form of linear bandit

Disjoint linear model

E[rt,a|xt,a] = xTt,aθ∗a.

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

A general linear model

E[rt |xt] = xTt θ∗.

xt = arg maxx∈At

maxθ∈Ct

xT θ

θ =

θ1...θa...θK

, At =

...0

xt,a

0...

: a = 1, 2, . . . ,K

Page 39: Contextual*Linear*BanditProblem*&* Applicaons*

A general form of linear bandit

Disjoint linear model

E[rt,a|xt,a] = xTt,aθ∗a.

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

A general linear model

E[rt |xt] = xTt θ∗.

xt = arg maxx∈At

maxθ∈Ct

xT θ

θ =

θ1...θa...θK

, At =

...0

xt,a

0...

: a = 1, 2, . . . ,K

Page 40: Contextual*Linear*BanditProblem*&* Applicaons*

A hybrid linear model

A hybrid linear model

E[rt,a|zt,a, xt,a] = zTt,aβ∗ + xT

t,aθ∗a.

at = arg maxa∈{1,...,K}

maxβ,θa∈Ct

zTt,aβ + xT

t,aθa

A general linear model

E[rt |xt] = xTt θ∗.

xt = arg maxx∈At

maxθ∈Ct

xT θ

θ =

βθ1...θa...θK

, At =

zt,a...0

xt,a

0...

: a = 1, 2, . . . ,K

Page 41: Contextual*Linear*BanditProblem*&* Applicaons*

A general form of linear bandit, continued

Disjoint linear model

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

Ct,a = {θa : ‖θa − θ̂t,a‖At,a ≤ ct}

θ̂t,a = (DTt,aDt,a +

Id)−1DTt,ayt,a ,

At,a = DTt,aDt,a +

Id .

A general linear model

xt = arg maxx∈At

maxθ∈Ct

xT θ

Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

θ̂t = (DTt Dt +

Id)−1DTt yt ,

At = DTt Dt +

Id .

Page 42: Contextual*Linear*BanditProblem*&* Applicaons*

A general form of linear bandit, continued

Disjoint linear model

at = arg maxa∈{1,...,K}

maxθa∈Ct,a

xTt,aθa

Ct,a = {θa : ‖θa − θ̂t,a‖At,a ≤ ct}

θ̂t,a = (DTt,aDt,a +

Id)−1DTt,ayt,a ,

At,a = DTt,aDt,a +

Id .

A general linear model

xt = arg maxx∈At

maxθ∈Ct

xT θ

Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

θ̂t = (DTt Dt +

Id)−1DTt yt ,

At = DTt Dt +

Id .

Page 43: Contextual*Linear*BanditProblem*&* Applicaons*

An O(d√

T ) regret boundTheorem (Theorem 2 + Theorem 3 in APS 2011)Assume that

1 The measurement noise ηt is independent of everything and isσ-sub-Gaussian for some σ > 0, i.e., E[eληt ] ≤ exp( λ

2σ2

2 ) for all λ ∈ R .2 For all t and all x ∈ At, xT θ∗ ∈ [−1, 1].

Then, for any δ > 0, with probability at least 1 − δ, for all t ≥ 0,1 θ∗ lies in the confidence ellipsoid

Ct =

θ : ‖θ − θ̂t‖At ≤ ct := σ

√log det At + d log λ + 2 log

1δ+‖θ∗‖√λ

2 The regret of the linUCB algorithm satisfies

Rt =√

8t︸︷︷︸I

√log det At + d log λ︸ ︷︷ ︸

II

σ√log det At + d log λ + 2 log1δ+‖θ∗‖√λ

︸ ︷︷ ︸III:ct

Lemma (Determinant-Trace Inequality, Lemma 10 in APS 2011)If for all t ≥ 0, ‖xt‖2 ≤ L then

log det At ≤ d log(1λ+

tL2

d)

.

Page 44: Contextual*Linear*BanditProblem*&* Applicaons*

An O(d√

T ) regret boundTheorem (Theorem 2 + Theorem 3 in APS 2011)Assume that

1 The measurement noise ηt is independent of everything and isσ-sub-Gaussian for some σ > 0, i.e., E[eληt ] ≤ exp( λ

2σ2

2 ) for all λ ∈ R .2 For all t and all x ∈ At, xT θ∗ ∈ [−1, 1].

Then, for any δ > 0, with probability at least 1 − δ, for all t ≥ 0,1 θ∗ lies in the confidence ellipsoid

Ct =

θ : ‖θ − θ̂t‖At ≤ ct := σ

√log det At + d log λ + 2 log

1δ+‖θ∗‖√λ

2 The regret of the linUCB algorithm satisfies

Rt =√

8t︸︷︷︸I

√log det At + d log λ︸ ︷︷ ︸

II

σ√log det At + d log λ + 2 log1δ+‖θ∗‖√λ

︸ ︷︷ ︸III:ct

Lemma (Determinant-Trace Inequality, Lemma 10 in APS 2011)If for all t ≥ 0, ‖xt‖2 ≤ L then

log det At ≤ d log(1λ+

tL2

d)

.

Page 45: Contextual*Linear*BanditProblem*&* Applicaons*

The r of the proof

We consider the high probability event θ∗ ∈ Ct for all t ≥ 0.

rt = 〈x∗t , θ∗〉 − 〈xt, θ

∗〉 xt, θ̃t = arg maxx∈At

maxθ∈Ct〈x, θ〉

≤ 〈xt, θ̃t〉 − 〈xt, θ∗〉 θ∗ ∈ Ct

= 〈xt, θ̃t − θ∗〉

= 〈xt, θ̂t − θ∗〉 + 〈xt, θ̃t − θ̂t〉

≤ ‖xt‖A−1t‖θ̂t − θ

∗‖At + ‖xt‖A−1t‖θ̃t − θ̂t‖At Cauchy-Schwarz

≤ 2ct‖xt‖A−1t

θ∗, θ̃t ∈ Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

Since xT θ∗ ∈ [−1, 1] for all x ∈ At, then we have rt ≤ 2. Therefore,

rt ≤ min{2ct‖xt‖A−1t, 2} ≤ 2ct min{‖xt‖A−1

t, 1}

Page 46: Contextual*Linear*BanditProblem*&* Applicaons*

The r of the proof

We consider the high probability event θ∗ ∈ Ct for all t ≥ 0.

rt = 〈x∗t , θ∗〉 − 〈xt, θ

∗〉 xt, θ̃t = arg maxx∈At

maxθ∈Ct〈x, θ〉

≤ 〈xt, θ̃t〉 − 〈xt, θ∗〉 θ∗ ∈ Ct

= 〈xt, θ̃t − θ∗〉

= 〈xt, θ̂t − θ∗〉 + 〈xt, θ̃t − θ̂t〉

≤ ‖xt‖A−1t‖θ̂t − θ

∗‖At + ‖xt‖A−1t‖θ̃t − θ̂t‖At Cauchy-Schwarz

≤ 2ct‖xt‖A−1t

θ∗, θ̃t ∈ Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

Since xT θ∗ ∈ [−1, 1] for all x ∈ At, then we have rt ≤ 2. Therefore,

rt ≤ min{2ct‖xt‖A−1t, 2} ≤ 2ct min{‖xt‖A−1

t, 1}

Page 47: Contextual*Linear*BanditProblem*&* Applicaons*

The r of the proof, continued

r2t ≤ 4c2

t min{‖xt‖2A−1

t, 1} (1)

Consider the regret RT ≡∑T

t=1 rt,

RT ≤

√√T

T∑t=1

r2t ≤︸︷︷︸

By (1)

√√T

T∑t=1

4c2t min{‖xt‖

2A−1

t, 1}

≤ 2√

T︸︷︷︸I

cT︸︷︷︸III

√√T∑

t=1

min{‖xt‖2A−1

t, 1}︸ ︷︷ ︸

II

ct is monotonically increasing.

Since x ≤ 2 log(1 + x) for x ∈ [0, 1], we have

T∑t=1

min{‖xt‖2A−1

t, 1} ≤ 2

T∑t=1

log(1 + ‖xt‖2A−1

t) = 2(log det At + d log λ).

The last equality is proved in Lemma 11 in APS 2011.

Page 48: Contextual*Linear*BanditProblem*&* Applicaons*

The r of the proof, continued

r2t ≤ 4c2

t min{‖xt‖2A−1

t, 1} (1)

Consider the regret RT ≡∑T

t=1 rt,

RT ≤

√√T

T∑t=1

r2t ≤︸︷︷︸

By (1)

√√T

T∑t=1

4c2t min{‖xt‖

2A−1

t, 1}

≤ 2√

T︸︷︷︸I

cT︸︷︷︸III

√√T∑

t=1

min{‖xt‖2A−1

t, 1}︸ ︷︷ ︸

II

ct is monotonically increasing.

Since x ≤ 2 log(1 + x) for x ∈ [0, 1], we have

T∑t=1

min{‖xt‖2A−1

t, 1} ≤ 2

T∑t=1

log(1 + ‖xt‖2A−1

t) = 2(log det At + d log λ).

The last equality is proved in Lemma 11 in APS 2011.

Page 49: Contextual*Linear*BanditProblem*&* Applicaons*

The r of the proof, continued

r2t ≤ 4c2

t min{‖xt‖2A−1

t, 1} (1)

Consider the regret RT ≡∑T

t=1 rt,

RT ≤

√√T

T∑t=1

r2t ≤︸︷︷︸

By (1)

√√T

T∑t=1

4c2t min{‖xt‖

2A−1

t, 1}

≤ 2√

T︸︷︷︸I

cT︸︷︷︸III

√√T∑

t=1

min{‖xt‖2A−1

t, 1}︸ ︷︷ ︸

II

ct is monotonically increasing.

Since x ≤ 2 log(1 + x) for x ∈ [0, 1], we have

T∑t=1

min{‖xt‖2A−1

t, 1} ≤ 2

T∑t=1

log(1 + ‖xt‖2A−1

t) = 2(log det At + d log λ).

The last equality is proved in Lemma 11 in APS 2011.

Page 50: Contextual*Linear*BanditProblem*&* Applicaons*

How  do  we  evaluate  the  performance  of  a  recommenda3on  algorithm?  

•  Can  we  just  run  the  algorithm  on  “live”  data?  

•  Build  a  simulator  to  model  the  bandit  process,  evaluate  the  algorithm  based  on  the  simulated  data?  

 Algorithm  Evalua3on  

Page 51: Contextual*Linear*BanditProblem*&* Applicaons*

How  do  we  evaluate  the  performance  of  a  recommenda3on  algorithm?  

•  Can  we  just  run  the  algorithm  on  “live”  data?  Difficult  logis3cally.    

•  Build  a  simulator  to  model  the  bandit  process,  evaluate  the  algorithm  based  on  the  simulated  data?   May  introduce  bias  from  the  simulator.  

•  Yahoo!  Today  Module!  (random  ar3cle)  

 Algorithm  Evalua3on  

Page 52: Contextual*Linear*BanditProblem*&* Applicaons*

 Algorithm  Evalua3on  

Page 53: Contextual*Linear*BanditProblem*&* Applicaons*

 No  bias!  About  T  events  are  accepted  from  TK  trails.  

 Algorithm  Evalua3on  

Page 54: Contextual*Linear*BanditProblem*&* Applicaons*

Yahoo  News  …  

Randomly  shoot  user  an  ar3cle  a  as  highlighted  news.  

 Algorithm  Evalua3on  (data  collec3on)  

Page 55: Contextual*Linear*BanditProblem*&* Applicaons*

Yahoo  News  …  

Randomly  shoot  user  an  ar3cle  a  as  highlighted  news.  

This  event  contains  ar3cle  a,  feature  vector  X,  and  response  r  =  1/0  .  

 Algorithm  Evalua3on  (data  collec3on)  

Accept  this  event  iff  algorithm  predicts  the  same  ar3cle  a.  

Page 56: Contextual*Linear*BanditProblem*&* Applicaons*

In  Yahoo’s  data  base,  either  a  user  or  ar3cle  is  depicted  by  hundreds  raw  features.    Need  to  reduce  the  feature  dimensions.  

 Algorithm  Evalua3on  (construct  features)  

Page 57: Contextual*Linear*BanditProblem*&* Applicaons*

User   Gender   Age>20   Age>40   Student   ...  

1   1   0   1   …  

Ar3cle   Long   Domes3c   Tech   Poli3cs   …  

1   1   0   1   …  Raw  

Raw  

In  Yahoo’s  data  base,  either  a  user  or  ar3cle  is  depicted  by  hundreds  raw  features.    Need  to  reduce  the  feature  dimensions.  

 Algorithm  Evalua3on  (construct  features)  

Page 58: Contextual*Linear*BanditProblem*&* Applicaons*

User   Gender   Age>20   Age>40   Student   ...  

1   1   0   1   …  

Ar3cle   Long   Domes3c   Tech   Poli3cs   …  

1   1   0   1   …  Raw  

Raw  

In  Yahoo’s  data  base,  either  a  user  or  ar3cle  is  depicted  by  hundreds  raw  features.    Need  to  reduce  the  feature  dimensions.  

Suppose  there’s  a  weight  matrix  W,  st.  the  probability  of  user  clicking  on  ar3cle  a  is:  

 Algorithm  Evalua3on  (construct  features)  

Page 59: Contextual*Linear*BanditProblem*&* Applicaons*

Logis3c  Regression  to  get  W  K-­‐means  method  to  find  clusters/groups  of  the  users.  

 Algorithm  Evalua3on  (construct  features)  

(Cluster  this  projected  feature  vector.)  

Page 60: Contextual*Linear*BanditProblem*&* Applicaons*

Logis3c  Regression  to  get  W  K-­‐means  method  to  find  clusters/groups  of  the  users.  

•  The  constructed  feature  vector  for  a  user  would  be  the  possibili3es  of  being  in  different  groups.    Denote  as:    

•  The  same  procedure  can  be  applied  to  the  ar3cle  and  get  the  constructed  feature  vector.    

Ø  Disjointed  LinUCB,    use                      as  input  data.  Ø  Hybrid  model,  the  outer  product  of  constructed  user  and  

ar3cle  feature  vectors  is  also  included  as  global  features.  

 Algorithm  Evalua3on  (construct  features)  

(Cluster  this  projected  feature  vector.)  

Page 61: Contextual*Linear*BanditProblem*&* Applicaons*

 Algorithm  Evalua3on  (construct  features)  For  example,  we  have  binary  raw  feature  vectors:  

Page 62: Contextual*Linear*BanditProblem*&* Applicaons*

 Algorithm  Evalua3on  (construct  features)  For  example,  we  have  binary  raw  feature  vectors  for  user  and  ar3cle:  

•  Afer  clustering:    

Group   A   B   C   D   E  

Membership   0.2   0.1   0.35   0.3   0.05  User,      

Group   1   2   3   4   5  

Membership   0.7   0.05   0.15   0.05   0.05  Ar3cle  

Page 63: Contextual*Linear*BanditProblem*&* Applicaons*

 Algorithm  Evalua3on  (construct  features)  For  example,  we  have  binary  raw  feature  vectors  for  user  and  ar3cle:  

•  Afer  clustering:    

Group   A   B   C   D   E  

Membership   0.2   0.1   0.35   0.3   0.05  User,      

Group   1   2   3   4   5  

Membership   0.7   0.05   0.15   0.05   0.05  Ar3cle  

•  The  outer  product  of  the  two  constructed  feature  vectors  is:    

•       Hybrid  model:  

Remove  this  term  will  get  back  to  disjointed  LinUCB  

(25  dimensional  here)  

Page 64: Contextual*Linear*BanditProblem*&* Applicaons*

Without  u3lizing  features:  •  Purely  Random  •  E-­‐greedy  •  UCB  •  Omniscient  

Algorithms  with  features:  •  E-­‐greedy  (run  on  different  clusters)  •  E-­‐greedy  (hybrid,  epoch  greedy)  •  UCB  (cluster)  •  LinUCB  (disjoint)  •  LinUCB  (  hybrid)  

 Algorithm  Evalua3on  

Page 65: Contextual*Linear*BanditProblem*&* Applicaons*

 Algorithm  Evalua3on  

(CTR  normalized  by  the  random  recommenda3on  CTR)  

Page 66: Contextual*Linear*BanditProblem*&* Applicaons*

 Algorithm  Evalua3on  

(CTR  normalized  by  the  random  recommenda3on  CTR)  

Page 67: Contextual*Linear*BanditProblem*&* Applicaons*

Conclusion  

•  UCB  algorithm  without  feature  has  regret  bound:                •  LinUCB  using  feature  vectors  has  regret  bound:              

For  mul3-­‐armed  bandits  problem  

•  Evaluate  using  Yahoo  Front  Page  Today  Module  data.  •  Introducing  contextual  informa3on  (features)  to  the  

recommenda3on  algorithm,  the  CTR  (reward)  has    been  improved  by  about  10%.      

Page 68: Contextual*Linear*BanditProblem*&* Applicaons*

References  •  Li,  Lihong,  et  al.  "A  contextual-­‐bandit  approach  to  personalized  news  

ar3cle  recommenda3on."  Proceedings  of  the  19th  interna3onal  conference  on  World  wide  web.  ACM,  2010.  

•  Abbasi-­‐Yadkori,  Yasin,  Dávid  Pál,  and  Csaba  Szepesvári.  "Improved  algorithms  for  linear  stochas3c  bandits."  Advances  in  Neural  Informa3on  Processing  Systems.  2011.  

•  Auer,  Peter,  Nicolo  Cesa-­‐Bianchi,  and  Paul  Fischer.  "Finite-­‐3me  analysis  of  the  mul3armed  bandit  problem."  Machine  learning  47.2-­‐3  (2002):  235-­‐256.  

•  Auer,  Peter.  "Using  confidence  bounds  for  exploita3on-­‐explora3on  trade-­‐offs."  The  Journal  of  Machine  Learning  Research  3  (2003):  397-­‐422.