uplift modeling - instytut informatyki politechniki … a model which predicts thedi erencebetween...
Post on 19-May-2018
215 Views
Preview:
TRANSCRIPT
Uplift modeling
Szymon Jaroszewicz
Institute of Computer SciencePolish Academy of Sciences
Warsaw, Poland
National Institute of TelecommunicationsWarsaw, Poland
Szymon Jaroszewicz Uplift modeling
Joint work with:
Piotr Rzepakowski
Maciej Jaskowski
Lukasz Zaniewicz
Micha l So ltys
Szymon Jaroszewicz Uplift modeling
What is uplift modeling?
A typical marketing campaign
SamplePilot
campaignModel P(buy |X)
Selecttargets forcampaign
But this is not what we need!
We want people who bought because of the campaign
Not people who bought after the campaign
Szymon Jaroszewicz Uplift modeling
What is uplift modeling?
A typical marketing campaign
SamplePilot
campaignModel P(buy |X)
Selecttargets forcampaign
But this is not what we need!
We want people who bought because of the campaign
Not people who bought after the campaign
Szymon Jaroszewicz Uplift modeling
A typical marketing campaign
We can divide potential customers into four groups
1 Responded because of the action(the people we want)
2 Responded, but would have responded anyway(unnecessary costs)
3 Did not respond and the action had no impact(unnecessary costs)
4 Did not respond because the action had a(negative impact)
Szymon Jaroszewicz Uplift modeling
Solution: Uplift modeling
Solution: Uplift modeling
Two training sets:1 the treatment group
on which the action was taken2 the control group
on which no action was takenused as background
Build a model which predicts the difference between classprobabilities in the treatment and control groups
Szymon Jaroszewicz Uplift modeling
Difference with traditional classification
Notation:
PT probabilities in the treatment group
PC probabilities in the control group
Traditional models predict the conditional probability
PT (Y | X1, . . . ,Xm)
Uplift models predict change in behaviour resulting from the action
PT (Y | X1, . . . ,Xm)− PC (Y | X1, . . . ,Xm)
Szymon Jaroszewicz Uplift modeling
Marketing campaign (uplift modeling approach)
Marketing campaign (uplift modeling approach)
Treatmentsample
Controlsample
Pilotcampaign
ModelPT (buy |X ) −
PC (buy |X )
Selecttargets forcampaign
Szymon Jaroszewicz Uplift modeling
Applications in medicine
A typical medical trial:
treatment group: gets the treatmentcontrol group: gets placebo (or another treatment)do a statistical test to show that the treatment is better thanplacebo
With uplift modeling we can find out for whom the treatmentworks best
Personalized medicine
Szymon Jaroszewicz Uplift modeling
Main difficulty of uplift modeling
The fundamental problem of causal inference
Our knowledge is always incomplete
For each training case we know either
what happened after the treatment, orwhat happened if no treatment was given
Never both!
This makes designing uplift algorithms challenging
... and the intuitions are often hard to grasp
Szymon Jaroszewicz Uplift modeling
Evaluating uplift models
Szymon Jaroszewicz Uplift modeling
Evaluating uplift models
We have two separate test sets:
a treatment test seta control test set
Problem
To assess the gain for a customer we need to know both treatmentand control responses, but only one of them is known
Solution
Assess gains for groups of customers
Szymon Jaroszewicz Uplift modeling
Evaluating uplift classifiers
For example:
Gain for the 10% highest scoring customers =
% of successes for top 10% treated customers
− % of successes for top 10% control customers
Szymon Jaroszewicz Uplift modeling
Uplift curves
Uplift curves are a more convenient tool:
Draw separate lift curves on treatment and control data(TPR on the Y axis is replaced with percentage of successes inthe whole population)Uplift curve =
lift curve on treatment data – lift curve on control data
Interpretation: net gain in success rate if a given percentage ofthe population is treated
We can of course compute the Area Under the Uplift Curve(AUUC)
Szymon Jaroszewicz Uplift modeling
An uplift curve for marketing data
0 20 40 60 80 100percent targeted
0.01
0.00
0.01
0.02
0.03
0.04
0.05
cum
ulat
ive
gain
[% to
tal]
Hillstrom.Hillstrom_T1C1_visit_w
MultiP.Logit: area = 0.7599Zet.Logit: area = 0.7573T.Logit: area = 0.4872
Szymon Jaroszewicz Uplift modeling
An uplift curve for a medical trial dataset comparing twocancer treatments
Szymon Jaroszewicz Uplift modeling
Uplift modeling algorithms
Szymon Jaroszewicz Uplift modeling
The two model approach
An obvious approach to uplift modeling:
1 Build a classifier MT modeling PT (Y |X) on the treatmentsample
2 Build a classifier MC modeling PC (Y |X) on the controlsample
3 The uplift model subtracts probabilities predicted by bothclassifiers
MU(Y |X) = MT (Y |X)−MC (Y |X)
Szymon Jaroszewicz Uplift modeling
Two model approach
Advantages:
Works with existing classification models
Good probability predictions ⇒ good uplift prediction
Disadvantages:
Differences between class probabilities can follow a differentpattern than the probabilities themselves
each classifier focuses on changes in class probabilities butignores the weaker ‘uplift signal’algorithms designed to focus directly on uplift can give betterresults
Szymon Jaroszewicz Uplift modeling
Decision trees for uplift modeling
Szymon Jaroszewicz Uplift modeling
Hansotia, Rukstales 2002
The ∆∆P criterion
PT (Y = 1) = 5%PC (Y = 1) = 3%
∆P = 2%
PT (Y = 1) = 8%PC (Y = 1) = 3.5%
∆P = 4.5%
X < a
PT (Y = 1) = 3.7%PC (Y = 1) = 2.8%
∆P = 0.9%
X >= a
∆∆P = 3.6%
Pick a test with highest ∆∆P
Szymon Jaroszewicz Uplift modeling
Hansotia, Rukstales 2002
It is not in line with modern decision tree learning such asC4.5
splitting criterion directly maximizes the difference betweenprobabilities (target criterion)no pruning
Rzepakowski, Jaroszewicz 2010, 2012
splitting criterion based on Information Theory, more in linewith modern decision treespruning designed for uplift modelingmulticlass problems and multiway splits possibleif the control group is empty, the algorithm reduces to classicaldecision tree learning
Szymon Jaroszewicz Uplift modeling
KL divergence as a splitting criterion for uplift trees
Measure difference between treatment and control groupsusing KL divergence
KL(
PT (Y ) : PC (Y ))
=∑
y∈Dom(Y )
PT (y) logPT (y)
PC (y)
KL-divergence conditional on a test X
KL(PT (Y ) : PC (Y ) | X ) =∑x∈Dom(X )
NT (X = x) + NC (X = x)
NT + NCKL(PT (Y |X = x) : PC (Y |X = x)
)
note the weighting factorsNT and NC denote counts in the treatment and controldatasets
Szymon Jaroszewicz Uplift modeling
KL divergence as a splitting criterion for uplift trees
Measure difference between treatment and control groupsusing KL divergence
KL(
PT (Y ) : PC (Y ))
=∑
y∈Dom(Y )
PT (y) logPT (y)
PC (y)
KL-divergence conditional on a test X
KL(PT (Y ) : PC (Y ) | X ) =∑x∈Dom(X )
NT (X = x) + NC (X = x)
NT + NCKL(PT (Y |X = x) : PC (Y |X = x)
)
note the weighting factorsNT and NC denote counts in the treatment and controldatasets
Szymon Jaroszewicz Uplift modeling
The KLgain
How much larger does the difference between class distributions inT and C groups become after a split on X ?
KLgain(X ) =
KL(
PT (Y ) : PC (Y )|X)− KL
(PT (Y ) : PC (Y )
)
Szymon Jaroszewicz Uplift modeling
Properties of the KLgain
Properties:
If Y ⊥ X then KLgain(X ) = 0
If PT (Y |X ) = PC (Y |X ) then
KLgain(X ) = minimum
If the control group is empty, KLgain reduces to entropy gain(Laplace correction is used on P(Y ))
Szymon Jaroszewicz Uplift modeling
Negative values of KLgain
Classification decision trees: gain(X ) ≥ 0
KLgain(X ) can be negative:
PT (Y = 1) = 0.28PC (Y = 1) = 0.85
PT (Y = 1) = 0.7PC (Y = 1) = 0.9
X = 0
PT (Y = 1) = 0.1PC (Y = 1) = 0.4
X = 1
PT (X = 0) = 0.3PC (X = 0) = 0.9
Note the dependence of X on T /C group selection
Szymon Jaroszewicz Uplift modeling
Negative values of KLgain
Negative gain values are only possible when X depends ongroup selection
This a variant of the Simpson’s paradox
Theorem
If X is independent of the selection of the T and C groups then
KLgain(X ) ≥ 0
In practice we want X to be independent of the T /C groupselection
In medical research great care is taken to ensure this
Szymon Jaroszewicz Uplift modeling
The KLgain ratio
In standard decision trees, the gain is divided by test’s entropyto punish tests with large number of outcomes
In our case:
KLratio(X ) =KLgain(X )
I (X )
where
I (X ) = H
(NT
N,NC
N
)KL(PT (X ) : PC (X )) +
NT
NH(PT (X )) +
NC
NH(PC (X )) +
1
2
Tests with large numbers of outcomes are punished
Tests for which PT (X ) and PC (X ) differ are punished
This prevents splits correlated with the division into treatmentand control groups
Szymon Jaroszewicz Uplift modeling
Uplift modeling through class variable
transformation
Szymon Jaroszewicz Uplift modeling
Uplift modeling through class variable transformation
Introduced in Jaskowski, Jaroszewicz, 2012
Allows for adapting an arbitrary classifier to uplift modeling
Let G ∈ {T ,C} denote the group membership (treatment orcontrol)
Define an r.v.
Z =
1 if G = T and Y = 1,
1 if G = C and Y = 0,
0 otherwise.
In plain English: flip the class in the control dataset
Szymon Jaroszewicz Uplift modeling
Uplift modeling through class variable transformation
Now
P(Z = 1|X1, . . . ,Xm)
= PT (Y = 1|X1, . . . ,Xm)P(G = T |X1, . . . ,Xm)
+ PC (Y = 0|X1, . . . ,Xm)P(G = C |X1, . . . ,Xm)
Assume that G is independent of X1, . . . ,Xm (otherwise thestudy is badly constructed):
P(Z = 1|X1, . . . ,Xm) = PT (Y = 1|X1, . . . ,Xm)P(G = T )
+ PC (Y = 0|X1, . . . ,Xm)P(G = C )
Szymon Jaroszewicz Uplift modeling
Uplift modeling through class variable transformation
Assume P(G = T ) = P(G = C ) = 12 (otherwise reweight one
of the datasets):
2P(Z = 1|X1, . . . ,Xm)
= PT (Y = 1|X1, . . . ,Xm) + PC (Y = 0|X1, . . . ,Xm)
= PT (Y = 1|X1, . . . ,Xm) + 1− PC (Y = 1|X1, . . . ,Xm)
Finally
PT (Y = 1|X1, . . . ,Xm)− PC (Y = 1|X1, . . . ,Xm)
= 2P(Z = 1|X1, . . . ,Xm)− 1
Szymon Jaroszewicz Uplift modeling
Uplift modeling through class variable transformation
Conclusion
Modeling P(Z = 1|X ) is equivalent to modeling the differencebetween class probabilities in the treatment and control groups
The algorithm:
1 Flip the class in DC
2 Concatenate D = DT ∪DC
3 Build any classifier on D
4 The classifier is actually an uplift model
Szymon Jaroszewicz Uplift modeling
Advantages
Any classifier can be turned into an uplift model
A single model is built
coefficients are easier to interpret than for the double modelthe model predicts uplift directly
(will not focus on predicting classes themselves)a single model is built on a large dataset
(double model method subtracts two models built on smalldatasets)
Disadvantage: the double model may represent a morecomplex decision surface
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines
Introduced in Zaniewicz, Jaroszewicz, 2013
Recall that the outcome of an action can be
positivenegativeneutral
Main idea
Use two parallel hyperplanes dividing the sample space into threeareas:
positive (+1)
neutral (0)
negative (-1)
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines
Introduced in Zaniewicz, Jaroszewicz, 2013
Recall that the outcome of an action can be
positivenegativeneutral
Main idea
Use two parallel hyperplanes dividing the sample space into threeareas:
positive (+1)
neutral (0)
negative (-1)
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines
H1
H2
+1
0
−1H1 : 〈w, x〉+ b1 = 0H2 : 〈w, x〉+ b2 = 0
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines
Fundamental problem of causal inference⇒ We never know if a point was classified correctly!
Need to use as much information as possible
Four types of points: T+, T−, C+, C−
Positive area (+1):
T−, C+ definitely misclassifiedT+, C− may be correct and definitely not a loss
(true outcome may only be neutral)
Negative area (-1):
T+, C− definitely misclassifiedT−, C+ may be correct and definitely not a loss
(true outcome may only be neutral)
Neutral area (0):
all predictions may be correct or incorrect
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines – problem formulation
Penalize points for being on the wrong side of eachhyperplane separately
Points in the neutral area are penalized for crossing onehyperplane
this prevents all points from being classified as neutral
Points which are definitely misclassified are penalized forcrossing two hyperplanes
such points should be avoided, thus the higher penalty
Other points are not penalized
Szymon Jaroszewicz Uplift modeling
Uplift Support Vector Machines – problem formulation
H1
H2T+
ξi ,1C+
ξi ,2
T+C−
T+
ξi ,1ξi2
T−ξi ,2ξi ,1
C+
Szymon Jaroszewicz Uplift modeling
Optimization task – primal form
minw,b1,b2∈Rm+2
1
2〈w,w〉+ C1
∑DT
+∪DC−
ξi ,1 + C2
∑DT−∪DC
+
ξi ,1
+ C2
∑DT
+∪DC−
ξi ,2 + C1
∑DT−∪DC
+
ξi ,2,
subject to:
〈w, xi 〉+ b1 ≤ −1 + ξi ,1, for (xi , yi ) ∈ DT+ ∪DC
−,
〈w, xi 〉+ b1 ≥ +1− ξi ,1, for (xi , yi ) ∈ DT− ∪DC
+,
〈w, xi 〉+ b2 ≤ −1 + ξi ,2, for (xi , yi ) ∈ DT+ ∪DC
−,
〈w, xi 〉+ b2 ≥ +1− ξi ,2, for (xi , yi ) ∈ DT− ∪DC
+,
ξi ,j ≥ 0, dla i = 1, . . . , n, j ∈ {1, 2},
Szymon Jaroszewicz Uplift modeling
Optimization task – primal form
We have two penalty parameters:
C1 penalty coefficient for being on the wrong side of onehyperplane
C2 coefficient of additional penalty for crossing also thesecond hyperplane
All points classified as neutral are penalized with C1ξ
All definitely misclassified points are penalized with C1ξ andC2ξ
How do C1 and C2 influence the model?
Szymon Jaroszewicz Uplift modeling
Influence of penalty coefficients C1 and C2 on the model
Lemma
For a well defined model C2 ≥ C1. Otherwise the order of thehyperplanes would be reversed.
Lemma
If C2 = C1 then no points are classified as neutral.
Lemma
For sufficiently large ratio C2/C1 no point is penalized for crossingboth hyperplanes. (Almost all points are classified as neutral.)
Szymon Jaroszewicz Uplift modeling
Influence of penalty coefficients C1 and C2 on the model
The C1 coefficient plays the role of the penalty in classicalSVMs
The ratio C2/C1 decides on the proportion of cases classifiedas neutral
Szymon Jaroszewicz Uplift modeling
Example: the tamoxifen drug trial data
1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16C2 /C1
40
80
120
160
200
240
280
320nu
mbe
r of c
ases
tamoxifen
classified negativeclassified neutralclassified positive
Szymon Jaroszewicz Uplift modeling
Example: the tamoxifen drug trial data
1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16C2 /C1
16
12
8
4
0
4
8
12up
lift [
%]
tamoxifen
classified negativeclassified neutralclassified positive
Szymon Jaroszewicz Uplift modeling
Literature
Nicholas J. Radcliffe and Patrick D. Surry. Real-World UpliftModelling with Significance-Based Uplift Trees, White paper,Stochastic solution ltd, 2011 (PDF) [a good overview paper]
P. Rzepakowski, S. Jaroszewicz. Decision Trees for UpliftModeling, ICDM’2010 (PDF)
M. Jaskowski, S. Jaroszewicz. Uplift Modeling for ClinicalTrial Data, In ICML 2012 Workshop on Machine Learning forClinical Data Analysis (PDF)
L. Zaniewicz, S. Jaroszewicz. Support Vector Machines forUplift Modeling, IEEE ICDM Workshop on Causal Discovery(CD 2013) at ICDM 2013
Szymon Jaroszewicz Uplift modeling
top related