naoki tanaka , shohei shimizu, takashi washio
DESCRIPTION
Estimation of Causal D irection in the Presence of Latent C onfounders U sing a Bayesian LiNGAM Mixture M odel. Naoki Tanaka , Shohei Shimizu, Takashi Washio The Institute of Scientific and Industrial Research, Osaka University. Outline. Motivation Background Our Approach - PowerPoint PPT PresentationTRANSCRIPT
Estimation of Causal Direction in the Presence of Latent Confounders
Using a Bayesian LiNGAM Mixture Model
Naoki Tanaka,Shohei Shimizu, Takashi Washio
The Institute of Scientific and Industrial Research, Osaka University
2
Outline
1. Motivation2. Background3. Our Approach4. Our Model: Bayesian LiNGAM Mixture5. Simulation Experiments
3
Motivation• Recently, estimation of causal structure attracts much
attention in machine learning. – Epidemiology– Genetics
𝑥2
𝑓
Observed variables
Latent confounder
𝑥1
Sleep problems
Depression mood
Cause
• The estimation results can be biased if there are latent confounders.
→ Unobserved variables that have more than one observed child variables.
• We propose a new estimation approach that can solve the problem.
4
Outline
1. Motivation2. Background3. Our Approach4. Our Model: Bayesian LiNGAM Mixture5. Simulation Experiments
5
LiNGAM ( Linear Non-Gaussian Acyclic Model )[Shimizu et al., 2006]
• The relations between variables are linear.• Observed variables are generated from a DAG (Directed
Acyclic Graphs).
33
2312
131
5.08.04.1
exexxx
exx
• External influences are non-Gaussian.• No latent confounders.
→ are mutually independent.• LiNGAM is an identifiable causal model.
𝑥2𝑥1
𝑥3 0.51.4
-0.8𝑒1 𝑒2
𝑒3
6
A Problem of LiNGAM•Latent confounders make dependent.
→The estimation results can be biased.
dependent𝑥3= 𝑓'8.0
'
212
11
exxex
2312
131
5.08.04.1
exxxexx
Survival rate
Medicine A
Survival rate
Medicine A
Patients’ conditionmild
Patients’ conditionserious
7
• LvLiNGAM (Latent variable LiNGAM)
LiNGAM with Latent Confounders[Hoyer et al., 2008]
: Represent effects of on
: Latent variables・ Independent・ Non-Gaussian
8
A Problem in Estimation of LiNGAM with Latent Confounders
• Existing methods:• An estimation method using overcomplete ICA.
[Hoyer et al., 2008]
→Suffers from local optima and requires large sample sizes.• Estimates unconfounded causal relations.
[Entner and Hoyer, 2011; Tashiro et al., 2012]→Cannot estimate a causal direction of two observed variables that are affected by latent confounders.
• We propose an alternative.– Computationally simpler.– Capable of finding a causal direction in the presence of
latent confounders.
9
Outline
1. Motivation2. Background3. Our Approach4. Our Model: Bayesian LiNGAM Mixture5. Simulation Experiments
10
Basic Idea of Our Approach
• Assumption– Continuous latent confounders can be
approximated by discrete variables. →LiNGAM with latent confounders reduces to LiNGAM mixture model. [Shimizu et al., 2008]
• Estimation– Estimation of LiNGAM mixture. [Mollah et al., 2006]
• Also suffers from local optima.– Propose to use Bayesian approach.• Bayesian approach for basic LiNGAM. [Hoyer et al., 2009]
LiNGAM Mixture Model [Shimizu et al.,2008]
• A data generating model of observed variable within class is
11
𝐱=𝐁(𝑐)𝐱+ (𝐈−𝐁 (𝑐 ))𝛍(𝑐 )+𝐞(𝑐)
Matrix form
• Existing estimation methods of LiNGAM mixture model also suffer from local optima.[Mollah et al., 2006]
𝑥2𝑥1 0.8Class 1
𝑥2𝑥1 0.8Class 2
0 0
7 6
mean mean ++
++++
++ ++
+
+++++
+
++++ +++
+++ ++ ++ ++
++++ +++
++
+ +++++++
++++
+
++ ++ ++ +
++++ +
+
+
++ ++++ ++
+++ +
+
+++
+
++++ +
++
++
++++++ +++++++
+ +
++++ + +++ +++
+++
+++
++ ++ +++++++++
++ +++++++
+++++++
++++++++
++++
++
+++ +
++++ +++
+
++++
++++
+++ ++++++++++ ++
+++
+++
+ +++ +
++++++
++++
+++ ++++++++
+
+++++
+
+
+ ++
+++ +
++ ++++++
++++
+++++
++++++
+++ +
+
++
++
+
++++
+ +++++++++
++++++
++++++ +
++++
++
++ +++ +++
++
++ ++
++ ++ +++++++++
+
+++
+ +++
+++++
+
+ ++++++ +++
++
++ ++++ +
+++
+
+
+ +++ +++++++++ ++
+++
++
+++
++
+
++++ +
+++ ++++ +++
++
+++
++
++
++ +++ +
++ ++++
+
++++ + +
++++ +++
+
++++++ ++ ++++
+ ++
++ ++
+++++++
++
+
+++
+ +++ +
++ +
++
++ + +
++++ +++++++
+
++++++++++ +++++ ++
+++ ++
+++++++
++ +++++
++
++++++ +++ +
+ ++
+++
+ ++++++++
+
+++
+ +
++
+++
+++
++
+
+
++++++++
+++++++++
+++ +++++++++
+ ++++
+++
+ ++++
++ ++
++
++++ +
+
++
++ +
++++
+
+
++
+++ ++++
++
+ +++
++++++ ++
++++ ++
+++
++++ ++++ +++ ++
+
+++
+++
+++
++++++++
+
++++ ++++
+++++++++
+
++++++ + +
++
++
+++
+++
+++
++++++++
+ ++
+++
+++++++++ +
++++
+ +++ ++
++ ++++
++++++++
+
++ +++
++++ ++++
+ ++++++ +
++++++++++++
++
++
+
+ ++
+++++
+++
++
++
+
+++++ +
+++
+++ ++ ++
+++ +
+
+
+
+ ++ +
+++++
+++
++
++++
+++++
++++++
++++
++
++++
+
+++++++ ++
++++
++
++ +++
++++ +++ ++ ++
+++++ +++ +++
++
+++
++++ +
++
+++ ++++++
+
++
+
12
Relation of Latent Variable LiNGAM and LiNGAM Mixture (1)
• We assume that continuous latent confounderscan be approximated by discrete variableshaving several values with good precision. – The combination of the discrete values determine
which “class” an observation belongs to.→ within the same class are mutually independent.→It is simpler than incorporating latent confounders in LiNGAM directly.
independent
2)(
212
1)(
11
8.0 exx
exc
c
μμ
2312
131
5.08.04.1
efxxefx
Relation of Latent Variable LiNGAM and LiNGAM Mixture (2)
13
𝑥2𝑥10.3
𝑓 3 𝑓 40.70.90.6
00𝜇1 𝜇2
𝜇1(1)=0𝜇1(2)=0.9𝜇1(3)=0.3𝜇1(4 )=1.2
𝑥𝑖= ∑𝑘 ( 𝑗 )<𝑘(𝑖)
𝑏𝑖𝑗(𝑐 ) 𝑥 𝑗+∑
𝑑λ𝑖𝑑 𝑓 𝑑+𝑒𝑖(𝑐)𝑥𝑖= ∑
𝑘 ( 𝑗 )<𝑘(𝑖)𝑏𝑖𝑗
(𝑐 ) (𝑥 𝑗−𝜇 𝑗(𝑐))+¿𝑒𝑖
(𝑐)+𝜇𝑖(𝑐)¿
• A simple example– If latent confounders and can be approximated by 0 and 1 …
Latent Variable LiNGAM LiNGAM Mixture
𝜇2(1)=0𝜇2(2)=0 .6𝜇2(3)=0 .7𝜇2(4 )=1.3
𝑥2𝑥1
Class 4
.3𝜇2(4 )𝑥1 𝑥2.2
𝜇1(4 )
Class 4
𝑥2𝑥1
10.6
.7𝜇2(4 )
.2𝜇1(4 )
Class 4
𝑥2𝑥1
10.6
.7𝜇2(4 )
.2𝜇1(4 )
Class 4
𝑥2𝑥1
1 10.7
0.6
0𝜇2(4 )
.2𝜇1(4 )
Class 4
𝑥2𝑥1
1 10.7
0.6
0𝜇2(4 )
.2𝜇1(4 )
Class 4
𝑥2𝑥1
1 10.70.9
0.6
0𝜇2(4 )
.3𝜇1(4 )
Class 4
𝑥2𝑥1
1 10.70.9
0.6
0𝜇2(4 )
.3𝜇1(4 )
Class 4
𝑥2𝑥10.3
1 10.70.9
0.6
0𝜇2(4 )
0𝜇1(4 )
Class 4
𝑥2𝑥10.3
1 10.70.9
0.6
0𝜇2(4 )
0𝜇1(4 )
Class 3
𝑥2𝑥1 .7𝜇2(3)
0.3𝜇1(3)
Class 3
𝑥2𝑥1
10.7
0𝜇2(3)
0.3𝜇1(3)
Class 3
𝑥2𝑥1
10.7
0𝜇2(3)
.3𝜇1(3)
Class 3
𝑥2𝑥10.3
10.7
0𝜇2(3)
0𝜇1(3)
Class 3
𝑥2𝑥10.3
10.7
0𝜇2(3)
0𝜇1(3)
Class 3
𝑥2𝑥10.3
1 00.70.9
0.6
0𝜇2(3)
0𝜇1(3)
Class 2
𝑥2𝑥1 .6𝜇2(2)
.9𝜇1(2)
Class 2
𝑥2𝑥1
10.6
0𝜇2(2)
.9𝜇1(2)
Class 2
𝑥2𝑥1
10.6
0𝜇2(2)
.9𝜇1(2)
Class 2
𝑥2𝑥1
10.9
0.6
0𝜇2(2)
0𝜇1(2)
Class 2
𝑥2𝑥1
10.9
0.6
0𝜇2(2)
0𝜇1(2)
Class 2
𝑥2𝑥10.3
0 10.70.9
0.6
0𝜇2(2)
0𝜇1(2)
Class 1
𝑥2𝑥1 0𝜇2(1)
0𝜇1(1)
Class 1
𝑥2𝑥10.3
0 00.70.9
0.6
0𝜇2(1)
0𝜇1(1)
reduces
14
Outline
1. Motivation2. Background3. Our Approach4. Our Model: Bayesian LiNGAM Mixture5. Simulation Experiments
15
Bayesian LiNGAM Mixture Model (1)• The data within class are assumed to be generated
by the LiNGAM model.→ and , the densities of , have no relation to latent confounders , so they are not different between classes.
2312
131
5.08.04.1
efxxefx
Although changes …
Density do not change
does not change
• and are the same between classes, so we replace and of the LiNGAM mixture model by and :
• Then their probability density is
16
Bayesian LiNGAM Mixture Model (2)
• The probability density of the data within each class is mixed according to some weights.
• : multinomial distribution.• The parameters of the multinomial distribution:
Dirichlet distribution– A typical prior for the parameters of the multinomial
distribution.– Conjugate prior for multinomial distribution.
17
Compare Three LiNGAM Mixture Models
• Select the model with the largest log-marginal likelihood.
• There are only three (, and ) models between two observed variables because of the assumption of acyclicity.
𝑥2𝑥1
𝐺1
class
𝑥2𝑥1
𝐺2
𝑥2𝑥1
𝐺3
class class
18
• Log-marginal likelihood is calculated as follows:
• We use Monte Carlo integration to compute the integral.• The assumption of i.i.d. data,
Log-marginal Likelihood of Our Model• Bayes’ theorem •
LiNGAM-mixture Prior distribution
19
Distribution of • follows a generalized Gaussian distribution
with zero means.→Includes Gaussian, Laplace, continuous uniform and many non-Gaussian distributions.
– is the Gamma function. 𝑉𝑎𝑟 (𝑒𝑖)=1
20
1 … …
0.6 0.1 0.3
0.3 0.8 0.5
0.1 0.1 0.2
Prior Distributions and the Number of Classes • Prior distribution
– and – , and – can be calculated by using the equation
of .• How to select the number of classes.– Note that ‘true ’ does not exist.
Inv-Gamma(3,3)
① Selects the best model. (letter in red)
② Selects the best number of classes.(painted in orange)
In a Dirichlet process mixture model,
[Antoniak, 1974]
21
Outline
1. Motivation2. Background3. Our Approach4. Our Model: Bayesian LiNGAM Mixture5. Simulation Experiments
22
Simulation Settings(1)• Generated data using a LiNGAM with latent confounders.
[Hoyer et al., 2008]
• 100 trials.
𝑥2𝑥10.3
0.8𝑒1 𝑒2
𝑓 3 𝑓 4 𝑓 50.7 -10.8
0.9 0.6
• The distributions of latent variables ( ,,, and ) are randomly selected from the following three non-Gaussian distributions:
(This graph is .)
Laplace distributionMixture of two
Gaussian distribution (symmetric)
Mixture of two Gaussian distribution
(asymmetric)
23
Simulation Settings(2)
• Two methods for comparison:– Pairwise likelihood ratios for estimation of
non-Gaussian SEMs [Hyvärinen et al., 2013]
→Assumes no latent confounders.– PairwiseLvLiNGAM [Entner et al., 2011]
→Finds variable pairs that are not affected by latent confounders and then estimate a causal ordering of one to the other.
24
50 64/64 6/12 6/16
100 52/52 7/20 5/24
200 42/42 0/14 2/14
Simulation Results
• Our method is most robust against existing latent confounders.
( )
50 100 2000
20
40
60
80
100
50 100 2000
20
40
60
80
100
50 100 2000
20
40
60
80
100( → ) ( ← )True:
Our methodPairwise measurePairwiseLv LiNGAM(Number of outputs)
The
num
ber o
f co
rrec
t ans
wer
s
The
num
ber o
f co
rrec
t ans
wer
s
The
num
ber o
f co
rrec
t ans
wer
s
• “(Number of outputs)” is the number of estimation by PairwiseLvLiNGAM. – For the details,
Sample size Sample size Sample size
Correct answers / Number of outputs
25
Conclusions and Future Work• A challenging problem: Estimation of causal direction
in the presence of latent confounders.– Latent confounders violate the assumption of LiNGAM
and can bias the estimation results.• Proposed a Bayesian LiNGAM mixture approach.
– Capable of finding causal direction in the presence of latent confounders.
– Computationally simpler: no iterative estimation in the parameter space.
• In this simulation, our method was better than two existing methods.
• Future work– Test our method on a wide variety of real datasets.
26
27
Histograms of
1 2 3 4 5 6 705
10152025
G1, sample size:50
1 2 3 4 5 6 705
10152025
G2, sample size:50
1 2 3 4 5 6 705
10152025
G3, sample size:50
1 2 3 4 5 6 7 8 905
10152025
G1, sample size:100
1 2 3 4 5 6 7 8 905
10152025
G2, sample size:100
1 2 3 4 5 6 7 8 905
10152025
G3, sample size:100
1 2 3 4 5 6 7 8 9 1005
10152025
G1, sample size:200
1 2 3 4 5 6 7 8 9 1005
10152025
G2, sample size:200
1 2 3 4 5 6 7 8 9 1005
10152025
G3, sample size:200
28
Density of a Transformation[Hyvärinen et al., 2001]
• e.g. )• is the density of and is the density of .
– is i.i.d data, so . Similarly, • We can rewrite LiNGAM in a matrix form.
• could be permuted by simultaneous equal row and column permutations to be strictly lower triangular due to the acyclicity assumption. [Bollen, 1989]→ is lower triangular whose diagonal elements are all 1 .
• A determinant of lower triangular equals the product of its diagonal elements.→
29
Gaussian vs. Non-GaussianGaussian Non-Gaussian (uniform)
( → )
( ← )
𝑥1
𝑥1
𝑥1
𝑥1
𝑥2 𝑥2
𝑥2 𝑥2