naoki tanaka , shohei shimizu, takashi washio

Estimation of Causal Direction in the Presence of Latent Confounders

Using a Bayesian LiNGAM Mixture Model

Naoki Tanaka,Shohei Shimizu, Takashi Washio

The Institute of Scientific and Industrial Research, Osaka University

2

Outline

1. Motivation2. Background3. Our Approach4. Our Model: Bayesian LiNGAM Mixture5. Simulation Experiments

3

Motivation• Recently, estimation of causal structure attracts much

attention in machine learning. – Epidemiology– Genetics

𝑥2

𝑓

Observed variables

Latent confounder

𝑥1

Sleep problems

Depression mood

Cause

• The estimation results can be biased if there are latent confounders.

→ Unobserved variables that have more than one observed child variables.

• We propose a new estimation approach that can solve the problem.

4

Outline


5

LiNGAM （ Linear Non-Gaussian Acyclic Model ）[Shimizu et al., 2006]

• The relations between variables are linear.• Observed variables are generated from a DAG (Directed

Acyclic Graphs).

33

2312

131

5.08.04.1

exexxx

exx

• External influences are non-Gaussian.• No latent confounders.

→ are mutually independent.• LiNGAM is an identifiable causal model.

𝑥2𝑥1

𝑥3 0.51.4

-0.8𝑒1 𝑒2

𝑒3

6

A Problem of LiNGAM•Latent confounders make dependent.

→The estimation results can be biased.

dependent𝑥3= 𝑓'8.0

'

212

11

exxex

2312

131

5.08.04.1

exxxexx

Survival rate

Medicine A

Survival rate

Medicine A

Patients’ conditionmild

Patients’ conditionserious

7

• LvLiNGAM (Latent variable LiNGAM)

LiNGAM with Latent Confounders[Hoyer et al., 2008]

： Represent effects of on

： Latent variables・ Independent・ Non-Gaussian

8

A Problem in Estimation of LiNGAM with Latent Confounders

• Existing methods:• An estimation method using overcomplete ICA.

[Hoyer et al., 2008]

→Suffers from local optima and requires large sample sizes.• Estimates unconfounded causal relations.

[Entner and Hoyer, 2011; Tashiro et al., 2012]→Cannot estimate a causal direction of two observed variables that are affected by latent confounders.

• We propose an alternative.– Computationally simpler.– Capable of finding a causal direction in the presence of

latent confounders.

9

Outline


10

Basic Idea of Our Approach

• Assumption– Continuous latent confounders can be

approximated by discrete variables. →LiNGAM with latent confounders reduces to　 LiNGAM mixture model. [Shimizu et al., 2008]

• Estimation– Estimation of LiNGAM mixture. [Mollah et al., 2006]

• Also suffers from local optima.– Propose to use Bayesian approach.• Bayesian approach for basic LiNGAM. [Hoyer et al., 2009]

LiNGAM Mixture Model [Shimizu et al.,2008]

• A data generating model of observed variable within class is

11

𝐱=𝐁(𝑐)𝐱+ (𝐈−𝐁 (𝑐 ))𝛍(𝑐 )+𝐞(𝑐)

Matrix form

• Existing estimation methods of LiNGAM mixture model also suffer from local optima.[Mollah et al., 2006]

𝑥2𝑥1 0.8Class 1

𝑥2𝑥1 0.8Class 2

0 0

7 6

mean mean ++

++++

++ ++

+

+++++

+

++++ +++

+++ ++ ++ ++

++++ +++

++

+ +++++++

++++

+

++ ++ ++ +

++++ +

+

+

++ ++++ ++

+++ +

+

+++

+

++++ +

++

++

++++++ +++++++

+ +

++++ + +++ +++

+++

+++

++ ++ +++++++++

++ +++++++

+++++++

++++++++

++++

++

+++ +

++++ +++

+

++++

++++

+++ ++++++++++ ++

+++

+++

+ +++ +

++++++

++++

+++ ++++++++

+

+++++

+

+

+ ++

+++ +

++ ++++++

++++

+++++

++++++

+++ +

+

++

++

+

++++

+ +++++++++

++++++

++++++ +

++++

++

++ +++ +++

++

++ ++

++ ++ +++++++++

+

+++

+ +++

+++++

+

+ ++++++ +++

++

++ ++++ +

+++

+

+

+ +++ +++++++++ ++

+++

++

+++

++

+

++++ +

+++ ++++ +++

++

+++

++

++

++ +++ +

++ ++++

+

++++ + +

++++ +++

+

++++++ ++ ++++

+ ++

++ ++

+++++++

++

+

+++

+ +++ +

++ +

++

++ + +

++++ +++++++

+

++++++++++ +++++ ++

+++ ++

+++++++

++ +++++

++

++++++ +++ +

+ ++

+++

+ ++++++++

+

+++

+ +

++

+++

+++

++

+

+

++++++++

+++++++++

+++ +++++++++

+ ++++

+++

+ ++++

++ ++

++

++++ +

+

++

++ +

++++

+

+

++

+++ ++++

++

+ +++

++++++ ++

++++ ++

+++

++++ ++++ +++ ++

+

+++

+++

+++

++++++++

+

++++ ++++

+++++++++

+

++++++ + +

++

++

+++

+++

+++

++++++++

+ ++

+++

+++++++++ +

++++

+ +++ ++

++ ++++

++++++++

+

++ +++

++++ ++++

+ ++++++ +

++++++++++++

++

++

+

+ ++

+++++

+++

++

++

+

+++++ +

+++

+++ ++ ++

+++ +

+

+

+

+ ++ +

+++++

+++

++

++++

+++++

++++++

++++

++

++++

+

+++++++ ++

++++

++

++ +++

++++ +++ ++ ++

+++++ +++ +++

++

+++

++++ +

++

+++ ++++++

+

++

+

12

Relation of Latent Variable LiNGAM and LiNGAM Mixture (1)

• We assume that continuous latent confounderscan be approximated by discrete variableshaving several values with good precision. – The combination of the discrete values determine

which “class” an observation belongs to.→ within the same class are mutually independent.→It is simpler than incorporating latent confounders in LiNGAM directly.

independent　

2)(

212

1)(

11

8.0 exx

exc

c

μμ

2312

131

5.08.04.1

efxxefx

Relation of Latent Variable LiNGAM and LiNGAM Mixture (2)

13

𝑥2𝑥10.3

𝑓 3 𝑓 40.70.90.6

00𝜇1 𝜇2

𝜇1(1)=0𝜇1(2)=0.9𝜇1(3)=0.3𝜇1(4 )=1.2

𝑥𝑖= ∑𝑘 ( 𝑗 )<𝑘(𝑖)

𝑏𝑖𝑗(𝑐 ) 𝑥 𝑗+∑

𝑑λ𝑖𝑑 𝑓 𝑑+𝑒𝑖(𝑐)𝑥𝑖= ∑

𝑘 ( 𝑗 )<𝑘(𝑖)𝑏𝑖𝑗

(𝑐 ) (𝑥 𝑗−𝜇 𝑗(𝑐))+¿𝑒𝑖

(𝑐)+𝜇𝑖(𝑐)¿

• A simple example– If latent confounders and can be approximated by 0 and 1 …

Latent Variable LiNGAM LiNGAM Mixture

𝜇2(1)=0𝜇2(2)=0 .6𝜇2(3)=0 .7𝜇2(4 )=1.3

𝑥2𝑥1

Class 4

.3𝜇2(4 )𝑥1 𝑥2.2

𝜇1(4 )

Class 4

𝑥2𝑥1

10.6

.7𝜇2(4 )

.2𝜇1(4 )

Class 4

𝑥2𝑥1

10.6

.7𝜇2(4 )

.2𝜇1(4 )

Class 4

𝑥2𝑥1

1 10.7

0.6

0𝜇2(4 )

.2𝜇1(4 )

Class 4

𝑥2𝑥1

1 10.7

0.6

0𝜇2(4 )

.2𝜇1(4 )

Class 4

𝑥2𝑥1

1 10.70.9

0.6

0𝜇2(4 )

.3𝜇1(4 )

Class 4

𝑥2𝑥1

1 10.70.9

0.6

0𝜇2(4 )

.3𝜇1(4 )

Class 4

𝑥2𝑥10.3

1 10.70.9

0.6

0𝜇2(4 )

0𝜇1(4 )

Class 4

𝑥2𝑥10.3

1 10.70.9

0.6

0𝜇2(4 )

0𝜇1(4 )

Class 3

𝑥2𝑥1 .7𝜇2(3)

0.3𝜇1(3)

Class 3

𝑥2𝑥1

10.7

0𝜇2(3)

0.3𝜇1(3)

Class 3

𝑥2𝑥1

10.7

0𝜇2(3)

.3𝜇1(3)

Class 3

𝑥2𝑥10.3

10.7

0𝜇2(3)

0𝜇1(3)

Class 3

𝑥2𝑥10.3

10.7

0𝜇2(3)

0𝜇1(3)

Class 3

𝑥2𝑥10.3

1 00.70.9

0.6

0𝜇2(3)

0𝜇1(3)

Class 2

𝑥2𝑥1 .6𝜇2(2)

.9𝜇1(2)

Class 2

𝑥2𝑥1

10.6

0𝜇2(2)

.9𝜇1(2)

Class 2

𝑥2𝑥1

10.6

0𝜇2(2)

.9𝜇1(2)

Class 2

𝑥2𝑥1

10.9

0.6

0𝜇2(2)

0𝜇1(2)

Class 2

𝑥2𝑥1

10.9

0.6

0𝜇2(2)

0𝜇1(2)

Class 2

𝑥2𝑥10.3

0 10.70.9

0.6

0𝜇2(2)

0𝜇1(2)

Class 1

𝑥2𝑥1 0𝜇2(1)

0𝜇1(1)

Class 1

𝑥2𝑥10.3

0 00.70.9

0.6

0𝜇2(1)

0𝜇1(1)

reduces

14

Outline


15

Bayesian LiNGAM Mixture Model (1)• The data within class are assumed to be generated

by the LiNGAM model.→ and , the densities of , have no relation to latent confounders , so they are not different between classes.

2312

131

5.08.04.1

efxxefx

Although changes …

Density do not change

does not change

• and are the same between classes, so we replace and of the LiNGAM mixture model by and :

• Then their probability density is

16

Bayesian LiNGAM Mixture Model (2)

• The probability density of the data within each class is mixed according to some weights.

• : multinomial distribution.• The parameters of the multinomial distribution:

Dirichlet distribution– A typical prior for the parameters of the multinomial

distribution.– Conjugate prior for multinomial distribution.

17

Compare Three LiNGAM Mixture Models

• Select the model with the largest log-marginal likelihood.

• There are only three (, and ) models between two observed variables because of the assumption of acyclicity.

𝑥2𝑥1

𝐺1

class

𝑥2𝑥1

𝐺2

𝑥2𝑥1

𝐺3

class class

18

• Log-marginal likelihood is calculated as follows:

• We use Monte Carlo integration to compute the integral.• The assumption of i.i.d. data,

Log-marginal Likelihood of Our Model• Bayes’ theorem •

LiNGAM-mixture Prior distribution

19

Distribution of • follows a generalized Gaussian distribution

with zero means.→Includes Gaussian, Laplace, continuous uniform and many non-Gaussian distributions.

– is the Gamma function. 𝑉𝑎𝑟 (𝑒𝑖)=1

20

1 … …

0.6 0.1 0.3

0.3 0.8 0.5

0.1 0.1 0.2

Prior Distributions and the Number of Classes • Prior distribution

– and – , and – can be calculated by using the equation

of .• How to select the number of classes.– Note that ‘true ’ does not exist.

Inv-Gamma(3,3)

① Selects the best model. (letter in red)

② Selects the best number of classes.(painted in orange)

In a Dirichlet process mixture model,

[Antoniak, 1974]

21

Outline


22

Simulation Settings(1)• Generated data using a LiNGAM with latent confounders.

[Hoyer et al., 2008]

• 100 trials.

𝑥2𝑥10.3

0.8𝑒1 𝑒2

𝑓 3 𝑓 4 𝑓 50.7 -10.8

0.9 0.6

• The distributions of latent variables ( ，，， and ) are randomly selected from the following three non-Gaussian distributions:

(This graph is .)

Laplace distributionMixture of two

Gaussian distribution (symmetric)

Mixture of two Gaussian distribution

(asymmetric)

23

Simulation Settings(2)

• Two methods for comparison:– Pairwise likelihood ratios for estimation of

non-Gaussian SEMs [Hyvärinen et al., 2013]

→Assumes no latent confounders.– PairwiseLvLiNGAM [Entner et al., 2011]

→Finds variable pairs that are not affected by latent confounders and then estimate a causal ordering of one to the other.

24

50 64/64 6/12 6/16

100 52/52 7/20 5/24

200 42/42 0/14 2/14

Simulation Results

• Our method is most robust against existing latent confounders.

( )

50 100 2000

20

40

60

80

100

50 100 2000

20

40

60

80

100

50 100 2000

20

40

60

80

100( → ) ( ← )True:

Our methodPairwise measurePairwiseLv LiNGAM(Number of outputs)

The

num

ber o

f co

rrec

t ans

wer

s

The

num

ber o

f co

rrec

t ans

wer

s

The

num

ber o

f co

rrec

t ans

wer

s

• “(Number of outputs)” is the number of estimation by PairwiseLvLiNGAM. – For the details,

Sample size Sample size Sample size

Correct answers / Number of outputs

25

Conclusions and Future Work• A challenging problem: Estimation of causal direction

in the presence of latent confounders.– Latent confounders violate the assumption of LiNGAM

and can bias the estimation results.• Proposed a Bayesian LiNGAM mixture approach.

– Capable of finding causal direction in the presence of latent confounders.

– Computationally simpler: no iterative estimation in the parameter space.

• In this simulation, our method was better than two existing methods.

• Future work– Test our method on a wide variety of real datasets.

27

Histograms of

1 2 3 4 5 6 705

10152025

G1, sample size:50

1 2 3 4 5 6 705

10152025

G2, sample size:50

1 2 3 4 5 6 705

10152025

G3, sample size:50

1 2 3 4 5 6 7 8 905

10152025

G1, sample size:100

1 2 3 4 5 6 7 8 905

10152025

G2, sample size:100

1 2 3 4 5 6 7 8 905

10152025

G3, sample size:100

1 2 3 4 5 6 7 8 9 1005

10152025

G1, sample size:200

1 2 3 4 5 6 7 8 9 1005

10152025

G2, sample size:200

1 2 3 4 5 6 7 8 9 1005

10152025

G3, sample size:200

28

Density of a Transformation[Hyvärinen et al., 2001]

• e.g. ）• is the density of and is the density of ．

– is i.i.d data, so . Similarly, • We can rewrite LiNGAM in a matrix form.

• could be permuted by simultaneous equal row and column permutations to be strictly lower triangular due to the acyclicity assumption. [Bollen, 1989]→ is lower triangular whose diagonal elements are all 1 ．

• A determinant of lower triangular equals the product of its diagonal elements.→

29

Gaussian vs. Non-GaussianGaussian Non-Gaussian (uniform)

( → )

( ← )

𝑥1

𝑥1

𝑥1

𝑥1

𝑥2 𝑥2

𝑥2 𝑥2

naoki tanaka , shohei shimizu, takashi washio

Documents