bayesian scoring functions for bayesian belief networks

BAYESIAN SCORING FUNCTIONS FOR BAYESIAN BELIEF NETWORKS (BBNS)Jee Vang, [email protected]

Version 3.1 This work is licensed under a Creative Commons Attribution 3.0 Unported License.

mailto:[email protected]

http://creativecommons.org/licenses/by/3.0/deed.en_US

THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.

2

PURPOSE AND OUTLINE

Purpose: concisely illustrate how some Bayesian scoring functions have been established to score Bayesian belief networks (BBNs)

Define a BBN Cover what Bayesian scoring functions are based on

Basic mathematic functions (factorial, gamma, and Beta functions) Probability distributions (multinomial, Dirichlet, Dirichlet-multinomial) Bayes’ Theorem Assumptions

Give a few Bayesian scoring function examples (BD, K2, BDe, BDeu)


3

DEFINITION OF A BBN

G is a pair, G(V,E) V={V1,…,Vn} is a set of vertices with a

one-to-one correspondence with a set of random variables X={X1,…,Xn} (sometimes, Vi and Xi are used interchangeably)

E is a set of directed edges Eij denotes Vi Vj or equivalently Xi Xj

G is a special type of graph, called an acyclic directed graph (ADG): there is no path starting with any Vi and leading back to itself in the direction of the arrows

For any Xi Xj, Xi is said to be the parent of Xj

All the parents of Xi are denoted pa(Xi) or

G is called the structure of a BBN

P is a joint probability distribution over X

Chain rule

P satisfies the Markov condition which states that a variable is conditionally independent of all other variables given its parents

P is the parameters of a BBN

A BBN is defined as a pair (G,P) where G and P themselves are defined as follows


4

BASIC MATHEMATIC FUNCTIONS

The factorial function for a positive integer, , is defined as follows.

The gamma function generalizes the factorial function, and for a positive integer, , is defined as follows.

The Beta function for a set of k positive integers, , is defined as follows.


5

PROBABILITY DISTRIBUTIONS—MULTINOMIAL AND DIRICHLET The multinomial probability mass function (PMF) for a

discrete random variable with values is defined as follows.

The Dirichlet probability density function (PDF) for a continuous random variable is defined as follows.

Note the following. is a set of parameters

is a set of counts (frequencies) is a set of hyperparameters


6

PROBABILITY DISTRIBUTIONS—DIRICHLET-MULTINOMIAL The Dirichlet-multinomial PMF for a discrete random

variable with values is defined as follows.

The Dirichlet-multinomial PMF states the underlying model generating the data is multinomial and the parameters are Dirichlet distributed


7

PROBABILITY DISTRIBUTION—DIRICHLET-MULTINOMIAL (CONTINUED)

Integrating the Dirichlet-multinomial over gets the marginal joint probability of the data

Note that takes the form of the Dirichlet

Using substitution, we get the following

Note , since this integration is over the Dirichlet PDF


8


Expand the Beta functions and

Rearrange the terms

Drop to get the following proportional relationship


9


We can extend to discrete random variables, , as follows

Note is a vector of count vectors

is a count vector

is a vector of hyperparameter vectors is a hyperparameter vector

is a vector of the number of values corresponding to each


10

BAYES’ THEOREM

Use Bayes’ Theorem to compute the probability of the BBN structure, , given the data, , written as

Drop out the prior and marginal likelihood terms because typically, is assumed to have a uniform distribution is a normalizing constant

Thus,


11

ASSUMPTIONS

To compute some assumptions are needed The following assumptions have been reported

The data is generated from multinomial distributions (multinomial sample), The parameters associated with each variable in a BBN are independent (parameter

independence), If a variable has the same parents in two different networks, then the probability density functions

of the parameters associated with this node are identical in both networks (parameter modularity), The parameters are distributed according to the Dirichlet distribution (Dirichlet), There is no missing data (complete data), Data should not help discriminate network structures that represent the same assertions of

conditional independence (likelihood equivalence), and For any complete DAG, G, the probability of G, , is greater than zero (structure possibility)

Note, all these assumptions are not needed together (at the same time) In general, these assumptions are needed: multinomial sample, parameter independence,

parameter modularity, complete data In general, some of these assumptions may be relaxed (complete data, multinomial sample,

Dirichlet, etc…)


12

BAYESIAN DIRICHLET (BD) SCORING FUNCTION Define the following

is the set of n discrete random variables is the set of parents of and is the number of unique instantiations (configurations) of is the number of values for is the number of times and and is the hyperparameter for and and

Then may be defined as follows This equation represents the Bayesian Dirichlet (BD) scoring function Note that this equation looks very similar to differs from by having an extra product term for the parents, In fact, if (no variable has any parents), then

Hyperparameters corresponds to and counts corresponds to


13

SOME DIFFERENT BAYESIAN SCORING FUNCTIONS Some different Bayesian scoring functions are variants of the BD

scoring function Kutató (K2), Bayesian Dirichlet equivalent (BDe), Bayesian Dirichlet equivalent uniform (BDeu)

K2, BDe, and BDeu differ in the way the values for the hyperparameters are set

K2: BDe: BDeu: For BDe and BDeu, is called the equivalent sample size (ESS)

There is no widely accepted approach on how to set ESS

The maximum a posterior (MAP) BBN structure is very sensitive to ESS; choosing different values of ESS may lead to different MAP


14

SOME DIFFERENT BAYESIAN SCORING FUNCTIONS (CONTINUED) In full form, K2 and BDeu are defined as follows K2

BDeu


15

SUMMARY

The Bayesian scoring functions discussed are based on some basic mathematical functions (factorial, gamma, and Beta), probability distributions (multinomial, Dirichlet, Dirichlet-multinomial), Bayes’ Theorem, and assumptions

The Dirichlet-multinomial compound distribution and the integration of this PMF over is key to understanding how some Bayesian scoring functions have been established

Variations of some Bayesian scoring functions primarily deal with setting the values of hyperparameters

bayesian scoring functions for bayesian belief networks

Education