introduction to bayesian inference - nikhef · pdf file1 introduction the frequentist and...

NIKHEF-06-01

Introduction to Bayesian Inference

M. Botje

NIKHEF, PO Box 41882, 1009DB Amsterdam, the Netherlands

June 21, 2006

-2

-1

0

1

2

Μ

1

2

3

Σ-2

-1

0

1Μ

Abstract

These are the write-up of a NIKHEF topical lecture series on Bayesian inference.The topics covered are the definition of probability, elementary probability calcu-lus and assignment, selection of least informative probabilities by the maximumentropy principle, parameter estimation, systematic error propagation, model se-lection and the stopping problem in counting experiments.

Contents

1 Introduction 4

2 Bayesian Probability 5

2.1 Plausible inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Probability calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Exhaustive and exclusive sets of hypotheses . . . . . . . . . . . . . . . . 9

2.4 Continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Bayesian versus Frequentist inference . . . . . . . . . . . . . . . . . . . . 12

3 Posterior Representation 14

3.1 Mean, variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 The covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Basic Probability Assignment 21

4.1 Bernoulli’s urn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Gauss distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Least Informative Probabilities 28

5.1 Impact of prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Symmetry considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Maximum entropy principle . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Parameter Estimation 35

6.1 Gaussian sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.3 Example: no signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.4 Systematic error propagation . . . . . . . . . . . . . . . . . . . . . . . . 43

6.5 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Counting 50

7.1 Binomial counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2

7.2 The negative binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A Solutions to Selected Exercises 54

3

1 Introduction

The Frequentist and Bayesian approaches to statistics differ in the definition of prob-ability. For a Frequentist, probability is the relative frequency of the occurrence of anevent in a large set of repetitions of the experiment (or in a large ensemble of identicalsystems) and is, as such, a property of a so-called random variable. In Bayesian statis-tics, on the other hand, probability is not defined as a frequency of occurrence but asthe plausibility that a proposition is true, given the available information. Probabilitiesare then—in the Bayesian view—not properties of random variables but a quantitativeencoding of our state of knowledge about these variables. This view has far-reachingconsequences when it comes to data analysis since Bayesians can assign probabilities topropositions, or hypotheses, while Frequentists cannot.

In these lectures we present the basic principles and techniques underlying Bayesianstatistics or, rather, Bayesian inference. Such inference is the process of determiningthe plausibility of a conclusion, or a set of conclusions, which we draw from the availabledata and prior information.

Since we derive in this write-up (almost) everything from scratch, little reference is madeto the literature. So let us start by giving below our own small library on the subject.

Perhaps one of the best review articles (from an astronomers perspective) is T. Loredo‘From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics’ [1]. Veryilluminating case studies are presented in ‘An Introduction to Parameter Estimation

using Bayesian Probability Theory ’ [2] and ‘An Introduction to Model Selection using

Probability Theory as Logic’ [3] by G.L. Bretthorst. A nice summary of Bayesian statis-tics from a particle physics perspective can be found in the article ‘ Bayesian Inference

in Processing Experimental Data’ by G. D’Agostini [4].

A good introduction to Bayesian methods is given in the book by Sivia ‘Data Analysis—a

Bayesian Tutorial ’ [5]. More extensive, with many worked-out examples in Mathemat-ica, is the book by P. Gregory (also an astronomer) ‘Bayesian Logical Data Analysis for

the Physical Sciences ’ [6].1 The ultimate reference, but certainly not for the fainthearted,is the monumental work by Jaynes, ‘Probability Theory—The Logic of Science’ [7]. Un-fortunately Jaynes died before the book was finished so that it is incomplete. It isavailable in print (Cambridge University Press) but a free copy can still be found onthe website given in [7]. For those who want to refresh their memory on Frequentistmethods we recommend ‘Statistical Data Analysis ’ by G. Cowan [8].

A rich collection of very interesting articles (including those cited above) can be foundon the web site of T. Loredo http://www.astro.cornell.edu/staff/loredo/bayes.

Finally there are of course these lecture notes which can be found, together with thelectures themselves, on

http://www.nikhef.nl/user/h24/bayes.

Exercise 1.1: The literature list above suggests that Bayesian methods are more popularin astronomy than in our field of particle physics. Can you give reasons why astronomerswould be more inclined to turn Bayesian?

1The Mathematica notebooks can be downloaded, for free, from www.cambridge.org/052184150X.

4

2 Bayesian Probability

2.1 Plausible inference

In Aristotelian logic a proposition can be either true or false. In the following wewill denote a proposition by a capital letter like A and represent ‘true’ or ‘false’ by theBoolean values 1 and 0, respectively. The operation of negation (denoted by A) turnsa true proposition into a false one and vice versa.

Two propositions can be linked together to form a compound proposition. The stateof such a compound proposition depends on the states of the two input propositionsand on the way these are linked together. It is not difficult to see that there are exactly16 different ways in which two propositions can be combined.2 All these have specificnames and symbols in formal logic but here we will be concerned with only a few ofthese, namely, the tautology (⊤), the contradiction (⊥), the and (×),3 the or (+)and the implication (⇒). Below we give the truth tables of these binary operations

A B A⊤B A ⊥ B A × B A + B A ⇒ B0 0 1 0 0 0 10 1 1 0 0 1 11 0 1 0 0 1 01 1 1 0 1 1 1

(2.1)

Note that the tautology is always true and the contradiction always false, independentof the value of the input propositions.

Exercise 2.1: Show that A × A is a contradiction and A + A a tautology.

Two important relations between the logical operations ‘and’ and ‘or’ are given by thede Morgan’s laws

A × B = A + B and A + B = A × B. (2.2)

We note here a remarkable duality possessed by logical equations in that they can betransformed into other valid equations by interchanging the operations × and +.

Exercise 2.2: Prove (2.2). Hint: this is easiest done by verifying that the truth tables ofthe left and right-hand sides of the equations are the same. Once it is shown that the firstequation in (2.2) is valid, then duality guarantees that the second equation is also valid.

The reasoning process by which conclusions are drawn from a set of input propositions iscalled inference. If there is enough input information we apply deductive inference

which allows us to draw firm conclusions, that is, the conclusion can be shown to beeither true or false. Mathematical proofs, for instance, are based on deductive inferences.

2The two input propositions define four possible input states which each give rise to two possibleoutput states (true or false). These output states can thus be encoded as bits in a 4-bit word. Five outof the 16 possible output words are listed in the truth table (2.1).

3The conjunction A × B will often be written as the juxtaposition AB since it looks neat in longexpressions or as (A, B) since we are accustomed to that in mathematical notation.

5

If there is not enough input information we apply inductive inference which does notallow us to draw a firm conclusion. The difference between deductive and inductivereasoning can be illustrated by the following simple example:

P1: Roses are redP2: This flower is a rose

⇒ This flower is red (deduction)

P1: Roses are redP2: This flower is red

⇒ This flower is perhaps a rose (induction)

Induction thus leaves us in a state of uncertainty about our conclusion. However, thestatement that the flower is red increases the probability that we are dealing with a roseas can easily be seen from the fact that—provided all roses are red—the fraction of rosesin the population of red flowers must be larger than that in the population of all flowers.

The process of deductive and inductive reasoning can formally be summarized in termsof the following two syllogisms4

Deduction InductionProposition 1: If A is true then B is true If A is true then B is trueProposition 2: A is true B is trueConclusion: B is true A is more probable

The first proposition in both the syllogisms can be recognized as the implication A ⇒ Bfor which the truth table is given in (2.1). It is straight forward to check from this truthtable the validity of the conclusions derived above; in particular it is seen that A can betrue or false in the second (inductive) syllogism.

Exercise 2.3: (i) Show that it follows from A ⇒ B that B ⇒ A; (ii) What are the con-clusions in the two syllogisms above if we replace the second proposition by, respectively,‘A is false’ and ‘B is false’?

In inductive reasoning then, we are in a state of uncertainty about the validity (true orfalse) of the conclusion we wish to draw. However, it makes sense to define a measureP (A|I) of the plausibility (or degree of belief) that a proposition A is true, given theinformation I. It may seem quite an arbitrary business to attempt to quantify somethinglike a ‘degree of belief’ but this is not so.

Cox (1946) has, in a landmark paper [9], formulated the rules of plausible inference

and plausibility calculus by basing them on several desiderata. These desiderata are sofew that we can summarize them here: (i) boundedness—the degree of plausibility of aproposition A is a bounded quantity; (ii) transitivity—if A, B and C are propositionsand the plausibility of C is larger than that of B which, in turn, is larger than thatof A then the plausibility of C must be larger than that of A; (iii) consistency—theplausibility of a proposition A depends only on the relevant information on A and noton the path of reasoning followed to arrive at A.

4A syllogism is a triplet of related propositions consisting of a mayor premise, a minor premise anda conclusion.

6

It turns out that these desiderata are so restrictive that they completely determine thealgebra of plausibilities. To the surprise of many, this algebra appeared to be identicalto that of classical probability as defined by the axioms of Kolmogorov (see below forthese axioms). Plausibility is thus—for a Bayesian—identical to probability.5

2.2 Probability calculus

We will now derive several useful formula starting from the fundamental axioms of prob-ability calculus taking the viewpoint of a Homo Bayesiensis, when appropriate. First, asalready mentioned above, P (A|I) is a bounded quantity; by convention P (A|I) = 1 (0)when we are certain that the proposition A is true (false). Next, the two Kolmogorov

axioms are the sum rule

P (A + B|I) = P (A|I) + P (B|I) − P (AB|I) (2.3)

and the product rule

P (AB|I) = P (A|BI)P (B|I). (2.4)

Let us, at this point, spell-out the difference between AB (‘A and B’) and A|B (‘A givenB’): In AB, B can be true or false while in A|B, B is assumed to be true and cannotbe false. The following terminology is often used for the probabilities occurring in theproduct rule (2.4): P (AB|I) is called the joint probability, P (A|BI) the conditional

probability and P (B|I) the marginal probability.

Probabilities can be represented in a Venn diagram by the (normalized) areas of thesub-sets A and B of a given set I. The sum rule is then trivially understood from thefollowing diagram

I

AB

AB

while the product rule can be seen to give the relation between different normaliza-tions of the area AB: P (AB|I) corresponds to the area AB normalized to I, P (A|BI)corresponds to AB normalized to B and P (B|I) corresponds to B normalized to I.

Because A + A is a tautology (aways true) and AA a contradiction (always false) wefind from (2.3)

P (A|I) + P (A|I) = 1 (2.5)

which sometimes is taken as an axiom instead of (2.3)

Exercise 2.4: Derive the sum rule (2.3) from the axioms (2.4) and (2.5).

5The intimate connection between probability and logic is reflected in the title of Jaynes’ book:‘Probability Theory—The Logic of Science’.

7

If A and B are mutually exclusive propositions (they cannot both be true) then,because AB is a contradiction, P (AB|I) = 0 and (2.3) becomes

P (A + B|I) = P (A|I) + P (B|I) (A and B exclusive). (2.6)

If A and B are independent (the knowledge of B does not give us information on Aand vice versa),6 then P (A|BI) = P (A|I) and (2.4) becomes

P (AB|I) = P (A|I)P (B|I) (A and B independent). (2.7)

Because AB = BA we see from (2.4) that P (A|BI)P (B|I) = P (B|AI)P (A|I). Fromthis we obtain the rule for conditional probability inversion, also known as Bayes’

theorem:

P (H|DI) =P (D|HI)P (H|I)

P (D|I). (2.8)

In (2.8) we replaced A and B by D and H to indicate that in the following thesepropositions will refer to ‘data’ and ‘hypothesis’, respectively. Ignoring for the mo-ment the normalization constant P (D|I) in the denominator, Bayes’ theorem tells usthe following: The probability P (H|DI) of a hypothesis, given the data, is equal tothe probability P (H|I) of the hypothesis, given the background information alone (thatis, without considering the data) multiplied by the probability P (D|HI) that the hy-pothesis, when true, just yields that data. In Bayesian parlance P (H|DI) is called theposterior probability, P (D|HI) the likelihood, P (H|I) the prior probability andP (D|I) the evidence.

Bayes’ theorem describes a learning process in the sense that it specifies how to updatethe knowledge on H when new information D becomes available.

We remark that Bayes’ theorem is valid in both the Bayesian and Frequentist worldsbecause it follows directly from axiom (2.4) of probability calculus. What differs isinterpretation of probability: for a Bayesian, probability is a measure of plausibility sothat it makes perfect sense to convert P (D|HI) into P (H|DI) for data D and hypothesisH . For a Frequentist however, probabilities are properties of random variables and,although it makes sense to talk about P (D|HI), it does not makes sense to talk aboutP (H|DI) because a hypothesis H is a proposition and not a random variable. SeeSection 2.5 for more on Bayesian versus Frequentist.

Many people are not always fully aware of the consequences of probability inversion. Tosee this, consider the case of Mr. White who goes to a doctor for an AIDS test. Thistest is known to be 100% efficient (the test never fails to detect AIDS). A few days laterpoor Mr. White learns that he is positive. Does this mean that he has AIDS?

Most people (including, perhaps, Mr. White himself and his doctor) would say ‘yes’because they fail to realize that

P (AIDS|positive) 6= P (positive|AIDS).

6We are talking here about a logical dependence which could be defined as follows: A and B arelogically dependent when learning about A implies that we also will learn something about B. Notethat logical dependence does not necessarily imply causal dependence. Causal dependence does, on theother hand, always imply logical dependence.

8

That inverted probabilities are, in general, not equal should be even more clear fromthe following example:

P (rain|clouds) 6= P (clouds|rain).

Right?

In the next section we will learn how to deal with Mr. White’s test (and with the opinionof his doctor).

2.3 Exhaustive and exclusive sets of hypotheses

Let us now consider the important case that H can be decomposed into an exhaustiveset of mutually exclusive hypotheses {Hi}, that is, into a set of which one and only onehypothesis is true.7 Notice that this implies, by definition, that H itself is a tautology.Trivial properties of such a complete set of hypotheses are8

P (Hi, Hj|I) = P (Hi|I) δij (2.9)

and ∑

i

P (Hi|I) = P (∑

i

Hi|I) = 1 (normalization) (2.10)

where we used the sum rule (2.6) in the first equality and the fact that the logical sum ofthe Hi is a tautology in the second equality. Eq. (2.10) is the extension of the sum-ruleaxiom (2.5).

Similarly it is straight forward to show that

∑

i

P (D, Hi|I) = P (D,∑

i

Hi|I) = P (D|I). (2.11)

This operation is called marginalization9 and plays a very important role in Bayesiananalysis since it allows to eliminate sets of hypotheses which are necessary in the for-mulation of a problem but are otherwise of no interest to us (‘nuisance parameters’).

The inverse of marginalization is the decomposition of a probability: Using the prod-uct rule we can re-write (2.11) as

P (D|I) =∑

i

P (D, Hi|I) =∑

i

P (D|Hi, I)P (Hi|I) (2.12)

which states that the probability of D can be written as the weighted sum of the probabil-ities of a complete set of hypotheses {Hi}. The weights are just given by the probabilitythat Hi, when true, gives D. In this way we have expanded P (D|I) on a basis of proba-bilities P (Hi|I).10 Decomposition is often used in probability assignment because itallows to express a compound probability in terms of known elementary probabilities.

7A trivial example is the complete set H1 : x < a and H2 : x ≥ a with x and a real numbers.8Here and in the following we write the conjunction AB as A, B.9A projection of a two-dimensional distribution f(x, y) on the x or y axis is called a marginal

distribution. Because (2.11) is projecting out P (D|I) it is called marginalization.10Note that (2.12) is similar to the closure relation in quantum mechanics 〈D|I〉 =

∑

i〈D|Hi〉〈Hi|I〉.

9

Using (2.12), Bayes’ theorem (2.8) can, for a complete set of hypotheses, be written as

P (Hi|D, I) =P (D|Hi, I)P (Hi|I)∑

i P (D|Hi, I)P (Hi|I), (2.13)

from which it is seen that the denominator is just a normalization constant.

If we calculate with (2.13) the posteriors for all the hypotheses Hi in the set we obtain aspectrum of probabilities which, in the continuum limit, goes over to a probability den-sity distribution (see Section 2.4). Note that in computing this spectrum the likelihoodP (D|Hi, I) is viewed as a function of the hypotheses for fixed data. In case P (D|Hi, I)is regarded as a function of the data for fixed hypothesis it is not called a likelihood but,instead, a sampling distribution.

Exercise 2.5: Mr. White is positive on an AIDS test. The probability of a positivetest is 98% for a person who has AIDS (efficiency) and 3% for a person who has not(contamination). Given that a fraction µ = 1% of the population is infected, what is theprobability that Mr. White has AIDS? What would be this probability for full efficiencyand for zero contamination?

Exercise 2.6: What would be the probability that Mr. White has AIDS given the priorinformation µ = 0 (nobody has AIDS) or µ = 1 (everybody has AIDS)? Note that boththese statements on µ encode prior certainties. Convince yourself, by studying Bayes’theorem, of the fact that no amount of data can ever change a prior certainty.

2.4 Continuous variables

The formalism presented above describes the probability calculus of propositions or,equivalently, of discrete variables (which can be thought of as an index labeling a setof propositions). To extend this discrete algebra to continuous variables, consider thepropositions

A : r < a, B : r < b, C : a ≤ r < b

for a real variable r and two fixed real numbers a and b with a < b. Because we havethe Boolean relation B = A + C and because A and C are mutually exclusive we findfrom the sum rule (2.6)

P (a ≤ r < b|I) = P (r < b|I) − P (r < a|I) ≡ G(b) − G(a). (2.14)

In (2.14) we have introduced the cumulant G(x) ≡ P (r < x|I) which obviously is amonotonically increasing function of x. The probability density p is defined by

p(x|I) = limδ→0

P (x ≤ r < x + δ|I)

δ=

dG(x)

dx(2.15)

(note that p is positive definite) so that (2.14) can also be written as

P (a ≤ r < b|I) =

∫ b

a

p(r|I) dr. (2.16)

In terms of probability densities, the product rule (2.4) can now be written as

p(x, y|I) = p(x|y, I) p(y|I). (2.17)

10

Likewise, the normalization (2.10) can be written as

∫

p(x|I) dx = 1, (2.18)

the marginalization/decomposition (2.12) as

p(x|I) =

∫

p(x, y|I) dy =

∫

p(x|y, I) p(y|I) dy (2.19)

and Bayes’ theorem (2.13) as

p(y|x, I) =p(x|y, I) p(y|I)

∫p(x|y, I) p(y|I) dy

. (2.20)

Exercise 2.7: A counter produces a yes/no signal S when it is traversed by a pion.Given are the efficiency P (S|π, I) = ε and mis-identification probability P (S|π, I) = δ.The fraction of pions in the beam is P (π|I) = µ. What is the probability P (π|S, I) thata particle which generates a signal is a pion in case (i) µ is known and (ii) µ is unknown?In the latter case assume a uniform prior distribution for µ in the range 0 ≤ µ ≤ 1.

We make four remarks: (1)—Probabilities are dimensionless numbers so that the di-mension of a density is the reciprocal of the dimension of the variable. This impliesthat p(x) transforms when we make a change of variable x → f(x). The size of theinfinitesimal element dx corresponding to df is given by dx = |dx/df | df ; because theprobability content of this element must be invariant we have

p(f |I) df = p(x|I) dx = p(x|I)

∣∣∣∣

dx

df

∣∣∣∣df and thus p(f |I) = p(x|I)

∣∣∣∣

dx

df

∣∣∣∣. (2.21)

Exercise 2.8: A lighthouse at sea is positioned a distance d from the coast.

x

y

x0

ϑd

This lighthouse emits collimated light pulses at random times in random directions, thatis, the distribution of pulses is uniform in ϑ. Derive an expression for the probability toobserve a light pulse as a function of the position x along the coast. (From Sivia [5].)

(2)—Without prior information it is tempting to chose a uniform distribution for theprior density p(y|I) in Bayes’ theorem (2.20). However, the distribution of a transformedvariable z = f(y) will then, in general, not be uniform. An ambiguity arises in theBayesian formalism if our lack of information can be encoded equally well by a uniformdistribution in y as by a uniform distribution in z. How to select an optimal ‘non-informative’ prior among (perhaps many) alternatives is an important—and sometimes

11

controversial—issue: we will come back to it in Section 5. Note, however, that in aniterative learning process the choice of initial prior becomes increasingly less importantbecause the posterior obtained at one step can be taken as the prior in the next step.11

Exercise 2.9: We put two pion counters in the beam both with efficiency P (S|π, I) = εand mis-identification probability P (S|π, I) = δ. The fraction of pions in the beam isP (π|I) = µ. A particle traverses and both counters give a positive signal. What is theprobability that the particle is a pion? Calculate this probability by first taking theposterior of the measurement in counter (1) as the prior for the measurement in counter(2) and second by considering the two responses (S1, S2) as one measurement and usingBayes’ theorem directly. In order to get the same result in both calculations an assumptionhas to be made on the measurements in counters (1) and (2). What is this assumption?

(3)—The equations (2.19) and (2.20) are, together with their discrete equivalents, allyou need in Bayesian inference. Indeed, apart from the approximations and transforma-tions described in Section 3 and the maximum entropy principle described in Section 5,the remainder of these lectures will be nothing more than repeated applications of de-

composition, probability inversion and marginalization.

(4)—Plausible inference is, strictly speaking, always conducted in terms of probabilitiesinstead of probability densities. A density p(x|I) is turned into a probability by multi-plying it with the infinitesimal element dx. For conditional probabilities dx should referto the random variable (in front of the vertical bar) and not to the condition (behindthe vertical bar); thus p(x|y, I)dx is a probability but p(x|y, I)dy is not.

2.5 Bayesian versus Frequentist inference

In the previous sections some mention was made of the differences between the Bayesianand Frequentist approaches. Although it is not the intention of these lectures to makea detailed comparision of the two methods we think it is appropriate to make a fewremarks. For this we will follow the quasi-historical account of T. Loredo [1] whichhelps to put matters in a proper perspective.

First, it is important to realize that the calculus presented in the previous sectionsgives us the rules for manipulating probabilities but not how to assign them. Samplingdistributions (or likelihoods) are not much of a problem since they can, at least inprinciple, be assigned by deductive reasoning within a suitable model (a Monte Carlosimulation, for instance). However, the situation is less clear for the assignment of priors.

Bernoulli (1713) was the first to formulate a rule for a probability assignment which hecalled the principle of insufficient reason:

If for a set of N exclusive and exhaustive propositions there is no evidenceto prefer any proposition over the others then each proposition must beassigned an equal probability 1/N .

11The rate of convergence can be much affected by the initial choice of prior, see Section 5.1 for anice example taken from the book by Sivia.

12

While this can be considered as a solid basis for dealing with discrete variables, problemsarise when we try to extend the principle to infinitely large sets, that is, to continuousvariables. This is because a uniform distribution (‘uninformative’) can be turned into anon-uniform distribution (‘informative’) by a simple coordinate transformation. A solu-tion is provided by not using uniformity but, instead, maximum entropy as a criterionto select uninformative probability densities, see Section 5.

People were also concerned by the lack of rationale to use the axioms (2.3) and (2.4)for the calculus of a ‘degree of belief’ which was, indeed, the concept of probabilityfor Bernoulli (1713), Bayes (1763) and, above all, Laplace (1812) who was the first toformulate Bayesian inference as we know it today. In spite of the considerable successof Laplace in applying Bayesian methods to astronomy, medicine and other fields ofinvestigation, the concept of plausibility was considered to be too vague, and also toosubjective, for a proper mathematical formulation. This nail in the coffin of Bayesianismhas of course been removed by the work of Cox (1946), see Section 2.1.

An apparent masterstroke to eliminate all subjectiveness was to define probability as thelimiting frequency of the occurrence of an event in an infinite sample of repetitions or,equivalently, in an infinite ensemble of identical copies of the process under study. Sucha definition of probability is consistent with the rules (2.3) and (2.4) but inconsistentwith the notion of a probability assignment to a hypothesis since in the repetitions ofan experiment such a hypothesis can only be true or false and nothing else.12 This theninvalidates the inversion of P (D|HI) to P (H|DI) with Bayes’ theorem which is quiteconvenient since it removes, together with the theorem itself, the need to specify thisdisturbing prior probability P (H|I).

However, the problem is now how to judge the validity of an hypothesis without allowingaccess via Bayes’ theorem. The Frequentist answer is the construction of a so-calledstatistic. Such a statistic is a function of the data and thus a random variable witha distribution which can be derived directly from the sampling distribution. The ideais now to construct a statistic which is sensitive to the hypothesis being tested and tocompare the data with the long-term behavior of this statistic. Well known examplesof a statistic are the sample mean, variance, χ2 and so on. A disadvantage is that itis far from obvious how to construct a statistic when we have to deal with complicatedproblems. As a guidance many criteria have been invented like unbiasedness, efficiency,consistency and sufficiency but this still does not lead to a unique definition.

Exercise 2.10: The number n of radioactive decays observed during a time interval ∆tis Poisson distributed (Section 4.4)

P (n|R∆t) =(R∆t)n

n!e−R∆t,

where R is the decay rate. Motivate the use of n/∆t as a statistic for R (consult astatistics textbook if necessary).

Another unattractive feature of the Frequentist approach is that conclusions are basedon hypothetical repetitions of the experiment which never took place. This is in stark

12A model parameter thus has for a Frequentist a fixed, but unknown, value. This is also the viewof a Bayesian since the probability distribution that he assigns to the parameter does not describe howit fluctuates but how uncertain we are of its value.

13

contrast to Bayesian methods where the evidence is provided by the available data andprior information. Many repetitions (‘events’) actually do take place in particle physicsexperiments but this is not so in observational sciences like astronomy for instance. Also,it turns out that the construction of a sampling distribution is not as unambiguous asone might think. Sampling distributions may depend, in fact, not only on the relevantinformation carried by the data but also on how the data were actually obtained. Awell known example of this is the so-called stopping problem which we will discuss inSection 7.13

Guidance in the construction of priors, or of any probability, is provided by samplingtheory (the study of random processes like those occurring in games of chance), grouptheory (the study of symmetries and invariants) and, above all, the principle of maxi-mizing information entropy (see Section 5.3). However, one may take as a rule of thumpthat when the prior is wide compared to the likelihood, it does not matter very muchwhat distribution we chose. On the other hand, when the likelihood is so wide that itcompetes with any reasonable prior then this simply means that the experiment doesnot carry much information on the problem at hand. In such a case it should not comeas a surprise that answers become dominated by prior knowledge or assumptions. Ofcourse there is nothing wrong with that, as long as these prior assumptions are clearlystated. The prior also plays an important role when the likelihood peaks near a physi-cal boundary or, as very well may happen, resides in an unphysical region (likelihoodsrelated to neutrino mass measurements are a famous example; for another example seeSection 6.3 in this write-up). Note that in case the prior is important, Bayesian andFrequentist methods might yield different answers to your problem.

With these remarks we leave the Bayesian-Frequentist debate for what it is and refer tothe abundant literature on the subject, see e.g. [10] for recent discussions.

In the previous sections we have explicitly kept the probability densities conditional to‘I’ in all expressions as a reminder that Bayesian probabilities are always defined inrelation to some background information and also that this information must be thesame for all probabilities in a given expression; if this is not the case, calculations maylead to paradoxical results.14 This background information should not be regarded asencompassing ‘all that is known’ but simply as a list of what is needed to unambiguouslydefine all the probabilities in an expression. In the following we will be a bit liberal andsometimes omit ‘I’ when it clutters the notation.

3 Posterior Representation

The full result of Bayesian inference is the posterior distribution. However, instead ofpublishing this distribution in the form of a parameterization, table, plot or computerprogram it is often more convenient to summarize the posterior in terms of a few pa-

13Many people consider the stopping problem the nail in the coffin of Frequentism.14This happens, for instance, when we calculate a posterior P1(H |DI) ∝ P (D|HI)P (H |I) and

feed this posterior back into Bayes’ theorem as an ‘improved’ prior to calculate P2(H |DI) ∝P (D|HI)P1(H |DI) using the same data D. It is obvious already from the inconsistent notation thatthis kind bootstrapping is not allowed.

14

rameters. In the following subsection we present the mean, variance and covariance

as a measure of position and width. The remaining two subsections are devoted totransformations of random variables and to the properties of the covariance matrix.

3.1 Mean, variance and covariance

The expectation value of a function f(x) is defined by15

<f > =

∫

f(x) p(x|I) dx (3.1)

where the integration domain is understood to be the definition range of the distributionp(x|I). The k-th moment of a distribution is the expectation value <xk >. From (2.18)it immediately follows that the zeroth moment <x0 > = 1. The first moment is calledthe mean of the distribution and is a location measure

µ = x = <x> =

∫

x p(x|I) dx. (3.2)

The variance σ2 is the second moment about the mean

σ2 = <∆x2 > = <(x − µ)2 > =

∫

(x − µ)2 p(x|I) dx. (3.3)

The square root of the variance is called the standard deviation and is a measure ofthe width of the distribution.

Exercise 3.1: Show that the variance is related to the first and second moments by<∆x2 > = <x2 > − <x>2.

The width of a multivariate distribution is characterized by the covariance matrix:

µi = xi = <xi > =

∫

· · ·∫

xi p(x1, . . . , xn|I) dx1 · · ·dxn

Vij = <∆xi∆xj > =

∫

· · ·∫

(xi − µi)(xj − µj) p(x1, . . . , xn|I) dx1 · · ·dxn. (3.4)

The covariance matrix is obviously symmetric.

Exercise 3.2: Show that the off-diagonal elements of Vij vanish when x1, . . . , xn areindependent variables.

A correlation between the variables is better judged from the matrix of correlation

coefficients which is defined by

ρij =Vij

√ViiVjj

=Vij

σiσj. (3.5)

15We discuss here only continuous variables; the expressions for discrete variables are obtained byreplacing the integrals with sums.

15

It can be shown that −1 ≤ ρij ≤ +1.

The position of the maximum of a probability density function is called the mode16

which often is taken as a location parameter (provided the distribution has a singlemaximum). For the general case of an n-dimensional distribution one finds the modeby minimizing the function L(x) = − ln p(x|I). Expanding L around the point x (tobe specified below) we obtain

L(x) = L(x) +n∑

i=1

∂L(x)

∂xi

∆xi +1

2

n∑

i=1

n∑

j=1

∂2L(x)

∂xi∂xj

∆xi∆xj + · · · (3.6)

with ∆xi ≡ xi − xi. We now take x to be the point where L is minimum. With thischoice x is a solution of the set of equations

∂L(x)

∂xi= 0 (3.7)

so that the second term in (3.6) vanishes. Up to second order, the expansion can nowbe written in matrix notation as

L(x) = L(x) + 12(x − x)H(x − x) + · · · , (3.8)

where the Hessian matrix of second derivatives is defined by

Hij ≡∂2L(x)

∂xi∂xj

. (3.9)

Exponentiating (3.8) we find for our approximation of the probability density in theneighborhood of the mode:

p(x|I) ≈ C exp[−12(x − x)H(x − x)] (3.10)

where C is a normalization constant. Now if we identify the inverse of the Hessian witha covariance matrix V then the approximation (3.10) is just a multivariate Gaussian

in x-space17

p(x|I) ≈ 1√

(2π)n |V |exp

[−1

2(x − x)V −1(x − x)

], (3.11)

where |V | denotes the determinant of V .18

Sometimes the posterior is such that the mean and covariance matrix (or Hessian) canbe calculated analytically. In most cases, however, minimization programs like minuit

are used to determine numerically x and V from L(x) (the function L is calculated inthe subroutine fcn provided by the user).

The approximation (3.11) is easily marginalized. It can be shown that integrating amultivariate Gaussian over one variable xi is equivalent to deleting the corresponding

16We denote the mode by x to distinguish it from the mean x. For symmetric distributions thisdistinction is irrelevant since then x = x.

17This is why we have expanded in (3.6) the logarithm instead of the distribution itself: only thenthe inverse of the second derivative matrix is equal to the covariance matrix of a multivariate Gaussian.

18There is no problem with√

|V | since |V | is positive definite as will be shown in Section 3.3.

16

row and column i in the Hessian matrix H . This defines a new Hessian H ′ and, byinversion, a new covariance matrix V ′. Replacing V by V ′ and n by (n − 1) in (3.11)then obtains the integrated Gaussian. It is now easy to see that integration over all butone xi gives

p(xi|I) =1

σi

√2π

exp

[

−1

2

(xi − xi

σi

)2]

(3.12)

where σ2i is the diagonal element Vii of the covariance matrix V .19

Let us close this section by making the remark that communicating a posterior byonly giving the mean (or mode) and covariance is inappropriate—and perhaps evenmisleading—when the distribution is multi-modal, strongly a-symmetric or exhibits longtails.20 Note also that there are distributions for which mean and variance are ill definedas is, for instance, the case for a uniform distribution on [−∞, +∞]. Mean and variancemay not even exist because the integrals (3.2) or (3.3) are divergent. An example ofthis is the Cauchy distribution

p(x|I) =1

π

1

1 + x2.

Exercise 3.3: The Cauchy distribution is often called the Breit-Wigner distributionwhich usually is parameterized as

p(x|x0, Γ) =1

π

Γ/2

(Γ/2)2 + (x − x0)2.

Use (3.6)—in one dimension—to characterize the position and width of the Breit-Wignerdistribution. Show also that Γ is the FWHM (full width at half maximum).

If the choice of an optimal value is of strategic importance, like in business or finance,then this choice is often based on decision theory (not described here) instead of justtaking the mean or mode.

3.2 Transformations

In this section we briefly describe how to construct probability densities of functions ofa (multi-dimensional) random variable. We start by calculating the probability densityp(z|I) of a single function z = f(x) from a given distribution p(x|I) of n variables x.Decomposition gives

p(z|I) =

∫

p(z, x|I) dx =

∫

p(z|x, I) p(x|I) dx =

=

∫

δ[z − f(x)] p(x|I) dx (3.13)

19The error on a ‘fitted’ parameter given by minuit is the diagonal element of the covariance matrixand is thus the width of the marginal distribution of this parameter.

20Without additional information people will assume that the posterior is unimodal and that a onestandard deviation interval contains roughly 68.3% probability as is the case for a Gauss distribution.

17

where we have made the trivial assignment p(z|x, I) = δ[z − f(x)]. This assignmentguarantees that the integral only receives contributions from the hyperplane f(x) = z.

As an example consider two independent variables x and y distributed according top(x, y|I) = f(x)g(y). Using (3.13) we find that the distribution of the sum z = x + y isgiven by the Fourier convolution of f and g

p(z|I) =

∫

f(x)g(z − x) dx =

∫

f(z − y)g(y) dy. (3.14)

Likewise we find that the product z = xy is distributed according to the Mellin con-

volution of f and g

p(z|I) =

∫

f(x)g(z/x)dx

|x| =

∫

f(z/y)g(y)dy

|y| , (3.15)

provided that the definition ranges do not include x = 0 and y = 0.

Exercise 3.4: Use (3.13) to derive Eqs. (3.14) and (3.15).

In case of a coordinate transformation it may be convenient to just use the Jacobianas we have done in (2.21) in Section 2.4. By a coordinate transformation we mean amapping R

n → Rn by a set of n functions

z(x) = {z1(x), . . . , zn(x)},

for which there exists an inverse transformation

x(z) = {x1(z), . . . , xn(z)}.

The probability density of the transformed variables is then given by

q(z|I) = p[x(z)|I] |J | (3.16)

where |J | is the absolute value of the determinant of the Jacobian matrix

Jik =∂xi

∂zk

. (3.17)

Exercise 3.5: Let x and y be two independent variables distributed according to p(x, y|I) =f(x)g(y). Let u = x + y and v = x − y. Use (3.16) to obtain an expression for q(u, v|I)in terms of f and g and show, by integrating over v, that the marginal distribution ofu is given by (3.14). Likewise define u = xy and v = x/y and show that the marginaldistribution of u is given by (3.15).

The above, although it formally settles the issue of how to deal with functions of randomvariables, often gives rise to tedious algebra as can be seen from the following exercise:21

Exercise 3.6: Two variables x1 and x2 are independently Gaussian distributed:

p(xi|I) =1

σi

√2π

exp

[

−1

2

(xi − µi

σi

)2]

i = 1, 2.

Show, by carrying out the integral in (3.14), that the variable z = x1 + x2 is Gaussiandistributed with mean µ = µ1 + µ2 and variance σ2 = σ2

1 + σ22 .

21Later on we will use Fourier transforms (characteristic functions) to make life much easier.

18

An easy way to deal with any function of any distribution is to generate p(x|I) byMonte Carlo, calculate F (x) at each generation and histogram the result. However,if we are content with summarizing the distributions by a location parameter and acovariance matrix, then there is a very simple transformation rule, known as linear

error propagation.

Let Fλ(x) be one of a set of m functions of x. Linear approximation gives

∆Fλ ≡ Fλ(x) − Fλ(x) =n∑

i=1

∂Fλ(x)

∂xi

∆xi (3.18)

with ∆xi = xi − xi. Here x stands for your favorite location parameter (usually meanor mode). Now multiplying (3.18) by the expression for ∆Fµ and averaging obtains

<∆Fλ∆Fµ > =n∑

i=1

n∑

j=1

∂Fλ

∂xi

∂Fµ

∂xj

<∆xi∆xj > . (3.19)

Eq. (3.19) can be written in compact matrix notation as

VF = DVxDT (3.20)

where DT denotes the transpose of the m × n derivative matrix Dλi = ∂Fλ/∂xi. Wellknown applications are the quadratic addition of errors for a sum of independent vari-ables

σ2 = σ21 + σ2

2 + · · · + σ2n for z = x1 + x2 + · · ·+ xn (3.21)

and the quadratic addition of relative errors for a product of independent variables

(σ

z

)2

=

(σ1

x1

)2

+

(σ2

x2

)2

+ · · ·+(

σn

xn

)2

for z = x1x2 · · ·xn. (3.22)

Exercise 3.7: Use (3.19) to derive the two propagation rules (3.21) and (3.22).

Exercise 3.8: A counter is traversed by N particles and fires n times. Since n ⊆ N thesecounts are not independent but n and m = N−n are. Assume Poisson errors (Section 4.4)σn =

√n and σm =

√m and use (3.19) to show that the error on the efficiency ε = n/N

is given by

σε =

√

ε(1 − ε)

N

This is known as the binomial error, see Section 4.2.

3.3 The covariance matrix

In this section we investigate in some more detail the properties of the covariance matrixwhich, together with the mean, fully characterizes the multivariate Gaussian (3.11).

In Section 3.1 we have already remarked that V is symmetric but not every symmetricmatrix can serve as a covariance matrix. To see this, consider a function f(x) of a setof Gaussian random variables x. For the variance of f we have according to (3.20)

σ2 = <∆f 2 > = d V d,

19

where d is the vector of derivatives ∂f/∂xi. But since σ2 is positive for any function fit follows that the following inequality must hold:

d V d > 0 for any vector d. (3.23)

A matrix which possesses this property is called positive definite.

A covariance matrix can be diagonalized by a unitary transformation. This can beseen from Fig. 1 where we show the one standard deviation contour of two correlated

x1

x2(a)

σ2

σ1

y1

y2

(b)

Figure 1: The one standard deviation contour of a two dimensional Gaussian for (a) correlatedvariables x1 and x2 and (b) uncorrelated variables y1 and y2. The marginal distributions of x1 and x2

have a standard deviation of σ1 and σ2, respectively.

Gaussian variables (x1, x2) and two uncorrelated variables (y1, y2). It is clear from theseplots that the two error ellipses are related by a simple rotation. The rotation matrix isunique, provided that the variables x are linearly independent. A pure rotation is notthe only way to diagonalize the covariance matrix since the rotation can be combinedwith a scale transformation along y1 or y2.

The rotation U which diagonalizes V must, according to the transformation rule (3.19),satisfy the relation

UV UT = L ⇒ V UT = UTL (3.24)

where L = diag(λ1, . . . , λn) denotes a diagonal matrix. In (3.24) we have used theproperty U−1 = UT of an orthogonal transformation. Let the columns i of UT bedenoted by the set of vectors ui, that is,

uij = UT

ji = Uij . (3.25)

It is then easy to see that (3.24) corresponds to the set of eigenvalue equations

V ui = λiui. (3.26)

Thus, the rotation matrix U and the vector of diagonal elements λi is determined bythe complete set of eigenvectors and eigenvalues of the covariance matrix V .

Exercise 3.9: Show that (3.24) is equivalent to (3.26).

Exercise 3.10: (i) Show that for a symmetric matrix V and two arbitrary vectors x andy the following relation holds y V x = x V y; (ii) Show that the eigenvectors ui and uj

of a symmetric matrix V are orthogonal, that is, uiuj = 0 for i 6= j; (iii) Show that theeigenvalues of a positive definite symmetric matrix V are all positive.

20

The normalization factor of the multivariate Gaussian (3.11) is proportional to thesquare root of the determinant|V | which makes only sense if this determinant is positive.Indeed,

∏

i

λi = |L| =∣∣UV UT

∣∣ = |U | |V |

∣∣UT

∣∣ = |V | > 0

where we have used the fact that |U | = 1 and that all the eigenvalues of V are positive.

4 Basic Probability Assignment

In Section 2.3 we have introduced the operation of decomposition which allows us toassign a compound probability by writing it as a sum of known elementary probabilities.

As a very basic application of decomposition in combination with Bernoulli’s principle

of insufficient reason we will discuss, in the next section, drawing balls from an urn.In passing we will make some observations about probabilities which are very Bayesianand which you may find quite surprising if you have never encountered them before.

We then proceed by deriving the Binomial distribution in subsection 4.2 using nothingelse but the sum and product rules of probability calculus. Multinomial and Poissondistributions are introduced in the subsections 4.3 and 4.4. In the last subsection wewill introduce a new tool, the characteristic function, and derive the Gauss distributionas the limit of a sum of arbitrarily distributed random variables.

4.1 Bernoulli’s urn

Consider an experiment where balls are drawn from an urn. Let the urn contain Nballs and let the balls be labeled i = 1, . . . , N . We can now define the exhaustive andexclusive set of hypotheses

Hi = ‘this ball has label i’, i = 1, . . . , N.

Since we have no information on which ball we will draw we use the principle of insuf-ficient reason to assign the probability to get ball ‘i’ at the first draw:

P (Hi|N, I) =1

N. (4.1)

Next, we consider the case that R balls are colored red and W = N − R are coloredwhite. We define the exhaustive and exclusive set of hypotheses

HR = ‘this ball is red’

HW = ‘this ball is white’.

21

We now want to assign the probability that the first ball we draw will be red. To solvethis problem we decompose this probability into the hypothesis space {Hi} which gives

P (HR|I) =

N∑

i=1

P (HR, Hi|I) =

N∑

i=1

P (HR|Hi, I)P (Hi|I)

=1

N

N∑

i=1

P (HR|Hi, I) =R

N(4.2)

where, in the last step, we have made the trivial probability assignment

P (HR|Hi, I) =

{1 if ball ‘i’ is red0 otherwise.

Next, we assign the probability that the second ball will be red. This probability dependson how we draw the balls:

1. We draw the first ball, put it back in the urn and shake the urn. The latter actionmay be called ‘randomization’ but from a Bayesian point of view the purposeof shaking the urn is, in fact, to destroy all information we might have on thewhereabouts of this ball after it was put back in the urn (it would most likelyend-up in the top layer of balls). Since this drawing with replacement doesnot change the contents of the urn and since the shaking destroys all previouslyaccumulated information, the probability of drawing a red ball a second time isequal to that of drawing a red ball the first time:

P (R2|I) =R

N,

where R2 stands for the hypothesis ‘the second ball is red’.

2. We record the color of the first ball, lay it aside, and then draw the second ball.Obviously the content of the urn has changed after the first draw. Depending onthe color of the first ball we assign:

P (R2|R1, I) =R − 1

N − 1

P (R2|W1, I) =R

N − 1

Exercise 4.1: Draw a first ball and put it aside without recording its color. Show thatthe probability for the second draw to be red is now P (R2|I) = R/N .

In the above we have shown that the probability of the second draw may depend on theoutcome of the first draw. We will now show that the probability of the first draw maydepend on the outcome of the second draw! Consider the following situation: A firstball is drawn and put aside without recording its color. A second ball is drawn and it

22

turns out to be red. What is the probability that the first ball was red? Bayes’ theoremimmediately shows that it is not R/N :

P (R1|R2, I) =P (R2|R1, I)P (R1|I)

P (R2|R1, I)P (R1|I) + P (R2|W1, I)P (W1|I)=

R − 1

N − 1.

If this argument fails to convince you, take the extreme case of an urn containing onered and one white ball. The probability of a red ball at the first draw is 1/2. Lay theball aside and take the second ball. If it is red, then the probability that the first ballwas red is zero and not 1/2. The fact that the second draw influences the probability ofthe first draw has of course nothing to do with a causal relationship but, instead, witha logical relationship.

Exercise 4.2: Draw a first ball and put it back in the urn without recording its color.The color of a second draw is red. What is the probability that the first draw was red?

4.2 Binomial distribution

We now draw n balls from the urn, putting the ball back after each draw and shaking theurn. In this way the probability that a draw is red is the same for all draws: p = R/N .What is the probability that we find r red balls in our sample of n draws? Again, weseek to decompose this probability into a combination of elementary ones which are easyto assign. Let us start with the hypothesis

Sj = “the n balls are drawn in the sequence labeled ‘j’ ”

where j = 1, . . . , 2n is the index in a list of all possible sequences (of length n) ofwhite and red draws. The set of hypotheses {Sj} is obviously exclusive and exhaustive.The draws are independent, that is, the probability of the kth draw does not dependon the outcome of the other draws (remember that this is only true for drawing withreplacement). Thus we find from the product rule

P (Sj|I) = P (C1, . . . , Cn|I) =n∏

k=1

P (Ck|I) = prj(1 − p)n−rj , (4.3)

where Ck stands for red or white at the kth draw and where rj is the number of reddraws in the sequence j. Having assigned the probability of each element in the set{Sj}, we now decompose our probability of r red balls into this set:

P (r|I) =

2n∑

j=1

P (r, Sj|I) =

2n∑

j=1

P (r|Sj, I)P (Sj|I)

=

[2n∑

j=1

δ(r − rj)

]

pr(1 − p)n−r (4.4)

where we have assigned the trivial probability

P (r, |Sj, I) = δ(r − rj) =

{1 when the sequence Sj contains r red draws0 otherwise.

23

The sum inside the square brackets in (4.4) counts the number of sequences in the set{Sj} which have just r red draws. It is an exercise in combinatorics to show that thisnumber is given by the binomial coefficient. Thus we obtain

P (r|p, n) =n!

r!(n − r)!pr(1 − p)n−r =

(n

r

)

pr(1 − p)n−r. (4.5)

This is called the binomial distribution which applies to all processes where theoutcome is binary (red or white, head or tail, yes or no, absent or present etc.), providedthat the probability p of the outcome of a single draw is the same for all draws. InFig. 2 we show the distribution of red draws for n = (10, 20, 40) trials for an urn withp = 0.25.

0 2 4 6 8 10r

0.05

0.1

0.15

0.2

0.25

P H r È 10, 0.25 L

0 2 4 6 8 10 12 14r

0.05

0.1

0.15

0.2

P H r È 20, 0.25 L

0 5 10 15 20 25r

0.020.040.060.080.1

0.120.14

P H r È 40, 0.25 L

Figure 2: The binomial distribution to observe r red balls in n = (10, 20, 40) draws from an urncontaining a fraction p = 0.25 of red balls.

The binomial probabilities are just the terms of the binomial expansion

(a + b)n =n∑

r=0

(n

r

)

ar bn−r (4.6)

with a = p and b = 1 − p. From this it follows immediately that

n∑

r=0

P (r|p, n) = 1.

The condition of independence of the trials is important and may not be fulfilled: forinstance, suppose we scoop a handful of balls out of the urn and count the number rof red balls in this sample. Does r follow the binomial distribution? The answer is ‘no’since we did not perform draws with replacement, as required. This can also be seenfrom the extreme situation where we take all balls out of the urn. Then r would not bedistributed at all: it would just be R.

The first and second moments and the variance of the binomial distribution are

<r> =n∑

r=0

rP (r|p, n) = np

<r2 > =

n∑

r=0

r2P (r|p, n) = np(1 − p) + n2p2

<∆r2 > = <r2 > − <r>2 = np(1 − p) (4.7)

24

If we now define the ratio µ = r/n then it follows immediately from (4.7) that

<µ> = p <∆µ2 > =p(1 − p)

n(4.8)

The square root of this variance is called the binomial error which we have alreadyencountered in Exercise 3.8. It is seen that the variance vanishes in the limit of largen and thus that µ converges to p in that limit. This fundamental relation between aprobability and a limiting relative frequency was first discovered by Bernoulli and iscalled the law of large numbers. This law is, of course, the basis for the Frequentistdefinition of probability.

4.3 Multinomial distribution

A generalization of the binomial distribution is the multinomial distribution whichapplies to N independent trials where the outcome of each trial is among a set of kalternatives with probability pi. Examples are drawing from an urn containing ballswith k different colors, the throwing of a dice (k = 6) or distributing N independentevents over the bins of a histogram.

The multinomial distribution can be written as

P (n|p, N) =N !

n1! · · ·nk!pn1

1 · · · pnk

k (4.9)

where n = (n1, . . . , nk) and p = (p1, . . . , pk) are vectors subject to the constraints

k∑

i=1

ni = N and

k∑

i=1

pi = 1. (4.10)

The multinomial probabilities are just the terms of the expansion

(p1 + · · · + pk)N

from which the normalization of P (n|p, N) immediately follows. The average, varianceand covariance are given by

<ni > = Npi

<∆n2i > = Npi(1 − pi)

<∆ni∆nj > = −Npipj for i 6= j. (4.11)

Marginalization is achieved by adding in (4.9) two or more variables ni and their corre-sponding probabilities pi.

Exercise 4.3: Use the addition rule above to show that the marginal distribution of eachni in (4.9) is given by the binomial distribution P (ni|pi, N) as defined in (4.5).

The conditional distribution on, say, the count nk is given by

P (m|nk, q, M) =M !

n1! · · ·nk−1!qn1

1 · · · qnk−1

k−1

25

where

m = (n1, . . . , nk−1), q =1

s(pi, . . . , pk−1), s =

k−1∑

i=1

pi and M = N − nk.

Exercise 4.4: Derive the expression for the conditional probability by dividing the jointprobability (4.9) by the marginal (binomial) probability P (nk|pk, N).

4.4 Poisson distribution

Here we consider ‘events’ or ‘counts’ which occur randomly in time (or space). Thecounting rate R is supposed to be given, that is, we know the average number of countsµ = R∆t in a given time interval ∆t. There are several ways to derive an expression forthe probability P (n|µ) to observe n events in a time interval with contains, on average,µ events. Our derivation is based on the fact that this probability distribution is alimiting case of the binomial distribution.

Assume we divide the interval ∆t in N sub-intervals δt. The probability to observe anevent in such a sub-interval is then p = µ/N , see (4.2). Now we can always make Nso large and δt so small that the number of events in each sub-interval is either one orzero. The probability to find n events in N sub-intervals is then equal to the (binomial)probability to find n successes in N trials:

P (n|N) =N !

n!(N − n)!

( µ

N

)n (

1 − µ

N

)N−n

.

Taking the limit N → ∞ then yields the desired result

P (n|µ) = limN→∞

N !

(N − n)!(N − µ)n

µn

n!

(

1 − µ

N

)N

=µn

n!e−µ. (4.12)

This distribution is known as the Poisson distribution. The normalization, averageand variance are given by, respectively,

∞∑

n=0

P (n|µ) = 1, <n>= µ and <∆n2 >= µ. (4.13)

Exercise 4.5: A counter is traversed by beam particles at an average rate of R particlesper second. (i) If we observe n counts in a time interval ∆t, derive an expression forthe posterior distribution of µ = R∆t, given that the prior for µ is uniform. Calculatemean, variance, mode and width (inverse of the Hessian) of this posterior. (ii) Give anexpression for the probability p(τ |R, I) dτ that the time interval between the passage oftwo particles is between τ and τ + dτ .

4.5 Gauss distribution

The sum of many small random fluctuations leads to a Gauss distribution, irrespectiveof the distribution of each of the terms contributing to the sum. This fact, known as

26

the central limit theorem, is responsible for the dominant presence of the Gaussdistribution in statistical data analysis. To prove the central limit theorem we firsthave to introduce the characteristic function which is nothing else than the Fouriertransform of a probability density.

The Fourier transform, and its inverse, of a distribution p(x) is defined by

φ(k) =

∫∞

−∞

eikxp(x) dx p(x) =1

2π

∫∞

−∞

e−ikx φ(k) dk (4.14)

This transformation plays an important role in proving many theorems related to sumsof random variables and moments of probability distributions. This is because a Fouriertransform turns a Fourier convolution in x-space, see (3.14), into a product in k-space.22

To see this, consider a joint distribution of n independent variables

p(x|I) = f1(x1) · · ·fn(xn).

Using (3.13) we write for the transform of the distribution of the sum z =∑

xi

φ(k) =

∫∞

−∞

dz

∫∞

−∞

· · ·∫

∞

−∞

dx1 · · ·dxn exp(ikz) f1(x1) · · ·fn(xn) δ(z −n∑

i=1

xi)

=

∫∞

−∞

· · ·∫

∞

−∞

dx1 · · ·dxn exp

(

ikn∑

i=1

xi

)

f1(x1) · · ·fn(xn)

=

∫∞

−∞

· · ·∫

∞

−∞

dx1 · · ·dxn exp(ikx1)f1(x1) · · · exp(ikxn)fn(xn)

= φ1(k) · · ·φn(k). (4.15)

The transform of a sum of independent random variables is thus the product of thetransforms of each variable.

The moments of a distribution are related to the derivatives of the transform at k = 0:

dnφ(k)

dkn=

∫∞

−∞

(ix)n eikxp(x) dx ⇒ dnφ(0)

dkn= in <xn > . (4.16)

The characteristic functions (Fourier transforms) of many distributions can be found in,for instance, the particle data book [11]. Of importance to us is the Gauss distributionand its transform

p(x) =1

σ√

2πexp

[

−1

2

(x − µ

σ

)2]

φ(k) = exp

(

iµk − 1

2σ2k2

)

(4.17)

To prove the central limit theorem we consider the sum of a large set of n randomvariables s =

∑xj . Each xj is distributed independently according to fj(xj) with mean

22A Mellin transform turns a Mellin convolution (3.15) in x-space into a product in k-space. We willnot discuss Mellin transforms in these notes.

27

µj and a standard deviation σ which we take, for the moment, to be the same for all fj.To simplify the algebra, we do not consider the sum itself but rather

z =

n∑

j=1

yj =

n∑

j=1

xj − µj√n

=s − µ√

n(4.18)

where we have set µ =∑

µi. Now take the Fourier transform φj(k) of the distributionof yj and make a Taylor expansion around k = 0. Using (4.16) we find

φj(k) =∞∑

m=0

km

m!

dmφj(0)

dkm=

∞∑

m=0

(ik)m <ymj >

m!

= 1 +∞∑

m=2

(ik)m <(xj − µj)m >

m! nm/2= 1 − k2σ2

2n+ O(n−3/2) (4.19)

Taking only the first two terms of this expansion we find from (4.15) for the characteristicfunction of z

φ(k) =

(

1 − k2σ2

2n

)n

→ exp

(

−1

2σ2k2

)

for n → ∞. (4.20)

But this is just the characteristic function of a Gaussian with mean zero and width σ.Transforming back to the sum s we find

p(s) =1

σ√

2πnexp

[(s − µ)2

nσ2

]

(4.21)

It can be shown that the central limit theorem also applies when the widths of theindividual distributions are different in which case the variance of the Gauss is σ2 =

∑σ2

i

instead of nσ2 as in (4.21). However, the theorem breaks down when one or moreindividual widths are much larger than the others, allowing for one or more variables xi

to occasionally dominate the sum. It is also required that all µi and σi exist so that thetheorem does not apply to, for instance, a sum of Cauchy distributed variables.

Exercise 4.6: Apply (4.16) to the characteristic function (4.17) to show that the mean andvariance of a Gauss distribution are µ and σ2, respectively. Show also that all momentsbeyond the second vanish (the Gauss distribution is the only one who has this property).

Exercise 4.7: In Exercise 3.6 we have derived the distribution of the sum of two Gaussiandistributed variables by explicitly calculating the convolution integral (3.14). Derive thesame result by using the characteristic function (4.17). Convince yourself that the ‘centrallimit theorem’ always applies to sums of Gaussian distributed variables even for a finitenumber of terms or large differences in width.

5 Least Informative Probabilities

5.1 Impact of prior knowledge

The impact of the prior on the outcome of plausible inference is nicely illustrated bya very instructive example, taken from Sivia [5], were the bias of a coin is determinedfrom the observation of the number of heads in N throws.

28

Let us first recapitulate what constitutes a well posed problem so that we can applyBayesian inference.

• First, we need to define a complete set of hypotheses. For our coin flipping ex-periment this will be the value of the probability h to obtain a head in a singlethrow. The definition range is 0 ≤ h ≤ 1.

• Second, we need a model which relates the set of hypotheses to the data. In otherwords, we need to construct the likelihood P (D|H, I) for all the hypotheses in theset. In our case this is the binomial probability to observe n heads in N throwsof a coin with bias h

P (n|h, N, I) =N !

n!(N − n)!hn(1 − h)N−n. (5.1)

• Finally we need to specify the prior probability p(h|I) dh.

If we observe n heads in N throws, the posterior distribution of h is given by Bayes’theorem

p(h|n, N, I) dh = CN !

n!(N − n)!hn(1 − h)N−n p(h|I) dh, (5.2)

where C is a normalization constant which presently is of no interest to us. In Fig. 3these posteriors are shown for 10, 100 and 1000 throws of a coin with bias h = 0.25 for a

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 0�0

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 3�10

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 21�100

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 255�1000

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 0�0

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 3�10

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 21�100

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 255�1000

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 0�0

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 3�10

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 21�100

0.2 0.4 0.6 0.8 1h

0.20.40.60.8

1

pHhL 255�1000

Figure 3: The posterior density p(h|n, N) for n heads in N flips of a coin with bias h = 0.25. Inthe top row of plots the prior is uniform, in the middle row it is strongly peaked around h = 0.5 whilein the bottom row the region h < 0.5 has been excluded. The posterior densities are scaled to unitmaximum for ease of comparison.

flat prior (top row of plots), a strong prior preference for h = 0.5 (middle row) and a priorwhich excludes the possibility that h < 0.5 (bottom row). It is seen that the flat priorconverges nicely to the correct answer h = 0.25 when the number of throws increases.

29

The second prior does this too, but more slowly. This is not surprising because we haveencoded, in this case, a quite strong prior belief that the coin is unbiased and it takesa lot of evidence from the data to change that belief. In the last choice of prior we seethat the posterior cannot go below h = 0.5 since we have excluded this region by settingthe prior to zero. This is an illustration of the fact that no amount of data can changecertainties encoded by the prior as we have already remarked in Exercise 2.6. This canof course be turned into an advantage since it allows to exclude physically forbiddenregions from the posterior, like a negative mass for instance.

From this example it is clear that unsupported information should not enter into theprior because it may need a lot of data to converge to the correct result in case thisinformation turns out to be wrong. The maximum entropy principle provides a means toconstruct priors which have the property that they are consistent with given boundaryconditions but are otherwise maximally un-informative. Before we proceed with themaximum entropy principle let us first, in the next section, make a few remarks onsymmetry considerations in the assignment of probabilities.

5.2 Symmetry considerations

In Section 2.5 we have introduced Bernoulli’s principle of insufficient reason which statesthat, in absence of relevant information, equal probability should be assigned to mem-bers of an enumerable set of hypotheses. While this assignment strikes us as beingvery reasonable it is worthwhile to investigate if we can find some deeper—or at leastalternative—reason behind this principle.

Suppose we would plot the probabilities assigned to each hypothesis in a bar chart.If we are in a state of complete ignorance about these hypotheses then it obviouslyshould not not matter how they would be ordered in such a chart. Stated differently, inabsence of additional information the set of hypotheses is invariant under permutations.But our bar chart of probabilities can only be invariant under permutations if all theprobabilities are the same. Hence the statement that Bernoulli’s principle is due toour complete ignorance is equivalent to the statement that it is due to permutationinvariance.

Similarly, translation invariance implies that the least informative probability distribu-tion of a so-called location parameter is uniform. If, for instance, we are completelyignorant of the actual whereabouts of the train from Utrecht to Amsterdam then anylocation on the track should be, for us, equally probable. In other words, this probabilityshould obey the relation

p(x|I) dx = p(x + a|I) d(x + a) = p(x + a|I) dx, (5.3)

which can be satisfied only when p(x|I) is a constant.

A somewhat less intuitive assignment is related to scale parameters. Suppose wewould be completely ignorant of distance scales in the universe so that a star in thenight sky could, for us, very well be light years, light decades, light centuries or furtheraway. Then each scale would seem equally probable, that is, for r > 0 and α > 0

p(r|I) dr = p(αr|I) d(αr) = α p(αr|I) dr. (5.4)

30

But this is only possible when p(r|I) ∝ 1/r. This probability assignment, which appliesto positive definite scale parameters, is called a Jeffreys prior. Note that a Jeffreysprior is uniform in ln(r), which means that it assigns equal probability per decadeinstead of per unit interval as does a uniform prior.

Both the uniform and the Jeffreys prior cannot be normalized when the variable rangesare x ∈ [−∞,∞] or r ∈ [0,∞]. Such un-normalizable distributions are called improper.The way to deal with improper distributions is to normalize them on a finite intervaland take the limits to infinity (or zero) at the end of the calculation. This is, by the way,good practice in any mathematical calculation involving limits. The posterior should,of course, always remain finite. If not, it is very likely that your problem is either illposed or that your data do not carry enough information.

Exercise 5.1: We make an inference on a counting rate R by observing the number ofcounts n in a time interval ∆t. Assume that the likelihood P (n|R∆t) is Poisson distributedas defined by (4.12). Assume further a Jeffreys prior for R, defined on the positive intervalR ∈ [a, b]. Show that (i) for ∆t = 0 the posterior is equal to the prior and that we cannottake the limits a → 0 or b → ∞; (ii) when ∆t > 0 but still n = 0 we can take the limitb → ∞ but not a → 0; (iii) that the latter limit can be taken once n > 0. (From Gull [12].)

Finally, let us remark that we have only touched here upon very simple cases where ‘noinformation’, ‘equal probability’ and ‘invariance’ are more or less synonymous so thatthe above may seem quite trivial. However, in Jaynes [7] you can find several exampleswhich are far from trivial.

5.3 Maximum entropy principle

In the assignments we have made up to now (mainly in Section 4), there was alwaysenough information to unambiguously determine the probability distribution. For in-stance if we apply the principle of insufficient reason to a fair dice then this leads to aprobability assignment of 1/6 for each face i = 1, . . . , 6. Note that this corresponds toan expectation value of < i >= 3.5. But what probability should we assign to each ofthe six faces when no information is given about the dice except that, say, <i>= 4.5?There are obviously an infinite number of probability distributions which satisfy thisconstraint so that we have to look elsewhere for a criterion to select one of these.

Jaynes (1957) has proposed to take the distribution which is the least informative bymaximizing the entropy, a concept first introduced by Shannon (1948) in his pioneeringpaper on information theory [13]. The entropy carried by a probability distribution isdefined by

S(p1, . . . , pn) = −n∑

i=1

pi ln

(pi

mi

)

(discrete case)

S(p) = −∫

p(x) ln

[p(x)

m(x)

]

dx (continuous case) (5.5)

where mi or m(x) is the so-called Lebesgue measure (see below) which satisfies

n∑

i=1

mi = 1 or

∫

m(x) dx = 1. (5.6)

31

The definition (5.5) is such that larger entropies correspond to smaller informationcontent of the probability distribution. Note that the Lebesgue measure makes theentropy invariant under coordinate transformations since both p and m transform inthe same way.

To get some feeling for the meaning of this Lebesgue measure imagine a set of swimmingpools on a rainy day. The amount of water collected by each pool will depend onthe distribution of the falling raindrops and on the surface size of each pool. Thesesurface sizes then play the role of the Lebesgue measure. Some formal insight can begained by maximizing (5.5) imposing only the normalization constraint and nothingelse. Restricting ourselves to the discrete case we find, using the method of Lagrangemultipliers, that the following equation has to be satisfied

δ

[n∑

i=1

pi ln

(pi

mi

)

+ λ

(n∑

i=1

pi − 1

)]

= 0. (5.7)

Differentiation of (5.7) to pi leads to the equation

ln

(pi

mi

)

+ 1 + λ = 0 ⇒ pi = mie−(λ+1).

Imposing the normalization constraint∑

pi = 1 we find, using (5.6)

n∑

i=1

pi =

(n∑

i=1

mi

)

e−(λ+1) = e−(λ+1) = 1

from which it immediately follows that

pi = mi (discrete case), p(x) dx = m(x) dx (continuous case). (5.8)

The Lebesgue measure mi or m(x) is thus the least informative probability distributionin complete absence of information. To specify this ‘Ur -prior’ we have, again, to lookelsewhere and use symmetry arguments (see Section 5.2) or just common sense. Inpractice a uniform or very simple Lebesgue measure is often adequate to describe thestructure of the sampling space.

Let us now suppose that the Lebesgue measure is known (we will simply take it to beuniform in the following) and proceed by imposing further constraints in the form of so-called testable information. Such testable information is nothing else than constraintson the probability distributions themselves like specifying moments, expectation values,etc. Generically, we can write a set of m such constraints as

n∑

i=1

fki pi = βk or

∫

fk(x)p(x) dx = βk, k = 1, . . . , m. (5.9)

Using Lagrange multipliers we maximize the entropy by solving the equation, in thediscrete case,

δ

[n∑

i=1

pi ln

(pi

mi

)

+ λ0

(n∑

i=1

pi − 1

)

+

m∑

k=1

λk

(n∑

i=1

fkipi − βk

)]

= 0. (5.10)

32

Differentiating to pi gives the equation

ln

(pi

mi

)

+ 1 + λ0 +

m∑

k=1

λkfki = 0 ⇒ pi = mi exp(−1 − λ0) exp

(

−m∑

k=1

λkfki

)

.

Imposing the normalization condition∑

pi = 1 to solve for λ0, we find

pi =1

Zmi exp

(

−m∑

k=1

λkfki

)

(5.11)

where we have introduced the partition function

Z(λ1, . . . , λm) =n∑

i=1

mi exp

(

−m∑

k=1

λkfki

)

. (5.12)

Such partition functions play a very important role in statistical mechanics and ther-modynamics since they contain all the information on the system under consideration,somewhat like the Lagrangian in dynamics. Indeed, our constraints (5.9) are encodedin Z through

−∂ ln Z

∂λk= βk. (5.13)

Exercise 5.2: Prove (5.13) by differentiating the logarithm of (5.12).

The formal solution (5.11) guarantees that the normalization condition is obeyed but isstill void of content since we have not yet determined the unknown Lagrange multipliersλ1, . . . , λk from the equations (5.9) or, equivalently, from (5.13). These equations oftenhave to be solved numerically; a few simple cases are presented below.

Assuming a uniform Lebesgue measure, it immediately follows from (5.8) that in absenceof any information we have

pi = p(xi|I) = constant,

in accordance with Bernoulli’s principle of insufficient reason. In the continuum limitthis goes over into p(x|I) = constant.

Let us now consider a continuous distribution defined on [0,∞] and impose a constrainton the mean

<x> =

∫∞

0

xp(x|I) dx = µ

Taking the continuum limit of (5.11) we have, assuming a uniform Lebesgue measure,

p(x|I) = e−λx

[∫∞

0

e−λx dx

]−1

= λe−λx.

Imposing the constraint on <x> leads to∫

∞

0

λx e−λx dx =1

λ= µ

33

from which we find that x follows an exponential distribution

p(x|µ, I) =1

µexp

(

−x

µ

)

. (5.14)

The moments of this distribution are given by

<xn > = n! µn so that <x> = µ, <x2 > = 2µ2 and <∆x2 > = µ2.

Another interesting case is a continuous distribution defined on [−∞,∞] with a con-straint on the variance

<∆x2 > =

∫∞

−∞

(x − µ)2 p(x|I) dx = σ2.

We find from the continuum limit of (5.11) after normalization

p(x|I) =

√

λ

πexp

[−λ(x − µ)2

].

The constraint on the variance gives

<∆x2 > =

√

λ

π

∫∞

−∞

(x − µ)2 exp[−λ(x − µ)2

]dx =

1

2λ= σ2

so that x turns out to be Gaussian distributed

p(x|µ, σ, I) =1

σ√

2πexp

[

−1

2

(x − µ

σ

)2]

. (5.15)

This is then the third time we encounter the Gaussian: in Section 3.1 as a convenientapproximation of the posterior in the neighborhood of the mode, in Section 4.5 asthe limiting distribution of a sum of random variables and in this section as the leastinformative distribution consistent with a constraint on the variance.

As an example of a non-uniform Lebesgue measure we will now derive the Poissondistribution from the maximum entropy principle. We want to know the distributionP (n|I) of n counts in a time interval ∆t when there is nothing given but the average

∞∑

n=0

nP (n|I) = µ. (5.16)

To find the Lebesgue measure of the time interval ∆t we divide it into a very largenumber (M) of intervals δt. A particular distribution of counts over these infinitesimalboxes is called a micro-state. If the micro-states are independent and equally probablethen it follows from the sum rule that the probability of observing n counts is propor-tional to the number of micro-states which have n boxes occupied. It is easy to see thatfor M ≫ n this number is given by Mn/n!. Upon normalization we then have for theLebesgue measure

m(n) =Mn

n!e−M . (5.17)

34

Inserting this result in (5.11) we get

P (n|I) =C e−M

(Me−λ

)n

n!

with1

C= e−M

∞∑

n=0

(Me−λ

)n

n!= e−M exp

(Me−λ

),

so that

P (n|I) =

(Me−λ

)n

exp (Me−λ) n!. (5.18)

To calculate the average we observe that

∞∑

n=0

n(Me−λ

)n

n!= − ∂

∂λ

∞∑

n=0

(Me−λ

)n

n!= − ∂

∂λexp

(Me−λ

)= Me−λ exp

(Me−λ

)

Combining this with (5.18) we find from the constraint (5.16)

Me−λ = µ ⇒ P (n|µ) =µn

n!e−µ (5.19)

which is the same result as derived in Section 4.4.

6 Parameter Estimation

In data analysis the measurements are often described by a parameterized model. Inhypothesis testing such a model is called a composite hypothesis (i.e. one with pa-rameters) in contrast to a simple hypothesis (without parameters). Given a compositehypothesis, the problem is how to extract information on the parameters from the data.This is called parameter estimation. It is important to realize that the compositehypothesis is assumed here to be true; investigating the plausibility of the hypothesisitself, by comparing it to a set of alternatives, is called ‘model selection’. This will bethe subject of Section 6.5.

The relation between the model and the data is encoded in the likelihood function

p(d|a, s, I)

where d denotes a vector of data points and a and s are the model parameters whichwe have sub-divided in two classes:

1. The class a of parameters of interest;

2. The class s of so-called nuisance parameters which are necessary to modelthe data but are otherwise of no interest. These parameters usually describe thesystematic uncertainties due to detector calibration, acceptance corrections andso on. Input parameters also belong to this class like, for instance, an input valueof the strong coupling constant αs ± ∆αs taken from the literature.

35

There may also be parameters in the model which have known values. These are, if notexplicitly listed, included in the background information ‘I’.

Given a prior distribution for the parameters a and s, Bayes’ theorem gives for the jointposterior distribution

p(a, s|d, I)dads =p(d|a, s, I)p(a, s|I)dads∫

p(d|a, s, I)p(a, s|I)dads. (6.1)

The posterior of the parameters a is then obtained by marginalization of the nuisanceparameters s:

p(a|d, I) =

∫

p(a, s|d, I) ds. (6.2)

As we have discussed in Section 5, some care has to be taken in choosing appropriatepriors for the parameters a. However, a very nice feature is that unphysical regions canbe excluded from the posterior by simply setting the prior to zero. In this way it is—togive an example—impossible to obtain a negative value for the neutrino mass even if thatwould be preferred by the likelihood. The priors for s are assumed to be known fromdetector studies (Monte Carlo simulations) or, in case of external parameters, from theliterature. Note that the marginalization (6.2) provides a very elegant way to propagatethe uncertainties in the parameters s to the posterior distribution of a (‘systematic errorpropagation’, see Section 6.4).

Bayesian parameter estimation is thus fully described by the equations (6.1) and (6.2).However, these two innocent looking formula do often represent a large amount of nu-merical computation to evaluate the integrals. This may be a far from trivial task, inparticular when our estimation problem is multi-dimensional. Considerable simplifica-tions occur when two or more variables are independent (the probability distributionsthen factorize), when the distributions are Gaussian or when de model is linear in theparameters.

In the following subsections we will discuss a few simple cases which are frequentlyencountered in data analysis.

6.1 Gaussian sampling

Suppose we make a series of measurements of the temperature in a room. Let thesemeasurements be randomly distributed according to a Gaussian with a known width σ(resolution of the thermometer). We now ask the question what is the best estimate µof the temperature?

Since the measurements are independent, we can use the product rule (2.7) and writefor the likelihood

p(d|µ, σ) =n∏

i=1

p(di|µ, σ) =1

(2πσ2)n/2exp

[

−1

2

n∑

i=1

(di − µ

σ

)2]

.

Assuming a uniform prior for µ the posterior becomes

p(µ|d, σ) = C exp

[

−1

2

n∑

i=1

(di − µ

σ

)2]

. (6.3)

36

To calculate the normalization constant it is convenient to writen∑

i=1

(di − µ)2 = V + n(d − µ)2

where

d =1

n

n∑

i=1

di and V =

n∑

i=1

(di − d)2.

The constant C is now obtained from

1

C= exp

(

− V

2σ2

)∫∞

−∞

exp

[

−n(d − µ)2

2σ2

]

dµ =

√

2πσ2

nexp

(

− V

2σ2

)

.

Inserting this result in (6.3) we find

p(µ|d, σ, n) =

√n

2πσ2exp

[

−n

2

(d − µ

σ

)2]

. (6.4)

But this is just a Gaussian with mean d and width σ/√

n. Thus we have the well knownresult

µ = d ± σ√n

. (6.5)

Exercise 6.1: Derive (6.5) directly from (6.3) by expanding L = − ln p using equations(3.6), (3.7) and (3.9) in Section 3.1. Calculate the width as the inverse of the Hessian.

Next, suppose that σ is unknown. Assuming a uniform prior

p(σ|I) =

{0 for σ ≤ 0

constant for σ > 0

we find for the posterior

p(µ, σ|d, V, n) =C

σnexp

[

−V + n(d − µ)2

2σ2

]

(6.6)

where, again, C is a normalization constant. In Fig. 4a we show this joint distributionfor four samples (n = 4) drawn from a Gauss distribution of zero mean and unit width.Here we have set the random variables in (6.6) to d = 0 and V = 4.

The posterior for µ is found by integrating over σ.

p(µ|d, V, n) =

∫∞

0

p(µ, σ|d, V, n) dσ = C ′

[1

V + n(d − µ)2

](n−1)/2

. (6.7)

This integral converges only for n ≥ 2 which is not surprising since one measurementcannot carry information on σ. When n = 2, the distribution (6.7) is improper (cannotbe normalized on [−∞,∞]). Calculating the normalization constant C ′ for n ≥ 3 wefind the Student-t distribution for ν = n − 2 degrees of freedom:

p(µ|d, V, n) =Γ[(n − 1)/2]

Γ[(n − 2)/2]

√n

πV

[V

V + n(d − µ)2

](n−1)/2

. (6.8)

This distribution can be written in a form which has the number of degrees of freedomν as the only parameter.

37

-2 -1 0 1 2

0.5

1

1.5

2

2.5

3

Μ

Σ HaL

-4 -2 0 2 4Μ

0.1

0.2

0.3

0.4

0.5

0.6

0.7

pHΜL HbL

2 4 6 8 10Σ

0.1

0.2

0.3

0.4

0.5

pHΣL HcL

Figure 4: (a) The joint distribution p(µ, σ|d, V, n) for d = 0, V = 4 and n = 4; (b) The marginaldistribution p(µ|d, V, n) (full curve) compared to a Gaussian with σ = 1/

√3 (dotted curve); (c) The

marginal distribution p(σ|V, n).

Exercise 6.2: Transform t2 = ν(ν + 2)(d − µ)2/V with ν = n − 2 > 0 and show that(6.8) can be written as

p(µ|d, V, n) dµ → p(t|ν) dt =Γ[(ν + 1)/2]

Γ(ν/2)√

πν

[1

1 + t2/ν

](ν+1)/2

dt.

This is the expression for the Student-t distribution usually found in the literature.

Exercise 6.3: Show by expanding the negative logarithm of (6.7) that the maximum ofthe posterior is located at u = d and that the Hessian is given by H = n(n− 1)/V . Notethat we do not need to know the normalization constant C′ for this calculation.

Characterizing the posterior by mode and width we find a result similar to (6.5)

µ = d ± S√n

with S2 =V

n − 1. (6.9)

In Fig. 4b we show the marginal distribution (6.8) for d = 0, V = 4 and n = 4 (fullcurve) compared to a Gaussian with zero mean and width

√

V/n(n − 1) = 1/√

3. It isseen that the Student-t distribution is similar to a Gaussian but has much longer tails.

Likewise we can integrate (6.6) over µ to obtain the posterior for σ:

p(σ|d, V, n) =

∫∞

−∞

p(µ, σ|d, V, n) dµ =C ′

σn−1exp

(

− V

2σ2

)

. (6.10)

Integrating this equation over σ to calculate the normalization constant C ′ we find theχχχ222 distribution for ν = n − 2 degrees of freedom

p(σ|V, n) = 2

(V

2

)ν/21

Γ(ν/2)σν+1exp

(

− V

2σ2

)

. (6.11)

This distribution is shown for V = 4 and n = 4 in Fig. 4c.

38

Exercise 6.4: Transform χ2 = V/σ2 and show that (6.11) can be written in the morefamiliar form with ν as the only parameter

p(σ|V, n) dσ → p(χ2|ν) dχ2 =

(χ2)α−1

exp(− 12χ2)

2αΓ(α)dχ2

with α = ν/2. Use the definition of the Gamma function

Γ(z) =

∫ ∞

0

tz−1e−t dt

and the property Γ(z + 1) = zΓ(z) to show that the mean and variance of the χ2 distri-bution are given by ν and 2ν, respectively.

Exercise 6.5: Show by expanding the negative logarithm of (6.10) that the maximumof the posterior is located at σ =

√

V/(n − 1) and that the Hessian is given by H =2(n − 1)2/V .

The mode and width of p(σ|V, n) are given by

σ =

√

V

n − 1± 1

n − 1

√

V

2. (6.12)

More familiar measures of the χ2 distribution for ν degrees of freedom are the averageand the variance (see Exercise 6.4)

<χ2 > = ν and <∆χ2 > = 2ν.

6.2 Least squares

In this section we consider the case that the data can be described by a function f(x; a)of a variable x depending on a set of parameters a. For simplicity we will consider onlyfunctions of one variable; the extension to more dimensions is trivial. Suppose thatwe have made a series of measurements {di} at the sample points {xi} and that eachmeasurement is distributed according to some sampling distribution pi(di|µi, σi). Hereµi and σi characterize the position and the width of the sampling distribution pi of datapoint di. We parameterize the positions µi by the function f :

µi(a) = f(xi; a).

If the measurements are independent we can write for the likelihood

p(d|a, I) =

n∏

i=1

pi[di|µi(a), σi].

Introducing the somewhat more compact notation pi(di|a, I) for the sampling distribu-tions, we write for the posterior distribution

p(a|d, I) = C

(n∏

i=1

pi(di|a, I)

)

p(a|I) (6.13)

39

where C is a normalization constant and p(a|I) is the joint prior distribution of theparameters a. The position and width of the posterior can be found by minimizing

L(a) = − ln[p(a|d, I)] = − ln(C) − ln[p(a|I)] −n∑

i=1

ln[pi(di|a, I)] (6.14)

as described in Section 3.1, equations (3.6)–(3.9). In practice this is often done nu-merically by presenting L(a) to a minimization program like minuit. Note that thisprocedure can be carried out for any sampling distribution pi be it Binomial, Poisson,Cauchy, Gauss or whatever. In case the prior p(a|I) is chosen to be uniform, the secondterm in (6.14) is constant and the procedure is called a maximum likelihood fit.

The most common case encountered in data analysis is when the sampling distributionsare Gaussian. For a uniform prior, (6.14) reduces to

L(a) = constant + 12χ2 = constant + 1

2

n∑

i=1

[di − µi(a)

σi

]2

. (6.15)

We then speak of χχχ222 minimization or least squares minimization. When the func-tion f(x; a) is linear in the parameters, the minimization can be reduced to a singlematrix inversion, as we will now show.

A function which is linear in the parameters can generically be expressed by

f(x; a) =

m∑

λ=1

aλfλ(x) (6.16)

where the fλ are a set of functions of x and the aλ are the coefficients to be determinedfrom the data. We denote by wi ≡ 1/σ2

i the weight of each data point and write for thelog posterior

L(a) = 12

n∑

i=1

wi [di − f(xi; a)]2 = 12

n∑

i=1

wi

[

di −m∑

λ=1

aλfλ(xi)

]2

. (6.17)

The mode a is found by setting the derivative of L to all parameters to zero

∂L(a)

∂aµ= −

n∑

i=1

wi

[

di −m∑

λ=1

aλfλ(xi)

]

fµ(xi) = 0. (6.18)

We can write this equation in vector notation as

b = Wa so that a = W−1b (6.19)

where the (symmetric) matrix W and the vector b are given by

Wλµ =

n∑

i=1

wi fλ(xi)fµ(xi) and bµ =

n∑

i=1

wi di fµ(xi). (6.20)

Differentiating (6.17) twice to aλ yields an expression for the Hessian

Hλµ =∂2L(a)

∂aλ∂aµ

= Wλµ. (6.21)

40

Higher derivatives vanish so that the quadratic expansion (3.8) in Section 3.1 is exact.

To summarize, when the function to be fitted is linear in the parameters we can builda vector b and a matrix W as defined by (6.20). The posterior (assuming uniformpriors) is then a multivariate Gaussian with mean a = W−1b and covariance matrixV ≡ H−1 = W−1. In this way, a fit to the data is reduced to one matrix inversionand does not need starting values for the parameters, nor iterations, nor convergencecriteria.

Exercise 6.6: Calculate the matrix W and the vector b of (6.20) for a polynomial pa-rameterization of the data

f(x; a) = a1 + a2x + a3x2 + a4x

3 + · · · .

Exercise 6.7: Show that a fit to a constant results in the weighted average of the data

a1 =

∑

i widi∑

i wi± 1√∑

i wi

.

6.3 Example: no signal

In this section we describe a typical case where the likelihood prefers a negative valuefor a positive definite quantity. Defining confidence limits in such a case is a notoriouslydifficult problem in Frequentist statistics. But in our Bayesian approach the solution istrivial as is illustrated by the following example where a negative counting rate is foundafter background subtraction.

A search was made by the NA49 experiment at the CERN SPS for D0 production in alarge sample of 4× 106 Pb-Pb collisions at a beam energy of 158 GeV per nucleon [18].Since NA49 does not have secondary vertex reconstruction capabilities, all pairs ofpositive and negative tracks in the event were accumulated in invariant mass spectraassuming that the tracks originate from the decays D0 → K−π+ or D 0 → K+π−. In theleft-hand side plot of Fig. 5 we show the invariant mass spectrum of the D0 candidates.The vertical lines indicate a ±90 MeV window around the nominal D0 mass. The largecombinatorial background is due to a multiplicity of approximately 1400 charged tracksper event giving, for each event, about 5×105 entries in the histogram. In the right-handside plot we show the invariant mass spectrum after background subtraction. Clearlyno signal is observed.

A least squares fit to the data of a Cauchy line shape on top of a polynomial backgroundyielded a negative value N(D0 + D 0) = −0.36 ± 0.74 per event as shown by the fullcurve in the right-hand side plot of Fig. 5. As already mentioned above, this is a typicalexample of a case where the likelihood favors an outcome which is unphysical.

To calculate an upper limit on the D0 yield, Bayesian inference is used as follows. First,the likelihood of the data d is written as a multivariate Gaussian in the parameters a

which describe the D0 yield and the background shape

p(d|a, I) =1

√

(2π)n |V |exp

[−1

2(a − a)V −1(a − a)

](6.22)

41

,K) (GeV)πm(1 2

)-1

dN/d

m (

GeV

0

5000

310×

,K) (GeV)πm(1.8 1.9

dN/d

m (

1/G

eV)

-200

0

200

400

Figure 5: Left: The invariant mass distribution of D0 candidates in 158A GeV Pb-Pb collisions at theCERN SPS. The open (shaded) histograms are before (after) applying cuts to improve the significance.The vertical lines indicate a ±90 MeV window around the nominal D0 mass. Right: The D0 + D 0

invariant mass spectrum after background subtraction. The full curve is a fit to the data assuming afixed signal shape. The other curves correspond to model predictions of the D0 yield.

where n is the number of parameters and a and V are the best values and covariance ma-trix as obtained from minuit, see also Section 3.1. Taking flat priors for the backgroundparameters leads to the posterior distribution

p(a|d, I) =C

√

(2π)n |V |exp

[−1

2(a − a)V −1(a − a)

]p(N |I) (6.23)

where C is a normalization constant and p(N |I) is the prior for the D0 yield N . Theposterior for N is now obtained by integrating (6.23) over the background parameters.As explained in Section 3.1 this yields a one-dimensional Gauss with a variance givenby the diagonal element σ2 = VNN of the covariance matrix. Thus we have

p(N |d, I) =C

σ√

2πexp

−1

2

(

N − N

σ

)2

p(N |I) = C g(N ; N, σ) p(N |I) (6.24)

where we have introduced the short-hand notation g() for a one-dimensional Gaussiandensity. In (6.24), N = −0.36 and σ = 0.74 as obtained from the fit to the data.

As a last step we encode in the prior p(N |I) our knowledge that N is positive definite

P (N |I) ∝ θ(N) with θ(N) =

{0 for N < 01 for N ≥ 0.

(6.25)

Inserting (6.25) in (6.24) and integrating over N to calculate the constant C we find

p(N |d, I) = g(N ; N, σ) θ(N)

[∫∞

0

g(N ; N, σ) dN

]−1

. (6.26)

The posterior distribution is thus a Gaussian with mean and variance as obtained fromthe fit. This Gaussian is set to zero for N < 0 and re-normalized to unity for N ≥ 0.

42

The upper limit (Nmax) corresponding to a given confidence level (CL) is then calculatedfrom

CL =

∫ Nmax

0

p(N |d, I) dN =

∫ Nmax

0

g(N ; N, σ) dN

[∫∞

0

g(N ; N, σ) dN

]−1

. (6.27)

Using the numbers quoted above we find N(D0 + D 0) < 1.5 per event at 98% CL.

6.4 Systematic error propagation

Data are often subject to sources of uncertainty which cause a simultaneous fluctuationof more than one data point. We will call these correlated uncertainties systematic

errors in contrast to statistical errors which are—by definition—point to point un-correlated.

To propagate the systematic uncertainties to the parameters a of interest one oftenoffsets the data by each systematic error in turn, redo the analysis, and then add thedeviations from the optimal values a in quadrature. Such an intuitive ad hoc procedure(offset method) has no sound theoretical foundation and may even spoil your resultby assigning errors which are far too large, see [14] for an illustrative example and alsoExercise 6.10 below.

To take systematic errors into account we will include them in the data model. This canof course be done in many ways, depending on the experiment being analyzed. Here werestrict ourselves to a linear parameterization which has the advantage that it is easilyincorporated in any least squares minimization procedure. This model, as it stands,does not handle asymmetric errors. However, in case we deal with several systematicsources these asymmetries tend to vanish by virtue of the central limit theorem.

In Fig. 6 we show a systematic distortion of a set of data points

di → di + s∆i.

Here ∆i is a list of systematic deviations and s is an interpolation parameter which

2 4 6 8 10x

0.5

1

1.5

2dHxL

2 4 6 8 10x

0.5

1

1.5

2dHxL

Figure 6: Systematic distortion (black symbols) of a set of data points (gray symbols) for two valuesof the interpolation parameter s = +1 (left) and s = −1 (right).

controls the amount of systematic shift applied to the data. Usually there are several

43

sources of systematic error stemming from uncertainties in the detector calibration,acceptance, efficiency and so on. For m such sources, the data model can be written as

di = ti(a) + ri +m∑

λ=1

sλ∆iλ, (6.28)

where ti(a) = f(xi; a) is the theory prediction containing the parameters a of interestand ∆iλ is the correlated error on point i stemming from source λ. In (6.28), the uncor-related statistical fluctuations of the data are described by the independent Gaussianrandom variables ri of zero mean and variance σ2

i . The sλ are independent Gaussianrandom variables of zero mean and unit variance which account for the systematic fluc-tuations. The joint distribution of r and s is thus given by

p(r, s|I) =

n∏

i=1

1

σi

√2π

exp

(

− r2i

2σ2i

) m∏

λ=1

1√2π

exp

(

−s2λ

2

)

.

The covariance matrix of this joint distribution is trivial:

<rirj > = σ2i δij <sλsµ > = δλµ <risλ > = 0. (6.29)

Because the data are a linear combination of the Gaussian random variables r and s itfollows that d is also Gaussian distributed

p(d|I) =1

√

(2π)n |V |exp

[−1

2(d − d)V −1(d − d)

]. (6.30)

The mean d is found by taking the average of (6.28)

di = <di > = ti(a) + <ri > +

m∑

λ=1

<sλ > ∆iλ = ti(a). (6.31)

A transformation of (6.29) by linear error propagation (see Section 3.2) gives for thecovariance matrix

V =

σ21 + S11 S12 · · · S1n

S21 σ22 + S22 · · · S2n

......

. . ....

Sn1 Sn2 · · · σ2n + Snn

with Sij =m∑

λ=1

∆iλ∆jλ. (6.32)

Exercise 6.8: Use the propagation formula (3.19) in Section 3.2 to derive (6.32) from(6.28) and (6.29).

It is also easy to calculate this covariance matrix by directly averaging the product∆di∆dj. Because all the cross terms vanish by virtue of (6.29) we immediately obtain

Vij = <∆di∆dj > = <(ri +∑

λ

sλ∆iλ)(rj +∑

λ

sλ∆jλ)>

= <rirj > +∑

λ

∑

µ

∆iλ∆jµ <sλsµ > + · · ·

= σ2i δij +

∑

λ

∆iλ∆jλ.

44

Inserting (6.31) in (6.30) and assuming a uniform prior, the log posterior of the param-eters a can be written as

L(a) = − ln[p(a|d)] = Constant + 12

n∑

i=1

n∑

j=1

[di − ti(a)]V −1ij [dj − tj(a)]. (6.33)

Minimizing L defined by (6.33) is impractical because we need the inverse of the co-variance matrix (6.32) which can become very large. Furthermore, when the systematicerrors dominate, the matrix might—numerically—be uncomfortably close to a matrixwith the simple structure Vij = ∆i∆j , which is singular.

Fortunately (6.33) can be cast into an alternative form which avoids the inversion oflarge matrices [15]. Our derivation of this result is based on the standard steps taken ina Bayesian inference: (i) use the data model to write an expression for the likelihood;(ii) define prior probabilities; (iii) calculate posterior probabilities with Bayes’ theoremand (iv) integrate over the nuisance parameters.

The likelihood p(d|a, s) is calculated from the decomposition in the variables r

p(d|a, s) =

∫

dr p(d, r|a, s) =

∫

dr p(d|r, a, s) p(r|a, s) (6.34)

The data model (6.28) is incorporated through the trivial assignment

p(d|r, a, s) =

n∏

i=1

δ[ri + ti(a) +

m∑

λ=1

sλ∆iλ − di]. (6.35)

As already discussed above, the distribution of r is taken to be

p(r|a, s) =

n∏

i=1

1

σi

√2π

exp

(

− r2i

2σ2i

)

. (6.36)

Inserting (6.35) and (6.36) in (6.34) yields, after integration over r,

p(d|a, s) =

n∏

i=1

1

σi

√2π

exp

[

−12

(di − ti(a) −

∑

λ sλ∆iλ

σi

)2]

. (6.37)

Assuming a Gaussian prior for s

p(s|I) = (2π)−m/2m∏

λ=1

exp(−12s2

λ)

and a uniform prior for a, the joint posterior distribution can be written as

p(a, s|d) = C exp

−12

n∑

i=1

wi

(

di − ti(a) −m∑

λ=1

sλ∆iλ

)2

− 12

m∑

λ=1

s2λ

(6.38)

where wi = 1/σ2i . The log posterior L = − ln p can now numerically be minimized (for

instance by minuit) with respect to the parameters a and s. Marginalization of the

45

nuisance parameters s, as described in Section 3.1, then yields the desired result. Clearlywe now got rid of our large covariance matrix (6.32) at the expense of extending theparameter space from a to a × s. In global data analysis where many experiments arecombined the number of systematic sources s can become quite large so that minimizingL of (6.38) may still not be very attractive.

However, the fact that L is linear in s allows to analytically carry out the minimizationand marginalization with respect to s. For this, we expand L like in (3.6) but only to s

and not to a (it is easy to show that this expansion is exact i.e. that higher derivativesin sλ vanish):

L(a, s) = L(a, s) +∑

λ

∂L(a, s)

∂sλ∆sλ +

1

2

∑

λ

∑

µ

∂2L(a, s)

∂sλ∂sµ∆sλ∆sµ.

Solving the equations ∂L/∂sµ = 0 we find

L(a, s) = L(a, s) + 12(s − s)S(s − s) (6.39)

with

s = S−1b

Sλµ = δλµ +n∑

i=1

wi∆iλ∆iµ

bλ(a) =

n∑

i=1

wi[di − ti(a)]∆iλ. (6.40)

and

L(a, s) = 12

n∑

i=1

wi

(


λ=1

sλ∆iλ

)2

− 12

m∑

λ=1

s2λ. (6.41)

Exercise 6.9: Derive the equations (6.39)–(6.41).

Exponentiating (6.39) we find for the posterior

p(a, s|d) = C exp [−L(a, s)] exp[−1

2(s − s)S(s − s)

]

which yields upon integrating over the nuisance parameters s

p(a|d) = C ′ exp [−L(a, s)] (6.42)

The log posterior (6.41) can now numerically be minimized with respect to the param-eters a. Instead of the n × n covariance matrix V of (6.32) only an m × m matrix S

has to be inverted with m the number of systematic sources.

The solution (6.40) for s can be substituted back into (6.41). This leads after straightforward algebra to the following very compact and elegant representation of the poste-rior [15]

p(a|d) = C ′ exp

{

−12

[n∑

i=1

wi(di − ti)2 − b S−1b

]}

. (6.43)

46

The first term in the exponent is the usual χ2 in absence of correlated errors whilethe second term takes into account the systematic correlations. Note that S does notdepend on a so that S−1 can be computed once and for all. The vector b, on the otherhand, does depend on a so that it has to be recalculated at each step in the minimizationloop.

Although the posteriors defined by (6.33) and (6.43) look very different, it can be shown(tedious algebra) that they are mathematically identical [16]. In other words, minimizing(6.33) leads to the same result for a as minimizing the negative logarithm of (6.43).

It is clear that the uncertainty on parameters derived from a given data sample shoulddecrease when new data are added to the sample. This is because additional data willalways increase the available information even when these data are very uncertain. Fromthis it follows immediately that the error obtained from an analysis of the total samplecan never be larger than the error obtained from an analysis of any sub-sample of thedata. It turns out that error estimates based on equations (6.40)–(6.43) do meet thisrequirement but that the offset method—mentioned in the beginning of this section—does not. This issue is investigated further in the following exercise.

Exercise 6.10: We make n measurements di ± σi of the temperature in a room. Themeasurements have a common systematic offset error ∆. Calculate the best estimate µof the temperature and the total error (statistical ⊕ systematic) by: (i) using the offsetmethod mentioned at the beginning of this section and (ii) using (6.43). To simplify thealgebra assume that all data and errors have the same value: di = d, σi = σ.

A second set of n measurements is added using another thermometer which has the sameresolution σ but no offset uncertainty. Calculate the best estimate µ and the total errorfrom both data sets using either the offset method or (6.43). Again, assume that di = dand σi = σ to simplify the algebra.

Now let n → ∞. Which error estimate makes sense in this limit and which does not?

The systematic errors described in this section were treated as offsets. Another im-portant class are scale or normalization errors. Scale uncertainties are non-linearand their treatment is beyond the scope of this write-up although it is similar to thatdescribed above for offset errors: include scale parameters s in the data model, calcu-late the joint posterior p(a, s|d) assuming a normal, or perhaps a lognormal23 priordistribution centered around unity and, finally, integrate over s. Considerable careshould be taken in correctly formulating the log likelihood (or log posterior) since scaleuncertainties not only affect the position but also the widths of the sampling distri-butions. Therefore normalization terms—which usually are ignored—suddenly becomedependent on the (scale) parameters so that they should be taken into account in theminimization. For more on the treatment of normalization uncertainties and possiblelarge biases in the fit results we refer to [17].

6.5 Model selection

In the previous sections we have derived posterior distributions of model parametersunder the assumption that the model hypothesis is true. In this section we enlarge

23When x is normally (i.e. Gaussian) distributed then y = ex follows the lognormal distribution.

47

the hypothesis space to allow for several competing models and ask the question whichmodel should be preferred in light of the evidence provided by the data. The Bayesianprocedure which provides, at least in principle, an answer to this question is calledmodel selection. As we will see, this model selection is not only based on the qualityof the data description (‘goodness of fit’) but also on a factor which penalizes modelswhich have a larger number parameters. Bayesian model selection thus automaticallyapplies Occam’s razor in preferring, to a certain degree, the more simple models.

As already mentioned several times before, Bayesian inference can only asses the plausi-bility of an hypothesis when this hypothesis is a member of an exclusive and exhaustiveset. One can of course always complement a given hypothesis H by its negation H butthis does not bring us very far since H is usually too vague a condition for a meaningfulprobability assignment.24 Thus, in general, we have to include our model H into a finiteset of mutually exclusive and exhaustive alternatives {Hk}. This obviously restricts theoutcome of our selection procedure to one of these alternatives but allows, on the otherhand, to use Bayes’ theorem (2.13) to assign posterior probabilities to each of the Hk

P (Hk|D, I) =P (D|Hk, I) P (Hk|I)∑

i P (D|Hi, I) P (Hi|I). (6.44)

To avoid calculating the denominator, one often works with the so-called odds ratio

(i.e. a ratio of probabilities)

Okj =P (Hk|D, I)

P (Hj|D, I)︸︷︷︸

Posterior odds

=P (Hk|I)

P (Hj|I)︸︷︷︸

Prior odds

× P (D|Hk, I)

P (D|Hj, I)︸︷︷︸

Bayes factor

. (6.45)

The first term on the right-hand side is called the prior odds and the second term theBayes factor.

The selection problem can thus be solved by calculating the odds with (6.45) and accepthypothesis k if Okj is much larger than one, declare the data to be inconclusive if theratio is about unity and reject k in favor of one of the alternatives if Okj turns out to bemuch smaller than one. The prior odds are usually set to unity unless there is strongprior information in favor of one of the hypotheses. However, when the hypotheses in(6.45) are composite, then not only the prior odds depend on prior information but alsothe Bayes factor.

To see this, we follow Sivia [5] in working out an illustrative example where the choice isbetween a parameter-free hypothesis H0 and an alternative H1 with one free parameterλ. Let us denote a set of data points by d and decompose the probability densityp(d|H1) into

p(d|H1) =

∫

p(d, λ|H1) dλ =

∫

p(d|λ, H1) p(λ|H1) dλ. (6.46)

To evaluate (6.46) we assume a uniform prior for λ in a finite range ∆λ and write

p(λ|H1) =1

∆λ. (6.47)

24We could, for instance, describe a detector response to pions by a probability P (d|π). However,it would be very hard to assign something like a ‘not-pion probability’ P (d|π) without specifying thedetector response to members of an alternative set of particles like electrons, kaons, protons etc.

48

Next, an expansion of the logarithm of the likelihood up to second order in λ gives

L(λ) = − ln[p(d|λ, H1)] = L(λ) + 12[(λ − λ)/σ]2 + · · ·

where λ is the position of the maximum likelihood and σ is the width as characterizedby the inverse of the Hessian. By exponentiating L we obtain

p(d|λ, H1) ≈ p(d|λ, H1) exp

−1

2

(

λ − λ

σ

)2

(6.48)

Finally, inserting (6.47) and (6.48) in (6.46) gives, upon integration over λ

p(d|H1) ≈ p(d|λ, H1)σ√

2π

∆λ. (6.49)

Exercise 6.11: Derive an expression for the posterior of λ by inserting Eqs. (6.47), (6.48)and (6.49) in Bayes theorem

p(λ|d, H1) =p(d|λ, H1) p(λ|H1)

p(d|H1).

Thus we find for the odds ratio (6.45)

P (H0|d, I)

P (H1|d, I)︸︷︷︸

Posterior odds

=P (H0|I)

P (H1|I)︸︷︷︸

Prior odds

× p(d|H0)

p(d|H1, λ)︸︷︷︸

Likelihood ratio

× ∆λ

σ√

2π︸︷︷︸

Occam factor

(6.50)

As already mentioned above, the prior odds can be set to unity unless there is informa-tion which gives us prior preference for one model over another. The likelihood ratiowill, in general, be smaller than unity and therefore favor the model H1. This is becausethe additional flexibility of an adjustable parameter usually yields a better descriptionof the data. This preference for models with more parameters leads to the well knownphenomenon that one can ‘fit an elephant’ with enough free parameters.25 This clearlyillustrates the inadequacy of using the fit quality as the only criterion in model selection.Indeed, such a criterion alone could never favor a simpler model.

Intuitively we would prefer a model that gives a good description of the data in a widerange of parameter values over one with many fine-tuned parameters, unless the latterwould provide a significantly better fit. Such an application of Occam’s razor is encodedby the so-called Occam factor in (6.50). This factor tends to favor H0 since it penalizesH1 for reducing a wide parameter range ∆λ to a smaller range σ allowed by the fit. Herewe immediately face the problem that H0 would always be favored in case ∆λ is set toinfinity. As far as we know, there is no other way but setting prior parameter ranges ashonestly as possible when dealing with model selection problems.

In case Hi and Hj both have one free parameter (µ and λ) the odds ratio becomes

P (Hi|d, I)

P (Hj|d, I)=

P (Hi|I)

P (Hj|I)× p(d|Hi, µ)

p(d|Hj, λ)× ∆λ

σλ

σµ

∆µ. (6.51)

25Including as many parameters as data points will cause any model to perfectly describe the data.

49

For similar prior ranges ∆λ and ∆µ the likelihood ratio has to overcome the penaltyfactor σµ/σλ. This factor favors the model for which the likelihood has the largest width.It may seem a bit strange that the less discriminating model is favored but inspectionof (6.49) shows that the evidence P (D|H) carried by the data tends to be larger formodels with a larger ratio σ/∆λ, that is, for models which cause a smaller collapse ofthe hypothesis space when confronted with the data. Note that the choice of prior rangeis less critical in (6.51) than in (6.50) so that this poses not much of a problem when weuse Bayesian model selection to chose between, say, a Breit-Wigner or a Gaussian peakshape in an invariant mass spectrum.

Finally, let us remark that the Occam factor varies like the power of the number of freeparameters so that models with many parameters may get very strongly disfavored bythis factor.

Exercise 6.12: Generalize (6.51) to the case where Hi has n free parameters λλλ and Hj

has m free parameters µµµ (with n 6= m).

This presentation of model selection had to remain very sketchy since practical appli-cations depend strongly on the details of the selection problem at hand. For manyworked-out examples we refer to Sivia [5], Gregory [6], Bretthorst [3] and Loredo [1]which also contains a discussion on goodness-of-fit tests used in Frequentist model se-lection.

7 Counting

It is often thought that Frequentist inference is ‘objective’ because it is only based onsampling distributions or on distributions derived thereof. However, in the constructionof such a sampling distribution we need to specify how repetitions of the experimentare actually carried out. The potential difficulties arising from this are perhaps bestillustrated by the so-called stopping problem, already mentioned in Section 2.5.

The stopping problem occurs in counting experiments because the sampling distribution(or likelihood) depends on the strategy we have adopted to halt the experiment. In asimple coin flipping experiment, for instance, the sampling distribution is different whenwe chose to stop after a fixed amount of N throws or chose to stop after the observationof a fixed amount of n heads. In the first case the number of heads is the randomvariable while in the second case it is the number of throws.

In the following subsections we will have a closer look at these two stopping strategies.As we will see, Frequentist inference is sensitive to the stopping rules but Bayesianinference is not.

7.1 Binomial counting

We denote the probability of ‘heads’ in coin flipping by h and that of ‘tails’ by 1 − h.We assume that the coin flips are independent so that the probability of observing n

50

heads in N throws is given by the binomial distribution

P (n|N, h, I) =N !

n!(N − n)!hn(1 − h)N−n. (7.1)

If we define as our statistic for h the ratio R = n/N , the average and variance of R aregiven by (4.8)

<R>=<n>

N= h <∆R2 >=

h(1 − h)

N. (7.2)

From Bayes’ theorem we obtain for the posterior of h

p(h|n, N, I) dh = C P (n|N, h, I) p(h|I) dh (7.3)

where C is a normalization constant. Assuming a uniform prior p(h|I) = 1 we find byintegrating (7.3):

1

C=

∫ 1

0

N !

n!(N − n)!hn(1 − h)N−n dh =

1

N + 1

so that we obtain for the normalized posterior

p(h|n, N) dh =(N + 1)!

n!(N − n)!hn(1 − h)N−n dh. (7.4)

In Fig. 7 we plot the evolution of the posterior for the first three throws (for more

0.2 0.4 0.6 0.8 1h

0.2

0.4

0.6

0.8

1

pHhL 0�1

0.2 0.4 0.6 0.8 1h

0.2

0.4

0.6

0.8

1

pHhL 0�2

0.2 0.4 0.6 0.8 1h

0.2

0.4

0.6

0.8

1

pHhL 1�3

Figure 7: The posterior density p(h|n, N) for n heads in the first three flips of a coin with bias h = 0.25.A uniform prior distribution of h has been assumed. The densities are scaled to unit maximum for easeof comparison.

throws see Fig. 3 in Section 5). From this plot, and also from (7.4), it is seen that thedistribution vanishes at the edges of the interval only when at least one head and onetail has been observed. Thus for 1 ≤ n ≤ N − 1 the distribution has its peak insidethe interval so that it makes sense to calculate the position and width from L = − ln p.This gives

h =n

N±

√

h(1 − h)

N(1 ≤ n ≤ N − 1). (7.5)

Note, as a curiosity, that the expressions given above are similar to those for <R> and<∆R2 > given in (7.2).

Exercise 7.1: Derive (7.5)

51

Exercise 7.2: A counter is traversed by N particles and fires N times. Calculating theefficiency and error from (7.2) gives ε = 1 ± 0 which is an unacceptable estimate forthe error. Derive from (7.4) an expression for the lower limit εα corresponding to the αconfidence interval defined by the equation

∫ 1

εα

p(ε|N, N) dε = α.

Show that for N = 4 and α = 0.65 the result on the efficiency can be reported as

ε = 1 +0−0.19 (65% CL)

7.2 The negative binomial

Instead of throwing the coin N times we may decide to throw the coin as many times asis necessary to observe n heads. Because the last throw must by definition be a head,and because the probability of this throw does not depend on the previous throws, theprobability of N throws is given by

P (N |n, h) = P (n-1 heads in N-1 throws) × P (one head in one throw)

=(N − 1)!

(n − 1)!(N − n)!hn (1 − h)N−n n ≥ 1, N ≥ n. (7.6)

This distribution is known as the negative binomial. In Fig. 8 we show this distribu-tion for a fair coin (h = 0.5) and n = (3, 9, 20) heads.

4 6 8 10 12 14N

0.05

0.1

0.15

0.2P H N È 3, 0.5 L

10 15 20 25 30 35N

0.02

0.04

0.06

0.08

0.1

P H N È 9, 0.5 L

30 40 50 60 70N

0.01

0.02

0.03

0.04

0.05

0.06

0.07P H N È 20, 0.5 L

Figure 8: The negative binomial P (N |n, h) distribution of the number of trials N needed to observen = (3, 9, 20) heads in flips of a fair coin (h = 0.5).

It can be shown that P (N |n, h) is properly normalized∞∑

N=n

P (N |n, h) = 1.

The first and second moments and the variance of this distribution are

<N > =

∞∑

N=n

NP (N |n, h) =n

h

<N2 > =∞∑

N=n

N2P (N |n, h) =n(1 − h)

h2+

n2

h2

<∆N2 > =n(1 − h)

h2(7.7)

52

If we define the ratio Q = N/n as our statistic for z ≡ 1/h it follows directly from (7.7)that the average and variance of Q are given by

<Q> = z <∆Q2 > =z(z − 1)

n(7.8)

In the previous section we took R = n/N as a statistic for h because it had the propertythat < R > = h, when averaged over the binomial distribution. But if we average Rover the negative binomial (7.6) then <R> 6= h. This can easily be seen, without anyexplicit calculation, from the fact that N and not n is the random variable and that thereciprocal of an average is not the average of the reciprocal:

<R> = <n

N> = n <

1

N> 6= n

<N >= h.

To calculate the posterior we again assume a flat prior for h so that we can write

p(h|N, n) dh = C(N − 1)!

(n − 1)! (N − n)!hn (1 − h)N−n dh

which gives, upon integration, a value for the normalization constant

C =N(N + 1)

n.

The normalized posterior is thus

p(h|N, n) dh =(N + 1)!

n!(N − n)!hn (1 − h)N−n dh. (7.9)

This posterior is the same as that given in (7.4) as it should be since the relevantinformation on h is carried by the observation of how many heads there are in a certainnumber of throws and not by how the experiment was halted.

It is worthwhile to have a closer look at the likelihoods (7.1) and (7.6) to understandwhy the posteriors come out to be the same. It is seen that the dependence of bothlikelihoods on h is given by the term

hn(1 − h)(N−n).

The terms in front are different because of the different stopping rules but these termsdo not enter in the inference on h since they do not depend on h. They can thus beabsorbed in the normalization constant of the posterior which can in both cases bewritten as

p(h|n, N) dh = C hn(1 − h)(N−n) dh

with, obviously, the same constant C.

Here we have encountered a very important property of Bayesian inference namely itsability to discard information which is irrelevant. This is in accordance with Cox’desideratum of consistency which states that conclusions should depend on relevantinformation only. Frequentist analysis does not possess this property since the stoppingrule must be specified in order to construct the sampling distribution and a meaningfulstatistic. Such inference therefore violates at least one of Cox’ desiderata.

53

A Solutions to Selected Exercises

Exercise 1.1

This is because astronomers often have to draw conclusions from the observation of rare events. Bayesianinference is well suited for this since it is based solely on the evidence carried by the data (and priorinformation) instead of being based on hypothetical repetitions of the experiment.

Exercise 2.3

(i) From the truth table (2.1) it is seen that both B and A ⇒ B are true if and only if A is false. Butthis is just the implication B ⇒ A.

(ii) Negating the minor premises in the syllogisms leads to the conclusions

Induction DeductionProposition 1: If A is true then B is true If A is true then B is trueProposition 2: A is false B is falseConclusion: B is less probable A is false

Exercise 2.4

From de Morgan’s law and repeated application of the product and sum rules (2.4) and (2.5) we find

P (A + B) = 1 − P (A + B) = 1 − P (AB)

= 1 − P (B|A)P (A) = 1 − P (A)[1 − P (B|A)]

= 1 − P (A) + P (B|A)P (A) = P (A) + P (AB)

= P (A) + P (A|B)P (B) = P (A) + P (B)[1 − P (A|B)]

= P (A) + P (B) − P (A|B)P (B) = P (A) + P (B) − P (AB).

Exercise 2.5

(i) The probability for Mr. White to have AIDS is

P (A|T ) =P (T |A)P (A)

P (T |A)P (A) + P (T |A)P (A)=

0.98 × 0.01

0.98 × 0.01 + 0.03 × 0.99= 0.25.

(ii) For full efficiency P (T |A) = 1 so that

P (A|T ) =1 × 0.01

1 × 0.01 + 0.03 × 0.99= 0.25.

(iii) For zero contamination P (T |A) = 0 so that

P (A|T ) =0.98 × 0.01

0.98 × 0.01 + 0 × 0.99= 1.

Exercise 2.6

(i) If nobody has AIDS then P (A) = 0 and thus

P (A|T ) =P (T |A) × 0

P (T |A) × 0 + P (T |A) × 1= 0.

(ii) If everybody has AIDS then P (A) = 1 and thus

P (A|T ) =P (T |A) × 1

P (T |A) × 1 + P (T |A) × 0= 1.

In both these cases the posterior is thus equal to the prior independent of the likelihood P (T |A).

54

Exercise 2.7

(i) When µ is known we can write for the posterior distribution

P (π|S, µ) =P (S|π)P (π)

P (S|π)P (π) + P (S|π)P (π)=

εµ

εµ + δ(1 − µ).

(ii) When µ is unknown we decompose P (π|S) in µ which gives

P (π|S) =

∫ 1

0

P (π, µ|S) dµ =

∫ 1

0

P (π|S, µ)p(µ) dµ.

Assuming a uniform prior p(µ) = 1 gives for the probability that the signal S corresponds to a pion

P (π|S) =

∫ 1

0

dµεµ

εµ + δ(1 − µ)=

ε[ε − δ + δ ln(δ/ε)]

(δ − ε)2,

where we have used the Mathematica program to evaluate the integral.

Exercise 2.8

The quantities x, x0, d and ϑ are related by x = x0 + d tan ϑ. With p(ϑ|I) = 1/π it follows that p(x|I)is Cauchy distributed

p(x|I) = p(ϑ|I)

∣∣∣∣

dϑ

dx

∣∣∣∣=

1

π

cos2 ϑ

d=

1

π

d

(x − x0)2 + d2.

The first and second derivatives of L = − ln p are

dL

dx=

2(x − x0)

(x − x0)2 + d2

d2L

dx2= − 4(x − x0)

2

[(x − x0)2 + d2]2+

2

(x − x0)2 + d2.

This gives for the position and width of p(x)

dL(x)

dx= 0 ⇒ x = x0

1

σ2=

d2L(x)

dx2=

2

d2⇒ σ =

d√2.

Exercise 2.9

(i) The posterior distribution of the first measurement is

P (π|S1) =P (S1|π)P (π)

P (S1|π)P (π) + P (S1|π)P (π)=

εµ

εµ + δ(1 − µ).

Using this as the prior for the second measurement we have

P (π|S1, S2) =P (S2|π, S1)P (π|S1)

P (S2|π, S1)P (π|S1) + P (S2|π, S1)P (π|S1)= · · · =

ε2µ

ε2µ + δ2(1 − µ).

Here we have assumed that the two measurements are independent, that is,

P (S2|π, S1) = P (S2|π) = ε P (S2|π, S1) = P (S2|π) = δ.

(ii) Direct application of Bayes’ theorem gives

P (π|S1, S2) =P (S1, S2|π)P (π)

P (S1, S2|π)P (π) + P (S1, S2|π)P (π)=

ε2µ

ε2µ + δ2(1 − µ).

Here we have again assumed that the two measurements are independent, that is,

P (S1, S2|π) = P (S1|π)P (S2|π) = ε2 P (S1, S2|π) = P (S1|π)P (S2|π) = δ2.

Both results are thus the same when we assume that the two measurements are independent.

55

Exercise 3.1

Because averaging is a linear operation we have

<∆x2 > = <(x − x)2 > = <x2 > −2x <x> + <x>2

= <x2 > −2 <x>2 + <x>2= <x2 > − <x>2 .

Exercise 3.2

The covariance matrix can be written as

Vij =<(xi − xi)(xj − xj)> =<xixj > − <xi ><xj > .

For independent variables the joint probability factorizes p(xi, xj |I) = p(xi|I)p(xj |I) so that

<xixj > =

∫

dxi xip(xi|I)

∫

dxj xjp(xj |I) =<xi ><xj > .

This implies that the off-diagonal elements of Vij vanish.

Exercise 3.3

(i) It is easiest to make the transformation y = 2(x − x0)/Γ so that the Breit-Wigner transforms into

p(x|x0, Γ, I) → p(y|I) = p(x|I)

∣∣∣∣

dx

dy

∣∣∣∣=

1

π

1

1 + y2.

For this distribution L = lnπ + ln(1 + y2). The first and second derivatives of L are given by

dL

dy=

2y

1 + y2

d2L

dy2= − 4y2

(1 + y2)2+

2

1 + y2.

From dL/dy = 0 we find y = 0. It follows that the second derivative of L at y is 2 so that the widthof the distribution is σy = 1/

√2. Transforming back to the variable x = x0 + Γy/2 gives

x = x0 and σx = σy

∣∣∣∣

dx

dy

∣∣∣∣=

Γ

2√

2.

(ii) Substituting x = x0 in the Breit-Wigner formula gives for the maximum value 2/(πΓ). Substitutingx = x0 ± Γ/2 gives a value of 1/(πΓ) which is just half the maximum.

Exercise 3.4

For z = x + y and z = xy we have

p(z|I) =

∫∫

δ(z − x − y) f(x)g(y) dxdy =

∫

f(z − y)g(y) dy and

p(z|I) =

∫∫

δ(z − xy) f(x)g(y) dxdy =

∫∫

δ(z − w) f(w/y)g(y)dw

|y| dy =

∫

f(z/y)g(y)dy

|y| .

Exercise 3.5

(i) The inverse transformations are

x =u + v

2y =

u − v

2

so that the determinant of the Jacobian is

|J | =

∣∣∣∣

∂x/∂u ∂x/∂v∂y/∂u ∂y/∂v

∣∣∣∣=

∣∣∣∣

1/2 1/21/2 −1/2

∣∣∣∣=

1

2.

56

The joint distribution of u and v is thus

p(u, v) = p(x, y) |J | =1

2f(x)g(y) =

1

2f

(u + v

2

)

g

(u − v

2

)

.

Integration over v gives

p(u) =

∫

p(u, v) dv =1

2

∫

f

(u + v

2

)

g

(u − v

2

)

dv =

∫

f(w)g(u − w) dw

which is just (3.14).

(ii) Here we have for the inverse transformation and the Jacobian determinant

x =√

uv y =

√u

v|J | =

∣∣∣∣

v(4uv)−1/2 u(4uv)−1/2

(4uv)−1/2 −u(4uv3)−1/2

∣∣∣∣=

1

2v

which gives for the joint distribution

p(u, v) =1

2vf(√

uv)g

(√u

v

)

.

Marginalization of v gives

p(u) =

∫

p(u, v) dv =

∫dv

2vf(√

uv)g

(√u

v

)

=

∫dw

wf(w)g

( u

w

)

which is just (3.15).

Exercise 3.6

According to (3.14) the distribution of z is given by

p(z|I) =1

2πσ1σ2

∫ ∞

−∞

dx exp

{

−1

2

[(x − µ1

σ1

)2

+

(z − x − µ2

σ2

)2]}

.

The standard way to deal with such an integral is to ‘complete the squares’, that is, to write theexponent in the form a(x − b)2 + c. This allows to carry out the integral

∫ ∞

−∞

dx exp{− 1

2

[a(x − b)2 + c

]}= exp

(− 1

2c)∫ ∞

−∞

dy exp(− 1

2ay2)

=

√

2π

aexp

(− 1

2c).

Our problem is now reduced to finding the coefficients a, b and c such that the following equation holds

p(x − r)2 + q(x − s)2 = a(x − b)2 + c

where the left-hand side is a generic expression for the exponent of the convolution of our two Gaussians.Since the coefficients of the powers of x must be equal at both sides of the equation we have

p + q = a pr + qs = ab pr2 + qs2 = ab2 + c.

Solving these equations for a, b and c gives

a = p + q b =pr + qs

p + qc =

pq

p + q(s − r)2.

Then substitutingp = 1/σ2

1 q = 1/σ22 r = µ1 s = z − µ2

yields the desired result

p(z|I) =1

√

2π(σ21 + σ2

2)exp

[

− (z − µ1 − µ2)2

2(σ21 + σ2

2)

]

.

57

Exercise 3.7

For independent random variables xi with variance σ2i we have <∆xi∆xj > = σiσjδij .

(i) For the sum z =∑

xi we have ∂z/∂xi = 1 so that (3.19) gives

<∆z2> =∑

i

∑

j

σiσjδij =∑

i

σ2i .

(ii) For the product z =∏

xi we have ∂z/∂xi = z/xi so that (3.19) gives

<∆z2 > =∑

i

∑

j

z

xi

z

xjσiσjδij = z2

∑

i

(σi

xi

)2

.

Exercise 3.8

We have

ε =n

N=

n

n + m

∂ε

∂n=

m

(n + m)2=

N − n

N2

∂ε

∂m= − n

(n + m)2= − n

N2

and

<∆n2 > = n <∆m2 > = m = N − n <∆n∆m> = 0.

Inserting this in (3.19) gives for the variance of ε

<∆ε2 > =

(∂ε

∂n

)2

<∆n2 > +

(∂ε

∂m

)2

<∆m2 >

=

(N − n

N2

)2

n +( n

N2

)2

(N − n) =ε(1 − ε)

N.

Exercise 3.9

Writing out (3.24) in components gives, using the definition (3.25)

∑

k

VikUTkj =

∑

k

UTikλkδkj ⇒

∑

k

Vikujk = λju

ji ⇒ V uj = λju

j .

Exercise 3.10

(i) For a symmetric matrix V and two arbitrary vectors x and y we have

y V x =∑

ij

yi Vij xj =∑

ij

xjVTji yi =

∑

ij

xjVji yi = x V y.

(ii) Because V is symmetric we have

ui V uj = uj V ui ⇒ λiuiuj = λjuiuj or (λi − λj)uiuj = 0 ⇒ uiuj = 0 for λi 6= λj .

(iii) If V is positive definite we have uiV ui = λiuiui = λi > 0.

Exercise 4.1

Decomposition of P (R2|I) in the hypothesis space {R1, W1} gives

P (R2|I) = P (R2, R1|I) + P (R2, W1|I)

= P (R2|R1, I)P (R1|I) + P (R2|W1, I)P (W1|I)

=R − 1

N − 1

R

N+

R

N − 1

W

N=

R

N.

58

Exercise 4.2

For draws with replacement we have P (R2|R1, I) = P (R2|W1, I) = P (R1|I) = R/N and P (W1|I) =W/N . Inserting this in Bayes’ theorem gives

P (R1|R2, I) =P (R2|R1, I)P (R1|I)

P (R2|R1, I)P (R1|I) + P (R2|W1, I)P (W1|I)=

R

N.

Exercise 4.3

Without loss of generality we can consider marginalization of the multinomial distribution over all butthe first probability. According to (4.10) we have

n′

2 =

k∑

i=2

ni = N − n1 and p′2 =

k∑

i=2

pi = 1 − p1.

Inserting this in (4.9) gives the binomial distribution

P (n1, n′

2 | p1, p′

2, N) =N !

n!(N − n)!pn1

1 (1 − p1)N−n1 .

Exercise 4.4

From the product rule we have

P (n1, . . . , nk|I) = P (n1, . . . , nk−1|nk, I)P (nk|I)

From this we find for the conditional distribution

P (n1, . . . , nk−1|nk, I) =P (n1, . . . , nk|I)

P (nk|I)

=N !

n1! · · ·nk!pn1

1 · · · pnk−1

k−1 pnk

k × nk!(N − nk)!

N !

1

pnk

k (1 − pk)N−nk

=(N − nk)!

n1! · · ·nk−1!

(p1

1 − pk

)n1

· · ·(

pk−1

1 − pk

)nk−1

.

Exercise 4.5

(i) The likelihood to observe n counts in a time window ∆t is given by the Poisson distribution

P (n|µ) =µn

n!exp(−µ)

with µ = R∆t and R the average counting rate. Assuming a flat prior for µ ∈ [0,∞] the posterior is

p(µ|n) = Cµn

n!exp(−µ)

with C a normalization constant which turns out to be unity: the Poisson distribution has the remark-able property that it is normalized with respect to both n and µ:

∞∑

n=0

P (n|µ) =

∞∑

n=0

µn

n!exp(−µ) = 1 and

∫ ∞

0

p(µ|n) dµ =

∫ ∞

0

µn

n!exp(−µ) = 1.

The mean, second moment and the variance of the posterior are

<µ> = n + 1, <µ2 > = (n + 1)(n + 2), <∆µ2 > = n + 1.

The log posterior and the derivatives are

L = constant − n ln(µ) + µ,dL

dµ= 1 − n

µ,

d2L

dµ2=

n

µ2.

59

Setting the derivative to zero we find µ = n. The square root of the inverse of the Hessian at the modegives for the width σ =

√n.

(ii) The probability p(τ |R, I) dτ that the time interval between the passage of two particles is betweenτ and τ + dτ is given by

p(τ |R, I) dτ = P (‘no particle passes during τ ’) × P (‘one particle passes during dτ ’)

= exp(−Rτ) × Rdτ.

Exercise 4.6

The derivatives of the characteristic function (4.17) are

dφ(k)

dk= (iµ − kσ2) exp

(

iµk − 1

2σ2k2

)d2φ(k)

dk2=[(iµ − kσ2)2 − σ2

]exp

(

iµk − 1

2σ2k2

)

.

From (4.16) we find for the first and second moment

<x> =1

i

dφ(0)

dk= µ <x2 > =

1

i2d2φ(0)

dk2= µ2 + σ2

from which it immediately follows that the variance is given by <x2 > − <x>2= σ2.

Exercise 4.7

From (4.15) and (4.17) we have

φ(k) = φ1(k)φ2(k) = exp

[

i(µ1 + µ2)k − 1

2(σ2

1 + σ22)k2

]

which is just the characteristic function of a Gauss with mean µ1 + µ2 and variance σ21 + σ2

2 .

Exercise 5.2

From differentiating the logarithm of (5.12) we get

−∂ lnZ

∂λk= − 1

Z

∂Z

∂λk= − 1

Z

n∑

i=1

mi∂

λkexp

m∑

j=1

λjfji

=1

Z

n∑

i=1

fki mi exp

m∑

j=1

λjfji

=n∑

i=1

fki pi = βk.

Exercise 6.1

The log posterior of (6.3) is given by

L = Constant +1

2

n∑

i=1

(di − µ

σ

)2

.

Setting the first derivative to zero gives

dL

dµ= −

n∑

i=1

(di − µ

σ2

)

= 0 ⇒ µ =1

n

n∑

i=1

di.

The Hessian and the square root of its inverse at µ are

d2L

dµ2=

n∑

i=1

1

σ2=

n

σ2⇒

[d2L(µ)

dµ2

]−12

=σ√n

.

60

Exercise 6.3

The negative logarithm of the posterior (6.7) is given by

L = Constant +n − 1

2ln(V + nz2

)

where we have set z = µ − d. The first derivative to µ is

dL

dµ=

dL

dz=

(n − 1)nz

V + nz2= 0 ⇒ z = 0 ⇒ µ = d.

For the second derivative we find

d2L

dµ2=

d2L

dz2=

(n − 1)n

V + nz2− 2(n − 1)n2z2

(V + nz2)2⇒ H =

d2L(µ)

dµ2=

d2L(0)

dz2=

(n − 1)n

V.

Exercise 6.4

By substituting t = 12χ2 we find for the average

<χ2 > =

∫ ∞

0

χ2p(χ2|ν) dχ2 =2

Γ(α)

∫ ∞

0

tαe−t dt =2Γ(α + 1)

Γ(α)= 2α = ν.

Likewise, the second moment is found to be

<χ4 > =

∫ ∞

0

χ4p(χ2|ν) dχ2 =4

Γ(α)

∫ ∞

0

tα+1e−t dt =4Γ(α + 2)

Γ(α)= 4α(α + 1) = ν(ν + 2).

Therefore the variance is<χ4 > − <χ2 >2= ν(ν + 2) − ν2 = 2ν.

Exercise 6.5

The negative logarithm of (6.10) is given by

L = Constant + (n − 1) ln(σ) +V

2σ2

whence we obtain

dL

dσ=

n − 1

σ− V

σ3= 0 ⇒ σ2 =

V

n − 1

d2L

dσ2=

1 − n

σ2+

3V

σ4⇒ H =

d2L(σ)

dL2=

2(n − 1)2

V.

Exercise 6.6

In case of a polynomial parameterization

f(x; a) = a1 + a2x + a3x2 + a4x

3 + · · ·

the basis functions are given by fλ(x) = xλ−1. To give an example, for a quadratic polynomial theequation (6.19) takes the form

∑

i wi

∑

i wixi

∑

i wix2i∑

i wixi

∑

i wix2i

∑

i wix3i∑

i wix2i

∑

i wix3i

∑

i wix4i

a1

a2

a3

=

∑

i widi∑

i widixi∑

i widix2i

.

Exercise 6.7

In case f(x; a) = a1 (fit to a constant) the matrix W and the vector b are one-dimensional:

W =∑

i

wi, b =∑

i

widi.

61

It then immediately follows from Eqs. (6.19) and (6.21) that

a1 =

∑

i widi∑

i wi± 1√∑

i wi

.

Exercise 6.8

The covariance matrix V ′ of r and s is given by (6.29) as

V ′

ij = σ2i δij , V ′

λµ = δλµ, V ′

iλ = 0.

Differentiation of (6.28) gives for the derivative matrix D

Dij =∂di

∂rj= δij , Diλ =

∂di

∂sλ= ∆iλ.

Carrying out the matrix multiplication V = DV ′DT we find

Vij =∑

k

∑

l

DikV ′

klDTlj +

∑

λ

∑

µ

DiλV ′

λµDTµj

=∑

k

∑

l

δikσ2kδklδlj +

∑

λ

∑

µ

∆iλδλµ∆jµ

= σ2i δij +

∑

λ

∆iλ∆jλ.

Exercise 6.9

From (6.38) we have for the log posterior

L(a, s) = Constant + 12

n∑

i=1

wi

(


λ=1

sλ∆iλ

)2

+ 12

m∑

λ=1

s2λ.

Setting the derivative to zero leads to

∂L(a, s)

∂sλ= −

n∑

i=1

wi(di − ti)∆iλ +

m∑

µ=1

sµ

n∑

i=1

wi∆iλ∆iµ + sλ = 0.

This equation can be rewritten as

m∑

µ=1

sµ

(

δλµ +n∑

i=1

wi∆iλ∆iµ

)

=n∑

i=1

wi(di − ti)∆iλ.

But this is just the matrix equation∑

µ Sλµsµ = bλ as given in (6.40).

Exercise 6.10

The best estimate µ of the temperature is, according to Exercise 6.7, given by the weighted average.Setting di = d and wi = 1/σ2

i = 1/σ2, we find for the average and variance of n measurements

µ =

∑widi

∑wi

= d and <∆µ2 > =1

∑wi

=σ2

n.

1. Offsetting all data points by an amount ±∆ gives for the best estimate µ± = d±∆. Adding thestatistical and systematic deviations in quadrature we thus find from the offset method that

µ = d ±√

σ2

n+ ∆2 ⇒ µ = d ± ∆ for n → ∞.

62

2. The matrix S and the vector b defined in (6.40) are in this case one-dimensional. We have

S = 1 + n

(∆

σ

)2

, b =n(d − µ)∆

σ2,

∑

wi(di − ti)2 =

n(d − µ)2

σ2.

Inserting this in (6.43) we find, after some algebra

L(µ) =1

2

n(d − µ)2

σ2 + n∆2,

dL(µ)

dµ= − n(d − µ)

σ2 + n∆2,

d2L(µ)

dµ2=

n

σ2 + n∆2.

Setting the first derivative to zero gives µ = d. Equating the error as the square root of theinverse of the Hessian (second derivative at µ) obtains the same result as from the offset method:

µ = d ±√

σ2

n+ ∆2 ⇒ µ = d ± ∆ for n → ∞.

We now add a second set of n measurements which do not have a systematic error ∆. The weightedaverage gives for the best estimate µ and the variance of the combined data

µ = d and <∆µ2 > =σ2

2n.

1. Offsetting the first set of n data points by an amount ±∆ but leaving the second set intactgives for the best estimate µ± = d ± ∆/2. Adding the statistical and systematic deviations inquadrature we thus find from the offset method that

µ = d ±

√

σ2

2n+

(∆

2

)2

⇒ µ = d ± ∆

2for n → ∞.

But this error is larger than if we would have considered only the second data set:

µ = d ± σ√n

⇒ µ = d ± 0 for n → ∞

In other words, the offset method violates the requirement that the error derived from all availabledata must always de smaller than that derived from a subset of the data.

2. The matrix S and the vector b defined in (6.40) are now two-dimensional but with many zero-valued elements since the systematic error ∆ of the second data set is zero. We find

S =

(S 00 1

)

b =

(b0

)

,

where S and b are defined above. The log posterior of (6.43) is found to be

L(µ) =1

2

[

2n

(d − µ

σ

)2

− bS−1b

]

=1

2

n(d − µ)2

σ2 + n∆2

(

2 + n∆2

σ2

)

.

Solving the equation dL(µ)/dµ = 0 immediately yields µ = d. The inverse of the secondderivative gives an estimate of the error. After some straight forward algebra we find

µ = d ±(

σ2

n+ ∆2

)(

2 + n∆2

σ2

)−1

.

It is seen that the error vanishes in the limit n → ∞, as it should be.

Exercise 6.11

In (3.11) we state that the posterior p(λ|d, H1) is Gaussian distributed in the neighborhood of the mode

λ. Indeed, that is exactly what we find inserting Eqs. (6.47), (6.48) and (6.49) in Bayes theorem,

p(λ|d, H1) =1

σ√

2πexp

−1

2

(

λ − λ

σ

)2

.

63

The approximations made in Eqs. (6.47) and (6.48) are thus consistent with those made in (3.11).

Exercise 6.12

When Hi has n free parameters λλλ and Hj has m free parameters µµµ, the Occam factor in (6.51) becomesthe ratio of multivariate Gaussian normalization factors, see (3.11):

√

(2π)m |Vµ|(2π)n |Vλ|

.

Exercise 7.1

For the negative log posterior of the binomial distribution we have

L = Constant− n ln(h) − (N − n) ln(1 − h).

The first and second derivatives of L to h are

dL

dh=

hN − n

h(1 − h)

d2L

dh2=

h2N + n − 2hn

h2(1 − h)2.

From dL/dh = 0 we obtain h = n/N . The Hessian is thus d2L(h)/dh2 = N/h(1 − h) from which (7.5)follows.

Exercise 7.2

The posterior of ε for N counts in N events is, from (7.4),

p(ε|N, N) = (N + 1)εN .

Integrating the posterior we find for the α confidence level

α =

∫ 1

a

dε p(ε|N, N) = 1 −∫ a

0

dε p(ε|N, N) = 1 − a(N+1) ⇒ a = (1 − α)1/(N+1)

For N = 4 and α = 0.65 we find a = 0.81 so that the result on the efficiency can be reported as

ε = 1 +0−0.19 (65% CL)

References

[1] T. Loredo, ‘From Laplace to Supernova SN 1987A: Bayesian Inference in Astro-

physics ’ in ‘Maximum Entropy and Bayesian Methods ’, ed. P.F. Fougere, KluwerAcademic Publishers, Dordrecht (1990).

[2] G.L. Bretthorst, ‘An Introduction to Parameter Estimation using Bayesian Prob-

ability Theory ’ in ‘Maximum Entropy and Bayesian Methods ’, ed. P.F. Fougere,Kluwer Academic Publishers, Dordrecht (1990).

[3] G.L. Bretthorst, ‘An Introduction to Model Selection using Probability Theory as

Logic’ in ‘Maximum Entropy and Bayesian Methods ’, ed. G.L. Heidbreder, KluwerAcademic Publishers, Dordrecht (1996).

[4] G. D’Agostini, ‘Bayesian Inference in Processing Experimental Data—Principles

and Basic Applications ’, arXiv:physics/0304102 (2003).

64

[5] D.S. Sivia, ‘Data Analysis—a Bayesian Tutorial ’, Oxford University Press (1997).

[6] P. Gregory, ‘Bayesian Logical Data Analysis for the Physical Sciences, CambridgeUniversity Press (2005).’

[7] E.T. Jaynes, ‘Probability Theory—The Logic of Science’, Cambridge UniversityPress (2003). See also http://omega.albany.edu:8008/JaynesBook.html (1998).

[8] G. Cowan, ‘Statistical Data Analysis ’, Oxford University Press (1998).

[9] R.T. Cox, ‘Probability, frequency and reasonable expectation’, Am. J. Phys. 14, 1(1946).

[10] F. James et al. eds., ‘Proc. Workshop on Confidence Limits’, CERN Yellow Report2000-005. See also http://www.cern.ch/cern/divisions/ep/events/clw.

[11] PDG, S. Eidelman et al., Phys. Lett. B592, 1 (2004).

[12] S.F. Gull, ‘Bayesian Inductive Inference and Maximum Entropy ’ in ‘Maximum-

Entropy and Bayesian Methods in Science and Engineering ’, G.J. Ericson andC.R. Smith eds., Vol. I, 53, Kluwer Academic Publishers (1988).

[13] C.E. Shannon, ‘The mathematical theory of communication’, Bell Systems Tech. J.27, 379 (1948).

[14] S.I. Alekhin, ‘Statistical properties of estimators using the covariance matrix ’, hep-ex/0005042.

[15] D. Stump et al., ‘Uncertainties of predictions from parton distribution functions ’,Phys. Rev. D65 014012 (2002), hep-ph/0101051.

[16] D. Stump, private communication.

[17] G. D’Agostini, Nucl. Inst. Meth. A346, 306 (1994);T. Takeuchi, Prog. Theor. Phys. Suppl. 123, 247 (1996), hep-ph/9603415.

[18] NA49 Collab., C. Alt et al., ‘Upper limit of D0 production in central Pb-Pb collisions

at 158A GeV ’, Phys. Rev. C73, 034910 (2006), nucl-ex/0507031.

65

Index

assignment of probabilitiesby decomposition, 9, 21by maximum entropy, 31–35by principle of insufficient reason, 21by symmetry considerations, 30–31trivial assignments, 18, 22, 23, 45

background information, 6, 14, 36Bayes factor, 48Bayes’ theorem

as a learning process, 8, 12definition of, 8, 10, 11in Frequentist statistics, 8, 13

Bayes, T., 13Bayesian inference, 4, 13–14

discards irrelevant information, 53steps taken in, 29, 45

Bayesian probability, see probabilityBayesian-Frequentist debate, 12–14Bernoulli’s urn, drawing from, 21–23Bernoulli, J., 12, 25binomial distribution

definition and properties of, 23–25posterior, for uniform prior, 51–52

binomial error, 19, 25Breit-Wigner distribution, 17

Cauchy distribution, 17, 28, 41causal versus logical dependence, 8, 23central limit theorem, 27, 43

proof and validity of, 27–28characteristic function, 27χ2 distribution, 38–39χ2 minimization, see least squaresconditional probability, definition of, 7confidence limits, 41, 43, 52conjunction, see logical andcontradiction, 5, 7, 8convolution, 18correlation coefficient, 15counting experiments, 50covariance matrix

as the inverse of the Hessian, 16definition of, 15determinant of, 16, 21

diagonal elements of, 17, 42diagonalization of, 20eigenvectors and eigenvalues of, 20linear transformation of, 19, 44of systematic errors, 44properties of, 15, 19–21

Cox’ desiderata, 6Cox, R.T., 6, 13cumulant, 10

de Morgan’s laws, 5decision theory, 17decomposition, definition of, 9, 11deduction; deductive inference, 5degree of belief, see plausibilitydensity, see probability distributiondisjunction, see logical ordistribution, see probability distributiondrawing with(out) replacement, 22–24

eigenvalue equations, 20ensemble, 13entropy

definition of, 31maximum entropy principle, 13, 14, 31–

35error contour, 20error propagation, linear, 19, 44evidence, 50

definition of, 8expectation value, 15exponential distribution, 34

Fourier convolution, 18, 27Fourier transform, 27, 28Frequentist probability, see probability

Gauss distributioncharacteristic function of, 27from central limit theorem, 26from maximum entropy, 34marginalization of, 16multivariate, definition of, 16

Gaussian sampling, 36–39sample mean and width, 37

66

group theory, 14

Hessian matrix, definition of, 16hypothesis

complete set of, 9, 21, 23, 48in Bayesian inference, 8in Frequentist statistics, 8, 13simple and composite, 35

implication, 5, 6improper distribution, 31, 37induction; inductive inference, 6inference

Bayesian, see Bayesian inferencedefinition of, 5

information entropy, see entropyinvariance, 30–31

Jacobian matrix, definition of, 18Jaynes, E.T., 4, 31Jeffreys prior, 31joint probability, definition of, 7

Kolmogorov axioms, 7

Lagrange multipliers, 32, 33Laplace, P.S., 13law of large numbers, 25least squares minimization, 40–41Lebesgue measure, 31, 32

non-uniform, 34likelihood

definition of, 8for a complete set of hypotheses, 10in parameter estimation, 35in unphysical region, 14, 41width, compared to prior, 14

linear model, 36, 40–41, 46location parameter, 15, 16, 30logical and, 5logical or, 5logical versus causal dependence, 8, 23lognormal distribution, 47

marginal probability, definition of, 7marginalization, 12

definition of, 9, 11of multinomial distribution, 25of multivariate Gauss, 16

of nuisance parameters, 36, 42, 46maximum likelihood fit, 40mean, definition of, 15Mellin convolution, 18minuit, 16, 40, 42, 45mode, definition of, 16model selection, 35, 47–50moments

definition of, 15from characteristic function, 27

multinomial distribution, 25–26

negation, 5negative binomial distribution, 52–53normalization errors, 47normalization, definition of, 9, 11nuisance parameters, 35

Occam factor, 49, 50Occam’s razor, 48odds ratio, 48, 49

parameter estimation, 35–41partition function, 33plausibility, 6plausible inference, see Bayesian inferencePoisson distribution, 13, 26

definition and properties of, 26from maximum entropy, 34posterior, for uniform prior, 26, 59

polynomial parameterization, 41positive definite matrix, 20posterior probability

definition of, 8for different priors, 29–30logarithmic expansion of, 16, 37–41, 46,

51principle of insufficient reason, 21, 30

definition of, 12from maximum entropy, 33

principle of maximum entropy, see entropyprior odds, 48prior probability

choice of, 11, 28–30defines physical domain, 14, 30, 36, 42definition of, 8uniform, 11, 29, 40, 48, 53width, compared to likelihood, 14

67

probabilityBayesian definition of, 4, 7, 8, 13Frequentist definition of, 4, 8, 13, 25

probability assignment, see assignmentprobability calculus, 7–11

axioms of, 7probability distribution

definition of, 10properties of, 15–16

probability inversion, 8, 12product rule

definition of, 7, 10for independent propositions, 8

propositions, 5exclusive, 8independent, 8

quadratic addition of errors, 19

random variable, 4, 8, 13, 18

sampling distributionambiguity in construction of, 14, 50definition of, 10

sampling theory, 14scale errors, 47scale parameter, 30Shannon, C.E., 31standard deviation, 15statistic, 51, 53

definition of, 13desirable properties of, 13

statistical error, definition of, 43stopping problem, 14, 50, 53Student-t distribution, 37–38subjective probability, 13sum rule

definition of, 7for exclusive propositions, 8

syllogism, 6systematic error propagation, 36, 43–47

using covariance matrix, 44using offset method, 43, 47

systematic error, definition of, 43

tautology, 5, 7, 9testable information, 32transformation

coordinate transformation, 18of probability density, 11, 17–18unitary, 20

uniform prior, see prior probabilityuninformative probability, 11, 13, 32, 34

variance, definition of, 15

weighted average, 41well posed problem, specification of, 29

68

introduction to bayesian inference - nikhef · pdf file1 introduction the frequentist and...

Documents