machine learning: probability theory

21
Machine Learning Probability Theory WS2013/2014 – Dmitrij Schlesinger, Carsten Rother

Upload: phungtruc

Post on 12-Feb-2017

226 views

Category:

Documents


2 download

TRANSCRIPT

Machine Learning

Probability Theory

WS2013/2014 – Dmitrij Schlesinger, Carsten Rother

Probability space

is a three-tuple (Ξ©, 𝜎, 𝑃) with:

β€’ Ξ© βˆ’ the set of elementary events

β€’ 𝜎 βˆ’ algebra

β€’ 𝑃 βˆ’ probability measure

𝜎-algebra over Ω is a system of subsets,

i.e. 𝜎 βŠ† 𝒫(Ξ©) (𝒫 is the power set) with:

β€’ Ξ© ∈ 𝜎

β€’ 𝐴 ∈ 𝜎 β‡’ Ξ© βˆ– 𝐴 ∈ 𝜎

β€’ 𝐴𝑖 ∈ 𝜎 𝑖 = 1…𝑛 β‡’ 𝑖=1𝑛 𝐴𝑖 ∈ 𝜎

𝜎 is closed with respect to the complement and countable conjunction. It follows: βˆ… ∈ 𝜎, 𝜎 is closed also with respect to the countable disjunction (due to the De Morgan's laws)

2Machine Learning : Probability Theory

Probability space

Examples:

β€’ 𝜎 = βˆ…,Ξ© (smallest) and 𝜎 = 𝒫 Ξ© (largest) 𝜎-algebras over Ξ©

β€’ the minimal 𝜎-algebra over Ξ© containing a particular subset 𝐴 ∈ Ξ©is 𝜎 = βˆ…, 𝐴, Ξ© βˆ– 𝐴, Ξ©

β€’ Ξ© is discrete and finite, 𝜎 = 2Ξ©

β€’ Ξ© = ℝ , the Borel-algebra (contains all intervals among others)

β€’ etc.

3Machine Learning : Probability Theory

Probability measure

𝑃: 𝜎 β†’ 0,1 is a β€žmeasureβ€œ (Ξ ) with the normalizing 𝑃 Ξ© = 1

𝜎-additivity: let 𝐴𝑖 ∈ 𝜎 be pairwise disjoint subsets, i.e. 𝐴𝑖 ∩ 𝐴𝑖′ = βˆ…, then

𝑃

𝑖

𝐴𝑖 =

𝑖

𝑃(𝐴𝑖)

Note: there are sets for which there is no measure.

Examples: the set of irrational numbers, function spaces β„βˆž etc.

Banach-Tarski paradoxon (see Wikipedia ):

4Machine Learning : Probability Theory

(For us) practically relevant cases

β€’ The set Ξ© is β€žgood-naturedβ€œ, e.g. ℝ𝑛, discrete finite sets etc.

β€’ 𝜎 = 𝒫 Ξ© , i.e. the algebra is the power set

β€’ We often consider a (composite) β€ževentβ€œ 𝐴 βŠ† Ξ© as the union of elemantary ones

β€’ Probability of an event is

𝑃 𝐴 =

πœ”βˆˆπ΄

𝑃(πœ”)

5Machine Learning : Probability Theory

Random variables

Here a special case – real-valued random variables.

A random variable πœ‰ for a probability space (Ξ©, 𝜎, 𝑃) is a mapping πœ‰: Ξ© β†’ ℝ, satisfying

πœ”: πœ‰ πœ” ≀ π‘Ÿ ∈ 𝜎 βˆ€ π‘Ÿ ∈ ℝ

(always holds for power sets)

Note: elementary events are not numbers – they are elements of a general set Ξ©

Random variables are in contrast numbers, i.e. they can be summed up, subtracted, squared etc.

6Machine Learning : Probability Theory

Distributions

Cummulative distribution function of a random variable πœ‰ :

πΉπœ‰ π‘Ÿ = 𝑃 πœ”: πœ‰ πœ” ≀ π‘Ÿ

Probability distribution of a discrete random variable πœ‰: Ξ© β†’ β„€ :

π‘πœ‰ π‘Ÿ = 𝑃({πœ”: πœ‰ πœ” = π‘Ÿ})

Probability density of a continuous random variable πœ‰: Ξ© β†’ ℝ :

π‘πœ‰ π‘Ÿ =πœ•πΉπœ‰(π‘Ÿ)

πœ•π‘Ÿ

7Machine Learning : Probability Theory

Distributions

Why is it necessary to do it so complex (through the cummulative distribution function)?

Example – a Gaussian

Probability of any particular real value is zero β†’ a β€ždirectβ€œ definition of a β€žprobability distributionβ€œ is senseless

It is indeed possible through the cummulative distribution function.

8Machine Learning : Probability Theory

Mean

A mean (expectation, average ... ) of a random variable πœ‰ is

𝔼𝑃 πœ‰ =

πœ”βˆˆΞ©

𝑃 πœ” β‹… πœ‰(πœ”) =

π‘Ÿ

πœ”:πœ‰ πœ” =π‘Ÿ

𝑃 πœ” β‹… π‘Ÿ =

π‘Ÿ

π‘πœ‰ π‘Ÿ β‹… π‘Ÿ

Arithmetic mean is a special case:

π‘₯ =1

𝑁

𝑖=1

𝑛

π‘₯𝑖 =

π‘Ÿ

π‘πœ‰(π‘Ÿ) β‹… π‘Ÿ

with

π‘₯ ≑ π‘Ÿ and π‘πœ‰ π‘Ÿ =1

𝑁

(uniform probability distribution).

9Machine Learning : Probability Theory

Mean

The probability of an event 𝐴 ∈ Ξ© can be expressed as the mean value of a corresponding β€žindicatorβ€œ-variable

𝑃 𝐴 =

πœ”βˆˆπ΄

𝑃(πœ”) =

πœ”βˆˆΞ©

𝑃 πœ” β‹… πœ‰(πœ”)

with

πœ‰ πœ” = 1 if πœ” ∈ 𝐴0 otherwise

Often, the set of elementary events can be associated with a random variable (just enumerate all πœ” ∈ Ξ© ).

Then one can speak about a β€œprobability distribution over Ξ©β€œ (instead of the probability measure).

10Machine Learning : Probability Theory

Example 1 – numbers of a die

The set of elementary events: Ξ© = π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓

Probability measure: 𝑃 π‘Ž =1

6, 𝑃 𝑐, 𝑓 =

1

3…

Random variable (number of a die): πœ‰ π‘Ž = 1, πœ‰ 𝑏 = 2β€¦πœ‰ 𝑓 = 6

Cummulative distribution: πΉπœ‰ 3 =1

2, πΉπœ‰ 4.5 =

2

3…

Probability distribution: π‘πœ‰ 1 = π‘πœ‰ 2 β€¦π‘πœ‰ 6 =1

6

Mean value: 𝔼𝑃 πœ‰ = 3.5

Another random variable (squared number of a die)

πœ‰β€² π‘Ž = 1, πœ‰β€² 𝑏 = 4 … πœ‰β€² 𝑓 = 36

Mean value: 𝔼𝑃 πœ‰ = 151

6

Note: 𝔼𝑃 πœ‰β€² β‰  𝔼𝑃2(πœ‰)

11Machine Learning : Probability Theory

Example 2 – two independent dice numbers

The set of elementary events (6x6 faces):

Ξ© = π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 Γ— π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓

Probability measure: 𝑃 π‘Žπ‘ =1

36, 𝑃 𝑐𝑑, π‘“π‘Ž =

1

18…

Two random variables:

1) The number of the first die: πœ‰1 π‘Žπ‘ = 1, πœ‰1 π‘Žπ‘ = 1, πœ‰1 𝑒𝑓 = 5 …

2) The number of the second die: πœ‰2 π‘Žπ‘ = 2, πœ‰2 π‘Žπ‘ = 3, πœ‰2 𝑒𝑓 = 6 …

Probability distributions:

π‘πœ‰1 1 = π‘πœ‰1 2 = β‹― = π‘πœ‰1 6 =1

6

π‘πœ‰2 1 = π‘πœ‰2 2 = β‹― = π‘πœ‰2 6 =1

6

12Machine Learning : Probability Theory

Example 2 – two independent dice numbers

Consider the new random variable: πœ‰ = πœ‰1 + πœ‰2

The probability distribution π‘πœ‰ is not uniform anymore

π‘πœ‰ ∝ (1,2,3,4,5,6,5,4,3,2,1)

Mean value is 𝔼𝑃 πœ‰ = 7

In general for mean values:

𝔼𝑃 πœ‰1 + πœ‰2 =

πœ”βˆˆΞ©

𝑃 πœ” β‹… (πœ‰1 πœ” + πœ‰2 πœ” ) = 𝔼𝑃 πœ‰1 + 𝔼𝑃(πœ‰2)

13Machine Learning : Probability Theory

Random variables of higher dimension

Analogously: Let πœ‰: Ξ© β†’ ℝ𝑛 be a mapping (𝑛 = 2 for simplicity), with πœ‰ = πœ‰1, πœ‰2 , πœ‰1: Ξ© β†’ ℝ and πœ‰2: Ξ© β†’ ℝ

Cummulative distribution function:

πΉπœ‰ π‘Ÿ, 𝑠 = 𝑃 πœ”: πœ‰1 πœ” ≀ π‘Ÿ ∩ πœ”: πœ‰2 πœ” ≀ 𝑠

Joint probability distribution (discrete):

π‘πœ‰= πœ‰1,πœ‰2 π‘Ÿ, 𝑠 = 𝑃 πœ”: πœ‰1 πœ” = π‘Ÿ ∩ πœ”: πœ‰2 πœ” = 𝑠

Joint probability density (continuous):

π‘πœ‰= πœ‰1,πœ‰2 π‘Ÿ, 𝑠 =πœ•2πΉπœ‰(π‘Ÿ, 𝑠)

πœ•π‘Ÿ πœ•π‘ 14Machine Learning : Probability Theory

Independence

Two events 𝐴 ∈ 𝜎 and 𝐡 ∈ 𝜎 are independent, if

𝑃 𝐴 ∩ 𝐡 = 𝑃 𝐴 β‹… 𝑃(𝐡)

Interesting: Events 𝐴 and 𝐡 = Ξ© βˆ– 𝐡 are independent, if 𝐴 and 𝐡 are independent

Two random variables are independent, if

πΉπœ‰= πœ‰1,πœ‰2 π‘Ÿ, 𝑠 = πΉπœ‰1 π‘Ÿ β‹… πΉπœ‰2 𝑠 βˆ€ π‘Ÿ, 𝑠

It follows (example for continuous πœ‰):

π‘πœ‰ π‘Ÿ, 𝑠 =πœ•2πΉπœ‰(π‘Ÿ, 𝑠)

πœ•π‘Ÿπœ•π‘ =

πœ•πΉπœ‰1(π‘Ÿ)

πœ•π‘Ÿβ‹…πœ•πΉπœ‰2 𝑠

πœ•π‘ = π‘πœ‰1 π‘Ÿ β‹… π‘πœ‰2 𝑠

15Machine Learning : Probability Theory

Conditional probabilities

Conditional probability:

𝑃 𝐴 𝐡) =𝑃(𝐴∩𝐡)

𝑃(𝐡)

Independence (almost equivalent): 𝐴 and 𝐡 are independent, if

𝑃 𝐴 𝐡) = 𝑃(𝐴) and/or 𝑃 𝐡 𝐴) = 𝑃(𝐡)

Bayesβ€˜ Theorem (formula, rule)

𝑃 𝐴 𝐡) =𝑃 𝐡 𝐴) β‹… 𝑃(𝐴)

𝑃(𝐡)

16Machine Learning : Probability Theory

Further definitions (for random variables)

Shorthand: 𝑝 π‘₯, 𝑦 ≑ π‘πœ‰(π‘₯, 𝑦)

Marginal probability distribution:

𝑝 π‘₯ =

𝑦

𝑝(π‘₯, 𝑦)

Conditional probability distribution:

𝑝 π‘₯ 𝑦 =𝑝(π‘₯, 𝑦)

𝑝(𝑦)

Note: π‘₯ 𝑝(π‘₯|𝑦) = 1

Independent probability distribution:𝑝 π‘₯, 𝑦 = 𝑝 π‘₯ β‹… 𝑝(𝑦)

17Machine Learning : Probability Theory

Example

Let the probability to be taken ill be

𝑝 𝑖𝑙𝑙 = 0.02

Let the conditional probability to have a temperature in that case is

𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 = 0.9

However, one may have a temperature without any illness, i.e.

𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 = 0.05

What is the probability to be taken ill provided that one has a temperature?

18Machine Learning : Probability Theory

Example

Bayes’ rule:

𝑝 𝑖𝑙𝑙 π‘‘π‘’π‘šπ‘ =𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)

𝑝(π‘‘π‘’π‘šπ‘)=

(marginal probability in the denominator)

=𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)

𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝 𝑖𝑙𝑙 + 𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)=

=0.9 β‹… 0.02

0.9 β‹… 0.02 + 0.05 β‹… 0.98β‰ˆ 0.27

βˆ’ not so high as expected , the reason – very low prior probabilityto be taken ill

19Machine Learning : Probability Theory

Further topics

The model

Let two random variables be given:

β€’ The first one is typically discrete (i.e. π‘˜ ∈ 𝐾) and is called β€œclass”

β€’ The second one is often continuous (π‘₯ ∈ 𝑋) and is called β€œobservation”

Let the joint probability distribution 𝑝(π‘₯, π‘˜) be β€œgiven”.

As π‘˜ is discrete it is often specified by 𝑝 π‘₯, π‘˜ = 𝑝 π‘˜ β‹… 𝑝(π‘₯|π‘˜)

The recognition task: given π‘₯, estimate π‘˜.

Usual problems (questions):

β€’ How to estimate π‘˜ from π‘₯ ?

β€’ The joint probability is not always explicitly specified.

β€’ The set 𝐾 is sometimes huge.

20Machine Learning : Probability Theory

Further topics

The learning task:

Often (almost always) the probability distribution is known up to free parameters. How to choose them (learn from examples)?

Next classes:

1. Recognition, Bayessian Decision Theory

2. Probabilistic learning, Maximum-Likelihood principle

3. Discriminative models, recognition and learning

…

21Machine Learning : Probability Theory