lectures 3 & 4: estimators
TRANSCRIPT
Lectures 3 & 4: Estimators
• Maximum likelihood estimation (MLE)• Likelihood function, information matrix
• Least Squares estimation• Total least squares
• General least squares
• Regularization
• Maximum a Posteriori estimation (MAP)• Posterior density via Bayes’ rule
• Confidence regions
Hilary Term 2007 A. Zisserman
Maximum Likelihood Estimation
In the line fitting (linear regression) example the estimate of the line parameters θ involved two steps:
1. Write down the likelihood function expressing the probability of the data z given the parameters θ
2. Maximize the likelihood to determine θ
i.e. choose the value of θ so as to make the data as likely as possible.
Note, the solution does not depend on the variance σ2 of the noise
y
x
mina,b
nXi
(yi− (axi+ b))2
Example application: Navigation
Take multiple measurements of beacons to improve location estimates
The Likelihood function
More formally, the likelihood is defined as
L(θ) = p(z|θ)• the likelihood is the conditional probability of obtaining measurements z = (z1, z2, …, zn) given that the true value is θ
p(z|θ)
θz
Note, in general
• The likelihood is a function of θ, but it is not a probability distribution over θ,
and its integral with respect to θ does not (necessarily) equal one.
• It would be incorrect to refer to this as “the likelihood of the data”.
Example: Gaussian Sensors
Likelihood is a Gaussian response in 2D for true value θ = x
-6 -4 -2 0 2 4 6-5
-4
-3
-2
-1
0
1
2
3
4
5
Z1
Z2
x
L(θ) = p(z|θ)
Given observations z and a likelihood function , the Maximum Likelihoodestimate of θ is the value of θ which maximizes the likelihood function
L(θ)
θ = argmaxθL(θ)
Now, look at several problems involving combining multiple measurements from Gaussian sensors
MLE Example 1
Suppose we have two independent sonar measurements (z1, z2) of position x, and the sensor error may be modelled as p(zi|θ) = N(θ, σ2), determine the MLE of x.
Since the sensors are independent the likelihood is
and since the sensors are Gaussian
Solution
L(x) = p(z1, z2|x) = p(z1|x) p(z2|x)
L(x) ∼ e−(z1−x)22σ2 × e−
(z2−x)22σ2 = e
−(z1−x)2+(z2−x)22σ2
ignore irrelevant normalization constants
z1 z2
p(z1|x) p(z2|x)
L(x) ∼ e−(z1−x)2+(z2−x)2
2σ2
x
x=z1 + z2
2L(x) ∼ e−(x−x)2σ2
− lnL(x) = ((z1− x)2+ (z2− x)2)/(2σ2)= (2x2 − 2x(z1+ z2)+ z21 + z22)/(2σ
2)
= (x − x)2/(σ2)+ c(z1, z2)
with
Note
• the likelihood is a Gaussian
• the variance is reduced (cf the original sensor variance)
Z1Z2
p(z1|x) p(z2|x)L(x) ∼ e−
(x−x)2σ2
Maximum likelihood estimate of x
x = argmaxx
L(x)
negative log likelihood
Compute min by differentiating wrt x
x= x=z1 + z2
2
ML x= argminx{− lnL(x)}
d{− lnL(x)}dx
= 2(x − x)/(σ2) = 0
− lnL(x) = (x − x)2/(σ2)+ c(z1, z2)
MLE Example 2: Gaussian fusion with different variances
Suppose we have two independent sonar measurements (z1, z2) of position x, and the sensor error may be modelled as p(z1|θ) = N(θ, σ2
1) and p(z2|θ) = N(θ, σ2
2), determine the MLE of x.
Again, as the sensors are independent
Solution
L(x) = p(z1, z2|x) = p(z1|x) p(z2|x)
L(x) ∼ e−(z1−x)
2
2σ21 × e
−(z2−x)2
2σ22
e.g. if the sensors are: p(z1|x) ∼ N(x,102) p(z2|x) ∼ N(x,202)
Suppose we obtain sensor readings of z1 = 130, z2 = 170.
xMLE =130/102 + 170/202
1/102 + 1/202= 138.0
• the ML estimate is closer to the more confident measurement
The negative log likelihood is then:
− lnL(x) = 1
2
hσ−21 (z1− x)2+ σ−22 (z2− x)2
i+const
=1
2
h(σ−21 + σ−22 )x
2 − 2(σ−21 z1+ σ−22 z2)xi+const
=1
2(σ−21 + σ−22 )
⎡⎣x− σ−21 z1 + σ−22 z2
σ−21 + σ−22
⎤⎦2+ constwhich is maximised w.r.t. x when
xMLE =σ−21 z1 + σ−22 z2
σ−21 + σ−22
The Information Matrix
We have seen that for two independent Normal variables:
• information weighted mean
• additivity of statistical information
(like the parallel resistors formula)
This motivates the definition of the information matrix
x =σ−21 z1 + σ−22 z2
σ−21 + σ−22
σ−2x = σ−21 + σ−22
S =Σ−1
Sx = Sz1 + Sz2
In 2D with z= (zx, zy)>
Sx = Σ−1x = Sz1 + Sz2
x = Σx¡Sz1z1+ Sz2z2
¢If Σ=
"σ2x 0
0 σ2y
#then S =
"1/σ2x 0
0 1/σ2y
#
• large covariance → uncertainty, lack of information
• small covariance → some confidence, or strong information
A singular covariance matrix implies “infinite” or “perfect” information.
MLE Example 3: Gaussian fusion with 2D sensors
Suppose we have two independent 2D vision measurements (z1, z2) of position x, and the sensor error may be modelled as p(z1|θ) = N(θ, Σ1) and p(z2|θ) = N(θ, Σ2), determine the MLE of x, where
z1 =
Ã11
!Σ1 =
"1 00 4
#and z2 =
Ã2−1
!Σ2 =
"4 00 1
#
Again, the sensors are independent so we sum their information
Solution
Σ−1x = Σ−11 +Σ−12 =
"1 00 0.25
#+
"0.25 00 1
#=
"1.25 00 1.25
#
x = Σx¡Sz1z1+ Sz2z2
¢=
"0.8 00 0.8
# Ã"1 00 0.25
#Ã11
!+
"0.25 00 1
#Ã2−1
!!=
Ã1.2−0.6
!
L(x) = p(z1|x) p(z2|x)
N(z1,Σ1), z1 =Ã11
!Σ1 =
"1 00 4
#and N(z2,Σ2), z2 =
Ã2−1
!Σ2 =
"4 00 1
#
xy
Line fitting with errors in both coordinates
Suppose points are measured from a true line
y = ax+ b with errors as
xi = xi+ vi vi ∼ N(0,σ2)yi = yi+ wi wi ∼ N(0,σ2)
In the case of linear regression, the ML estimate gave the cost function
measured value
predicted value
mina,b
nXi
(yi− (axi+ b))2
In the case of errors in both coordinates, the ML cost function is
mina,b
nXi
(xi−xi)2+(yi−yi)2 subject to yi = axi+b
i.e. it is necessary also to estimate (predict)
corrected points (xi, yi)
corrected point
This estimator is called total least squares, or orthogonal regression
Solution [exercise – non-examinable]
• line passes through centroid of the points
• line direction is parallel to the eigenvector of the covariance matrix corresponding to the least eigenvalue
µ=nXi
xi
Σ =nXi
h(xi − µ)(xi− µ)>
i
mina,b
nXi
(xi−xi)2+(yi−yi)2 subject to yi = axi+b
Application: Estimating Geometric Transformations
The line estimation problem is equivalent to estimating a 1D affine transformation from noisy point correspondences
x/
x
x0 = αx+ β
{x0i↔ xi}
Can also compute the ML estimate of 2D transformations
Reminder: Plane Projective transformation
• A projective transformation is also called a ``homography'' and a ``collineation''.
• H has 8 degrees of freedom.
• It can be estimated from 4 or more point correpondences
If the measurement error is Gaussian, then the ML estimate of H and the corrected correspondences is given by minimizing
Maximum Likelihood estimation of homography H
Cost function minimization:
• for 2D affine transformations there is a matrix solution (non-examinable)
• for 2D projective transformation, minimize numerically using e.g. non-linear gradient descent
subject to
e.g. Camera rotating about its centre
image plane 1
image plane 2
• The two image planes are related by a homography H
• H only depends on the camera centre, C, and the planes, not on the 3D structure
Example : Building panoramic mosaics
4 frames from a sequence of 30
The camera rotates (and zooms) with fixed centre
Example : Building panoramic mosaics
30 frames
The camera rotates (and zooms) with fixed centre
Homography between two frames
ML estimate of homography from these 100s of point correspondences
Choice of mosaic frame
This produces the classic "bow-tie" mosaic.
Choose central image as reference
General Linear Least Squares
w = y− Hθ
Write the residuals as an n-vector⎛⎜⎜⎜⎜⎜⎜⎝w1w2..wn
⎞⎟⎟⎟⎟⎟⎟⎠ =⎛⎜⎜⎜⎜⎜⎜⎝y1y2..yn
⎞⎟⎟⎟⎟⎟⎟⎠ −⎡⎢⎢⎢⎢⎢⎢⎣x1 1x2 1..xn 1
⎤⎥⎥⎥⎥⎥⎥⎦Ãab
!
mina,b
Xi
(yi− axi− b)2
Then the sum of squared residuals becomes
w>w = (y−Hθ)> (y−Hθ)We want to minimize this w.r.t. θ
H is a n x 2 matrix
Summary
• If the generative model is linear
• then the ML solution for Gaussian noise is
• where the matrix is the pseudo-inverse of H
=
H>Hθ = H>y
d
dθ(y −Hθ)> (y−Hθ) = 2H> (y −Hθ) = 0
H+ = (H>H)−1H>
θ = (H>H)−1 H> y = H+y
y= Hθ
θ = H+y
Generalization I: multi-linear regression
H=
⎡⎢⎢⎢⎢⎢⎢⎣x1 z1 1x2 z2 1..xn zn 1
⎤⎥⎥⎥⎥⎥⎥⎦ θ = (H>H)−1H>y
yi = θ2xi + θ1zi+ θ0+ wi
yi =Xj
θjφj (xi) + wi
Mo re general still with basis functions φj (x)
Generalization II: regression with basis functions
yi = θ2 (xi)2 + θ1 (xi)+ θ0 + wi
for example, quadratic regression
• what is the correct function to fit ?
• how is the quality of the estimate assessed?
H=
⎡⎢⎢⎢⎢⎢⎢⎣φ2(x1) φ1(x1) 1φ2(x2) φ1(x2) 1
.
.φ2(xn) φ1(xn) 1
⎤⎥⎥⎥⎥⎥⎥⎦ =⎡⎢⎢⎢⎢⎢⎢⎣x21 x1 1
x22 x2 1..
x2n xn 1
⎤⎥⎥⎥⎥⎥⎥⎦
yi =Xj
θjφj (xi)+ wi
We have seen:
• An estimator is a function f(z) which computes an estimate
for the parameters θ from measurements z = (z1, z2, … zn)
• The Maximum Likelihood estimator is
Summary point
θ = f(z)
θ = argmaxθp(z|θ)
Now, we will compute the posterior, and hence the Maximum a Posteriori(MAP) estimator
Bayes’ rule and the Posterior distribution
Suppose we have
• p(z|θ) the conditional density of the sensor(s)
• additionally, p(θ), a prior density of θ the parameter to be estimated.
Use Bayes’ rule to get the posterior density
p(θ|z) = p(z|θ)p(θ)p(z)
=likelihood× priornormalizing factor
Likelihood p(z|θ)
Prior p(θ)
Posterior
p(θ|z) ∼ p(z|θ)p(θ)
θ
Likelihood p(z|θ)
Prior p(θ)
Posterior
p(θ|z) ∼ p(z|θ)p(θ)
θ
Example: posterior for uniform distributions
Likelihood p(z|θ)
Prior p(θ)
Posterior
p(θ|z) ∼ p(z|θ)p(θ)
θ
Likelihood p(z|θ)
Prior p(θ)
Posterior
p(θ|z) ∼ p(z|θ)p(θ)
θ
MAP Estimator
θ = argmaxθp(θ|z)
MAP
MLE?
MAP Example: Gaussian fusion
Consider again two sensors modelled as independent normal
distributions:
z1 = θ+ w1, p(z1|θ) ∼ N(θ,σ21)z2 = θ+ w2, p(z2|θ) ∼ N(θ,σ22)
so that the joint likelihood is
p(z1, z2|θ) = p(z1|θ)× p(z2|θ)Suppose in addition we have prior information that
p(θ) ∼ N(θp,σ2p )Then the posterior density is
p(θ|z1, z2) ∼ p(z1, z2|θ) × p(θ)
e.g. if the sensors are: p(z1|x) ∼ N(x,102), p(z2|x) ∼ N(x,202), and the
prior is p(x) ∼ N(150,302).
Suppose we obtain sensor readings of z1 = 130, z2 = 170 then
xMAP =130/102+ 170/202+ 150/302
1/102 + 1/202+ 1/302= 139.0
σpos =1q
1/102 + 1/202+ 1/302= 8.57
The posterior is a Gaussian, and we can write down its mean
and variance simply by adding statistical information:
p(θ|z1, z2) ∼ N(µpos,σ2pos)where
σ−2pos = σ−21 + σ−22 + σ−2p
µpos =σ−21 z1 + σ−22 z2+ σ−2p θp
σ−21 + σ−22 + σ−2p= θMAP
50 100 150 200 2500
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Matlab example
p(z1|x)p(z2|x)
p(θ)
p(θ|z1, z2)
xinc=1; t = [50:inc:250];
pz1 = normal(t, 130,10^2); pz2 = normal(t, 170,20^2); prx = normal(t, 150,30^2);
posx = pz1 .* pz2 .* prx / (inc*(sum(pz1 .* pz2 .* prx)));
plot(t,pz1,'r', t,pz2,'g', t,prx,'k:', t, posx, 'b');
MAP vs MLE
For the posterior
σ−2pos = σ−21 + σ−22 + σ−2p θMAP=σ−21 z1+ σ−22 z2 + σ−2p θp
σ−21 + σ−22 + σ−2p
Recall that for the likelihood
σ−2L = σ−21 + σ−22 θMLE =σ−21 z1 + σ−22 z2
σ−21 + σ−22
so
σ−2pos = σ−2L + σ−2p θMAP=σ−2p θp+ σ−2L θMLE
σ−2p + σ−2L
i.e. prior acts as an additional sensor
• Decrease in posterior variance relative to the prior and likelihood
1
σ2pos=
1
σ2L+1
σ2p
e.g.
σL = =1q
1/102 + 1/202= 8.94 σp = 30 σpos =
1q1/102+ 1/202 + 1/302
= 8.57
If σp→∞ then σpos→ σL (prior is very uninformative)
• Effect of uniform prior ?
θMAP= θp+σ2p
σ2p + σ2L׳θMLE− θp
´• Update to the prior
• If θMLE = θp then θMAP is unchanged from the prior
(but its variance is decreased)
• If σL is large compared to σp then θMAP≈ θp
• If σp is large compared to σL then θMAP≈ θMLE
(noisy sensor)
(weak prior)
MAP Application 1: Berkeley traffic tracking system
Tracks cars by:
• computing corner features on each video frame
• tracking corners (2D) from frame to frame
Objective: estimate image corner position x(t)
features
tracks
grouping
Tracking corner features
Berkeley visual traffic monitoring system
Traffic by day and by night
ground plane homography
Velocity measurement gate
Tracking Corners
true position
measurement
Corners need 2D Gaussians
Corner feature in 2D coordinates is a vector x = (x, y)>.Suppose x ∼ N(0,σ2x), y ∼ N(0, σ2y) then
p(x, y) =1
2πσxσyexp
(−12
Ãx2
σ2x+y2
σ2y
!)and the covariance matrix is
Σ =
"σ2x 0
0 σ2y
#
Corner fusion in 2D
Given a vehicle at position x with prior
distribution
N(x0, P0)
and a corner measurement
z ∼ N(x, R)Summing information:
S = P−1 = R−1 + P−10Information weighted mean:
x= P³R−1z+P−10 x0
´
Measurementlikelihood
Priordistribution
Prior:
Likelihood:
Corner fusion in 2D
Isotropic measurement error
P0 =
Ã2 11 3
!
x0 =
Ã−1−1
!
R =
Ã1 00 1
!
z=
Ã12
!
Posteriordistribution
Posterior:
P =
Ã0.64 0.090.09 0.73
!
x=
Ã0.551.37
!
Where does the prior come from?
Use the posterior from the previous time step, “updated” to take
account of the dynamics
Iterate:
• prediction (posterior + dynamics) to give prior
• make measurement z
• fuse measurement and prior to give new posterior
xt− xt−1 ∼ NÃÃ0v
!, σ2
Ã1 00 1
!!
Compute estimate and covariance for tracked corner
Prior: Dynamics: Likelihood:
a) Predictionat t=1
b) Fuse measurementat t=1
Dynamics:
P1 = P0+PD =
Ã2.3 1.01.0 3.3
!; x1 = x0+v =
Ã−1−1
!
P1 =³P−11 +R−1
´−1=
Ã0.674 0.0760.076 0.750
!
x1 = P1³P−11 x1 +R−1z1
´=
Ã3.951.78
!
P0 =
Ã2 11 3
!PD =
Ã0.3 00 0.3
!R =
Ã1 00 1
!
x0 =
Ã−1−1
!v =
Ã00
!z1 =
Ã62
!
Berkeley tracker
In action …
• To quantify confidence and uncertainty define a confidence region R about a point x (e.g. the mode) such that at a confidence level c · 1
• we can then say (for example) there is a 99% probability that the true value is in R
• e.g. for a univariate normal distribution N(µ,σ2)
Confidence regions
p(x ∈ R) = c
p(|x − µ| < σ) ≈ 0.67
p(|x− µ| < 2σ) ≈ 0.95
p(|x− µ| < 3σ) ≈ 0.997-6 -4 -2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
• In 2D confidence regions are usually chosen as ellipses
0 2 4 6 8 10
-3
-2
-1
0
1
2
3
4
5
99%3
86%2
39%1
confidenced
99%
86%
For an n dimensionalnormal distributionwithmean zero and covariance
Σ, then
x>Σ−1x
is distributed as a χ2-distributionon n degrees of freedom.
In 2D the region enclosed by the ellipse
x>Σ−1x= d2
defines a confidence region
p(χ22 < d2) = c
where c can be obtained from standard χ2 tables (e.g. HLT) or com-
puted numerically.
MAP Application 2: Image Restoration
1
0
i
likelihood prior
uMAP= argminu C(u)
Suppose sensor model is
zi = ui +wi
where
ui ∈ {0,1} wi ∼ N(0,σ2)Then the likelihood is
p(z|u) ∼Yi
e−(zi−ui)2/(2σ2)
Assume the prior has the form
p(u) ∼Yi
e−λ(ui−ui−1)2
then the negative log of the posterior p(u|z) is
C =Xi
(zi− ui)2 + d(ui− ui−1)2
original
original plus noise
MAP restoration
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
d = 10
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
original
original plus noise
MAP restoration
d = 60
Loss Functions
Posterior
p(θ|z) ∼ p(z|θ)p(θ)
MAP
Example: hitting a target
Loss function L(x, x) =
(1 if |x − x| > t0 if |x − x| <= t
Expected Loss L(x) =Zp(x|z)L(x, x) dx
chosen value actual value
x
x
2t
Choose to minimize expected lossx x= argminx
Zp(x|z)L(x, x) dx
Example 1: squared loss function
µ= argminu
Z(x − u)2p(x) dx
L(x, x) = (x− x)2
this results in choosing the mean value
Sketch proof
Differentiate w.r.t. uZ2(u− x)p(x) dx= 0
Choose to minimize expected lossx x= argminx
Zp(x|z)L(x, x) dx
Example 2: absolute loss function
this results in choosing the median value
L(x, x) = |x − x|
xmed = argminu
Z|x− u|p(x) dx
[proof: exercise]
Warning: MAP and change of parameterization
MAP
MAP
Posterior
p(θ|z)
Posterior
p(θ0|z)
non linear map
θ0 = g(θ)
Problem:
• MAP finds maximum value of the posterior, but the posterior is a probability density
Solution:
• Loss functions correctly transform under change of coordinate (reparameterization)
Example of non-linear map: perspective projection