System IdentificationLecture 11: Statistical properties of parameter estimators, Instrumental
variable methods
Roy Smith
2018-11-28 11.1
2018-11-28 11.2
Statistical basis for estimation methods
Parametrised models:
G “ Gpθ, zq, H “ Hpθ, zq (pulse response, ARX, ARMAX,. . .. . . , state-space)
Estimation
θ “ argminθ
Jpθ, ZKq, (ZK : finite-length measured noisy data)
Examples: Least squares (linear regression)Prediction error methodsCorrelation methods
How do the statistical properties of the data (i.e. noise effects) influence ourchoice of methods and our results?
2018-11-28 11.3
Maximum likelihood estimation
Basic formulation
Consider K observations, z1, . . . , zK .
Each is a realisation of a random variable, with joint probability distribution,
fp x1, . . . , xKlooooomooooon;randomvariables
θq ÐÝ family of distributions parametrised by θ.
Another common notation is,
fpx1, . . . , xK | θq ÐÝ the pdf for x1, . . . , xK given θ.
For independent variables,
fpx1, . . . , xK ; θq “ f1px1; θqf2px2; θq ¨ ¨ ¨ fKpxK ; θq “Kź
i“1
fipxi; θq
2018-11-28 11.4
Maximum likelihood estimation
Likelihood function
Substituting the observation, ZK “ tz1, . . . , zKu, gives a function of θ,
Lpθq “ fpx1, . . . , xK ; θqˇˇxi“zi,i“1,...,K.
(Likelihood function)
Maximum likelihood estimator:
θML “ argmaxθ
Lpθq.
The value chosen for θ is the one that gives the most “agreement” with theobservation.
2018-11-28 11.5
Maximum likelihood estimation
Estimating the mean of a Gaussian distribution (σ2 “ 0.5)
2
4
6
8
10
02
46
810
12
0
0.25
0.5
0.75
x “ θ
fpx; θq “ 1?2πσ2
e´ px´θq2
2σ2
θx
2018-11-28 11.6
Maximum likelihood estimation
Estimating the mean of a Gaussian distribution (σ2 “ 0.5)
Datum: z “ 7.0
2
4
6
8
10
02
46
810
12
0
0.25
0.5
0.75
Lpθq “ fpz; θq
x “ θ
fpx; θq “ 1?2πσ2
e´ px´θq2
2σ2
θML “ 7.00
θx
2018-11-28 11.7
Maximum likelihood estimation
Log-likelihood function
It is often mathematically easier to consider,
θML “ argmaxθ
lnLpθq.
As the ln function is monotonic this gives the same θ.
This is typically the natural logarithm so as to be able to handle theexponentiation in typical pdfs.
2018-11-28 11.8
Example
Estimation of the mean of a set of samples
zi, i “ 1, . . . ,K zi „ N pθ0, σ2i q. (note: different variances)
Sample mean estimate: θSM “ 1
K
Kÿ
i“1
zi
Probability density functions (pdf): θ is the common mean of the distributions.
fipxi; θq “ 1a2πσ2
i
exp
ˆ´pxi ´ θq
2
2σ2i
˙
For independent samples the joint pdf is:
fpx1, . . . , xK ; θq “Kź
i“1
1a2πσ2
i
exp
ˆ´pxi ´ θq
2
2σ2i
˙
2018-11-28 11.9
Example
Estimation of the mean of a set of samples
θML “ argmaxθ
ln fpx1, . . . , xK ; θqˇˇxi“zi,i“1,...,K.
“ argmaxθ
ln Lpθq
“ argmaxθ
˜´K
2lnp2πq ´
Kÿ
i“1
1
2lnpσ2
i q ´ 1
2
Kÿ
i“1
pzi ´ θq2σ2i
¸
This gives (differentiate and equate to zero),
θML “
¨˚
1Kÿ
i“1
1
σ2i
˛‹‚
Kÿ
i“1
ziσ2i
2018-11-28 11.10
Bayesian approach
Random parameter framework
Consider θ to be a random variable with pdf: fθpxq.
This is an a priori distribution (assumed before the experiment).
Conditional distribution (inference from the experiment)
Our model (plus assumptions) gives a conditional distribution,
fpx1, . . . , xK | θqOn the basis of the experiment (xi “ zi),
Probpθ | z1, . . . , zKq “ ProbpZK | θqProbpθqProbpZKq
So,
argmaxθ
fpθ | z1, . . . , zKq “ argmaxθ
fpZK | θqfθpθq
2018-11-28 11.11
Maximum a posteriori (MAP) estimation
Estimator
Given data, ZK ,
θMAP “ argmaxθ
fpZK | θqfθpθq.
We can interpret the maximum likelihood estimator as,
θML “ argmaxθ
fpx1, . . . , xK ; θqˇˇxi“zi,i“1,...,K.
“ argmaxθ
fpZK | θq
These estimates coincide if we assume a uniform distribution for θ.
2018-11-28 11.12
MAP estimation
A priori parameter distribution
fθpθq “ 1a2πσ2
θ
e´ pθ´θaq2
2σ2θ , θa “ 5, σ2
θ “ 1.
fθpθq
θa “ 5
θa ˘ σθa
0.1
0.2
0.3
0.4
θ2 4 5 6 8 10
2018-11-28 11.13
MAP estimation
Estimating the mean: Gaussian distribution (σ2 “0.5, θa “5, σ2θa“1)
2
4
6
8
10
02
46
810
12
0
0.1
0.2
0.25
x “ θ
fpx; θqfθpθq “ 1?2πσ2
e´ px´θq2
2σ21a2πσ2
θ
e´ pθ´θaq2
2σ2θ
θx
2018-11-28 11.14
MAP estimation
Estimating the mean: Gaussian distribution (σ2 “0.5, θa “5, σ2θa“1)
Datum: z “ 7.0
2
4
6
8
10
02
46
810
12
0
0.1
0.2
0.25
fpz; θqfθpθqx “ θ
fpx; θqfθpθq “ 1?2πσ2
e´ px´θq2
2σ21a2πσ2
θ
e´ pθ´θaq2
2σ2θ
θMAP “ 6.33
θx
2018-11-28 11.15
Cramer-Rao bound
Mean-square error matrix
P “ E
"´θpZKq ´ θ0
¯´θpZKq ´ θ0
¯T*
Assume that the pdf for ZK is fpZK ; θq.
Cramer-Rao inequality
Assume EtθpZKqu “ θ0, and ZK Ă RK .
Then, P ěM´1 (M is the Fischer Information Matrix)
M “ E
#ˆd
dθln fpZK ; θq
˙ˆd
dθln fpZK ; θq
˙T+ˇˇˇθ“θ0
“ ´E"d2
dθ2ln fpZK ; θq
*ˇˇθ“θ0
2018-11-28 11.16
Maximum likelihood: statistical properties
Asymptotic results for i.i.d. variables
Consider a parametrised family of pdfs,
fpx1, . . . , xK ; θq “Kź
i“1
fipxi; θq.
Then,
limKÝÑ8 θML
w.p. 1ÝÑ θ0,
and
limKÝÑ8
?K
´θMLpZKq ´ θ0
¯„ N
`0,M´1
˘.
2018-11-28 11.17
Prediction error statistics
Prediction error framework
εpk, θq “ ypkq ´ ypk, θqAssume that εpk, θq is i.i.d. with pdf: fεpx; θq.
For example: ARX case, εpk, θ0q “ 0 „ N p0, σ2q.
Joint pdf for prediction:
fpXK ; θq “Kź
k“1
fεpεpk, θq; θq
2018-11-28 11.18
Prediction error statistics
Maximum likelihood approach
θML “ argmaxθ
fpXK ; θq |XK“ZK“ argmax
θLpθq
“ argmaxθ
ln fpZK | θq
“ argmaxθ
1
K
Kÿ
k“1
ln fεpεpk, θq; θq.
If we choose the prediction error cost function as,
lpε, θq “ ´ ln fεpε; θq,then,
θPE “ argminθ
1
K
Kÿ
k“1
lpεpk, θq, θq “ θML
2018-11-28 11.19
Prediction error statistics
Example
Gaussian noise case, εpkq „ N p0, σ2q.lpεpk, θq, θq “ ´ ln fεpε; θq
“ constant` 1
2lnσ2 ` 1
2
εpk, θq2σ2
If σ2 is constant (and not a parameter to be estimated) then,
θML “ argmaxθ
Lpθq “ argminθ
1
K
Kÿ
k“1
lpεpk, θq, θq
“ argminθ
}εpk, θq}22 “ θPE
2018-11-28 11.20
Prediction error statistics
Example
If we have a linear predictor, and independent gaussian noise, then,
θ “ argminθ
}εpk, θq}22,
§ Is a linear, least-squares problem;
§ Is equivalent to minimizingKÿ
k“1
´ ln fεpε; θq;§ Is equivalent to a maximum likelihood estimation;
§ Gives (asymptotically) the minimum variance parameter estimates.
2018-11-28 11.21
Linear regression statistics
One-step ahead predictor
ypk|θq “ ϕT pkqθ ` µpkqIn the ARX case µpkq “ epkq. In other special cases µpkq can depend on ZK .
Prediction error: εpkq “ ypkq ´ ϕT pkqθA typical cost function is:
Jpθ, ZKq “ 1
K
K´1ÿ
k“0
εpkq22
Least-squares criterion:
θLS “˜
1
K
K´1ÿ
k“0
ϕpkqϕT pkq¸´1
looooooooooooooomooooooooooooooonR´1K P Rdˆd
1
K
K´1ÿ
k“0
ϕpkqypkqlooooooooomooooooooon
fK P Rd
2018-11-28 11.22
Linear regression statistics
Bpθ, zq`1
Apθ, zqypkq upkq
vpkq
Least-squares estimator properties
The least-squares estimate can be expressed as,
θLS “ R´1K fK
True plant:
ypkq “ ϕT pkqθ0 ` vpkqAsymptotic bias:
limKÝÑ8 θLS ´ θ0 “ lim
KÝÑ8R´1K
1
K
K´1ÿ
k“0
ϕT pkqvpkq “ pR˚q´1f˚.
R˚ “ E!ϕpkqϕT pkq
), f˚ “ E tϕpkqvpkqu .
2018-11-28 11.23
Linear regression statistics
Consistency of the LS estimator
For consistency, limKÝÑ8 θLS “ θ0,
we require, pR˚q´1f˚ “ 0.
So,
1. R˚ must be non-singular. Persistency of excitation requirement.
2. f˚ “ E tϕpkqvpkqu “ 0. This happens if either:
2a. vpkq is zero-mean and independent of ϕpkq; or2b. upkq is independent of vpkq and G is FIR. (n “ 0).
This gives,
limKÝÑ8
?K
´θLS ´ θ0
¯„ N
`0, σ2
0pR˚q´1˘.
2018-11-28 11.24
Correlation methods
Ideal prediction error estimator
ypkq ´ ypk|k ´ 1q “ εpkq “ epkqloomoonideally
The sequence of prediction errors, tepkq, k “ 0,K ´ 1u, is white.
If the estimator is optimal (θ “ θ0) then the prediction errors contain nofurther information about the process.
Another intrepretation: the prediction errors, εpkq, are uncorrelated with theexperimental data, ZK .
2018-11-28 11.25
Correlation methods
Approach
Select a sequence, ζpkq, derived from the past data, ZK .
Require that the error, εpk, θq, is uncorrelated with ζpkq,
1
K
K´1ÿ
k“0
ζpkqεpk, θq “ 0 (could also use a function, αpεq )
We can view the ID problem as finding θ such that this relationship is satisfied.
The values, ζpkq, are known as instruments.
Typically ζpkq P Rdˆny , where θ P Rd, ypkq P Rny .
2018-11-28 11.26
Correlation methods
Procedure
Choose a linear filter, F pzq for the prediction errors,
εF pk, θq “ F pzqεpk, θq (this is optional).
Choose a sequence of correlation vectors, ζpk, ZK , θq constructed from thedata (and possibly θ).
Choose a function αpεq (default is αpεq “ ε). Then,
θ “ θ, solving fKpθ, ZKq “ 1
K
K´1ÿ
k“0
ζpk, θqαpεpk, θqq “ 0.
2018-11-28 11.27
Pseudo-linear regressions
Regression-based one-step ahead predictors
For ARX, ARMAX, etc., model structures we can write the predictor,
ypk|θq “ ϕT pk, θqθ.We previously solved this via LS (or iterative LS, or optimisation) methods.
Correlation based solution
θPLR “ θ solving1
K
K´1ÿ
k“0
ϕpk, θqp ypkq ´ ϕT pk, θqθlooooooooomooooooooonprediction error
q “ 0.
The prediction errors are orthogonal to the regressor, ϕpk, θq.
2018-11-28 11.28
Instrumental variable methods
Instrumental variables
θIV “ θ solving1
K
K´1ÿ
k“0
ζpk, θqpypkq ´ ϕT pk, θqθq “ 0.
This is solved by,
θIV “˜
1
K
K´1ÿ
k“0
ζpkqϕT pkq¸´1
1
K
K´1ÿ
k“0
ζpkqypkq.
So, for consistency we require,
E!ζpkqϕT pkq
)to be nonsingular,
and
E tζpkqvpkqu “ 0 (uncorrelated w.r.t. prediction error)
2018-11-28 11.29
Example
ARX model
ypkq`a1ypk´1q`¨ ¨ ¨`anypk´nq “ b1upk´1q`¨ ¨ ¨`bmupk´mq`vpkq
One approach: filtered input signals as instruments
Bpθ, zq`1
Apθ, zq
P pzqQpzq
ypkq upkqvpkq
xpkq
xpkq ` q1xpk ´ 1q ` ¨ ¨ ¨ ` qnxpk ´ nq “ p1upk ´ 1q ` ¨ ¨ ¨ ` pmupk ´mq
2018-11-28 11.30
Instrumental variable example
Bpθ, zq`1
Apθ, zq
P pzqQpzq
ypkq upkqvpkq
xpkq
ζpkq “ “´xpk ´ 1q . . . ´xpk ´ nq upk ´ 1q . . . upk ´mq‰
Here,
RK “ 1
K
K´1ÿ
k“0
ζpkqϕT pkq is required to be invertible,
and we also need,
E
#1
K
K´1ÿ
k“0
ζpkqvpkq+“ 0.
2018-11-28 11.31
Instrumental variable example
Invertibility of RK?
y “ BpzqApzqu`
1
Apzqv x “ P pzqQpzqu
So, ζpkqϕT pkq has the form,
ζpkqϕT pkq “„xk´10
uk´10
“yk´10 uk´1
0
‰
“„PQuk´10
uk´10
“`BAuk´10 ` 1
Avk´10
˘uk´10
‰
“„PQuk´10
uk´10
“BAuk´10 uk´1
0
‰
loooooooooooooooomooooooooooooooooninvertible?
` s„PQuk´10
0
“1Avk´10 0
‰
looooooooooooomooooooooooooonvanishing?pÝÑ 0q
2018-11-28 11.32
Instrumental variable example
y “ BpzqApzqu`
1
Apzqv, x “ P pzqQpzqu
ζpkqϕT pkq “»–P pzqQpzqu
k´10
uk´10
fifl„BpzqApzqu
k´10 uk´1
0
`»–P pzqQpzqu
k´10
0
fifl„
1
Apzqvk´10 0
This will be invertible if:
§ vpkq and upkq are uncorrelated.
§ upkq and xpkq “ P pzqQpzqupkq are sufficiently exciting.
§ There are no pole/zero cancellations betweenP pzqQpzq and
BpzqApzq .
2018-11-28 11.33
Instrumental variable approach
A nonlinear estimation problem
Bpθ, zqApθ, zq`
P pzqQpzq
ypkq upkqvpkq
xpkq
Choosing P pzq and QpzqThe procedure works well when P pzq « Bpzq and Qpzq « Apzq.
Approach:
1. Estimate θLS via linear regression.
2. Select Qpzq “ ALSpzq and P pzq “ BLSpzq.3. Calculate θIV.
2018-11-28 11.34
Instrumental variable approach
Considerations
§ Variance and MSE depend on the choice of instruments.
§ Consistency (asymptotically unbiased) is lost if:
§ Noise and instruments are correlated (for example, in closed-loop,generating instruments from u).
§ Model order selection is incorrect.
§ Filter dynamics cancel plant dynamics.
§ True system is not in the model set.
§ Closed-loop approaches: generate instruments from the excitation, r.
2018-11-28 11.35
Bibliography
Prediction error minimizationLennart Ljung, System Identification;Theory for the User, 2nd Ed., Prentice-Hall,1999, [sections 7.1, 7.2 & 7.3].
Parameter estimation statisticsLennart Ljung, System Identification;Theory for the User, 2nd Ed., Prentice-Hall,1999, [section 7.4].
Correlation and instrumental variable methodsLennart Ljung, System Identification;Theory for the User, 2nd Ed., Prentice-Hall,1999, [sections 7.5 & 7.6].
2018-11-28 11.36