a flavour of errors in variables modelling jonathan gillard [email protected]
TRANSCRIPT
Constructing the Model• We have two variables, ξ and η.• ξ and η are linearly related in the form η = α+βξ.
• Instead of observing n pairs (ξi, ηi) we observe the n data pairs (xi,yi), where
xi = ξi + δi
yi = ηi + εi
and it is assumed that i and i are independent error terms having zero mean and variances σδ
and σε respectively.2
2
Down’s Syndrome
• Affects 1 in 1000 children born in the UK.
• Down’s is caused by the presence of an extra chromosome. An extra copy of chromosome 21 is included when the sperm and the egg combine to form the embryo.
• Screening tests are used to calculate the chance of a baby having the condition.
The Data Set
0
20
40
60
80
100
120
100 105 110 115 120 125 130 135
Gestational Age
Lo
g A
FP
How can we fit a line?
• There are clearly errors in both variables.
• “To use standard statistical techniques of estimation to estimate β, one needs additional information about the variance of the estimators” – Madansky (1959)
• We know the dating error is ±2 days – this is enough information!
Method of Moments
• “The method of moments has a long history, involves an enormous amount of literature, has been through periods of severe turmoil associated with its sampling properties compared to other estimation procedures, yet survives as an effective tool, easily implemented and of wide generality”
– Bowman and Shenton
Method of Moments
• “The maximum likelihood approach to estimation is primarily justified by asymptotic (as the sample size goes to infinity) considerations”
– Cheng and Van Ness
Estimating the Parameters
• As the dating error is ±2 days, then σδ = 2.
• Use a modified ‘y on x’ regression estimator: β = sxy / (sxx - σδ).
• Other parameters i.e. intercept α can be estimated from the method of moment equations.
2
Regression Lines
0
20
40
60
80
100
120
100 105 110 115 120 125 130 135
Gestational Age
Lo
g A
FP x y pair
y on x
x on y
sigma[delta] known
Typology of Residuals
Cond’l Residuals “Local”
Innovation Residuals
“White noise”
Marginal Residuals “Global”
Typology of
Residuals (Haslett)
What are residuals used for?1. Prediction2. Model checking3. Leverage4. Influence5. Deletion
Estimating the true points
• Two naive m.m.e’s of ξ:
The optimal linear combination is:
2
2
2
x Var[x]
y yVar
2
2 2
2
yx
( )x
The Estimated True Points
0
20
40
60
80
100
120
100 105 110 115 120 125 130 135
Gestational Age
Lo
g A
FP
x y pair
xi eta pair
Estimated true against observed
100
105
110
115
120
125
130
135
140
100 105 110 115 120 125 130 135 140
x
esti
mat
ed x
i
x xi
y = x
A residual?
• Attempt to write as a usual regression model:
y = α + βx + (ε - βδ)
1. x is always random due to random error
2. Cov(x, ε – βδ) = -βσδ
3. Using ordinary l.s. estimates leads to inconsistent estimators
2
Residuals
-40
-20
0
20
40
60
80
100 105 110 115 120 125 130 135
Gestational Age
Res
idu
al
Residuals again!
-40
-20
0
20
40
60
80
100 105 110 115 120 125 130 135
Estimated Gestational Age
Res
idu
al
Questions?