lecture 1. introduction to variational data assimilation...

Data assimila;on

데이터 동화 deita donghwa

Variational Data AssimilationLecture 1. Introduction to Variational Data Assimilation.

Adrian Sandu1

1Computational Science Laboratory (CSL)Department of Computer Science

Virginia Tech

Ewha International School on Data Assimilation (EISDA 2012)Seoul, Korea, 22-24 August 2012

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Title. [1/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Hello

안녕하세요 An-‐yeong-‐ha-‐se-‐yo

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Hello!. [2/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Data assimilation

Data assimilation = the fusion of information from imperfect modelpredictions, and from noisy data, to obtain a consistent description ofthe state of a physical system, such as the atmosphere.

Approaches to solving data assimilation:I Variational (rooted in control theory)I Ensemble-based (rooted in statistical estimation theory)

Drivers for improvements in data assimilation are:I Better algorithmsI Better observing systemsI Better computational platforms

Lecture

설교 seolgyo

1. Introduction to variational d.a.. General view of data assimilation. [3/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Observation coverageObservation Coverage

6 February 200900 UTC ± 3h

Yannick Tremolet (ECMWF) Variational Data Assimilation July 2009 37 / 44Figure: (Tremolet, 2009)Lecture

설교 seolgyo


Improvements in data assimilation capabilitiesPerformance

1992 1995 1998 2001 2004 2007

98

95

90

80

70

60

50

40

D+3

D+7

1992 1995 1998 2001 2004 2007

98

95

90

80

70

60

50

40Operations

1992 1995 1998 2001 2004 2007

ERA-40

Anomaly correlation of 500hPa height forecasts

Northern hemisphere Southern hemisphere

D+5

ERA-Interim

D+3

D+5

D+7

Forecast performance has increased regularly over the years.

Yannick Tremolet (ECMWF) Variational Data Assimilation July 2009 41 / 44

Figure: (Tremolet, 2009)

Lecture

설교 seolgyo


Data assimilation optimally combines three sources ofinformation

The true state (sampled at the model grid points) xtrue ∈ Rn is unknownand needs to be estimated from the available information. In order toobtain an estimate of xtrue data assimilation combines three differentsources of information:

1. the prior information encapsulates our current knowledge of thestate

2. the model encapsulates our knowledge about physical andchemical laws that govern the evolution of the system

3. the observations are noisy and sparse snapshots of realityavailable at discrete times

The best estimate that optimally fuses all these sources of informationis called the analysis, and is denoted by xa.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Sources of information. [6/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Source of information #1: The prior I

I The background (prior) probability density Pb(x) encapsulates ourcurrent knowledge of the tracer distribution.

I Specifically, Pb(x) describes the uncertainty with which one knowsxtrue at the present, before any (new) measurements are taken.

I The mean taken with respect to this pdf is denoted by

Eb [f ] =

∫f (x)Pb(x) dx .

I The current best estimate of the true state is called the apriori, orthe background state xb. (This is often taken to be the mean of thebackground distribution xb = Eb[x].)

Lecture

설교 seolgyo


Source of information #1: The prior II

I A typical assumption is that the random background errorsεb = xb − xtrue are unbiased and have a normal pdf, i.e.,

εb = xb − xtrue ∈ N (0,B) .

Here B = Eb[εb (εb)T ] ∈ Rn×n is the background error covariance

matrix.I With many nonlinear models the normality assumption is difficult

to justify, but is nevertheless widely used because of itsconvenience.

Lecture

설교 seolgyo


Source of information #2: The model

I The model encapsulates our knowledge about physical andchemical laws that govern the evolution of the atmosphericcomposition.

I The model evolves an initial state x0 ∈ Rn at the initial time t0 tofuture state values xi ∈ Rn at future times ti ,

xi =Mt0→ti (x0) .

The size of the state space in realistic chemical transport modelsis very large, typically n ∈ O

(108) variables.

I The model is always imperfect (why?). Model error over [ti−1, ti ]

µi =Mti−1→ti(xtrue

i−1)− xtrue

i .

Lecture

설교 seolgyo


Source of information #3: The observations I

I Observations represent sparse and noisy snapshots of reality, thatare available at several discrete time moments.

I Specifically, measurements yi ∈ Rm of the true state are taken attimes ti , i = 1, · · · ,N

yi = Ht (xtruei)− ηobs

i , i = 1, · · · ,N.

I The observation operator Ht maps the physical state space ontothe observation space. In many practical situations Ht is a highlynonlinear mapping (as is the case, e.g., with satellite observationoperators).

I The measurement (instrument) errors are denoted by ηobsi .

I At present the chemical observations are sparsely distributed, andtheir number is small compared to the dimension of the statespace, m� n.

Lecture

설교 seolgyo


Source of information #3: The observations II

I Observation equation relates the true state with the observations.In order to relate the model state to observations we also considerthe relation

yi = H (xi)− εobsi , i = 1, · · · ,N ,

εobsi = H (xi)−H

(xtrue

i)

+H(xtrue

i)−Ht (xtrue

i)

+ ηobsi .

I The observation operator H maps the model state space onto theobservation space.

I The observation error term εobsi accounts for

1. measurement (instrument) errors, as well as2. representativeness errors (i.e., errors in the accuracy with which the

model can reproduce reality, and with which the numerical operatorH approximates Ht).

Lecture

설교 seolgyo


Source of information #3: The observations III

I Typically observation errors are assumed to be unbiased andnormally distributed

εobsi ∈ N (0,Ri) , i = 1, · · · ,N .

Moreover, observation errors at different times (εobsi and εobs

j fori 6= j) are assumed to be independent.

Lecture

설교 seolgyo


Example: the observation operator maps the modelstate space into observation space

Model-predicted Radiance

To allow model-data comparison, observation operators map the model state space to observation space

H Satellite-observed

Radiance

Lars Isaksen (http://www.ecmwf.int) Nov. 30, 2011. Lecture 1: Data assimilation.

Model-computed T and q

Compare

Lecture

설교 seolgyo


Result of data assimilation: The analysisI Based on these three sources of information data assimilation

computes the analysis (posterior) probability density Pa(x). Pa(x)describes the uncertainty with which one knows xtrue after all theinformation available from measurements has been accounted for.

I The mean taken with respect to this pdf is denoted by

Ea [f ] =

∫f (x)Pa(x) dx .

I The best estimate xa of the true state obtained from analysisdistribution is called the aposteriori, or the analysis state. (Thiscan be the posterior mean xa = Ea[x], or a posterior mode).

I The analysis estimation errors εa = xa − xtrue are characterized byI analysis mean error (bias): βa = Ea [εa]I analysis error covariance matrix:

A = Ea[(εa − βa) (εa − βa)T

]∈ Rn×n.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Data assimilation results. [14/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

The Bayesian estimation framework I

I The analysis probability density is the probability density of thestate conditioned by all the available observationsy = [y1, · · · ,yN ]. Bayes Theorem allows one to express theanalysis probability density as follows:

Pa(x) = P(x|y) =P(y|x) · Pb(x)

P(y).

I The denominator P(y) is the marginal probability density of theobservations and plays the role of a scaling factor.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. The Bayesian framework. [15/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

The Bayesian estimation framework III The probability of the observations conditioned by the statesP(y|x) is the probability that the observation errors assume thevalues H

(xb)− y

P (y|x) = Pobs (εobs = H(xb)− y

).

Since the observation errors εobs1 , . . . , εobs

N at different timest1, . . . , tN are (considered to be) independent, we have that:

P (y|x) =N∏

i=1

Pobs (εobsi)

=N∏

i=1

Pobs (H (xi)− yi) .

I In practice we want to define estimators xa of the true state xtrue

that are optimal in a certain sense.

Lecture

설교 seolgyo


Bayesian example. Background.

Lecture

설교 seolgyo


Bayesian example. Background and observation.

Lecture

설교 seolgyo


Bayesian example. Background, observation, andanalysis.

Lecture

설교 seolgyo


The meaning of “best” estimator1. Minimum mean square error (MMSE) estimator minimizes the

expected values of the mean square error minEa[‖xa − xtrue‖2]. Itis the mean of the posterior distribution, xa = Ea[x]. This estimatoris not practical for large scale systems, as it requires anintegration in the high dimensional state space. Practicalestimators are obtained by taking the mean of an approximation ofthe posterior distribution, see for example EnKF.

2. Maximum aposteriori estimator (MAP) is a computationallyfeasible estimator based on the mode of the posterior distribution,see for example variational methods.

3. Minimum variance unbiased (MVUE) estimator xa has the smallesttotal variance (min traceEa[(xa − Ea[xa])(xa − Ea[xa])T ]) among allunbiased estimators. An unbiased estimator is characterized by azero posterior error mean (i.e., zero bias, βa = 0). MVUEestimators are not guaranteed to exist, and when they do, they aredifficult to compute in practical problems.

Lecture

설교 seolgyo


Different “best” estimators

Lecture

설교 seolgyo


Analytical solution in the Gaussian and linear case IConsider the ideal case where the observation operator is linear

H (x) = H · x , H ∈ Rm×n .

and both the background errors and the observation errors arenormally distributed

Pb(x) = (2π)−n/2 (det B)−1/2 exp(−1

2(x− xb)T B−1(x− xb)

)Pobs (y|x) = (2π)−m/2 (det R)−1/2 exp

(−1

2(Hx− y)T R−1 (Hx− y)

)Use this probabilities in Bayes’ formula. A direct calculation shows thatthe posterior probability density is also Gaussian, Pa(x) = N (xa,A),

Pa(x) = (2π)−n/2 (det A)−1/2 exp(−1

2(x− xa)T A−1(x− xa)

)Lecture

설교 seolgyo


Analytical solution in the Gaussian and linear case II

The analysis mean xa and covariance A are given by the Kalman filterformulae:

K = BHT (H B HT + R)−1

xa = xb + K(y− H xb)

A = (I− K H) B

The matrix K ∈ Rn×m is called the “Kalman gain” operator.

Lecture

설교 seolgyo


Maximum aposteriori estimator I

In the maximum likelihood approach one looks for the argument thatmaximizes the posterior distribution, or equivalently, minimizes itsnegative logarithm:

xa = arg maxxPa(x) = arg min

xJ (x) , J (x) = − ln Pa(x) .

The above equation (defines the maximum aposteriori estimator(MAP). In this context the data assimilation problem is formulated asan optimization problem. Using Bayes the minimization cost functioncan be written as

J (x) = − ln Pa(x) = − lnPb (x)− lnP (y|x) + const .

The scaling factors of the probability densities, as well as the term− lnP(y), are constants in x and do not influence the minimization.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. MAP estimators. [24/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Maximum aposteriori estimator IIUnder the assumption that the background errors are normallydistributed we have that

− lnPb (x) =12(x− xb)T B−1 (x− xb)+ const .

Similarly, under the assumption that observation errors areindependent and normally distributed we have that

− lnP (y|x) =12

(H (x)− y)T R−1 (H (x)− y) + const .

The maximum likelihood estimator is obtained as the minimizer of thecost function

J (x) =12(x− xb)T B−1 (x− xb)+

12

(H (x)− y)T R−1 (H (x)− y) ,

where the constant terms have been left out.Lecture

설교 seolgyo


Maximum aposteriori estimator III

Note that if, in addition, the observation operator is linear then the costfunction is quadratic, and the minimizer can be computed explicitlyfrom setting the gradient to zero

∇xJ (xa) = B−1 (xa − xb)+ HT R (H(xa)− y) = 0 .

The result is the Kalman filter estimate for the mean. Moreover, theHessian of the cost function coincides with the inverse of the Kalmanfilter analysis covariance matrix

∇2x,xJ = B−1 + HT R−1 H = A−1 .

Lecture

설교 seolgyo


Three dimensional variational data assimilation(3D-Var) I

I Variational methods solve DA in an optimal control framework.I In the 3D-Var data assimilation the observations are considered

successively at times t1, · · · , tN .

Lecture

설교 seolgyo

1. Introduction to variational d.a.. The 3D-Var approach. [27/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Three dimensional variational data assimilation(3D-Var) II

Figure: Example of 3D-Var sequential solution procedure.

I The background state (i.e., the best state estimate at time ti ) isgiven by the model forecast, starting from the previous analysis(i.e., best estimate at time ti−1):

xbi =Mti−1→ti

(xa

i−1).

I The discrepancy between the model state xi and observations attime ti , together with the departure of the state from the modelforecast xb

i , are measured by the 3D-Var cost function:

J (xi) =12(xi − xb

i)T B−1

i

(xi − xb

i)

+12

(H(xi)− yi)T R−1

i (H(xi)− yi) .Lecture

설교 seolgyo


Three dimensional variational data assimilation(3D-Var) III

I While in principle a different background covariance matrix shouldbe used at each time, in practice the same matrix is re-usedthroughout the assimilation window, Bi = B, i = 1, . . . ,N.

I The 3D-Var analysis is the MAP estimator, and is computed as thestate which minimizes

xai = arg min J (xi) .

I Typically a gradient-based numerical optimization procedure isemployed to minimize. The gradient of the cost function is

∇xiJ (xi) = B−1i

(xi − xb

i)

+ HTi R−1

i (H(xi)− yi) .

Note that the gradient requires to computation of the linearizedobservation operator Hi = H′(xi) about the current state.

Lecture

설교 seolgyo


Three dimensional variational data assimilation(3D-Var) IV

I For linear observation operator:

∇xiJ (xai ) = 0 ⇒

(B−1

i + HTi R−1

i Hi

)·xa

i = B−1i ·xb

i + HTi R−1

i yi .

Lecture

설교 seolgyo


Four dimensional variational data assimilation(4D-Var) I

I 4D-Var = 3D-Var + timeI In 4D-Var data assimilation all observations at all times t1, · · · , tN

are considered simultaneously.

Lecture

설교 seolgyo


Four dimensional variational data assimilation(4D-Var) II

Figure: Example of 4D-Var smoothing procedure.

I The control parameters are the initial conditions x0; they uniquelydetermine the future states of the system via the model equation.

I The MAP estimate xa0 is the minimizer of the 4D-var cost function:

J (x0) =12(x0 − xb

0)T B−1

0

(x0 − xb

0)

+12

N∑i=1


i (H(xi)− yi)

Note that the departure of the initial conditions from thebackground is weighted by the inverse background covariancematrix, while the differences between the model predictions H(xi)

Lecture

설교 seolgyo


Four dimensional variational data assimilation(4D-Var) III

and observations yi are weighted by the inverse observation errorcovariances.

I The 4D-Var analysis is computed as the initial condition whichminimizes the cost function subject to the model equationconstraints

xa0 = arg minJ (x0) subject to: xi =Mt0→ti (x0) , i = 1, · · · ,N.

I The model propagates the optimal initial condition forward in timeto provide the analysis at future times, xa

i =Mt0→ti (xa0).

Lecture

설교 seolgyo


Four dimensional variational data assimilation(4D-Var) IV

I The large scale optimization problem is solved numerically using agradient-based technique. The gradient reads

∇x0J (x0) = B−10

(x0 − xb

0)

+N∑

i=1

(∂xi

∂x0

)T

HTi R−1

i (H(xi)− yi)

The 4D-Var gradient requires:I the linearized observation operator Hi = H′(xi ), andI the transposed derivatives of future states with respect to the initial

conditions (∂xi/∂x0)T .I The 4D-Var gradient can be obtained effectively by forcing the

adjoint model with observation increments, and running itbackwards in time. The construction of an adjoint model is anontrivial task.

Lecture

설교 seolgyo


Example: Lorenz (three variables). “True” solution andobservations

Example: The Lorenz three-variable system. “True” solution and observations.

Nov. 30, 2011. Lecture 1: Data assimilation. Lecture

설교 seolgyo


Example: Lorenz (three variables). “True” and“background” solutions

Example: The Lorenz three-variable system. “True” and background solutions.


설교 seolgyo


Example: Lorenz (three variables). 4D-Var solutionafter 2 iterations

Example: The Lorenz three-variable system. 4D-Var solution, 2 optimization iterations


설교 seolgyo


Example: Lorenz (three variables). 4D-Var solutionafter 20 iterations

Example: The Lorenz three-variable system. 4D-Var solution, 20 optimization iterations


설교 seolgyo


Three dimensional variational data assimilation(3D-Var) I

I The background state (i.e., the best state estimate at time ti ):

xbi =Mti−1→ti

(xa

i−1).

I The discrepancy between the model state xi and observations attime ti , together with the departure of the state from the modelforecast xb

i , are measured by the 3D-Var cost function:

J (xi) =12(xi − xb

i)T B−1

i

(xi − xb

i)

+12


i (H(xi)− yi) .

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Assimilation as an optimization problem. [39/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Three dimensional variational data assimilation(3D-Var) II

I The 3D-Var analysis is the MAP estimator, and is computed as thestate which minimizes

xai = arg min J (xi) .

I The optimality condition is:

∇xiJ (xai ) = B−1

i

(xa

i − xbi)

+ HTi R−1

i (H(xai )− yi) = 0.

I The gradient requires to computation of the linearized observationoperator Hi = H′(xi) about the current state.

I If the observation operator is linear this is a linear system:(B−1

i + HTi R−1

i Hi

)· xa

i = HTi R−1

i yi + B−1i xb

i .

Lecture

설교 seolgyo


Three dimensional variational data assimilation(3D-Var) III

I Typically a nonlinear gradient-based unconstrained minimizationprocedure is employed.

I Preconditioning is often used to improve convergence of thenumerical optimization problem. A change of variables isperformed by shifting the state and scaling it with the square rootof covariance:

xi = B1/2i

(xi − xb

i),

and carrying out the optimization with the new variables xi .

Lecture

설교 seolgyo


Strongly constrained 4D-Var I

The cost function:

J (x0, . . . ,xN) =12(x0 − xb

0)T B−1

0

(x0 − xb

0)

+12

N∑i=1


i (H(xi)− yi)

The 4D-Var analysis is computed as the initial condition whichminimizes the cost function subject to the model equation constraints

[xa0, . . . ,x

aN ] = arg minJ (x0, . . . ,xN) s.t.: xi =Mti−1→ti (xi−1) , i = 1, · · · ,N.

The model propagates the optimal initial condition forward in time toprovide the analysis at future times, xa

i =Mt0→ti (xa0).

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Strongly constrained, nonlinear 4D-Var. [42/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Strongly constrained 4D-Var II

Comments.I 4D-Var determines the analysis state at every gridpoint and at

every time within the analysis window, i.e., provides a“four-dimensional analysis”.

I Strongly constrained 4D-Var assumes that the observationoperators and the model are perfect. As a consequence, theanalysis corresponds to a trajectory (i.e. an integration) of themodel.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Strongly constrained, nonlinear 4D-Var. [43/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Full space solution I

I “Full space”: all model states x = [x0, . . . ,xN ] are optimizationvariables.

I Use Lagrange multipliers approach and transform theequality-constrained into an unconstrained minimization problem.

I Use a Lagrange multiplier for each of the constraints and buildLagrangian function

L (x,λ) = J (x)−N∑

i=1

λTi ·(xi −Mti−1→ti (xi−1)

).

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Full space solution. [44/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Full space solution III The necessary conditions for a function minimizer are

(O)dLdx0

= B−10

(x0 − xb

0)

+

(d

dx0Mt0→t1(x0)

)T

· λ1 = 0

(A)dLdxi

= HTi R−1

i (H(xi)− yi)− λi +

(d

dxiMti→ti+1(xi)

)T

· λi+1 = 0 ,

i = 1, . . . ,N − 1,

(A)dLdxN

= HTN R−1

N (H(xN)− yN)− λN = 0

(F )dLdλi

= xi −Mti−1→ti (xi−1) = 0, i = 1, . . . ,N .

I It is convenient to impose the “forward model” (F) condition first, toobtain the state variables:

xi =Mti−1→ti (xi−1) , i = 1, . . . ,N .Lecture

설교 seolgyo


Full space solution IIII Next we impose the “adjoint model” (A) conditions, which defines

the Lagrange multipliers (a.k.a. adjoint variables):

λN = HTN R−1

N (H(xN)− yN)

λi =

(d

dxiMti→ti+1 (xi)

)T

· λi+1 + HTi R−1

i (H(xi)− yi)

=

(dxi+1

dxi

)T

· λi+1 + HTi R−1

i (H(xi)− yi) , i = N − 1, . . . ,1 ,

λ0 =

(d

dx0Mt0→t1 x0

)T

· λ1 =

(dx1

dx0

)T

· λ1 .

Note thatI the adjoint model runs backwards in time, andI the model-observation mismatch is a forcing term in the adjoint

model.Lecture

설교 seolgyo


Full space solution IV

I The remaining “optimality” (O) condition reads

B−10

(x0 − xb

0)

+ λ0 = 0 .

Lecture

설교 seolgyo


Reduced space solution I

I “Reduced space”: only x0 is an optimization variable.I Eliminate the constraints by running the model in each iteration,

xi =Mt0→ti (x0).I The reduced gradient reads

∇x0J (x0) = B−10

(x0 − xb

0)

+N∑

i=1

(∂xi

∂x0

)T

HTi R−1

i (H(xi)− yi)

I The 4D-Var gradient requires not only the linearized observationoperator Hi = H′(xi), but also the transposed derivatives of futurestates with respect to the initial conditions (∂xi/∂x0)T .

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Reduced space solution. [48/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Reduced space solution II

I The transposed chain rule gives us that

∂xi

∂x0=

∂xi

∂xi−1· ∂xi−1

∂xi−2· · · ∂x1

∂x0(∂xi

∂x0

)T

· v =

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

·(

∂xi

∂xi−1

)T

· v .

Lecture

설교 seolgyo


Reduced space solution IIII The sum of transposed derivatives times a vector can be obtained

by forcing the adjoint model with observation increments, andrunning it backwards in time. For example, for N = 3

vi = HTi R−1

i (H(xi)− yi) , i = 1,2,3

E =

(∂x1

∂x0

)T

· v1 +

(∂x2

∂x0

)T

· v2 +

(∂x3

∂x0

)T

· v3

=

(∂x1

∂x0

)T

· v1 +

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

· v2

+

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

·(∂x3

∂x2

)T

· v3

This expression can be evaluated iteratively.

Lecture

설교 seolgyo


Reduced space solution IV

I Iteration 3:

λ3 = v3

E =

(∂x1

∂x0

)T

· v1 +

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

· v2

+

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

·(∂x3

∂x2

)T

· λ3

Lecture

설교 seolgyo


Reduced space solution V

I Iteration 2:

λ2 =

(∂x3

∂x2

)T

· λ3 + v2

E =

(∂x1

∂x0

)T

· v1 +

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

· v2

+

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

·(∂x3

∂x2

)T

· λ3

=

(∂x1

∂x0

)T

· v1 +

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

· λ2

Lecture

설교 seolgyo


Reduced space solution VI

I Iteration 1:

λ1 =

(∂x2

∂x1

)T

· λ2 + v1

E =

(∂x1

∂x0

)T

· v1 +

(∂x1

∂x0

)T

·(∂x2

∂x1

)T

· λ2

=

(∂x1

∂x0

)T

· λ1

I Iteration 0:

E = λ0 =

(∂x1

∂x0

)T

· λ1

Lecture

설교 seolgyo


Reduced space solution VII

I The reduced gradient reads

∇x0J (x0) = B−10

(x0 − xb

0)

+ λ0

We note that the optimality condition in the full state approachsimply states that ∇x0J (x0) = 0.

I The construction of an adjoint model is a nontrivial task.I A typical gradient-based minimization requires 10–100 iterations.I Each iteration requires one forward model solution, plus one

backward adjoint solution (cost 2-3 times that of the forwardmodel).

I Total cost is therefore 30− 400 that of the forward modelintegration.

Lecture

설교 seolgyo


Control flow of the adjoint-based optimizationprocedureThe construction of adjoint models is a labor intensive, error prone task – O(10) FTEs

Observations

Forward CTM model evolution

Backward adjoint model integration

Optimization

Cost function

Gradients

Update control variables

Check-pointing files

Observations

Forward CTM model evolution

Backward adjoint model integration

Optimization

Cost function

Gradients

Update control variables

Check-pointing files

Nov. 30, 2011. Lecture 1: Data assimilation.

Lecture

설교 seolgyo


Steepest descent methodAt the current step we have x(k) estimate of the optimal solution.Denote:

I J (k) = J(

x(k)0

), the current cost function value

I g(k) = ∇x0J (x(k)0 ), the current gradient

Update the initial condition:

x(k+1)0 = x(k)

0 − α(k) g(k) .

The step size is computed by a line search procedure:

α(k) = arg minαJ(

x(k)0 − α g(k)

).

Issues with steepest descent method:I The convergence is slow – it needs very many iterations.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Nonlinear unconstrained optimization. [56/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Practical optimization algorithms

See coming lecture by Dr. Xu:I Newton’s method (gradient & Hessian)I Quasi-Newton methods (gradient only)I Nonlinear conjugate gradients (gradient only)I Truncated Newton, Gauss Newton, etc. (gradient & approximate

Hessian)

Lecture

설교 seolgyo


Newton’s methodAt the current step we have x(k) estimate of the optimal solution.Denote:

I J (k) = J (x(k)0 ), the current cost function value

I g(k) = ∇x0J (x(k)0 ), the current gradient

I H(k) = ∇2x0,x0J (x(k)

0 ), the current Hessian

Build a quadratic model of J (x0) valid in a neighborhood of x(k)0 using

Taylor:

J (x(k)0 + s) ≈ J (k) + sT · g(k) +

12

sT H(k) s .

If H(k) is positive definite, then there is a unique minimizer of thequadratic model. Find the point that minimizes the quadratic model bysolving the linear system

H(k) · s(k) = −g(k) .

Use this solution to update the initial condition:

x(k+1)0 = x(k)

0 + s(k) .

Issues with Newton’s method:I H(k) ∈ Rn×n is huge, and difficult to computeI The solution of the linear system is very expensive

Lecture

설교 seolgyo


Truncated Newton method

1. Solve the Newton equation at each step by a linear conjugategradient procedure. This gives an “inner iteration”, which istruncated early – since we are only interested in obtaining adescent direction, not in a quality solution. The full procedure iscalled truncated Newton.

2. The CG method avoids the use of the full Hessian, and requiresonly Hessian times vector products. They can be obtained:

I By finite differencing the adjoint gradient

∇2J (x0) · v ≈ ∇J (x0 + ε v)−∇J (x0)

ε

I By implementing and running a second order adjoint model

Lecture

설교 seolgyo


Nonlinear conjugate gradientsNonlinear conjugate gradients method extends the linear CG algorithmto minimize nonlinear functions. The following is the sketchedalgorithm:

First search direction:s(0) = −g(0)

for k = 0,1, . . .

Compute step size by line search: α(k) = arg minαJ(

x(k)0 + α s(k)

)Update solution: x(k+1)

0 = x(k)0 + α(k) s(k)

Update gradient: g(k+1 ) = ∇J (x(k+1)0 )

β(k+1) =g(k+1)T

g(k+1)

g(k)T g(k)︸︷︷︸Fletcher−Reeves

or

(g(k+1) − g(k))T

g(k+1)

g(k)T g(k)︸︷︷︸Polak−Ribiere

or ...

Update search direction: s(k+1) = −g(k+1) + β(k+1)s(k)

end Lecture

설교 seolgyo


Quasi-Newton methods I

Quasi-Newton methods approximate the inverse of the Hessian H−1 bya symmetric, positive definite matrix B which is updated at each step.(

H(k))−1≈ B(k)

The algorithm proceeds as follows:1. Compute search direction:

s(k) = −B(k) · g(k) ≈ −(

H(k))−1· g(k)

2. Find step size by line search:

α(k) = arg minαJ(

x(k)0 + α s(k)

)Lecture

설교 seolgyo


Quasi-Newton methods II3. Update solution:

x(k+1)0 = x(k)

0 + α(k) s(k)

4. Update Hessian inverse approximation B(k+1)

Hessian approximations and update formulae. Consider thedifferences:

ξ(k) = x(k+1)0 − x(k)

0 = α(k)s(k) ∈ Rn

γ(k) = g(k+1) − g(k) Taylor= H(k)ξ(k) + o(‖ξ(k)‖) ∈ Rn .

In the quasi-Newton approach, the Hessian approximation are chosento satisfy the “quasi-Newton” condition

B(k+1) · γ(k) = ξ(k) .

Lecture

설교 seolgyo


Quasi-Newton methods IIIHow can we update B(k) to B(k+1) such as to satisfy the quasi-Newtonequation? The most successful methods are symmetric rank-twoupdates:

B(k+1) = B(k) + a u uT + b v vT

The choice of a,b and u, v is not unique. We can tune them to getB(k+1) symmetric, positive definite.

DFP (Davidon-Fletcher-Powell):

B(k+1)DFP = B(k) +

ξ(k) ξ(k) T

ξ(k) Tγ(k)− B(k)γ(k)γ(k) T B(k)

γ(k) T B(k)γ(k)

BFGS (Broyden-Flecher-Goldfarb-Shanno):

B(k+1)BFGS = B(k) +

(1 +

γ(k) T B(k)γ(k)

ξ(k) Tγ(k)

)ξ(k) ξ(k) T

ξ(k) Tγ(k)− ξ(k) γ(k) T B(k) + B(k)γ(k)ξ(k) T

ξ(k) Tγ(k)Lecture

설교 seolgyo


Quasi-Newton methods IV

Formally, B(k) ∈ Rn×n matrices. For n large, the storage of thesematrices is prohibitive. A better way to represent these matrices is:

1. store B(0) (the initial approximation) and the pairs {ξ(j), γ(j)}1≤j≤k

2. keep the number of stored vectors small, k ≤ K , by dropping theolder pairs. This is the limited-memory quasi-Newton approach

3. L-BFGS is the current gold standard for full nonlinear 4D-Var dataassimilation problems

Lecture

설교 seolgyo


Incremental 4D-Var I

1. Apply a sequential quadratic programming (SQP) approach(Fisher, 2009). The nonlinear optimization problem isapproximated by a sequence of quadratic optimization problems.

2. Express

xi = x(k)i + ξi = “current solution” + “increment” , i = 1, · · · ,N .

The estimation problem is linearized around the current solutiontrajectory x(k) (Bennett, 2002; Lewis,2005).

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Incremental 4D-Var. [65/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Incremental 4D-Var II

3. This leads to a quadratic cost function:

J incr (ξ0) =12

(∆bx(k)

0 + ξ0

)TB−1

0

(∆bx(k)

0 + ξ0

)+

12

N∑i=1

(Hiξi + d (k)

i

)TR−1

i

(Hiξi + d (k)

i

),

where: d (k)i = H

(x(k)

i

)− yi ,

∆bx(k)0 = x(k)

0 − xb0 .

4. Quadratic optimization gives optimal increment:

ξa0 = arg minJ incr (ξ0) subject to: ξi = Mti−1→ti ξi−1 , i = 1, · · · ,N .

Lecture

설교 seolgyo


Incremental 4D-Var III5. Update solution:

x(k+1)0 = x(k)

0 + ξa0 .

A new linearization is performed about x(k+1) and the incrementalproblem is solved again to improve the resulting analysis.

6. The gradient of the incremental 4D-Var cost function reads

∇ξ0J incr (ξ0) = B−10

(∆bx(k)

0 + ξ0

)+

N∑i=1

MTi HT

i R−1i

(Hiξi + d (k)

i

)Requires TL and Adj observation operators, and the Adj modelsolution op.

7. The Hessian of the incremental 4D-Var cost function is

∇2x0,x0J (x0) = B−1

0 +N∑

i=1

MTi HT

i R−1i Hi Mi .

Lecture

설교 seolgyo


Incremental 4D-Var IV

8. The solution of the incremental 4D-Var problem is obtained bysolving the following linear system:

∇2x0,x0J (x0) · ξa

0 = −B−10 ∆bx(k)

0 −N∑

i=1

MTi HT

i R−1i d (k)

i .

The right hand side of this linear system is obtained by one adjointintegration.

9. The symmetric (hopefully positive definite) system can be solvedby a Lanczos iterative procedure (e.g., conjugate gradients). Ateach iteration one Hessian-vector product is required. This isobtained by ...

Lecture

설교 seolgyo


Incremental 4D-Var V

Comments. (Tremolet, 2009):I A coarse resolution, simplified physics, linearized model used in

incremental 4D-Var.I The state x(k)

i =Mt0→ti (x(k)0 ) computed with full resolution

nonlinear model.I Innovations are computed with the nonlinear observation operator.I The low resolution increments ξa

0 are interpolated to the full modelresolution to perform the incremental 4D-Var update

Lecture

설교 seolgyo


Preconditioning I

The convergence speed depends on the condition number of theHessian

κ = λmax(∇2x0,x0J )/λmin(∇2

x0,x0J ) .

The closer to one it is the faster the convergence; and the higher it isthe slower the convergence. The conditioning depends on:

1. the background,

2. the dynamics of the system through Mi and MTi ,

3. the observations used through Hi and HTi ; note that the addition of

new observations can change the conditioning of the system.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Preconditioning. [70/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Preconditioning IIPreconditioning is important to speed up iterations. The solution iscarried out in the transformed variables z = X−1 δx, whereX ≈ (∇2

x0,x0J )−1/2. The transformed matrix is

∇2z0,z0J (x0) = XT ∇2

x0,x0J (x0) X .

A common choice is X = B1/20 , which makes the background term

equal to identity, and the smallest eigenvalue equal to one. Weassume below that this transformation has been applied and that allHessian eigenvalues are greater than or equal to one.A more involved idea (Tremolet, 2009) is to use the solution processfrom the previous time window to precondition the iterations in thecurrent time window. Let λj and vj be the eigenvalues andeigenvectors of the Hessian in the current assimilation window w ,sorted in decreasing order of the eigenvalues.

Lecture

설교 seolgyo


Preconditioning III

Lecture

설교 seolgyo


Weakly Constrained 4D-Var I

1. Weakly constrained 4D-Var avoids the assumption of a perfectmodel, implicit in the traditional formulation, at the expense ofsolving a larger optimization problem.

2. The state xi at ti is allowed to differ from the model prediction; thedifference is the model error, considered to be a random variable.With the assumptions that the model is not biased, the modelerror is normally distributed, and model errors at different timesare not correlated, we have that

xi =Mti−1→ti (xi−1) + ηi , ηi ∈ N (0,Qi) , i = 1, · · · ,N .

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Weakly constrained 4D-Var. [73/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Weakly Constrained 4D-Var II

3. The weakly constrained 4D-Var estimate of x = [x0,x1, . . . ,xN ] isthe unconstrained minimizer of the following cost function:

J wk (x) =12(x0 − xb

0)T B−1

0

(x0 − xb

0)

+12

N∑i=1


i (H(xi)− yi)

+12

N∑i=1

(xi −Mti−1→ti (xi−1)

)T Q−1i

(xi −Mti−1→ti (xi−1)

).

4. The model is not imposed exactly. Rather, it is treated as a weakconstraint.

5. The optimization variables are the model states at all timesx ∈ Rn(N+1), and therefore the resulting optimization problem is oflarger dimension than that for strongly-constrained 4D-Var.

Lecture

설교 seolgyo


Weakly Constrained 4D-Var III6. An alternative is to treat the model is a strong constraint, but

account for the model bias which contributes to the discrepancybetween model predictions and observations.

J wk (x0, β) =12(x0 − xb

0)T B−1

0

(x0 − xb

0)

+12

N∑i=1

(H(xi + βi)− yi)T R−1

i (H(xi + βi)− yi)

+12

N∑i=1

βTi Q−1

i βi ,

xi = Mt0→ti (x0) .

7. Difficult issues are related to the calibration of the covariances Qiand to building temporal correlation models for biases and modelerrors.

Lecture

설교 seolgyo


Models of the background and observation errorcovariances I

I The quality of the assimilation depends on the accuracy with whichthe background and observation error covariances are known.

I Models of observation errors include information about themeasuring instrument noise and bias (measurement error), andabout the resolution with which the model reproduces thepointwise variability of the physical system (representativenesserror).

I Background error covariances determine the relative weightingbetween observations and a priori data, and dictate how theinformation is spread in space and among variables.

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Error covariance models. [76/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

Models of the background and observation errorcovariances II

I Background covariances are based on models of the error at thecurrent time (or at initial time in 4D-Var). In case of cyclic dataassimilation the analysis covariance from the previous cycle,transported to the current time, becomes the new backgroundcovariance.

I Background covariance matrices need to:I capture the spatial error correlations created by the flow (transport

and diffusion),I capture the inter-species error correlations created by the chemical

interactions,I have full rank, such that terms of the form xT B−1 x make sense,

andI allow for computationally efficient evaluations of matrix vector

operations of the form B x, B1/2 x, and B−1 x.

Lecture

설교 seolgyo


Models of the background and observation errorcovariances III

I In (Chai ,2007) the CMAQ error statistics are estimated throughboth the NMC (National Meteorological Center) and theHollingsworth-Lönnberg methods.

Background error statistics can be estimated from ensembles of data assimilation runs

Run an ensemble of analyses with random observation and state

perturbations, plus stochastic model error representation.

Form differences between pairs of background fields.

Analysis Forecast xb+εb



Analysis Forecast xb+ηb



Background differences

Lecture given at EISDA 2012. Seoul, Korea, Aug. 2012. Figure: Run an ensemble of analyses with random observation and stateperturbations, plus stochastic model error representation. Form differ-ences between pairs of background fields.

Lecture

설교 seolgyo


Models of the background and observation errorcovariances IV

I An autoregressive (AR) model approach to represent backgrounderror covariance matrices has been proposed in (Sandu,2007).The background state error field is modeled as a multilateral AR ofthe form

εbi,j,k = αi±1,j±1,k±1 ε

bi±1,j,k + σi,j,k ξi,j,k .

Here (i , j , k) are gridpoint indices on a three dimensionalstructured grid. The model captures the correlations amongneighboring grid points, with α representing the correlationcoefficients in the x , y and z directions. The last term representsthe additional uncertainty at each grid point, with ξ ∈ N (0,1)normal random variables and σ local error variances.

Lecture

설교 seolgyo


Models of the background and observation errorcovariances V

Correct models of background (prior) errors are very important for data assimilation

• Background error representation determines the spread of information, and impacts the assimilation results

• Needs: high rank, capture dynamic dependencies, efficient computations

• Traditionally estimated empirically (NMC, Hollingsworth-Lonnberg)

1. Tensor products of 1d correlations, decreasing with distance (Singh et al, 2010)

2. Multilateral AR model (Constantinescu et al 2007)

3. Hybrid methods in the context of 4D-Var (Cheng et al, 2009)

[Constantinescu and Sandu, 2007]

Lecture given at EISDA 2012. Seoul, Korea, Aug. 2012. Figure: Correlations built by the AR model follow the flow lines.

Lecture

설교 seolgyo


Models of the background and observation errorcovariances VI

I A simplified approach proposed in (Singh et al, 2011) constructsmultidimensional correlation matrices as tensor products ofone-dimensional correlations. This method has resulted inimproved chemical data assimilation results with GEOS-Chem.

I The hybrid approach (Chen et al, 2010) estimates the analysiscovariance at the end of each assimilation window. An ensembledrawn from the background distribution is run side by side with theoptimization process, the subspace of errors corrected by 4D-Varis identified, and the background ensemble modified into one thatsamples the analysis distribution.

Lecture

설교 seolgyo


Thank you

감사합니다 gamsahabnida

Lecture

설교 seolgyo

1. Introduction to variational d.a.. Thank you. [82/82]Lecture given at EISDA 2012, Seoul, Korea, Aug. 22-24, 2012. (http://csl.cs.vt.edu)

lecture 1. introduction to variational data assimilation...

Documents