gockenbach diff calculus

Upload: tasos7639

Post on 03-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Gockenbach Diff Calculus

    1/55

    Optimization and Engineering, 2, 75129, 2001

    c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

    A Primer on Differentiation

    MARK S. GOCKENBACH

    Department of Mathematical Sciences, Michigan Technological University, 1400 Townsend Drive, Houghton,

    MI 49931-129 5, USA

    Received February 4, 2000; Revised April 4, 2001

    Abstract. The central idea of differential calculus is that the derivative of a function defines the best local linear

    approximation to the function near a given point. This basic idea, together with some representation theorems

    from linear algebra, unifies the various derivativesgradients, Jacobians, Hessians, and so forthencountered

    in engineering and optimization. The basic differentiation rules presented in calculus classes, notably the product

    and chain rules, allow the computation of the gradients and Hessians needed by optimization algorithms, even

    when the underlying operators are quite complex. Examples include the solution operators of time-dependent and

    steady-state partial differential equations. Alternatives to the hand-coding of derivatives are finite differences and

    automatic differentiation, both of which save programming time at the possible cost of run-time efficiency.

    Keywords: differentiation, solution operators, finite differences, automatic differentiation

    1. Introduction

    Throughout their study of calculus, students are introduced to derivatives of various types.

    These include:

    The (ordinary) derivative f(x) of a real-valued function f of a single variable. Thenumber f(x0)is the slope of the line tangent to the graph ofy= f(x)at x= x0. It isalso interpreted as the instantaneous rate of change ofy= f(x)at x= x0.

    The partial derivativesg

    x1(x1,x2, . . . ,xn ),

    g

    x2(x1,x2, . . . ,xn ) , . . . ,

    g

    xn(x1,x2, . . . ,xn )

    of a real-valued function of several variables. These numbers are interpreted as the

    instantaneous rates of change of y= g(x1,x2, . . . ,xn ) as one variable is changed andthe others held fixed.

    The gradient vector

    g(x1,x2, . . . ,xn )=

    gx1

    (x1,x2, . . . ,xn )

    g

    x2(x1,x2, . . . ,xn )

    ...

    g

    xn(x1,x2, . . . ,xn)

    .

  • 8/12/2019 Gockenbach Diff Calculus

    2/55

  • 8/12/2019 Gockenbach Diff Calculus

    3/55

    A PRIMER ON DIFFERENTIATION 77

    least in the United States) without encountering a course that makes this principle explicit.1

    Moreover, the elementary rules of differentiation as learned in calculus coursesthe prod-

    uct rule, chain rule, and so forthcan leave a student ill-prepared to compute derivatives

    of the complicated functions and operators that arise in advanced engineering and applied

    mathematics research.

    The purpose of this paper is to explain the concept of derivative from the point of view

    of local linear approximation, to show how the various types of derivatives mentioned

    above fit into the concept, and to work through several important and nontrivial exam-

    ples. In the following section, I discuss the basic definitions and notation needed. The

    setting for these definitions is a normed vector spacea vector space with a norm. For

    this reason, linear algebra is important. In Section 3, I present the elementary represen-

    tation theorems of linear algebra, and show how they lead to the various scalars, vectors,

    and matrices that arise in calculus courses in the context of differentiation. This is fol-lowed by a brief discussion of the rules of differentiation (Section 4), simple represen-

    tations for operators on infinite dimensional spaces (Section 5), and second derivatives

    (Section 6). In addition to several examples included in the sections described above, I

    discuss two more involved examples: the adjoint state method for handling finite dif-

    ference solution operators (Section 7), and a direct computation of the derivative (and

    its adjoint) of a finite element solution operator (Section 8). Finally, in Section 9, I dis-

    cuss two alternatives to programming derivatives by hand: finite differences and automatic

    differentiation.

    Throughout this paper, the emphasis is on the structure of maps and their derivatives, not

    on the analytic details. Therefore, most technical proofs are omitted.

    2. Definitions and notation

    2.1. Normed vector spaces; inner products

    The various derivatives described in the introduction can all be discussed in the context of

    a function (operator, map) fmapping one Euclidean space into another. I will write Rn for

    Euclideann-space, and denote a vectorxRn asx=(x1,x2, . . . ,xn )or

    x=

    x1

    x2

    ...

    xn

    .

    Note that R1 is (isomorphic to) R, the set of real numbers.

    The following examples were discussed in the introduction:

    f : RR, a real-valued function of a single variable, f : Rn R, a real-valued function of several variables, f : Rn Rm , a vector-valued function of several variables, f : RRn , a vector-valued function of a single variable.

  • 8/12/2019 Gockenbach Diff Calculus

    4/55

    78 GOCKENBACH

    From now on I adopt vector notation and write, for example, g(x) instead of

    g(x1,x2, . . . ,xn ). Also, I distinguish vectors from scalars only by context.

    Now, Euclideann-space is equipped with an inner product, namely, the dot product:

    (x,y)=xy=n

    i=1xiyi , x,yRn .

    A more general setting is an inner product space (which need not be Euclidean or finite-

    dimensional). An inner product space is just a vector space Vwith an inner product(, )V,which is a mapping fromV Vinto R satisfying the following properties:

    (u+ v,w)V= (u, w)V+ (v,w)Vfor allu, v , wV, , R; (u, v)V= (v , u)Vfor allu, vV; (v,v)V 0 for allvV, and(v, v)=0 if and only ifv=0.

    An inner product on Vinduces a norm V onV:

    vV=

    (v,v)V for allvV.

    It is sometimes necessary to work with norms that are not de fined by inner products. A

    general norm Uon a vector spaceUis a mapping fromUinto R satisfying

    uU 0 for alluU, and uU= 0 if and only ifu=0; uU= ||uUfor alluU, R; u+ vU uU+ vUfor allu, vU(the triangle inequality).

    It can be shown that if(, )Vis an inner product on a vector spaceV, then vV=

    (v,v)

    define a norm on V.

    The reason I discuss vector spaces more general than Euclidean space is that many prac-

    tical problems cannot be described using finite-dimensional spaces. For example, suppose

    is an open subset of Rn and f : Rn. Then under appropriate conditions on f, forany closed and bounded subset W of, there exists >0 such that, for each x0W, theInitial Value Problem (IVP)

    x= f(x)(1)

    x(0)= x0

    has a unique solution x: [, ] Rn . Thus the IVP (1) defines an operator S: W(C[, ])n , where (C[, ])n is the space of all continuous functions u: [, ]Rn .This space is infinite-dimensional and therefore cannot be identified with any Euclidean

    space. I pursue this example Section 5.5, where I compute the derivative ofS.

  • 8/12/2019 Gockenbach Diff Calculus

    5/55

    A PRIMER ON DIFFERENTIATION 79

    2.2. Definition of the derivative

    Now suppose X andY are normed linear spaces, suppose Uis an open subset ofX, and

    assume that f : U Y. As I explained in Section 2.1, the types of functions encounteredin calculus allfit under this description, as do many other important examples.

    First recall the following definition.

    Definition 2.1. Suppose X andY are vector spaces, and L:X Y. The operator L islinearif

    L(x+z)=L x+L zx,z X

    and

    L(x)=L xx X, R

    (or, more concisely,L (x+ z)=L x+ Lz for allx,z Xand, R).

    Next is the fundamental definition in this paper.

    Definition 2.2. LetxU. Suppose there is a continuous linear operator L :X Ysuchthat

    limx

    0

    f(x+ x) f(x) L xYxX =

    0.

    Then f is said to be differentiable at x, and L is called the derivative of f at x, denoted

    L =D f(x).

    According to this definition, if f is differentiable at x, then D f(x) defines a linear

    approximation to f nearx; indeed, if

    E(x, x)= f(x+ x) f(x) D f(x)x,

    then

    f(x+ x)= f(x) + D f(x)x+E(x, x)

    and

    E(x, x)YxX

    0 asx0.

    This last condition is abbreviated by

    E(x, x)=o(xX) asx 0

  • 8/12/2019 Gockenbach Diff Calculus

    6/55

    80 GOCKENBACH

    (read E(x, x) is little-oh ofxX), which indicates that the error E(x, x) is smallcompared to xX when xXis small. It is easy to show that, if fis differentiable atx,then no other linear map K: X Y defines a better local linear approximation to f nearx; that is, ifK= D f(x), then the error in the approximation

    f(x+ x)= f(x) +Kx

    is large than the error in the approximation

    f(x+ x)= f(x) +D f(x)x

    in the sense that

    limx0

    f(x+ x) f(x) KxYxX

    =0.

    Now, in addition to the basic definition of derivative just given, there is really just one

    key idea in this paper: the linear mapD f(x) has different representations, depending on the

    particularXand Yinvolved. There is an underlying question here, which properly belongs

    to linear algebra (or functional analysis when the spaces are infinite-dimensional): Given

    two normed vector spaces XandY,find a convenient representation for a continuous linear

    mapL : X Y. I will address this question in Sections 3 and 5 below; here I preview thosesections by answering the question for X=Y= R.

    Now suppose that L : RR is linear. Ifa= L(1), which is a real number, then, sincex

    =x

    1 for all x

    R,

    L x= x L(1)=a x.

    Thus, ifL : RR is linear, there is a real number aR such that

    L x=a x for allx R.

    That is, a linear map from R to R is represented by a real number. Therefore, if f : RRhas a derivative at x, it is customary in elementary calculus courses to define f(x) to bethenumber

    f(x)= limx

    0

    f(x+ x) f(x)x

    ,

    which is equivalent to

    limx0

    f(x+ x) f(x) f(x)xx

    =0.

    It now becomes clear that, under this definition, the number f(x)is just therepresenterofthe linear map D f(x).

  • 8/12/2019 Gockenbach Diff Calculus

    7/55

    A PRIMER ON DIFFERENTIATION 81

    It may seem overly pedantic to distinguish between the linear map and its representer.

    However, when the vector spaces X and Yare not both one-dimensional, I believe it is

    essential to make the distinction. Before I go on to more examples in Section 3, where this

    should become clearer, I need to define continuity of the derivative, and also the concept of

    partial derivatives.

    2.3. Continuity of the derivative

    I assume again that f : U Y, where XandYare norm vector spaces and U Xis open.If f is differentiable at each x U, then D f become an operator; for each x U, D f(x)belongs to L(X, Y), the space of all continuous linear maps from X intoY:

    D f : U L(X, Y).

    When f is differentiable at everyxU, fis simply said to be differentiable. Now,L(X, Y)is a vector space, since there is a natural way to add operators and multiply them by scalars,

    and continuity and linearity are obviously preserved by these operations. Also,L(X, Y) has

    a natural norm:

    LL(X,Y)=supL xY

    xX:x X,x=0

    .

    Note that this definition of norm implies

    L xY LL(X,Y)xXx X.The norm of an operator thus measures the largest factor by which the operator stretches

    ormagnifiesany vector in its domain. It can be shown that a linear operator L :X Yis continuous if and only if

    LL(X,Y)

  • 8/12/2019 Gockenbach Diff Calculus

    8/55

    82 GOCKENBACH

    (Of course, f might be defined on a subset of XY, but the exposition is simpler if Iassume that the domain of fis all ofXY.) SinceXYis a vector space, an operator likefis just another example that fits into the discussion above: f is differentiable at(x,y)if

    there is a continuous linear operator L : X Y Zsuch that

    lim(x,y)(0,0)

    f(x+ x,y+ y) f(x,y)L(x, y)Z(x, y)XY

    =0.

    (For the norm on X Y, the obvious choices are: (x,y) =

    x2X+ y2Y, (x,y) =max{xX, yY}, and (x,y) =xX+ yY. I will only use the property that(x, 0)XY= xX and(0,y)XY= yY, which holds for any of the above.) Onthe other hand, given any y

    Y,

    g(x)= f(x,y) for allx X

    defines an operatorg:X Z. Similarly, for any x X,

    h(y)= f(x,y) for allyY

    defines an operatorh: Y Z. The question now arises: What is the relationship betweenD f, Dg, and D h?

    The answer to this question is very simple when the structure of operators inL(XY,Z)is understood.

    Theorem 2.3. Let X, Y,and Z be normed linear spaces. Then L L(X Y,Z)if andonly if there exist L 1L(X,Z)and L2L(Y,Z)such that

    L(x,y)=L 1x+L 2y for all x X,yY.

    Proof: Suppose L L(X Y,Z). Define L 1L(X,Z)by

    L1x= L (x, 0) for allx X,

    andL 2L(Y,Z)by

    L2y= L (0,y) for allyY.

    It is easy to prove that L 1andL 2are indeed linear and bounded. Moreover, for any(x,y)X Y,

    L(x,y)= L((x, 0) + (0,y))= L(x, 0) +L (0,y)= L1x+L 2y,

    as desired.

  • 8/12/2019 Gockenbach Diff Calculus

    9/55

    A PRIMER ON DIFFERENTIATION 83

    On the other hand, it is easy to verify that ifL 1 L(X,Z), L 2 L(Y,Z), and L :XY Z isdefinedby

    L(x,y)=L 1x+L 2y for all(x,y) X Y,

    thenL L(X Y,Z).

    It is now easy to prove the following theorem.

    Theorem 2.4. Let X,Y,and Z be normed linear spaces,suppose f :XYZ,and let(x0,y0) X Y . Define g :XZ by g (x) = f(x,y0)and h: YZ by h (y) = f(x0,y).Suppose f is differentiable at(x0,y0). Then g is differentiable at x0,h is differentiable at

    y0,and

    D f(x0,y0)(x, y)=Dg(x0)x+D h(y0)y.

    The operators Dg(x0)and Dh(y0)are called the partial derivativesof f,and are denoted

    Dx f(x0,y0)and Dy f(x0,y0),respectively. Thus

    D f(x0,y0)(x, y)=Dx f(x0,y0)x+Dy f(x0,y0)y.

    Proof: By the preceding theorem, there existL 1 L(X,Z)and L 2 L(Y,Z)such that

    D f(x0,y0)(x, y)

    =L1x

    +L 2y for allx

    X, y

    Y.

    In particular,

    D f(x0,y0)(x, 0)=L1x,

    so

    limx0

    f(x0+ x,y0) f(x0,y0) L 1xZ(x, 0)XY

    =0.

    This is equivalent to

    limx0 g(x0+ x) g(x0) L 1xZ(xX =0,

    soL 1= Dg(x0). Similarly, L 2= Dh (y0), and the proof is complete.

    Note that, for example,

    Dxf(x,y)L(X,Z),

  • 8/12/2019 Gockenbach Diff Calculus

    10/55

    84 GOCKENBACH

    that is,

    Dxf :X Y L(X,Z).

    Similarly,

    Dy f :X Y L(Y,Z).

    The following theorem is only slightly harder to prove.

    Theorem 2.5. Suppose X,Y,and Z are normed linear spaces, f :X Y Z,and thepartial derivatives of f, Dxf(x,y) and Dy f(x,y),exist and are continuous on an open

    set UX Y. Then f is C1

    on U,and

    D f(x,y)(x, y)=Dxf(x,y)x+Dy f(x,y)y.

    Note that the continuity ofDxf andDy fis necessary; it isnotthe case that ifDxf(x0,y0)

    and Dy f(x0,y0)exist, then D f(x0,y0)must exist.

    These above results obviously generalize to an operator of the form f :X1 X2 Xn Z; the basic equation is

    D f(x)x= Dx1 f(x)x1+ Dx2 f(x)x2+ + Dxn f(x)xn ,

    wherex, x X1 X2 Xn.

    3. Representation of linear operators on Euclidean spaces

    3.1. The basic theorem

    I will now give the fundamental representation theorem for linear operators on Euclidean

    spaces. Specializing this result to the various contexts described in the introduction (R R,Rn R, Rn Rm , and R Rn ) will account for the various types of derivatives describedthere.

    Theorem 3.1. Let L: Rn Rm . Then L is linear if and only if there is an m n matrixA such that

    L x= Ax for all x Rn

    .

    Proof: Let {e1, e2, . . . , en} be the standard basis for Rn (so thei th component ofei is oneand all other components are zero). Letc1= Le1,c2= Le2, . . . , cn= Le1, and define Ato be them n matrix whose columns are the vectors c1, c2, . . . , cn. That is, Ai j is(cj )i ,thei th component of the vector cj . Then, since

    x= x1e1+ x2e2+ + xn en,

  • 8/12/2019 Gockenbach Diff Calculus

    11/55

    A PRIMER ON DIFFERENTIATION 85

    the linearity ofL yields

    L x= x1Le1+x2Le2+ +xnLen= x1c1+x2c2+ + xncn .

    However, by the definition of matrix multiplication,

    Ax= x1c1+ x2c2+ + xncnalso holds. Thus

    L x= Ax for allx Rn.

    Thus every linear operator L : Rn Rm can be represented by anm nmatrix.If f : Rn Rm is differentiable, then D f(x) : Rn Rm linear. Therefore, there is an

    mn matrix J representing D f(x). This matrix J turns out to be the Jacobian matrixmentioned in the introduction. To see this, it is convenient to first consider certain special

    cases.

    3.2. Representation of derivatives in special cases

    3.2.1. m=n= 1. In the special casem = n = 1, so that fis a real-valued function of areal variable, D f(x)is represented by a single number f(x), as was already shown.

    3.2.2. m= 1, n > 1. In the case m = 1, n> 1, so that f is a real-valued function of severalvariables, the result of Section 2.4 applies:

    D f(x)x= Dx1 f(x)x1+Dx2 f(x)x2+ +Dxn f(x)xn.

    Herex1, x2, . . . , xn are the components of the vector x Rn . Moreover, regarded asa function xi with the other components ofx heldfixed, f defines a real-valued function

    of a real variable. Thus Dxi f(x)can be represented by a single number, which is usually

    denoted

    f

    xi(x).

    Thus

    D f(x)x= fx1

    (x)x1+ f

    x2(x)x2+ +

    f

    xn(x), (2)

    which can be recognized as a matrix-vector product if the numbers

    f

    x1(x),

    f

    x2(x) , . . . ,

    f

    xn(x)

  • 8/12/2019 Gockenbach Diff Calculus

    12/55

    86 GOCKENBACH

    are gathered in a row, that is, a 1 nmatrix:f

    x1(x)x1,

    f

    x2(x)x2, . . . ,

    f

    xn(x)

    .

    This is the representer of D f(x) suggested by Theorem 3.1. There is a slightly different

    representer ofD f(x) that is to be preferred because it generalizes to the infinite-dimensional

    case: Eq. (2) is recognized as the inner product ofxwith the vector

    f(x)

    =

    fx1

    (x)

    fx2

    (x)

    ...

    fxn

    (x)

    ,

    which is called the gradientof f atx. Thus

    D f(x)x=(f(x), x)Rn for allxRn .

    The gradient f(x)is the usual representer ofD f(x).

    3.2.3. m >1, n= 1. In the casem >1, n=1, so that fis a vector-valued function of areal variable (often called a curve since its image is a curve inm-space),D f(t): RRmis represented by anm

    1 matrix. Now, fcan be written

    f(t)=

    f1(t)

    f2(t)

    ...

    fm (t)

    ,

    where each fi is a real-valued function of a real variable. By considering the definition of

    the derivative in this case, which implies that

    limt0

    |fi (t+ t) fi (t) (D f(t)t)i ||t| =0,

    it is easy to see that(D f(t)t)i= fi(t)t. Therefore the representer ofD f(t)is

    f1(t)

    f2(t)...

    fm (t)

    ,

  • 8/12/2019 Gockenbach Diff Calculus

    13/55

    A PRIMER ON DIFFERENTIATION 87

    which is written as f(t)or f(t)in calculus courses. Since anm 1 matrix can be thoughtof as a vector, the usual interpretation of f(t)as the tangent vector to the curve x= f(t)holds. If the curve is traced out by a particle as tvaries, then, at time t, the particle is at

    x= f(t), while at timet+ t, it is approximately at f(t) + f(t)t.

    3.2.4. The general case m > 1, n > 1. Finally, consider the casem > 1,n > 1, so that f

    is a vector-valued function of a vector variable. Then, by the results on partial derivatives,

    D f(x)x= Dx1 f(x)x1+Dx2 f(x)x2+ +Dxn f(x)xn.

    Now, regarded as a function ofxi with the other components ofx heldfixed, f defines a

    function of the type considered in Section 3.2.3. The representer ofDxi f(x)is the column

    vector

    f

    xi(x)=

    f1xi

    (x)

    f2xi

    (x)

    ...

    fmxi

    (x)

    ,

    and it follows from this that the matrix Jrepresenting D f(x)has

    f

    x1 (x),

    f

    x2 (x) , . . . ,

    f

    xn (x)

    as columns. Thus

    J=

    f1x1

    (x) f1

    x2(x) f1

    xn(x)

    f2x1

    (x) f2

    x2(x) f2

    xn(x)

    ......

    . . ....

    fmx1

    (x) fm

    x2(x) fm

    xn(x)

    ,

    which is the Jacobian matrix mentioned in the introduction. (Note that the gradients of the

    component functions fi (x)form the rows ofJ.)

    3.3. Summary

    Here is a summary of the results that I have presented:

    If f : RR, then the representer ofD f(t)is the scalar f(x).

  • 8/12/2019 Gockenbach Diff Calculus

    14/55

    88 GOCKENBACH

    If f : Rn R, then the representer ofD f(x)is the vectorf(x). Recall that this is aslight departure from the general framework, as D f(x)is represented via inner product

    with a (column) vector rather than via matrix multiplication with a row vector.

    If f : RRm , then the representer ofD f(x)is the (column) vector f(t). If f : Rn Rm , then the representer ofD f(x)is the Jacobian matrix J defined by

    Ji j= fi

    xj(x).

    3.4. Example: A quadratic function

    Suppose A is ann

    n symmetric matrix (AT

    =A), b

    Rn ,

    R, and f : Rn

    R is

    defined by

    f(x)= 12

    (x,Ax)Rn+ (b,x)Rn+ .

    To compute f(x), one method is to write fin terms of the components ofx,

    f(x)= 12

    ni=1

    nj=1

    Ai jxixj+n

    i=1bixi+ ,

    and compute the partial derivatives of f. While this is possible, it is easier to proceed from

    the definition: Write f(x+x) f(x) asa termthat is linearin xplus a smaller remainder.Now,

    f(x+ x) f(x)= 12

    (x+ x,A(x+ x)) + (b,x+ x) + 1

    2(x,Ax) (b,x) = (Ax+ b, x) + 1

    2(x,Ax).

    Note that the symmetry of A was used to conclude that (x,Ax)= (x,Ax). The term(x,A x)/2 is small compared to x whenxis small; in fact,

    12

    (x,Ax)=O (x2)=o(x).

    Therefore,

    f(x+ x) f(x)=(Ax+ b, x) + o(x),

    that is,

    D f(x)x=(Ax+ b, x).

    This equation exhibits the representer f(x)ofD f(x):

    f(x)= Ax+ b.

  • 8/12/2019 Gockenbach Diff Calculus

    15/55

    A PRIMER ON DIFFERENTIATION 89

    4. Rules for differentiation

    I will now review the important rules for differentiating functions.

    4.1. The derivative of a linear function

    Suppose f : X Yis linear and continuous. Thenf(x+ x) f(x)= f(x) + f(x) f(x)= f(x).

    It follows that

    D f(x)x= f(x),that is,D f(x)= f. This holds independently ofx X, and is the analogue of the rule fromcalculus which states that the derivative of a linear function is a constant.

    4.2. The chain rule

    Now suppose thath : Y Z,g : X Y areC1, and f : X Z is the composition ofhand g:

    f(x)=h (g(x)) for allx X.Then

    f(x+ x)=h (g(x+ x))= h (g(x) +Dg(x)x+ o(x))= h (g(x)) +D h(g(x))(Dg(x)x+ o(x))

    + o(Dg(x)x+ o(x))= f(x) +D h(g(x))Dg(x)x+ o(x).

    Thus

    D f(x)x= Dh (g(x))Dg(x)x.This is thechain rule.

    4.3. The product rule

    SupposeX,Y, Z, andWare normed vector spaces, and P : YZWis continuous andbilinear; that is

    P(1y1+ 2y2,z)=1 P(y1,z) + 2 P(y2,z)for all y1,y2Y,z Z, 1, 2R,

    P(y, 1z1+ 2z2)=1 P(y,z1) + 2 P(y,z2)for all yY,z1,z2 Z, 1, 2R.

  • 8/12/2019 Gockenbach Diff Calculus

    16/55

  • 8/12/2019 Gockenbach Diff Calculus

    17/55

    A PRIMER ON DIFFERENTIATION 91

    Thus

    h(x)= G Tf(x) +FTg(x).

    4.5. Example: Differentiating an inverse

    SupposeL : XL(Y,Z), and assume thatL (x)1 exists and is continuous for eachx X.Define f :X L(Z, Y)by

    f(x)=L(x)1.

    Assuming that L isC1

    , what is D f(x)?Both the chain rule and the product rule are involved in the answer. Define : U

    L(Z, Y)by (K)=K1, where

    U= {K L(Y,Z): K1 exists and is continuous}.

    Then

    K(K)= I,

    where I : Z Zis the identity operator. Differentiating both sides yields

    KD (K) K+ K(K)=0. (3)

    To obtain this result, the product rule was applied to the mappingP : L(Y,Z)L(Z, Y)L(Z,Z) defined byP (K,L )=KL. Also, note thatIL(Z,Z) is constant, so its derivativeis zero. Now, (K)=K1, so (3) yields

    KD (K)K= K K1

    or

    D(K) K= K1K K1. (4)

    (The reader will notice the shadow of the calculus rule

    f(x)=

    1

    x f(x)

    = 1

    x2

    here.)

    Equation (4) can now be combined with the chain rule to findD f(x) for f(x)=L (x)1.Since f is the composition of and L , the chain rule yields

    D f(x)x= D (L(x))DL(x)x= L(x)1DL(x)x L(x)1.

  • 8/12/2019 Gockenbach Diff Calculus

    18/55

    92 GOCKENBACH

    This expression forD f(x)xis the product (composition) of three linear operators,L (x)1,DL(x)x, and L (x)1. Note that since the product of linear operators is not commutative,order is important in this formula.

    5. Simple representations on infinite-dimensional spaces

    IfX andYare both infinite-dimensional, then little can be said in general about the repre-

    sentation ofL L(X, Y). However, ifoneof the spaces isfinite-dimensional, then it is notdifficult to derive some useful results.

    5.1. Real-valued functions defined on Hilbert spaces

    First I take the special case of f : X R, where X is aHilbert spacea complete innerproduct space. In this case, the Riesz Representation theorem is available.

    Theorem 5.1. (Riesz Representation Theorem). Let X be a Hilbert space. Then if L :

    X R is linear and continuous,there exists a unique v X such that

    L(x)=(x, v)x for all x X.

    This theorem follows immediately from Theorem 3.1 ifXis finite-dimensional. For a proof

    in the infinite-dimensional case, see any book on Hilbert spaces or functional analysis.

    Now suppose f : X R is differentiable at x X. Then D f(x)is a continuous linearfunction defined onX, and, by the Riesz representation theorem, there exists a vector v

    X

    satisfying

    D f(x)x=(x, v)X for allx X.

    Just as in the finite-dimensional case, the vector v is called the gradientof f at x, and is

    denoted by f(x).

    5.2. Example: The gradient of a nonlinear least-squares function

    A common optimization problem is to minimize the nonlinear least-squares function

    f(x)

    = 12

    F(x)

    22

    = 1

    2(F(x),F(x))Y,

    where F: X Yis a nonlinear operator and X,Yare Hilbert spaces. I will now apply theresults developed above to compute the gradient of f. I will also specialize the results to

    X= Rn ,Y= Rm .By the product rule,

    D f(x)x= 12

    (F(x),D f(x)x)Y+ 12 (D f(x)x,F(x))Y= (D f(x)x,F(x))Y.

  • 8/12/2019 Gockenbach Diff Calculus

    19/55

    A PRIMER ON DIFFERENTIATION 93

    Now, anm nmatrix Ahas the property that

    (Ax,y)Rm= (x,ATy)Rn for allx Rn,yRm .

    Similarly, for every operator L L(X, Y), there is a uniqueadjointoperatorL defined bythe equation

    (L x,y)Y= (x,Ly)X for allx X,yY.

    (The existence and uniqueness ofLcan be proved using the Riesz representation theorem.)Therefore,

    D f(

    x)

    x=(

    D F(x

    )x

    ,F

    (x

    ))Y=

    (x

    ,D F

    (x

    ) F

    (x

    ))X

    ,

    which shows that

    f(x)=D F(x) F(x).

    Computing the adjoint of D f(x) can be quite challenging in some applications; see

    Section 7 for a nontrivial example. In the case ofX=Rn ,Y= Rm ,D f(x)is represented bythe Jacobian matrix J, and therefore D f(x)is represented by its transpose. It follows that

    f(x)= JTF(x),

    where Jis the Jacobian matrix ofFatx.

    5.3. Finite-dimensional operators on Hilbert space

    Next I consider the case ofF : X Rm , where Xis again a Hilbert space. Clearly Fcanbe represented as

    F(x)=

    F1(x)

    F2(x)

    ...

    Fm (x)

    ,

    where Fi : X R, i= 1, 2, . . . , m. It follows that

    D F(x)x=

    D f1(x)x

    D f2(x)x

    ...

    D fm (x)x

  • 8/12/2019 Gockenbach Diff Calculus

    20/55

    94 GOCKENBACH

    =

    (F1(x), x)X(F2(x), x)X

    ...

    (Fm (x), x)X

    .

    Thus the derivative ofFcan be represented by m vectors in X, namely,

    F1(x), F2(x) , . . . , Fm (x).

    By analogy with thefinite-dimensional case, these vectors can be thought of as forming the

    rows of a matrix (with infinitely many columns).

    5.4. Operators with afinite-dimensional domain

    5.4.1. The case of a one-dimensional domain. SupposeY is a normed linear space, and

    assumeF: RYis differentiable. TheD F(t)is a continuous linear operator from R intoY. It is simple to represent such operators, for ifL L(R, Y)andz=L (1), then

    Lt=tL(1)=t z for alltR.

    ThusL is represented by an element z ofY.

    It follows that D F(t)is represented by an element ofY, which is denoted F(t):

    D F(t)t= tF(t) for all t R.

    5.4.2. The case of a finite-dimensional domain. Now suppose F: Rn Y is differen-tiable. ThenD F(x) L(Rn , Y), and the structure of such an operator must be determined.LetL L(Rn, Y) and let {e1, e2, . . . , en} be the standard basis for Rn. Thenfor anyx Rn ,

    x=n

    i=1xi ei ,

    and so

    L x

    =

    n

    i=1

    xiLei .

    That is, there are nvectorsL e1, Le2, . . . ,L enin Y, and each imageL xis a linear combina-

    tion of thesenvectors. Thesenvectors representL . By analogy with the caseY= Rm , onecan think of the representer ofL as a matrix with ncolumns (each of which is a vector in Y).

    It is now easy to see that D F(x) is represented by n vectors, each of which is the

    representer of a partial derivative of F at x. Again, one can think of the representer of

    D F(x)as a matrix with n columns, each a representer of a partial derivative ofFatx.

  • 8/12/2019 Gockenbach Diff Calculus

    21/55

    A PRIMER ON DIFFERENTIATION 95

    5.5. Example: The solution operator of an IVP

    As an important example of the previous section, consider a vector field f :Rn, where Rn is an open set. The vectorfield f defines anautonomous(i.e. time-independent)ordinary differential equation (ODE)

    x= f(x).

    Bya standardresultof the theoryof ODEs, ifW is closedand bounded and f is C1,thenthere exists a positive number such that, for each x0 W, there exists x (C[, ])nsatisfying the Initial Value Problem (IVP)

    x

    = f(x),

    x(0)=x0. (5)That is, there is an operator S : W (C[, ])n with S(x0)= x, the solution of (5). IcallSthe solution operator of the IVP.

    Recall that(C[, ])n is the space of all continuous, vector-valued functions de finedon [, ]. The usual norm of(C[, ])n is

    u=max{u(t)2 : t[, ]}.This definition implies that ifu vis small, then u(t) v(t) is uniformly small onthe interval [, ]. For this reason, is sometimes called theuniformnorm.

    The derivativeDS(x0)is computed by finding the local linear approximation to S(x0+x0)

    S(x0). Writez

    =S(x0

    +x0)and x

    =S(x0). Thenz satisfies

    z= f(z),z(0)=x0+ x0,

    andxsatisfies (5). Therefore, ifw=zx, then

    w= z x= f(z) f(x)= D f(x)(z x) + o(z x)= D f(x)w+ o(w)

    and

    w(0)=z (0) x(0)=x0+ x0 x0=x0.

    Since a linear (inx0) approximation tow is desired, it is reasonable to drop the o(w)term from the ODE and consider the solution u to

    u= D f(x(t))u,(6)

    u(0)=x0.

  • 8/12/2019 Gockenbach Diff Calculus

    22/55

    96 GOCKENBACH

    Note thatu really does depend linearly on x0. Indeed, ifu solves (6) andv solves

    u= D f(x(t))u,u(0)=y0,

    theny= u+ vsatisfies

    y= u+ v=Df(x(t))u+ D f(x(t))v= Df(x(t))(u+ v)

    = Df(x(t))y,

    and

    y(0)= u(0) + v(0)= x0+ y0.

    Thereforeu, the solution of (6), depends linearly onx0, and it is an approximation tow

    since it is obtained by solving an IVP with the same initial conditions as that satisfied byw

    and with a slightly changed vectorfield. It can be proved, in fact, that w=u + o(w)=u +o(x02). (This is a standard theorem about the continuous dependence of the solutionto an IVP on the vector field.) Therefore DS(x0)x0= u, where u is the solution of theIVP (6).

    6. Second derivatives

    In elementary calculus, if f :R Ris twice differentiable, then the scalar f(x) represent-ing D f(x)is called thefirst derivative. Since, in this way of looking at things, f: RRis the same type of function as is f itself, it is natural to define f(x) as the derivativeof f at x, so that f(x)is also a scalar and f, like f and f, maps R into R. As I willnow explain, this is another instance in which the one-dimensional case gives a completely

    misleading picture.

    Suppose X andYare normed linear spaces, UX is open, and f : UY is differen-tiable. Then the derivative D f is also an operator mapping one normed linear space into

    another; however, it isnotof the same type as f, since D f mapsU into L(X, Y). It does

    make sense to ask whether D f : U L(X, Y) is differentiable; to examine this ques-tion, Definition 2.2 is applied. The operator D f is differentiable atx U if there exists acontinuous linear operator L L(X,L(X, Y))such that

    limx0

    Df(x+ x) D f(x) L xL(X,Y)xX

    =0.

    If suchanL exists, then f is saidto be twice-differentiable at x, andL is denoted byD2f(x).

    If f is twice-differentiable at each x U, then f is called twice-differentiable, in which

  • 8/12/2019 Gockenbach Diff Calculus

    23/55

    A PRIMER ON DIFFERENTIATION 97

    caseD2 f is an operator mapping Uinto L(X,L(X, Y)). If this operator is continuous, then

    f is calledC2.

    Now, clearly Definition 2.2 can be used to discuss derivatives of order three and higher.

    However, things become quite awkward. For example, if f is three times differentiable,

    then

    D3 f(x)L(X,L(X,L(X, Y))),

    and ifD 4 f(x)exists, then

    D4 f(x)L(X,L(X,L(X,L(X, Y)))).

    Fortunately, a simplification is afforded by the nature of the spaces

    L(X,L(X, Y)),L(X,L(X,L(X, Y))),...

    ConsiderL L(X,L(X, Y)). By definition, L (x) L(X, Y)and L (x)z Yfor eachx,z X. In other words, L defines an operator B :XX Y by

    B(x,z)=L(x)z.

    It is easy to see that B is bilinear, that is, that

    B(1x1+ 2x2,z)=1B(x1,z) + 2B(x2,z)x1,x2,z X, 1, 2 R,B(z, 1x1+ 2x2)=1B(z,x1) + 2B(z,x2)x1,x2,z X, 1, 2 R.

    Indeed,

    L(1x1+ 2x2)=1L(x1) + 2L(X2),

    from which it follows that

    B(1x1+ 2x2,z)=L(1x1+ 2x2)z= 1L(x1)z+ 2L(x2)z= 1B(x1,z) + 2B(x2,z).

    Similarly, L (z)is linear, so

    B(z, 1x1+ 2x2)=L(z)(1x1+ 2x2)= 1L(z)x1+ 2L(z)x2= 1B(z,x1) + 2B(z,x2).

  • 8/12/2019 Gockenbach Diff Calculus

    24/55

    98 GOCKENBACH

    IfX,Y, and Zare normed linear spaces, then the space of continuous bilinear operators

    B:X Y Zis denoted by L2(X, Y,Z). This space has a natural norm:

    BL2(X,Y,Z)=supB(x,y)z

    xXyY: x X,yY,x=0,y=0

    .

    It is a standard result that a bilinear operator B :X Y Zis continuous if and only if

    BL2(X,Y,Z)

  • 8/12/2019 Gockenbach Diff Calculus

    25/55

  • 8/12/2019 Gockenbach Diff Calculus

    26/55

    100 GOCKENBACH

    Note that the notation D2xy f denotes the partial derivative with respect to y of Dxf, and

    similarly for D2yxf. By Theorem 6.1, each ofD2 f(x, y), D 2xxf(x, y), and D

    2yy f(x, y)is

    symmetric. It follows easily that

    D2x y f(x,y)zx= D2yxf(x,y)xz for allx X, zY.

    Of course, these results can be generalized to the case of f :X1X2 Xn Z,in which case the fundamental formulas are

    D2f(x)xr=n

    i=1

    nj=1

    D2xixj f(x)xj ri ,

    and

    D2xixj f(x)xj ri= D2xjxi f(x)ri xj .

    6.3. Representation of second derivatives onfinite-dimensional spaces

    Now I returnto the caseof f : Rn Rm and derive the formula for the 3-tensor representingD2 f(x). Since Rn can be regarded as the product ofncopies of R,D2 f(x) can be expressed

    in terms of the second partial derivatives of f:

    D2f(x)xr=n

    j=1

    n

    k=1

    D2xjxk f(x)xkxj .

    Recall that D 2xjxk fis the derivative with respect to xkofDxj f; also recall that

    Dxj f : Rn L(R, Rm ),

    or, effectively,

    Dxj f : Rn Rm

    (since each operator in L(R, Rm )) is represented by a vector in Rm , and vice versa).

    Specifically,Dxj f(x)is represented by the vector

    f

    xj(x).

    By the same reasoning, D 2xjxk f(x)is also represented by a vector in Rm , namely,

    2 f

    xkxj(x).

  • 8/12/2019 Gockenbach Diff Calculus

    27/55

    A PRIMER ON DIFFERENTIATION 101

    It follows that

    D2f(x)xr=n

    j=1

    nk=1

    2f

    xk xj(x)xkxj .

    Thus,

    D2f(x)xr

    i=

    nj=1

    nk=1

    2fi

    xk xj(x)xkxj ,

    which shows that D 2f(x)is represented by the 3-tensorT, with

    Ti jk= 2fi

    xkxj(x).

    6.4. The Hessian

    In Sections 3.2.2 and 5.1, I showed that, in the case of f :X R, the usual representer ofD f(x) is thegradientvector f(x).WhenX= Rn , thisis not quite the sameas the Jacobianmatrix (which is a row matrix in that case); the gradient is adopted instead precisely because

    it generalizes to the case in which Xis a Hilbert space. In the same way, the representer

    for D2f(x)has a special form when the range of fis R. Indeed, in this case D2f(x) is a

    bilinear operator mapping XXinto R, and the following theorem holds.

    Theorem 6.4. Suppose X is a Hilbert space and B L2(X,X,R) is a bilinear form.Then there exists a linear operator L L(X,X)such that

    B(x,y)=(L x,y)x.

    This theorem can be proved using the Riesz representation theorem, since, for fixedx, the

    map

    y B(x,y)

    defines a continuous, linear, real-valued function onX.InthecaseofD2f(x), where f :XR, the linear operator representing the bilinear operator is called the Hessianoperator andis denoted 2 f(x). That is,

    2 f(x)L(X,X)

    is defined by

    D2f(x)xy= 2 f(x)x, yX

    .

  • 8/12/2019 Gockenbach Diff Calculus

    28/55

    102 GOCKENBACH

    In the case of f : Rn R, the Hessian matrix is just the specialization of the tensor Tdiscussed above. Since m = 1 in this case, the 3-tensor can obviously be identified witha 2-tensor, i.e. a matrix (just as the Jacobian matrix can be identified with a vector, the

    gradient, in this case). Therefore,

    (2 f(x))i j= 2f

    xj xi(x).

    6.5. Example: The Hessian of a nonlinear least-squares function

    I now return to the example of Section 5.2. Let F: X Y, where X andY are Hilbertspaces, and define

    f(x)= 12

    (F(x),F(x))Y.

    I showed earlier that

    D f(x)x=(D F(x)x,F(x))Y.

    By the product rule, it follows that

    D2F(x)xr= (D F(x)x,D F(x)z)Y+ (D2F(x)xr,F(x))Y. (7)

    This gives a formula for the second derivative of f, but it must be rearranged to exhibit the

    Hessian operator. Thefirst term is easy to handle, since

    (D F(x)x,D F(x)r)Y= (x,D F(x)D F(x)r)X.

    The operatorD F(x)D F(x)thus forms part of the Hessian; indeed, in small residual least-squares problems, this operator is a good approximation to the Hessian, at least forxnear the

    minimizer. For this reason, it is often used as an approximation to Hessian, and is referred

    to as theGauss-NewtonHessian.

    To handle the second term, write

    B(x, r)=D2 F(x)xr;

    then

    (D2F(x)xr,F(x))Y= (B(x, r),F(x))Y= (x, (B(, r)) F(x))X,

    where I write B(, r)for the linear operator defined by

    x B(x, r).

  • 8/12/2019 Gockenbach Diff Calculus

    29/55

    A PRIMER ON DIFFERENTIATION 103

    Now, it is easy to see that

    r(B(, r)) F(x)

    defines a linear operator mapping Xto X, and it can be shown to be bounded (continuous).

    This operator depends onxthroughF(x)and D 2F(x), and I will denote it by S(x), so that

    S(x)r= (B(, r)) F(x).

    With this notation,

    (D2F(x)xr,F(x))Y

    =(x,S(x)r))X,

    and therefore

    2f(x)=D F(x)D F(x) +S(x).

    Lastly, I will computeS(x)in the case F: Rn Rm . In that case,D 2 F(x)is representedby the 3-tensor T, where

    Ti jk= Fi

    xkxj(x).

    Therefore, settingz= F(x),

    (D2 F(x)xr,z)Rm=m

    i=1(D2 F(x)xr)izi

    =m

    i=1

    nj=1

    nk=1

    Ti jkxkrjzi

    =n

    k=1

    n

    j=1

    mi=1

    Ti jkzi

    rj

    xk

    =

    x,

    nj=1

    mi=1

    Ti jkzi

    rj

    Rn

    ,

    and so

    (S(x)r)k=n

    j=1

    mi=1

    Ti jkzi

    rj

    =n

    j=1

    mi=1

    Fi

    xkxj(x)Fi (x)

    rj .

  • 8/12/2019 Gockenbach Diff Calculus

    30/55

    104 GOCKENBACH

    This shows thatS(x)is represented by the matrix whose(k, j )entry is

    mi=1

    Fi

    xk xj(x)Fi (x),

    and hence

    S(x)=m

    i=1Fi (x)2 Fi (x).

    The matrix representing S(x) has been referred to as the mess matrix, and the above

    formula shows that it is expensive to compute. This explains the popularity of the Gauss-

    Newton Hessian. However, in large-residual least-squares problems, use of the full Hessian

    (or an approximation to it) is necessary.

    7. Example: The adjoint state method

    As a more involved example, I will discuss the computation ofDG(c) and DG(c) for anonlinear operatorG defined by an (explicit)finite-difference simulation. This discussion

    is taken from the paper (Gockenbach et al., in press). The problem described here arises, for

    example, when one or more coefficients in a partial differential equation are to be estimated

    by the Output Least-Squares (OLS) technique. In this technique, the parameters are chosen

    to produce simulated data as close as possible (in a norm induced by an inner product) to

    observed data. Specifically, the OLS problem is

    minc

    J(c),J(c)= 12G(c) D obs2, (8)

    where c C denotes the unknown parameters, Dobs is the observed data, and G is theforward map, that is, the operator embodying the mathematical model of the dependence of

    the data on the parameters. Thus the OLS problem is just a nonlinear least-squares problem

    of the type discussed above, and

    J(c)=DG (c)(G(c) D obs ).

    In the application I consider here,G is defined by an explicitfinite-difference simulation

    followedby sampling (inmanyapplications, only part offield simulatedby finite-differences

    is observable). I will therefore assume that

    G(c)=SU=N

    n=0SnU

    n,

    whereUn Uis (related to) the n th time level of the simulated field, S : U D is thesampling operator, and Dis the data space (that is, D obs D).

  • 8/12/2019 Gockenbach Diff Calculus

    31/55

    A PRIMER ON DIFFERENTIATION 105

    Note thatSis defined by

    SU=N

    n=0SnU

    n,

    where Sn : U Dfor n= 0, 1, . . . ,N.That is, each time level of the computedfield issampled, and the results are accumulated as the data. This formalism provides an efficient

    way to abstractly represent several different sampling possibilities. For example, the entire

    time levelUn may be recorded for certain values ofn, in which case Sn is the zero operator

    for all other values ofn. Alternatively, every time level could be sampled at a few receiver

    locations (as in the typical seismic experiment), and the results recorded as time series. At

    the other extreme the entire history of the field could be retained. All of these possibilitiescan be accommodated within the above formalism by appropriate choice ofS.

    Anyfinite-difference scheme can be considered to be formally two-level, by concatenat-

    ing several time levels if necessary. Therefore,

    Un+1 = Hn (c, Un), n=0, 1, . . . ,N 1.

    I call Hn : CU U thestencil operator.

    7.1. A convection-diffusion example

    I will now pause to give an explicit example of the situation described above. Consider thefollowing initial-boundary value problem for the convection-diffusion equation:

    ut+ a(x)ux= 0, 0< x0,u(x, 0)= (x), 0< x0,

    wherea(x) >0 for all x [0, 1]. Define a grid on the rectangle(x, t)[0, 1] [0, T] bysetting

    xj

    = j x, x

    =

    1

    M

    , tn

    =nt, t

    =

    T

    N

    ,

    and writeu nj for the approximation tou (xj , tn). Since the characteristics of the PDE point

    up and to the right in the (x, t)plane, it is natural to discretize using a forward difference

    in time and a backward difference in space to obtain

    un+1j unjt

    + ajunj unj1

    x=0,

  • 8/12/2019 Gockenbach Diff Calculus

    32/55

    106 GOCKENBACH

    whereaj= a (xj ). Taking into account the initial and boundary values yields

    un+1j =

    0, j= 0, n=0, 1, . . . ,N 1j aj xt( j j1), n=0, j=1, 2, . . . ,Mun1 a1 xtu

    n1 , n=1, 2, . . . ,N 1, j=1

    unj a1 xt

    unj unj1

    , n=1, 2, . . . ,N 1, j=2, 3, . . . ,M

    (9)

    The stencil operator Hfor this example is therefore defined by H(a, un )= un+1, whereun+1 is defined by (9). In terms of the above notation, U is (M+1)-dimensional space,while Cis M-dimensional space.

    To introduce sampling, suppose that sensors are placedat several grid points on the spatialgrid, say atxj1 ,xj2 , . . . ,xj , and that the observed data consists of the times series

    u1ji , u2ji

    , . . . , uNji , i=1, 2, . . . , .

    If each time series forms a column of a matrix, then the data D is an(N+ 1) matrix,and we have

    D=N

    n=0SnU

    n ,

    whereSnUn is the matrix with every row equal to zero except in nth row, which has entries

    unj1 , unj2

    , . . . , unj .

    7.2. Back to the general case

    Thelinearizationof the mapc G(c)is the result offirst-order perturbation of the time-stepping equations:

    DG (c)c=N

    n=0SnU

    n,

    where

    Un+1 = DcHn (c, Un)c +DUHn(c, Un )Un

    andUn = (DU(c)c)n . Note that if the originalfinite-difference scheme is linear (reallyaffine: linear plus constant), then it can be written as

    Un+1 = A(c)Un + Fn ,

  • 8/12/2019 Gockenbach Diff Calculus

    33/55

    A PRIMER ON DIFFERENTIATION 107

    whereA(c) =DUHn(c, U) (DUHn is independent of the time level nin this case). It followsthatUsatisfies

    Un+1 = A(c)Un + (DA(c)c)Un .

    Therefore, in this common case, the linearization is computed by afinite-difference simula-

    tion identical to the original, except that the right-hand side Fn is replaced by

    (DA(c) c)Un .

    I will now show how to compute the adjoint ofD G(c). The spaces Cand Urequire inner

    products; these inner products will be denoted (, )Cand(, )U, respectively. ThefieldUbelongs to UN+1, and I define the inner product on UN+1 by

    (U, V)UN+1=N

    n=0(Un , Vn )U.

    For convenience, and suppressing the dependence onc, write An for DUHn(c, Un )and

    Fn+1 for DcHn(c, Un)c, F0 =0, so that the linearized scheme can be written as

    U0 =0, Un+1 A nUn = Fn+1, n=0, 1, . . . ,N 1.

    This can also be written as

    MU= F,

    where M : UN+1 UN+1 is the block linear operator

    M=

    I 0 0 0A0 I 0 0

    0 A1 I 0...

    .... . .

    . . ....

    0 0 AN1 I

    (note that Mdepends onc, but I suppress this dependence). ThenU= M1 F, and theexplicit time-stepping scheme is equivalent to solvingMU= Fby forward substitution.

    Now, write B for the operator mapping cto F:

    (Bc)n =

    0, n=0DcHn1(c, Un1)c, n=1, 2, . . . ,N

    (again suppressing the fact that B depends onc). Then

    DG (c)=S M1B,

  • 8/12/2019 Gockenbach Diff Calculus

    34/55

    108 GOCKENBACH

    and so

    DG (c)= BMS.

    Assuming that Sn, Sn , and the stencil operator Hn and its derivatives and adjoints DcHn

    (c, U),DUHn (c, U), DcHn (c, U), and DUHn (c, U) are known (the reader might find it

    instructive to compute these derivatives and adjoints for the convection-diffusion example

    given above), I will now show how to compute D G(c) from them.Note that D G(c)D= BMSD. Write V= SD. Then, as is easy to verify,

    Vn =( SD)n = Sn D, n=0, 1, . . . ,N.

    From my choice of inner product on UN

    +1

    , it follows that M is the block linear operator

    M=

    I A0 0 00 I A1 0...

    .... . .

    . . ....

    0 0 I AN10 0 0 I

    Write W= MV, so that W solves MW= V. Since M is block upper trian-gular, Wcan be found by back substitution, which is equivalent to the following reverse

    time-stepping scheme:

    WN = VN, Wn1 = An1Wn + Vn1, n= N,N 1, . . . , 1.

    I will refer to Was theadjoint stateand to the equation MW= Vas theadjoint stateequation.

    Next I compute B . Note that

    (Bc, W)UN+1= (0, W0)U+N

    n=1(DcHn1(c, Un1)c, Wn )U

    =N

    n=1(c, DcHn1(c, Un1)Wn )C

    =

    c,N

    n=1DcHn1(c, Un1)Wn

    C

    .

    This shows that

    BW=N

    n=1DcHn1(c, Un1)Wn.

  • 8/12/2019 Gockenbach Diff Calculus

    35/55

    A PRIMER ON DIFFERENTIATION 109

    Thus the procedure for computing D G(c)D, forDDN+1, is:

    1. Solve the simulation problem to produce thefieldU(needed in steps 3b and 3c).

    2. Setcto zero.

    3. Forn= N, N 1, . . . , 1:(a) Compute Vn = Sn D.(b) Compute Wn by taking one step (backward in time) on the adjoint state equation

    (or simply WN = VN).(c) Add DcHn1(c, Un1)Wn to the output vectorc.

    A logistical problem immediately asserts itself: Uis produced by steppingforwardin time,

    Wby steppingbackwards.Unless the state space has small dimension (which is certainlynot the typical case), storage of the entire time history of the reference field U is very

    expensive in terms of memory. On the other hand, one could, at each step of the backward

    time-stepping algorithm, re-compute the needed time level Un by forward time-stepping

    fromU0. This is obviously expensive in terms of computation time.

    To balance the need for storage and recomputation, a checkpointing scheme due to

    Andreas Griewank (1992), extended in Symes et al. (1998), can be employed. The idea

    is to save (checkpoint) various time levelsUn to use as intermediate initial data to restart

    the computation ofU during the solution of the adjoint state system. A complete description

    of the algorithm appears in Gockenbach et al. (in press).

    8. Example: Differentiating afinite element solution operator in an inverse problem

    As myfinal example, I will discuss the computation of the derivative and its adjoint when

    the operator is the (approximate) solution operator, as implemented using thefinite element

    method, of an elliptic partial differential equation. Suppose is a bounded polygonal region

    in the plane and consider the boundary value problem (BVP)

    (au)= p in , (10)u= 0 on .

    This BVP models, for example, the small tranverse displacements uof an elastic membrane

    under a tranverse pressure p. The coefficient a is describes the elastic properties of the

    membrane, and when the membrane is heterogeneous, a is a function of space: a=a (x).The usualdirect problemis to computeu given the functions pand a; that is, given the

    elastic properties of the membrane and the pressure to which it is subjected, determine its

    displacement. In many applications, it is necessary to solve an inverse problem, such as:

    Given p and a measurement ofu, estimate a; that is, by observing the displacement of

    the membrane under a known pressure, estimate the elastic properties of the membrane. (It

    would also be possible to consider p as needing to be measured, that is, that it forms part

    of the data of the problem. To simplify the presentation, I will assume that the pressure p

    is known.)

  • 8/12/2019 Gockenbach Diff Calculus

    36/55

    110 GOCKENBACH

    One way to solve the inverse problem numerically is to use the Output Least-Squares

    approach, as described in the previous section, in conjunction with the finite element method

    for solving the BVP. Suppose that the measured data is denoted u obs anda is to be chosen

    so that the predicted displacementu , as simulated by piecewise linearfinite elements, is to

    be as close tou obs as possible in the L2()norm. It is necessary to have a representation

    for the unknown coefficienta, and I will represent it using a piecewise linear function. Let

    T (h) be a triangulation of, and define

    P(h) = { : R| is continuous and piecewise linear relative to T (h)}

    P(h)0 = {P(h) | =0 on }.

    Suppose the nodes of the triangulation T( h) arex1,x2, . . . ,xm and i is the element ofP(h)

    defined by

    i (xj )=

    1, j=i0, j=i .

    Then {1, 2, . . . , m} is the standard basis for the space P(h), and every elementuP(h)satisfies

    u=m

    i=1u(xi )i .

    The basis functions that correspond to interior nodes comprise a basis forP(h)0 ; I will denote

    this basis by {1, 2, . . . , n} (there exists a sequence i1, i2, . . . , insuch that j= ij , j=1, 2, . . . , n).

    Thefinite element method for estimating the solution of (10) takes the form

    finduP(h) such that

    au i=

    fi i= 1, 2, . . . , n. (11)

    Upon substituting

    u=n

    j=1Uj j ,

    (11) can be written as the matrix-vector equation KU= P, where

    Ki j=

    aj i , i, j= 1, 2, . . . , n,

    Pi=

    pi , i= 1, 2, . . . , n.

    Note that K Rnn is symmetric and positive definite.

  • 8/12/2019 Gockenbach Diff Calculus

    37/55

    A PRIMER ON DIFFERENTIATION 111

    Now define the (approximate) solution operator of (10) as

    f : P(h) P(h)0 ,

    where

    f :au=n

    i=1Ui i , U= K1 P

    and K Rnn and P Rn are defined as above. The OLS approach is then to minimizethe function J: P(h) R defined by

    J(a)= 12f(a) uobs2

    L2().

    This is another nonlinear least-squares function, and its gradient is given by

    J(a)= D f(a)(f(a) uobs).

    It is easier to compute D f(a)and D f(a) if I explicitly recognize the fact that the basesfor P(h) and P

    (h)0 make it possible to identify them with R

    m and Rn , respectively. Define

    E: Rm P(h) by

    EA

    =

    m

    i=1

    Ai i ,

    and note that, as discussed above, E1 is defined by

    (E1a)i= a (xi ), i= 1, 2, . . . , m.

    Similarly, define E0: Rn P(h)0 by

    E0U=n

    i=1Ui i ;

    then(E10 u)i=

    u (xji ), i

    =1, 2, . . . , n. I can then write

    f= E10 FE,

    where F: Rm Rn is defined by

    F(A)=U, a=m

    i=1Ai i , u=

    ni=1

    Ui i , U= K1 P.

  • 8/12/2019 Gockenbach Diff Calculus

    38/55

    112 GOCKENBACH

    I will now how to compute D F(A)and D F(A). The matrix Kdepends on A , so I willwrite K= K(A). With

    a=m

    k=1Akk,

    it follows that

    Ki j (A)=

    aj i

    = m

    k=1

    Akk

    j

    i

    =m

    k=1

    kj i

    Ak

    =m

    k=1Ti jkAk,

    where

    Ti jk=

    kj i , i, j=1, 2, . . . , m, k=1, 2, . . . , m.

    It then follows that, for any A, ARm ,

    DK(A)A=m

    k=1Ti jkAk= K(A).

    This result,DK(A)A= K(A), is to be expected because the operator K: A K(A)islinear in A. Since Fis defined by

    F(A)=K(A)1 P,

    the result from Section 4.5 applies, and

    D F(A)A= K(A)1(D K(A)A)K(A)1 P= K(A)1 K(A)U,

    whereU= F(A). This formula shows that computing Df(A)Afor a givenAis no moreexpensive than computing the simulated displacementU(assumingUis computedfirst, so

    K(A)andUare already known), and may be much less expensive if the matrix K(A)has

    already been factored.

  • 8/12/2019 Gockenbach Diff Calculus

    39/55

    A PRIMER ON DIFFERENTIATION 113

    I will now turn to the computation ofD f(A). Note that

    (K(A)U)i=n

    j=1(K(A))i j Uj

    =n

    j=1

    mk=1

    Ti jkAkUj

    =m

    k=1

    nj=1

    Ti jkAkUj

    =m

    k=1

    n

    j=1

    Ti jkUjAk.I now define the matrix L= L (U)by

    L(U)=n

    j=1Ti jkUj,

    which allows me to write

    K(A)U= L (U)A

    and

    D F(A)A

    = K(A)1L(U)A.

    The formula for D f(A) now follows:

    D F(A)U= L(U)T K(A)1U

    (where I used the fact that Kis symmetric).

    The relationship between D f(a)and D F(A)is straightforward, and, indeed, is exactly

    analogous to the relationship between f and F. Ifa= E Aanda=EA, that is,

    a=m

    i=1Ai i , a=

    ni=1

    Ai i ,

    thenu= D f(a)aand U= D F(A)Asatisfy

    u=n

    i=1Ui i .

    Indeed, this follows from the chain rule applied to the relationship

    f(a)=E0 F(E1a).

  • 8/12/2019 Gockenbach Diff Calculus

    40/55

  • 8/12/2019 Gockenbach Diff Calculus

    41/55

    A PRIMER ON DIFFERENTIATION 115

    by the fundamental rule (AB)= BA, which shows that

    (E1)=(E)1. (14)

    The calculation ofE is exactly the same as for E10 ; the result is

    E= ME1, (15)

    where MRmm is defined by

    Mi j= (i , j )L2().

    Together, (14) and (15) yield

    (E1)= EM1.

    The matrixMis symmetric and positive definite and hence invertible; this follows from the

    fact that {1, 2, . . . , m} is linearly independent.Using the expressions for E0 and(E

    1), (13) yields

    D f(a)= EM1D F(E1a)M0E10 .

    The appearance of the trivial mappings E and E10 in this formula is no more significantthan it was in the formula for D f(a). On the other hand, the Gram matrices M1 and M0appear because of the different inner products used for the two pairs of isomorphic spaces.

    9. Avoiding the need to program derivatives

    The user of a software package implementing numerical optimization algorithms is required

    to provide some computer code (usually a subroutine written in a given language) to evaluate

    the objective and constraint functions. (This is how the user specifies his or her problem

    to the optimization code.) Typically, the optimization code will need values of various

    derivatives, which can be obtained in several ways:

    1. The user can provide hand-written computer codes to evaluate the derivatives.

    2. The optimization code can estimate the derivatives usingfinite differences.

    3. The derivatives can be produced, either by the user or the optimization code, using

    automatic differentiation.

    The emphasis of my presentation so far has been on understanding the basic theory of

    derivatives, particularly the linear algebraic foundations, and on using this theory to derive

    formulas for derivatives of specific functions. Such understanding is essential for hand-

    coding derivatives.

    Suppose, though, that a user wishes to avoid the labor (and risk2) of programming the

    derivativesof theproblem functions. In this finalsection,I will briefly discuss the advantages

  • 8/12/2019 Gockenbach Diff Calculus

    42/55

    116 GOCKENBACH

    and disadvantages of the other two approaches to the computation of derivatives, finite

    differences and automatic differentiation.

    9.1. Finite difference estimation of derivatives

    Optimization codes generally use the representers of the relevant derivatives: the gradient

    and Hessian of a real-valued function, the Jacobian matrix of a vector-valued function. In

    order to be concise, I will mostly limit my discussion to the computation of the gradient of

    a real-valued function.

    Suppose3 f : Rn R. Then f(x)is the vector in Rn whosei th component is

    f

    xi(x)= limh0

    f(x

    +hei )

    f(x)

    h , (16)

    whereei is thei th standard basis vector (that is, the vector with every component equal to

    zero, except thei th, which is one). When the only information available about f is ablack

    boxthat will return its value for a given x, it is not possible to implement (16) exactly, as

    the limit operation implies an infinite calculation.

    A natural way of approximating f/xi is to simply truncate the limit operation by

    choosing a small but nonzero value ofh:

    f

    xi(x)

    .= f(x+ hei ) f(x)h

    . (17)

    Indeed, Taylors theorem,

    f(x+ hei )= f(x) + f

    xi(x)h+ 1

    2

    2 f

    x2i(x+ hei )h2, (0, 1),

    can easily be rearranged to show that

    f

    xi(x)= f(x+ hei ) f(x)

    h+ 1

    2

    2 f

    x2i(x+ hei )h, (0, 1).

    Thus, the error in (17) is O (h); this error is referred to as thetruncation error. The question

    now arises: What value ofh should be chosen in practice?

    Atfirst glance, it would appear that smaller values ofh (the smaller, the better) wouldtend to lead to better approximations of the partial derivative. Though this is true in exact

    arithmetic, it does not take into account the effects offloating point (computer) arithmetic.

    First of all, h cannot be chosen too small in comparison to xi ; otherwise, the values ofxiandxi+ h, rounded to the nearest floating point number, will be identical (and therefore,necessarily, so will be f(x+ hei )and f(x)). More subtly, the magnitude

    2 f

    x2iplays a part.

    A computer subroutine implementing the evaluation of fwill inevitably return inexact

    results, because of round-off error if for no other reason. Suppose the implemented function

  • 8/12/2019 Gockenbach Diff Calculus

    43/55

    A PRIMER ON DIFFERENTIATION 117

    actually returns

    f(x)= f(x) + (x),

    with

    |(x)|<

    for all relevant values ofx. Then formula (17) will be implemented as

    f

    xi

    (x) .

    =

    f(x+ hei ) f(x)h

    = f(x+ hei ) + (x+ hei ) f(x) (x)h

    = f(x+ hei ) f(x)h

    + (x+ hei ) (x)h

    = fxi

    (x) + 12

    2 f

    x2i(x+ hei )h+

    (x+ hei ) (x)h

    .

    There is no reason to expect that the function is differentiable, so all that can be said about

    the last term in the above expression is that

    (x+hei )

    (x)

    h

    2

    h.

    If the second partial derivative of f is bounded by M, then

    fxi (x) f(x+ hei ) f(x)

    h

    Mh2 + 2h . (18)This bound suggests that the total error in the approximation can grow as h0, since theround-off error(or at least the bound for it) grows as h is decreased.

    Thus smaller values ofharenot necessarily betterin practice,and so thequestionremains:

    How shouldhbe chosen in practice? One idea would be to choose hto minimize the bound

    in (18). This leads to the value

    h= 2

    M

    .

    However, this result is of limited use, since the value Mis not available in general. It does

    suggest, however, that

    h= O ()

  • 8/12/2019 Gockenbach Diff Calculus

    44/55

    118 GOCKENBACH

    is reasonable, and an estimate of might be available. The usual choice is

    h=sign(xi )|xi |

    , (19)

    with some adjustment made if |xi | is too close to zero.Withhdetermined by some variation on (19), the error in the computed partial derivative

    isO

    . This leads to thefirst disadvantage of usingfinite difference estimates for partial

    derivatives, and thus for gradients: the attainable accuracy in the computed minimizer is

    limited. Afterall, algorithmsfor numericaloptimization are based on the necessary condition

    that

    f(x)=0

    at a minimizer (or the analogous Lagrange multiplier conditions, which also involve f(x),for a constrained minimization problem). It is easy to see that the minimizer cannot be

    reliably computed to an accuracy greater than the accuracy with which the gradients is

    computed.

    The foregoing disadvantage offinite differences is only important for small problems,

    when it is reasonable (and may be important) to compute the solution to a high degree

    of accuracy. A more serious objection is related to the computational cost of using finite

    differences: to estimatef(x)costsn evaluations of the function f (assuming that f(x)must be computed anyway in the course of the optimization algorithm). For any problem

    in which it is expensive to evaluate f, or n is large, or both, this cost may be unacceptable.

    By comparison, the examples given in Section 7 and 8 yield formulas that will result in the

    computation of the gradient at a cost equal to a small multiple of the cost of computing the

    function itself. (Note,though, in thecase of theadjoint state methoddetailedin Section 7, thisefficiency depends on the use of the checkpointing scheme that was only briefly mentioned.)

    The major advantage of usingfinite differences is obvious: the user need only implement

    the problem functions and not their derivatives. The optimization code can then take care

    of all details concerning the estimation of derivatives, including the choice of the step

    sizeh(although the user may need to provide an estimate of). When the cost is affordable

    and there is no need for high accuracy solutions, this makes finite differences an attractive

    option. Although I will not discuss it here,finite difference methods can also be devised for

    computing Jacobian and Hessian matrices.

    9.2. Automatic differentiation

    Automatic, or algorithmic,differentiation (AD) is a term applied to a collection of techniquesfor automatically producing derivatives of functions implemented in computer subroutines.

    AD tools can analyze a computer program that implements a mathematical function or oper-

    ator, and systematically apply the rules of differentiation, notably the chain rule, producing

    a new computer program implementing the desired derivative.

    There are two primary approaches to automatic differentiation, operator overloading and

    source transformation. I will only discuss the source transformation approach, and, indeed,

    will focus on a single AD tool, TAMC (Giering, 1999). Another source transformation

  • 8/12/2019 Gockenbach Diff Calculus

    45/55

    A PRIMER ON DIFFERENTIATION 119

    tool is ADIFOR (Bischof et al., 1992). An AD package that uses the operator overloading

    approach is ADOL-C (Griewank et al., 1996). For a more complete discussion of automatic

    differentiation, see Griewank (2000). The following discussion is taken from the report

    (Gockenbach, 2000), which may be consulted for more details.

    The Tangent linear and Adjoint Model Compiler(TAMC), designed and implemented by

    Ralf Giering (1999), is an Automatic Differentiation (AD) package that produces linearized

    and adjoint code for nonlinear operators. To be more precise, given a Fortran subroutine

    implementing an operator of the formF: Rn Rm , TAMCcan produce code that computesD F(x)x and D F(x)y. TAMC can also produce derivatives and adjoints for operatorsdefined on product spaces, such as G : Rn Rk Rm .

    Although TAMC produces correct and efficient code, the exact operation of TAMC-

    generated code can be slightly counter-intuitive to those not well-versed in AD. I will

    present an explicit mathematical model for an operator as implemented by a computerprogram, and explain the output of TAMC in terms of this model.

    The following simple example will serve to introduce some of the issues encountered in

    using TAMC. Define F: RR by y= F(x)=x2. This operator is implemented in thefollowing Fortran subroutine:

    subroutine F(x,y)

    double precision x,y

    y = xxreturn

    end

    TAMC generates the following adjoint code (stripped of TAMC-generated comments):

    subroutine adF(x,adx,ady)

    implicit none

    double precision adx

    double precision ady

    double precision x

    adx = adx+2adyxady = 0.d0

    end

    This code correctly computes the adjoint ofD F(x); however, the valueD F(x)yis addedto (rather than assigned to) the output variable. Moreover, the input variable is assigned the

    value of zero after it is used. That is, instead of implementing

    x D F(x)y,

  • 8/12/2019 Gockenbach Diff Calculus

    46/55

    120 GOCKENBACH

    the TAMC-generated code implements

    x x+D F(x)y,y 0.

    Below I will show how this result could have been predicted.

    9.2.1. The mathematical structure of a subroutine implementing an operator. Consider

    an operator F: Rn Rm . A Fortran subroutine implementing y F(x)would have argu-mentsx,y, as well as (possibly) other arguments involved in the definition of the operator

    (grid parameters, constants, etc.). The subroutine (which may call other subroutines)

    consists of a sequence of statements which together perform the desired calculation. A

    number of variables are involved: x and y, any variables required to hold intermediate

    quantities, loop control indices, etc. Some of the variable merely serve to control the flow

    of the executable statements, and are not important in developing a mathematical model

    of the subroutine. The crucial variables are the active variables, which are necessarily of

    floating point type. A variable u is active if

    thefinal value of an output variable (i.e. one of the components ofy) depends one thevalue ofu at some step i ;

    the value of u at step i depends on the initial value of an input variable (one of thecomponents ofx).

    The input and output variables are active by definition. The phrase variable wat step j

    depends on the value of variable z at stepi will be left undefined. The intuitive meaningis that the value of wat step j is linked to the value of z by a sequence of assignments

    statements, the last of which assigns a value to w, and thefirst of which (and perhaps others)

    has z, holding its value from step i , in the right hand side. (A precise definition of this

    concept would be required to implement a package such as TAMC, but is not needed to

    understand how it works.)

    Now letSbe the set of all active variables appearing in the subroutine (or in subroutines

    called by itI will not bother to make this distinction). Identifying Swith a Euclidean

    space RN,F: Rn Rm can be viewed as the composition of operators

    F= PFMFM1 F1Q,

    where

    Fi: SS, i=1, 2, . . . ,M

    and Q and P are the natural projections onto the domain and range, respectively, of F.

    (That is, P:SRm is defined by

    (Ps)i= sji , i= 1, 2, . . . , m,

  • 8/12/2019 Gockenbach Diff Calculus

    47/55

    A PRIMER ON DIFFERENTIATION 121

    whereyi= sji (recall that each active variable, including every component ofy, is identifiedwith a component ofs ). The projection Q:S Rn is defined similarly.) Each statementassigning a value to an active variable can be thought of as implicitly de fining one of the

    operatorsFi , and it is in the sense of these operators that I spoke ofstepsin the previous

    paragraph. Of course, most steps will only involve a few variables, so most of the active

    variables will retain their previous values. The role of the projectors Q and P should be

    clear: Q assigns to the input variables their initial values and assigns zero to all othervariables;P extracts from the set of all active variables the output F(x).

    There may be other floating point variables that are not active by this definition. For

    example, if thefinal value of an output variable depends on the value ofz at some step, but

    that value ofz does not depend on the initial value of any input variable, then z plays the

    role of a constant. (Input arguments to the subroutine other thanx can be constants.) It is

    also possible to have variables which depend on the input variables, but do not influence anyoutput variable. Such a variable can be called diagnostic. For an example of a diagnostic

    variable, consider the variable z in the following program fragment:

    z=x(1) x(1)+x(2) x(2)if (z.gt.1.0d0) then

    y(1) = 2.0d0x(1)else

    y(1) = x(1)x(1)endif

    Constant and diagnostic variables are called passive variables.

    By way of example, consider an assignment statement of the form

    w=g(u).

    Assumingu and ware active variables, say u=si1 , w=si2 (s S), this implicitly definesthe operator Fk:SS

    (Fk(s))j=

    g

    si1

    , j=i 2,sj, j=i 2.

    It is instructive to compute D fk(s)and(DkF(s)). The derivative is given by

    (D Fk(s)s)j=

    gsi1si1 , j=i 2,sj , j=i 2.

    Therefore,

    (D Fk(s)s, r)=j=i2

    sj rj+ g(si1 )si1 ri2

    =

    j=i1,i2sjrj+ si1 (ri1+ g(si1 ) ri2 ) + si2 0,

  • 8/12/2019 Gockenbach Diff Calculus

    48/55

  • 8/12/2019 Gockenbach Diff Calculus

    49/55

    A PRIMER ON DIFFERENTIATION 123

    F1(x,y)=(x,x2)P(x,y)=y.

    The interpretation adopted by TAMC is that the subroutine implements the operator

    G1(x,y)=(x,x2).

    Note that

    DG 1(x,y)(x, y)=(x, 2xx),(DG 1(x,y))

    (x, y)=(x+ 2xy, 0).

    The derivative ofFis given by

    D F(x)x=2xx.

    TAMC produces the following subroutine for the derivative:

    subroutine g_f(x,g_x,g_y)

    implicit none

    double precision g_x

    double precision g_y

    double precision x

    g_y = 2g_x x

    end

    This subroutine performs the computation

    DG 1(x,y)(x, y)=(x, 2xx);

    that is, it performs the operation

    x x(implicitly),y

    2xx.

    In this case, the subroutine can also be regarded as performing the desired operation

    y D F(x)x.

    The adjoint ofD f(x)is given by

    D F(x)y=2xy.

  • 8/12/2019 Gockenbach Diff Calculus

    50/55

    124 GOCKENBACH

    TAMC produces the following adjoint code:

    subroutine adf(x,adx,ady)

    implicit none

    double precision adx

    double precision ady

    double precision x

    adx = adx+2ady xady = 0.d0

    end

    This subroutine performs the operation

    (x, y)(DG 1(x,y))(x, y),

    as would be predicted by my discussion above.

    In terms of the original operator F, the Fortran command

    call adf(x,dx,dy)

    does the following:

    adds the value D F(x)y to dx, assuming that x and dy have been initialized to holdthe values ofxandy, respectively;

    setsdy to zero.

    To use adf in the desired manner (given x, y, compute x= D f(x)y), one wouldhave to perform the following steps:

    1. initializex and dy to the values ofxand y, respectively;

    2. setdx to zero;

    3. savedy (assuming that its value is wanted later);

    4. calladf(x,dx,dy);

    5. restore the value ofdy.

    Alternatively, one could hand-edit the routine adf so as to

    1. replace theadd-tostatements with simple assignments;

    2. remove the statements that change the input variable (dyin this example).

    As a second example, suppose the operator

    F(x,y)=x2y

  • 8/12/2019 Gockenbach Diff Calculus

    51/55

    A PRIMER ON DIFFERENTIATION 125

    is implemented so that the result F(x,y)overwrites one of the inputs, say (arbitrarily) y.

    This is done in the following subroutine:

    subroutine F(x, y)

    double precision x,y,w

    w = xxy = wy

    return

    end

    The active variables are now x, y, w, andFcan be written as F= P F2 F1 Q, where

    F1(x,y, w)=(x,y,x2),F2(x,y, w)=(x, wy,w).

    TAMC regards the subroutineas implementing G2 G1, where, again, G1 = F1 and G2 = F2.Now,

    DG1(x,y, w)(x, y,w)=(x, y, 2xx),DG2(x,y, w)(x, y,w)=(x, wy+ yw,w),

    and

    (DG1(x,y,w))(x, y,w)=(x+ 2xw,y, 0),

    (DG2(x,y,w))(x, y,w)=(x, yw,w+y y).

    The TAMC-generated code for the derivative is:

    subroutine g_f(x,y,g_x,g_y)

    implicit none

    double precision g_x

    double precision g_y

    double precision x

    double precision y

    double precision g_w

    double precision w

    g_w = 2g_xxw = xx

  • 8/12/2019 Gockenbach Diff Calculus

    52/55

    126 GOCKENBACH

    g_

    y = g_

    wy+g_

    yw

    end

    Thefirst executable statement computes the action ofD G1(x,y, w), the second computes

    G1(x,y, w), and the third computes the action ofD G2(G1(x,y,w)). The behavior of this

    subroutine is exactly as expected, and as desired; the Fortran command

    call g_f(x,y,dx,dy)

    overwrites dy withD F(x,y)(x, y), assuming that x,y,dx,and dy have previously been

    initialized with the values x,y, d x, andy, respectively.

    TAMC generates the following code for the adjoint:

    subroutine adf(x,y,adx,ady)

    implicit none

    double precision adx

    double precision ady

    double precision x

    double precision y

    double precision adw

    double precision w

    adw = 0.d0

    w = xxadw = adw+adyyady = ady wadx = adx+2adw xadw = 0.d0

    end

    This subroutine initializes the local variable adwto zero (the first executable statement),

    computes G1(x,y, w) (the second statement), applies (DG 2(G1(x,y, w))) (statements

    three and four), and applies (DG 1(x,y,w)) (statements five and six). This behavior is

    consistent with my description above; note that its effect is the following:

    y x2y,x x+ 2x yy.

    The desired behavior of the subroutine is

    y x2y,x 2x yy.

  • 8/12/2019 Gockenbach Diff Calculus

    53/55

  • 8/12/2019 Gockenbach Diff Calculus

    54/55

    128 GOCKENBACH

    The apparent advantages of AD are:

    The user avoids the labor-intensive and error-prone task of implementing derivatives byhand.

    If it is necessary to modify the original function, its derivatives can be modi fied auto-matically by the AD tool, again saving time spent writing and debugging code.

    Modern AD tools can handle very complex code and produce, in many cases, efficientderivative code.

    The disadvantages are not quite so obvious, and stem from the fact the foregoing advantages

    are not quite realized:

    My discussion of the code generated by TAMC shows that, in fact, the code produced bya fully automatic AD tool may need some modification by hand to achieve the desired

    results as efficiently as possible.

    Code produced by a fully automatic AD tool can be inefficient for certain applications.An example of this is provided by the adjoint state calculation of Section 7; an AD tool

    would tend to either save or re-compute all of the intermediate computations needed for

    the reverse time-stepping calculation. Either approach is significantly inefficient in some

    applications.

    In summary, my view is that automatic differentiation is very useful in a variety of situations,

    but it must be used with care if efficiency is a prime consideration. In particular, fully

    automaticdifferentiation may not be satisfactory for some applications. 4

    Notes

    1. I am far from thefirst to notice this. See, for example, Groetsch (1980):

    A closely guarded secret in some elementary calculus courses is the fact that the basic idea of differential

    calculus is the local approximation of a nonlinear function by a linear function. To quote from Dieudonne

    (1969), In the classical teaching of calculus, this idea is immediately obscured by the accidental fact that,

    on a one-dimensional vector space, there is a one-to-one correspondence between linear forms and numbers,

    and therefore the derivative at a point is defined to be a number instead of a linear form.

    The texts of Dieudonne and Groetsch are general references for the material in this paper; see also the survey

    paper by Tapia (1971).

    2. Probably the most common difficulty encountered when using packaged optimization software results from

    the users providing incorrect derivatives.3. It is possible to compute gradients without referring to coordinates explicitly, for instance in the formula

    J(x)= D f(x)(F(x) d)for J(x)= (1/2)F(x) d2. However,finite difference derivatives dependon an explicit coordinate representation, and so I may as well assume that the function is de fined on Rn .

    4. See Griewank (2000), page 92:

    As a rule, a general-purpose AD tool will not produce transformed code as efficient as that produced by a

    special-purpose translator designed to work only with underlying code of a particular structure, since thelatter

    can make assumptions (often with far-reaching consequences), where as the former can only guess.

  • 8/12/2019 Gockenbach Diff Calculus

    55/55

    A PRIMER ON DIFFERENTIATION 129

    A tool that requires the user to rewrite the underlying code in order to make explicit such factors as variable

    dependence, structural sparsity, interface width, and memory access patterns will be able to produce efficient

    transformed code more easily than a tool that must use internal analysis of the underlying program to deduce

    these structural features, but that does not require such a great effort from the user. On the other hand, the

    first sort of tool is difficult to apply to legacy code. A possible compromise is a tool that allows, but does not

    require, the user to insert directives into the program text.

    The latest version of TAMC does allow user-defined directives, making efficient code more attainable.

    References

    C. Bischof, A. Carle, G. Corliss, A. Griewank, and P. Hovland, Adifor: Generating derivative code from fortran

    programs,Scientific ProgrammingVol. 1, pp. 129, 1992.

    J. Dieudonne,Foundations of Modern Analysis, Academic Press: New York, London, 1969.Ralf Giering, Tangent Linear and Adjoint Model Compiler, User Manual 1.4, 1999. URL: http://puddle.

    mit.edu/ralf/tamcM. S. Gockenbach,Understanding code generated by TAMC,Department of Computational and Applied Math-

    ematics, Rice University, Houston, TX, Technical Report TR00-30, 2000.

    M. S. Gockenbach, D. R. Reynolds, and W. W. Symes, Efficient and automatic implementation of the adjoint

    state method,in Press.

    A. Griewank,Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differenti-

    ation,Optimization Methods and Software, Vol. 1, pp. 3554, 1992.

    A. Griewank, D. Juedes, and J. Utke, ADOL-C, a package for the automatic differentiation of algorithms written

    in C/C++,ACM TOMSVol. 22, pp. 131167, 1996.Andreas Griewank, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM:

    Philadelphia, 2000.

    Charles W. Groetsch,Elements of Applicable Functional Analysis, Marcel Dekker: New York, 1980.

    W. W. Symes, J. O. Blanch, and R. Versteeg, A numerical study of linear inversion in layered viscoacoustic

    media, in Comparison of Seismic Inversion Methods on a Single Real Dataset, R. Keys and D. Foster, eds.,Society of Exploration Geophysicists, Tulsa, 1998.

    R. Tapia, The differentiation and integration of nonlinear operators, in Nonlinear Functional Analysis and

    Applications, L. Rall, ed., Academic Press: New York, 1971.