theory and algorithms for forecasting non-stationary time series

Download Theory and Algorithms for Forecasting Non-Stationary Time Series

If you can't read please download the document

Upload: buikhanh

Post on 04-Jan-2017

227 views

Category:

Documents


2 download

TRANSCRIPT

  • Theory and Algorithms for Forecasting Non-Stationary

    Time Series

    MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

    GOOGLE RESEARCH..VITALY KUZNETSOV KUZNETSOV@

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    MotivationTime series prediction:

    stock values. economic variables. weather: e.g., local and global temperature. sensors: Internet-of-Things. earthquakes. energy demand. signal processing. sales forecasting.

    2

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Google Trends

    3

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Google Trends

    4

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Google Trends

    5

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ChallengesStandard Supervised Learning:

    IID assumption. Same distribution for training and test data. Distributions fixed over time (stationarity).

    6

    none of these assumptions holds for time series!

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    OutlineIntroduction to time series analysis.

    Learning theory for forecasting non-stationary time series.

    Algorithms for forecasting non-stationary time series.

    Time series prediction and on-line learning.

    7

  • Introduction to Time Series Analysis

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Classical FrameworkPostulate a particular form of a parametric model that is assumed to generate data.

    Use given sample to estimate unknown parameters of the model.

    Use estimated model to make predictions.

    9

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Autoregressive (AR) ModelsDefinition: AR( ) model is a linear generative model based on the th order Markov assumption:where

    s are zero mean uncorrelated random variables with variance .

    are autoregressive coefficients. is observed stochastic process.

    10

    t

    pp

    Yt

    8t, Yt =pX

    i=1

    aiYti + t

    a1, . . . , ap

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Moving Averages (MA)Definition: MA( ) model is a linear generative model for the noise term based on the th order Markov assumption:where

    are moving average coefficients.

    11

    q

    q

    8t, Yt = t +qX

    j=1

    bjtj

    b1, . . . , bq

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ARMA modelDefinition: ARMA( , ) model is a generative linear model that combines AR( ) and MA( ) models:

    12

    p

    p

    q

    q

    .

    (Whittle, 1951; Box & Jenkins, 1971)

    8t, Yt =pX

    i=1

    aiYti + t +qX

    j=1

    bjtj

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ARMA

    13

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    StationarityDefinition: a sequence of random variables is stationary if its distribution is invariant to shifting in time.

    14

    Z = {Zt}+11

    (Zt, . . . , Zt+m) (Zt+k, . . . , Zt+m+k)

    same distribution

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Weak StationarityDefinition: a sequence of random variables is weakly stationary if its first and second moments are invariant to shifting in time, that is,

    is independent of . for some function .

    15

    Z = {Zt}+11

    f

    tE[Zt]E[ZtZtj ] = f(j)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Lag OperatorLag operator is defined by

    ARMA model in terms of the lag operator:

    Characteristic polynomialcan be used to study properties of this stochastic process.

    16

    L LYt = Yt1.

    P (z) = 1pX

    i=1

    aizi

    1

    pX

    i=1

    aiLi

    !Yt =

    1 +

    qX

    j=1

    bjLj

    !t

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Weak Stationarity of ARMATheorem: an ARMA( , ) process is weakly stationary if the roots of the characteristic polynomial are outside the unit circle.

    17

    P (z)

    p q

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProofIf roots of the characteristic polynomial are outside the unit circle then: where for all and are constants.

    18

    | i| > 1 i = 1, . . . , p c, c0

    P (z) = 1pX

    i=1

    aizi = c( 1 z) ( p z)

    = c0(1 11 z) (1 1p z)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProofTherefore, the ARMA( , ) process admits MA( ) representation:where is well-defined since

    19

    1

    1 1i L

    !1=

    1X

    k=0

    1i L

    k

    | 1i | < 1.

    p q 1

    pX

    i=1

    aiLi

    !Yt =

    1 +

    qX

    j=1

    bjLj

    !t

    Yt =

    1 11 L

    1

    1 1p L

    11 +

    qX

    j=1

    bjLj

    t

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProofTherefore, it suffices to show thatis weakly stationary.

    The mean is constant

    Covariance function only depends on the lag :

    20

    Yt =1X

    j=0

    jtj

    lE[YtYtl]

    E[YtYtl] =1X

    k=0

    1X

    j=0

    kjE[tjtlk] =1X

    j=0

    jj+l

    E[Yt] =1X

    j=0

    jE[tj ] = 0.

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ARIMANon-stationary processes can be modeled using processes whose characteristic polynomial has unit roots.

    Characteristic polynomial with unit roots can be factored:where has no unit roots.

    Definition: ARIMA( , , ) model is an ARMA( , ) modelfor :

    21

    P (z) = R(z)(1 z)D

    R(z)

    (1 L)DYtp pq qD

    .

    1

    pX

    i=1

    aiLi

    ! 1 L

    !DYt =

    1 +

    qX

    j=1

    bjLj

    !t

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ARIMANon-stationary processes can be modeled using processes whose characteristic polynomial has unit roots.

    Characteristic polynomial with unit roots can be factored:where has no unit roots.

    Definition: ARIMA( , , ) model is an ARMA( , ) modelfor :

    22

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Other ExtensionsFurther variants:

    models with seasonal components (SARIMA). models with side information (ARIMAX). models with long-memory (ARFIMA). multi-variate time series models (VAR). models with time-varying coefficients. other non-linear models.

    23

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Modeling VarianceDefinition: the generalized autoregressive conditional heteroscedasticity GARCH( , ) model is an ARMA( , ) model for the variance of the noise term :where

    s are zero mean Gaussian random variables with variance conditioned on .

    is the mean parameter.

    24

    t t

    8t, 2t+1 = ! +p1X

    i=0

    i2ti +

    q1X

    j=0

    j2tj

    tt {Yt1, Yt2, . . .}

    ! > 0

    p pq q

    (Engle, 1982; Bollerslev, 1986)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    GARCH ProcessDefinition: the generalized autoregressive conditional heteroscedasticity (GARCH( , )) model is an ARMA( , ) model for the variance of the noise term :where

    s are zero mean Gaussian random variables with variance conditioned on .

    is the mean parameter.

    25

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    State-Space ModelsContinuous state space version of Hidden Markov Models:where

    is an -dimensional state vector. is an observed stochastic process. and are model parameters. and are noise terms.

    26

    Xt+1 = BXt +Ut,

    Yt = AXt + t

    Xt n

    Yt

    A B

    Ut t

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    State-Space Models

    27

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    EstimationDifferent methods for estimating model parameters:

    Maximum likelihood estimation: Requires further parametric assumptions on the noise

    distribution (e.g. Gaussian).

    Method of moments (Yule-Walker estimator). Conditional and unconditional least square estimation.

    Restricted to certain models.

    28

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Invertibility of ARMADefinition: an ARMA( , ) process is invertible if the roots of the polynomial are outside the unit circle.

    29

    p q

    Q(z) = 1 +qX

    j=1

    bjzj

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning guaranteeTheorem: assume ARMA( , ) is weakly stationary and invertible. Let denote the least square estimate of and assume that is known. Then, converges in probability to zero.

    Similar results hold for other estimators and other models.

    30

    baTkbaT ak

    p

    p

    qYt

    a = (a1, . . . , ap)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    NotesMany other generative models exist.

    Learning guarantees are asymptotic.

    Model needs to be correctly specified.

    Non-stationarity needs to be modeled explicitly.

    31

  • Theory

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Time Series ForecastingTraining data: finite sample realization of some stochastic process,

    Loss function: , where is a hypothesis set of functions mapping from to .

    Problem: find with small path-dependent expected loss,

    33

    H

    X Y

    h 2 H

    L : H Z ! [0, 1]

    (X1, Y1), . . . , (XT , YT ) 2 Z = X Y.

    L(h,ZT1 ) = EZT+1

    L(h, ZT+1)|ZT1

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Standard AssumptionsStationarity:

    Mixing:

    34

    (Zt, . . . , Zt+m) (Zt+k, . . . , Zt+m+k)

    same distribution

    B A

    n n+ k

    dependence between events decaying with k.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning TheoryStationary and -mixing process: generalization bounds.

    PAC-learning preserved in that setting (Vidyasagar, 1997). VC-dimension bounds for binary classification (Yu, 1994). covering number bounds for regression (Meir, 2000). Rademacher complexity bounds for general loss

    functions (MM and Rostamizadeh, 2000).

    PAC-Bayesian bounds (Alquier et al., 2014).

    35

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning TheoryStationarity and mixing: algorithm-dependent bounds.

    AdaBoost (Lozano et al., 1997). general stability bounds (MM and Rostamizadeh, 2010). regularized ERM (Steinwart and Christmann, 2009). stable on-line algorithms (Agarwal and Duchi, 2013).

    36

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProblemStationarity and mixing assumptions:

    often do not hold (think trend or periodic signals). not testable. estimating mixing parameters can be hard, even if

    general functional form known.

    hypothesis set and loss function ignored.

    37

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    QuestionsIs learning with general (non-stationary, non-mixing) stochastic processes possible?

    Can we design algorithms with theoretical guarantees?

    need a new tool for the analysis.

    38

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Key Quantity - Fixed h

    39

    1 Tt T + 1

    key difference

    Key average quantity:1

    T

    TX

    t=1

    hL(h,ZT1 ) L(h,Zt11 )

    i.

    L(h,Zt11 ) L(h,ZT1 )

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    DiscrepancyDefinition:

    captures hypothesis set and loss function. can be estimated from data, under mild assumptions. in IID case or for weakly stationary processes with

    linear hypotheses and squared loss (K and MM, 2014).

    40

    = 0

    = suph2H

    L(h,ZT1 )

    1

    T

    TX

    t=1

    L(h,Zt11 )

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Weighted DiscrepancyDefinition: extension to weights .

    strictly extends discrepancy definition in drifting (MM and Muoz Medina, 2012) or domain adaptation (Mansour, MM, Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss (Devroye et al., 1996; Ben-David et al., 2007).

    admits upper bounds in terms of relative entropy, or in terms of -mixing coefficients of asymptotic stationarity for an asymptotically stationary process.

    41

    (q1, . . . , qT ) = q

    (q) = suph2H

    L(h,ZT1 )

    TX

    t=1

    qt L(h,Zt11 )

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    EstimationDecomposition: .

    42

    1 Ts T T + 1

    (q) suph2H

    1

    s

    TX

    t=Ts+1L(h,Zt11 )

    TX

    t=1

    qt L(h,Zt11 )

    + suph2H

    L(h,ZT1 )

    1

    s

    TX

    t=Ts+1L(h,Zt11 )

    .

    (q) 0(q) +s

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning GuaranteeTheorem: for any , with probability at least , for all and ,

    43

    > 0 1 h 2 H

    where G = {z 7! L(h, z) : h 2 H}.

    > 0

    L(h,ZT1 ) TX

    t=1

    qtL(h, Zt) +(q) + 2+ kqk2

    r2 log

    E[N1(,G, z)]

    ,

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Bound with Emp. DiscrepancyCorollary: for any , with probability at least , for all and ,

    44

    > 0 1 h 2 H > 0

    where

    8>>>>>>>:

    b(q) = sup

    h2H

    1

    s

    TX

    t=Ts+1L(h, Zt)

    TX

    t=1

    qtL(h, Zt)

    us unif. dist. over [T s, T ]G = {z 7! L(h, z) : h 2 H}.

    L(h,ZT1 ) TX

    t=1

    qtL(h, Zt) + b(q) +s + 4

    +

    hkqk2 + kq usk2

    is

    2 log

    2Ez

    N1(,G, z)

    ,

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Weighted Sequential -CoverDefinition: let be a -valued full binary tree of depth . Then, a set of trees is an -norm -weighted -cover of a function class on if

    45

    z Z TV

    G zl1 q

    " v1g(z1)v3g(z3)v6g(z6)v12g(z12)

    #1

    kqk1.

    (Rakhlin et al., 2010; K and MM, 2015)

    g(z1), v1

    g(z2), v2 g(z3), v3

    g(z4), v4 g(z5), v5 g(z6), v6 g(z7), v7

    g(z12), v12

    +11

    +1+11 1

    1

    8g 2 G, 8 2 {1}T , 9v 2 V :TX

    t=1

    |vt() g(zt())|

    kqk1.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Sequential Covering NumbersDefinitions:

    sequential covering number:

    expected sequential covering number:

    46

    N1(,G, z) = min{|V| : V l1-norm q-weighted -cover of G}.

    Z1, Z01 D1

    Z3, Z03 D2(|Z 01)Z2, Z 02 D2(|Z1)

    Z5, Z05

    D3(|Z1, Z 02)Z4, Z

    04

    D3(|Z1, Z2) ZT : distribution based on Zts.

    EzZT

    [N1(,G, z)].

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProofKey quantities:

    Chernoff technique: for any ,

    47

    (q) = suph2H

    L(h,ZT1 )TX

    t=1

    qt L(h,Zt11 ).

    (ZT1 ) = suph2H

    L(h, ZT )

    TX

    t=1

    qtL(h, Zt)

    P(ZT1 (q) > )

    P"sup

    h2H

    TX

    t=1

    qtL(h,Zt11 ) L(h, Zt)

    >

    #(sub-add. of sup)

    = P"exp

    t suph2H

    TX

    t=1

    qtL(h,Zt11 ) L(h, Zt)

    > et

    #(t > 0)

    et E"exp

    t suph2H

    TX

    t=1

    qtL(h,Zt11 ) L(h, Zt)

    #. (Markovs ineq.)

    t > 0

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    SymmetrizationKey tool: decoupled tangent sequence associated to .

    and i.i.d. given .

    48

    Z0T1 Z

    T1

    Zt Z 0t Zt11

    P(ZT1 (q) > )

    et E"exp

    t suph2H

    TX

    t=1

    qtL(h,Zt11 ) L(h, Zt)

    #

    = et E"exp

    t suph2H

    TX

    t=1

    qtE[L(h, Z 0t)|Zt11 ] L(h, Zt)

    #

    (tangent seq.)

    = et E"exp

    t suph2H

    Eh TX

    t=1

    qtL(h, Z 0t) L(h, Zt)

    ZT1i#

    (lin. of expectation)

    et E"exp

    t suph2H

    TX

    t=1

    qtL(h, Z 0t) L(h, Zt)

    #. (Jensens ineq.)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Symmetrization

    49

    P(ZT1 (q) > )

    et E"exp

    t suph2H

    TX

    t=1

    qtL(h, Z 0t) L(h, Zt)

    #

    = et E(z,z0)

    E

    "exp

    t suph2H

    TX

    t=1

    qttL(h, z0t()) L(h, zt())

    #

    (tangent seq. prop.)

    = et E(z,z0)

    E

    "exp

    t suph2H

    TX

    t=1

    qttL(h, z0t()) + t sup

    h2H

    TX

    t=1

    qttL(h, zt())#

    (sub-add. of sup)

    et E(z,z0)

    E

    "1

    2

    exp

    2t sup

    h2H

    TX

    t=1

    qttL(h, z0t())

    +

    1

    2

    exp

    2t sup

    h2H

    TX

    t=1

    qttL(h, zt())#

    (convexity. of exp)

    = et EzE

    "exp

    2t sup

    h2H

    TX

    t=1

    qttL(h, zt())#

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Covering Number

    50

    P(ZT1 (q) > )

    et EzE

    "exp

    2t sup

    h2H

    TX

    t=1

    qttL(h, zt())#

    et EzE

    "exp

    2thmaxv2V

    TX

    t=1

    qttvt() + i#

    (-covering)

    et(2) Ez

    "X

    v2VE

    "exp

    2t

    TX

    t=1

    qttvt()##

    (monotonicity of exp)

    et(2) Ez

    "X

    v2Vexp

    t2kqk2

    2

    ##(Hoedings ineq.)

    Ez

    hN1(,G, z)

    iexp

    t( 2) + t

    2kqk2

    2

    .

  • Algorithms

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReviewTheorem: for any , with probability at least , for all and ,

    This bound can be extended to hold uniformly over at the price of the additional term: .

    Data-dependent learning guarantee.

    52

    > 0 1 h 2 H

    q

    eO(kq uk1p

    log2 log2(1 kq uk)1)

    > 0

    L(h,ZT1 ) TX

    t=1

    qtL(h, Zt) + b(q) +s + 4

    +

    hkqk2 + kq usk2

    is

    2 log

    2Ez

    N1(,G, z)

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy-Risk MinimizationKey Idea: directly optimize the upper bound on generalization over and .

    This problem can be solved efficiently for some and .

    53

    q h

    L H

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Kernel-Based RegressionSquared loss function:

    Hypothesis set: for PDS kernel ,

    Complexity term can be bounded by .

    54

    H =n

    x 7! w K(x) : kwkH o

    .

    K

    O(log

    3/2 T ) supx

    K(x, x)kqk2

    L(y, y0) = (y y0)2

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Instantaneous DiscrepancyEmpirical discrepancy can be further upper bounded in terms of instantaneous discrepancies:where and .

    55

    b(q) TX

    t=1

    qtdt +Mkq uk1

    M = supy,y0

    L(y, y0)

    dt = suph2H

    1

    s

    TX

    t=Ts+1L(h, Zt) L(h, Zt)

    !

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProofBy sub-additivity of supremum

    56

    b(q) = suph2H

    (1

    s

    TX

    t=Ts+1L(h, Zt)

    TX

    t=1

    qtL(h, Zt)

    )

    = suph2H

    (TX

    t=1

    qt

    1

    s

    TX

    t=Ts+1L(h, Zt) L(h, Zt)

    !

    +TX

    t=1

    1T

    qt1s

    TX

    t=Ts+1L(h, Zt)

    )

    TX

    t=1

    qt suph2H

    1

    s

    TX

    t=Ts+1L(h, Zt) qtL(h, Zt)

    !+Mku qk1.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Computing DiscrepanciesInstantaneous discrepancy for kernel-based hypothesis with squared loss: .

    Difference of convex (DC) functions.

    Global optimum via DC-programming: (Tao and Ahn, 1998).

    57

    dt = supkw0k

    TX

    s=1

    us(w0 K(xs) ys)2 (w0 K(xt) yt)2

    !

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

    Corresponding optimization problem: .

    58

    > 0 1 h 2 H 0 < kq uk1 1

    minq2[0,1]T ,w

    (TX

    t=1

    qt(w K(xt) yt)2 + 1TX

    t=1

    qtdt + 2kwkH + 3kq uk1

    )

    L(h,ZT1 ) TX

    t=1

    qt

    L(h, Zt

    ) +

    b(q) +

    s

    +

    eOlog

    3/2 T supx

    K(x, x)+ kq uk1

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

    Corresponding optimization problem: .

    59

    > 0 1 h 2 H 0 < kq uk1 1

    minq2[0,1]T ,w

    (TX

    t=1

    qt(w K(xt) yt)2 + 1TX

    t=1

    qtdt + 2kwkH + 3kq uk1

    )

    L(h,ZT1 ) TX

    t=1

    qt

    L(h, Zt

    ) +

    b(q) +

    s

    +

    eOlog

    3/2 T supx

    K(x, x)+ kq uk1

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

    Corresponding optimization problem: .

    60

    > 0 1 h 2 H 0 < kq uk1 1

    minq2[0,1]T ,w

    (TX

    t=1

    qt(w K(xt) yt)2 + 1TX

    t=1

    qtdt + 2kwkH + 3kq uk1

    )

    L(h,ZT1 ) TX

    t=1

    qt

    L(h, Zt

    ) +

    b(q) +

    s

    +

    eOlog

    3/2 T supx

    K(x, x)+ kq uk1

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

    Corresponding optimization problem: .

    61

    > 0 1 h 2 H 0 < kq uk1 1

    minq2[0,1]T ,w

    (TX

    t=1

    qt(w K(xt) yt)2 + 1TX

    t=1

    qtdt + 2kwkH + 3kq uk1

    )

    L(h,ZT1 ) TX

    t=1

    qt

    L(h, Zt

    ) +

    b(q) +

    s

    +

    eOlog

    3/2 T supx

    K(x, x)+ kq uk1

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

    Corresponding optimization problem: .

    62

    > 0 1 h 2 H 0 < kq uk1 1

    minq2[0,1]T ,w

    (TX

    t=1

    qt(w K(xt) yt)2 + 1TX

    t=1

    qtdt + 2kwkH + 3kq uk1

    )

    L(h,ZT1 ) TX

    t=1

    qt

    L(h, Zt

    ) +

    b(q) +

    s

    +

    eOlog

    3/2 T supx

    K(x, x)+ kq uk1

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Convex ProblemChange of variable: .

    Upper bound: .

    where . convex optimization problem.

    63

    rt = 1/qt

    D = {r : rt 1}

    .

    |r1t 1/T | T1|rt T |

    minr2D,w

    (TX

    t=1

    (w K(xt) yt)2 + 1dtrt

    + 2kwkH + 3TX

    t=1

    |rt T |)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Two-Stage AlgorithmMinimize empirical discrepancy over (convex optimization).

    Solve (weighted) kernel-ridge regression problem: where is the solution to discrepancy minimization problem.

    64

    q

    minw

    (TX

    t=1

    q

    t (w K(xt) yt)2 + kwkH

    )

    q

    b(q)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Preliminary ExperimentsArtificial data sets:

    65

    ,

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    True vs. Empirical Discrepancies

    66

    Discrepancies Weights

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Running MSE

    67

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Real-world DataCommodity prices, exchange rates, temperatures & climate.

    68

  • Time Series Prediction & On-line Learning

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Two Learning ScenariosStochastic scenario:

    distributional assumption. performance measure: expected loss. guarantees: generalization bounds. On-line scenario:

    no distributional assumption. performance measure: regret. guarantees: regret bounds. active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al.

    2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth, 1998, 2001; Koolen et al., 2015).

    70

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    On-Line Learning SetupAdversarial setting with hypothesis/action set .

    For to do

    player receives . player selects . adversary selects . player incurs loss . Objective: minimize (external) regret

    71

    ht 2 H

    H

    Tt = 1

    xt 2 X

    yt 2 Y

    L(ht(xt), yt)

    RegT =TX

    t=1

    L(ht(xt), yt) minh2H

    TX

    t=1

    L(h(xt), yt).

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Example: Exp. Weights (EW)Expert set , .

    72

    EW({E1, . . . ,EN})1 for i 1 to N do2 w1,i 13 for t 1 to T do4 Receive(x

    t

    )

    5 ht

    PN

    i=1 wt,iEiPNi=1 wt,i

    6 Receive(yt

    )7 Incur-Loss(L(h

    t

    (xt

    ), yt

    ))8 for i 1 to N do9 w

    t+1,i wt,i eL(Ei(xt),yt) . (parameter > 0)10 return h

    T

    H = {E1, . . . ,EN} H = conv(H)

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    EW GuaranteeTheorem: assume that is convex in its first argument and takes values in . Then, for any and any sequence , the regret of EW at time satisfies

    73

    For

    L

    [0, 1] >0y1, . . . , yT 2 Y T

    RegT logN

    +

    T

    8

    .

    =p

    8 logN/T ,

    RegT p

    (T/2) logN.

    RegTT

    = O

    rlogN

    T

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    EW - ProofPotential:

    Upper bound:

    74

    t = logPN

    i=1 wt,i.

    t

    t1 = log

    PN

    i=1 wt1,i eL(Ei(xt),yt)

    PN

    i=1 wt1,i

    = log

    Ewt1

    [eL(Ei(xt),yt)]

    = log

    E

    wt1

    exp

    L(E

    i

    (xt

    ), yt

    ) Ewt1

    [L(Ei

    (xt

    ), yt

    )] E

    wt1[L(E

    i

    (xt

    ), yt

    )]

    Ewt1

    [L(Ei

    (xt

    ), yt

    )] +

    2

    8(Hoedings ineq.)

    L

    Ewt1

    [Ei

    (xt

    )], yt

    +

    2

    8(convexity of first arg. of L)

    = L(ht

    (xt

    ), yt

    ) +

    2

    8.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    EW - ProofUpper bound: summing up the inequalities yields

    Lower bound:

    Comparison:

    75

    T 0 TX

    t=1

    L(ht(xt), yt) +

    2T

    8.

    T

    0 = logNX

    i=1

    e

    PT

    t=1 L(Ei(xt),yt) logN

    log Nmaxi=1

    e

    PT

    t=1 L(Ei(xt),yt) logN

    = N

    min

    i=1

    TX

    t=1

    L(Ei

    (x

    t

    ), y

    t

    ) logN.

    TX

    t=1

    L(ht(xt), yt)Nmin

    i=1

    TX

    t=1

    L(Ei(xt), yt) logN

    +

    T

    8

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    QuestionsCan we exploit both batch and on-line to

    design flexible algorithms for time series prediction with stochastic guarantees?

    tackle notoriously difficult time series problems e.g., model selection, learning ensembles?

    76

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Model SelectionProblem: given time series models, how should we use sample to select a single best model?

    in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution.

    but, how do we select a validation set for general stochastic processes?

    use most recent data? use the most distant data? use various splits?

    models may have been pre-trained on .

    77

    NZT1

    ZT1

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning EnsemblesProblem: given a hypothesis set and a sample , find accurate convex combination with and .

    in most general case, hypotheses may have been pre-trained on .

    78

    H ZT1h =

    PTt=1 qtht h 2 HA

    q 2

    ZT1

    on-line-to-batch conversion for general non-stationary non-mixing processes.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    On-Line-to-Batch (OTB)Input: sequence of hypotheses returned after rounds by an on-line algorithm minimizing general regret

    79

    h = (h1, . . . , hT )

    AT

    RegT =TX

    t=1

    L(ht, Zt) infh2H

    TX

    t=1

    L(h, Zt).

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    On-Line-to-Batch (OTB)Problem: use to derive a hypothesis with small path-dependent expected loss,

    i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004).

    but, how do we design solutions for the general time-series scenario?

    80

    h = (h1, . . . , hT ) h 2 H

    LT+1(h,ZT1 ) = EZT+1

    L(h, ZT+1)|ZT1

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    QuestionsIs OTB with general (non-stationary, non-mixing) stochastic processes possible?

    Can we design algorithms with theoretical guarantees?

    need a new tool for the analysis.

    81

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Relevant Quantity

    82

    Average difference:

    1 T T + 1

    key difference

    LT+1(ht,ZT1 )

    t+ 1

    Lt(ht,Zt11 )

    1

    T

    TX

    t=1

    hLT+1(ht,ZT1 ) Lt(ht,Zt11 )

    i.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    On-line DiscrepancyDefinition:

    : sequences that can return. : arbitrary weight vector.

    natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015) .

    83

    HA Aq = (q1, . . . , qT )

    disc(q) = suph2HA

    TX

    t=1

    qthLT+1(ht,ZT1 ) Lt(ht,Zt11 )

    i.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy EstimationBatch discrepancy estimation method.

    Alternative method:

    assume that the loss is -Lipschitz. assume that there exists an accurate hypothesis :

    84

    h

    = infh

    EhL(ZT+1, h

    (XT+1))|ZT1i 1.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Discrepancy EstimationLemma: fix sequence in . Then, for any , with probability at least , the following holds for all :

    85

    ZT1 Z > 01 > 0

    where

    ddiscH(q) = suph2H,h2HA

    TX

    t=1

    qthLht(XT+1), h(XT+1)

    L

    ht, Zt

    i.

    disc(q) ddiscHT (q) + + 2+Mkqk2

    r2 log

    E[N1(,G, z)]

    ,

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Proof Sketch

    86

    disc(q) = suph2HA

    TX

    t=1

    qthLT+1(ht,ZT1 ) Lt(ht,Zt11 )

    i

    suph2HA

    TX

    t=1

    qt

    LT+1(ht,ZT1 ) E

    hL(ht(XT+1), h

    (XT+1))ZT1

    i

    + suph2HA

    TX

    t=1

    qthEhL(ht(XT+1), h

    (XT+1))ZT1

    i Lt(ht,Zt11 )

    i

    suph2HA

    TX

    t=1

    qt EhL(h(XT+1), YT+1)

    ZT1i

    + suph2HA

    TX

    t=1

    qthEhL(ht(XT+1), h

    (XT+1))ZT1

    i Lt(ht,Zt11 )

    i

    = suph2HA

    EhL(h(XT+1), YT+1)

    ZT1i

    + suph2HA

    TX

    t=1

    qthEhL(ht(XT+1), h

    (XT+1))ZT1

    i Lt(ht,Zt11 )

    i.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning GuaranteeLemma: let be a convex loss bounded by and a hypothesis sequence adapted to . Fix . Then, for any , the following holds with probability at least for the hypothesis :

    87

    L M

    > 0 1 h =

    PTt=1 qtht

    ZT1 q 2 hT1

    LT+1(h,ZT1 ) TX

    t=1

    qtL(ht, Zt) + disc(q) +Mkqk2

    r2 log

    1

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ProofBy definition of the on-line discrepancy,

    is a martingale difference, thus by Azumas inequality, whp,

    By convexity of the loss:

    88

    At = qthLt(ht, Zt11 ) L(ht, Zt)

    i

    TX

    t=1

    qtLt(ht, Zt11 ) TX

    t=1

    qtL(ht, Zt) + kqk2q

    2 log

    1 .

    TX

    t=1

    qthLT+1(ht,ZT1 ) Lt(ht,Zt11 )

    i disc(q).

    LT+1(h,ZT1 ) TX

    t=1

    qtLT+1(ht,ZT1 ).

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Learning GuaranteeTheorem: let be a convex loss bounded by and a set of hypothesis sequences adapted to . Fix . Then, for any , the following holds with probability at least for the hypothesis :

    89

    L M H

    > 0 1 h =

    PTt=1 qtht

    ZT1 q 2

    LT+1(h,ZT1 )

    infh2H

    TX

    t=1

    LT+1(h,ZT1 ) + 2disc(q) +RegT

    T

    +Mkq uk1 + 2Mkqk2

    r2 log

    2

    .

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ConclusionTime series forecasting:

    key learning problem in many important tasks. very challenging: theory, algorithms, applications. new and general data-dependent learning guarantees

    for non-mixing non-stationary processes.

    algorithms with guarantees. Time series prediction and on-line learning:

    proof for flexible solutions derived via OTB. application to model selection. application to learning ensembles.

    90

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    Time Series Workshop

    91

    Join us in Room 117 Friday, December 9th.

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesT. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4):13451367, 2010.

    A. Agrawal, J. Duchi. The generalization ability of online algorithms for dependent data. Information Theory, IEEE Transactions on, 59(1):573587, 2013.

    P. Alquier, X. Li, O. Wintenberger. Prediction of time series by statistical learning: general losses and fast rates. Dependence Modelling, 1:6593, 2014.

    Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. COLT, 2013.

    92

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesO. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. ICML, 2015.

    O. Anava and S. Mannor. Online Learning for heteroscedastic sequences. ICML, 2016.

    P. L. Bartlett. Learning with a slowly changing distribution. COLT, 1992.

    R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2):101123, 1997.

    93

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesP. Berti and P. Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & Probability Letters, 32(4):385 391, 1997.

    S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS. 2007.

    T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, 1986.

    G. E. P. Box, G. Jenkins. (1990) . Time Series Analysis, Forecasting and Control.

    94

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesO. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. COLT, 2001.

    N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory , 50(9), 2004.

    N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. COLT, 2006.

    C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.

    95

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesV. H. De la Pena and E. Gine. (1999) Decoupling: from dependence to independence: randomly stopped processes, U-statistics and processes, martingales and beyond. Probability and its applications. Springer, NY.

    P. Doukhan. (1994) Mixing: properties and examples. Lecture notes in statistics. Springer-Verlag, New York.

    R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4):9871007, 1982.

    96

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesD. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1): 27-46, 1994.

    M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2), 1998.

    M. Herbster and M. K. Warmuth. Tracking the best linear predictor. JMLR, 2001.

    D. Hsu, A. Kontorovich, and C. Szepesvri. Mixing time estimation in reversible Markov chains from a single sample path. NIPS, 2015.

    97

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesW. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi. Minimax time series prediction. NIPS, 2015.

    V. Kuznetsov, M. Mohri. Generalization bounds for time series prediction with non-stationary processes. In ALT, 2014.

    V. Kuznetsov, M. Mohri. Learning theory and algorithms for forecasting non-stationary time series. In NIPS, 2015.

    V. Kuznetsov, M. Mohri. Time series prediction and on-line learning. In COLT, 2016.

    98

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesA. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence and consistency of regularized boosting algorithms with stationary -mixing observations. In NIPS, pages 819826, 2006.

    Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT. 2009.

    R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, pages 534, 2000.

    99

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesD. Modha, E. Masry. Memory-universal prediction of stationary random processes. Information Theory, IEEE Transactions on, 44(1):117133, Jan 1998.

    M. Mohri, A. Munoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, 2012.

    M. Mohri, A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, 2009.

    M. Mohri, A. Rostamizadeh. Stability bounds for stationary -mixing and -mixing processes. Journal of Machine Learning Research, 11:789814, 2010.

    100

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesV. Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. GRC, 2010.

    A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, 2010.

    L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary beta-mixing processes. JMLR 11:19271956, 2010.

    C. Shalizi and A. Kontorovitch. Predictive PAC-learning and process decompositions. NIPS, 2013.

    101

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesA. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. NIPS,2010.

    I. Steinwart, A. Christmann. Fast learning from non-i.i.d. observations. NIPS, 2009.

    M. Vidyasagar. (1997). A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag New York, Inc.

    V. Vovk. Competing with stationary prediction strategies. COLT 2007.

    102

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    ReferencesB. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94116, 1994.

    103

  • pageTheory and Algorithms for Forecasting Non-Stationary Time Series

    -MixingDefinition: a sequence of random variables is -mixing if

    104

    Z = {Zt}+11

    B A

    n n+ k

    dependence between events decaying with k.

    (k) = supn

    EB2n1

    sup

    A21n+k

    P[A | B] P[A]! 0.