time series analysis report
TRANSCRIPT
TimeSeries Analysis Project
The following is a timeseries analysis report on the number of employed persons
in Australia per month from Feb 1978 until April 1991. This data comes from the
Australian Bureau of Statistics and was provided by the Time Series Data Library. This
data has 159 observations and each population value is in thousands, as shown in
Exhibit 1. From this data, we then generated a timeseries plot, shown in Exhibit 2.
From this plot, we can see that the series shows a strong increasing trend that may be
linear or curved.
At this point, we begin to work on fitting the data to a model. After converting the
data to time series data in R, the first thing that needed to be done was to fit the data to
a linear regression model. The summary for this data gave the following output:
Call: lm(formula = Population ~ time(popts)) Residuals: Min 1Q Median 3Q Max 435.56 130.54 19.95 135.83 365.58 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5785.3774 29.1055 198.77 <2e16 *** time(popts) 12.5220 0.3156 39.68 <2e16 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 182.6 on 157 degrees of freedom Multiple Rsquared: 0.9093, Adjusted Rsquared: 0.9088 Fstatistic: 1575 on 1 and 157 DF, pvalue: < 2.2e16 With a multiple Rsquared value of 91% and highly significant regression coefficients,
this seems like an excellent model for the data. However, the plot of the residuals
(shown in Exhibit 3), does not appear to be very random, which would imply that we
should instead try and quadratic fit for the data. Doing this gives us:
Call: lm(formula = Population ~ time(popts) + I(time(popts)2)) Residuals: Min 1Q Median 3Q Max 336.51 71.57 12.83 94.33 230.32
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.097e+03 2.900e+01 210.273 <2e16 *** time(popts) 9.011e01 8.367e01 1.077 0.283 I(time(popts)2) 7.263e02 5.066e03 14.338 <2e16 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 120.3 on 156 degrees of freedom Multiple Rsquared: 0.9609, Adjusted Rsquared: 0.9604 Fstatistic: 1916 on 2 and 156 DF, pvalue: < 2.2e16 Once again, with a multiple Rsquared value of 96% and highly significant regression
coefficients, this again seems like an excellent model for the data. However, once again
the plot of the residuals (shown in Exhibit 4), does not appear to be very random. Since
the data did not appear to be random, it was decided that a
KwiatkowskiPhillipsSchmidtShin (KPSS) test should be performed to determine
stationarity of the data and whether or not differencing would be required. This gave the
output:
KPSS Test for Level Stationarity data: popts[, 2] KPSS Level = 5.0025, Truncation lag parameter = 2, pvalue = 0.01 Warning message: In kpss.test(popts[, 2]) : pvalue smaller than printed pvalue Since the pvalue is less than 0.05, we can say without a doubt that the data is not
stationary and that differencing will be required.
The first step in this process is to determine the “best” lambda value for a power
transformation of this data. To do this, BoxCox analysis was performed as shown in
Exhibit 5, giving us a lambda value of 0.04. Using this value, we can see the
transformed values in the time series plot given by Exhibit 6. This appears to have
stabilized the variance, but there is still a strong increasing trend that must be
accounted for. We account for this trend by taking the difference of the transformed
values to give us the plot in Exhibit 7. However, while the trend is gone, it appears that
there may be seasonality in December of each year. However, since it only occurs in
one month of the year rather than the entire “season”, the outlier will only create a bit of
drift in the model, rather than seasonality. From here we will begin to find an acceptable
ARIMA model based on the ACF and PACF, which are shown in Exhibit 8.
The significant spikes at lag 3 and 5 in the ACF suggests either a MA(3) or MA(5)
component. Therefore, we will compare the AIC values of the different ARIMA models.
The AIC of ARIMA(0,1,3) is 1780.831, and ARIMA(0,1,5) is 1765.55. Since
ARIMA(0,1,5) gives the smaller AIC value, it is the model we chose. Additionally, it’s
residuals are plotted in Exhibit 9. Now, all the spikes before lag 12 are within the
significance limits, however, after lag 12 it exceeds significance limits. This makes
sense since, according to the plot in Exhibit 7, there was a large jump every twelfth
point. To double check this result, a BoxLjung test was performed which yielded the
following result:
BoxLjung test
data: res
Xsquared = 11.106, df = 6, pvalue = 0.08516
Since the pvalue is greater than 0.05, that shows that all lags beyond this point are
insignificant and we were correct in not being concerned with the spike at lag 12. Since
we now have our model, the next step is to calculate the estimated values.
The first method used was the methodofmoments estimate, which gave the output:
Coefficients:
ma1 ma2 ma3 ma4 ma5
0.5645 0.1440 0.4102 0.2659 0.3320
s.e. 0.0773 0.0817 0.0869 0.0816 0.0657
sigma2 estimated as 3669: log likelihood=873.44 AIC=1760.88
Next, the conditional least squares estimate was used, which resulted in:
Call:
arima(x = popts[, 2], order = c(0, 1, 5), method = "CSS")
Coefficients:
ma1 ma2 ma3 ma4 ma5
0.5289 0.1213 0.4046 0.2310 0.3589
s.e. 0.0774 0.0785 0.0854 0.0771 0.0670
sigma2 estimated as 3859: part log likelihood = 876.59
The values here are close to the values generated using methodofmoments, but
smaller sigma squared value hints that the methodofmoments value may be a better
fit. Lastly, a MaximumLikelihood estimate was computed to be:
Call:
arima(x = popts[, 2], order = c(0, 1, 5), method = "ML")
Coefficients:
ma1 ma2 ma3 ma4 ma5
0.5289 0.1223 0.4158 0.2322 0.3555
s.e. 0.0774 0.0797 0.0861 0.0791 0.0664
sigma2 estimated as 3823: log likelihood = 876.78, aic =
1763.55
While the MLE and least squares estimate are both close in their values, the
methodofmoments estimate has the lowest AIC value. Therefore, it was determined
the methodofmoments estimate was very likely the most accurate measure of the
values.
The final step is to now attempt to forecast the next five observations of this data
series. Utilizing the forecast function built into R gave us the following output:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
160 7707.165 7612.882 7801.449 7562.971 7851.359
161 7676.880 7576.026 7777.734 7522.638 7831.122
162 7649.566 7538.754 7760.377 7480.094 7819.037
163 7624.930 7501.260 7748.601 7435.792 7814.068
164 7602.711 7463.936 7741.487 7390.473 7814.950
This indicates the the next five observations should show a downward trend. This is
shown in Exhibit 10. Unfortunately, the fact that drift exists means that this forecast is
not entirely reliable. Because of this unreliability, the potential values for the next five
observations have a much larger spread than would normally be desired. However,
despite this unreliability, all of the projected values indicate a downward trend, which
means that even if the first observation is off, it is very likely that all the subsequent
observations will have a lower value.
In conclusion, we can safely say that this time series follows an ARIMA(0,1,5)
model that follows the equation
Additionally we.5645e .144e .4102e .2659e .332e .Y t = Y t−1 + et + 0 t−1 + 0 t−2 − 0 t−3 + 0 t−4 − 0 t−5
predicted the next five observations to be 7707.165, 7676.88, 7649.566, 7624.93, and
7602.711. However, as previously stated, these values may not be very accurate on
account of the drift. The drift in the data set proved to be a very frustrating component
since all the previous sets we had worked with had been very “clean” data sets that
didn’t include drift or seasonality. Because of this, a great deal of time was wasted in
trying to determine an accurate model simply because of a lack of knowledge.
Originally, I thought that, because of what I thought was seasonality, that I was going to
have to start all over with a new data set, but once I determined that what I was dealing
with wasn’t seasonality, it became much easier. While the tip to avoid forecasting
financial series was appreciated, it might also help in the future if either there is a tip to
be wary when dealing with employment numbers or perhaps if there is a bit more time
devoted to explaining drift and seasonality so that, if encountered, it doesn’t cause
future students so much issue.
Exhibit 1
Feb 1978 5985.7 March
1980 6278.
7
April 1982 6431.6 May 1984 6489.7
March 1978 6040.6 April 1980
6224.9
May 1982 6440.9 June 1984 6499
April 1978 6054.2
May 1980 6273.
4
June 1982 6414.3 July 1984 6528.7
May 1978 6038.3
June 1980 6269.
9
July 1982 6425.9 Aug 1984 6466.1
June 1978 6031.3
July 1980 6314.
1
Aug 1982 6379.3 Sept 1984 6579.8
July 1978 6036.1
Aug 1980 6281.
4
Sept 1982 6443.5 Oct 1984 6553.2
Aug 1978 6005.4 Sept 1980 6360 Oct 1982 6421.1 Nov 1984 6576.1
Sept 1978 6024.3
Oct 1980 6320.
2
Nov 1982 6367 Dec 1984 6636
Oct 1978 6045.9 Nov 1980 6342 Dec 1982 6370.2 Jan 1985 6452.4
Nov 1978 6033.8
Dec 1980 6426.
6
Jan 1983 6172.2 Feb 1985 6595.7
Dec 1978 6125.4
Jan 1981 6253
Feb 1983 6264.1 March 1985 6657.4
Jan 1979 5971.3
Feb 1981 6356.
5
March 1983 6310.4 April 1985 6588.8
Feb 1979 6050.7 March
1981 6428.
1
April 1983 6254.5 May 1985 6658
March 1979 6096.2 April 1981
6426.3
May 1983 6272.8 June 1985 6659.4
April 1979 6087.7
May 1981 6412.
4
June 1983 6266.5 July 1985 6703.4
May 1979 6075.6
June 1981 6413.
9
July 1983 6295 Aug 1985 6675.6
June 1979 6095.7
July 1981 6425.
3
Aug 1983 6241.1 Sept 1985 6814.7
July 1979 6103.9
Aug 1981 6393.
7
Sept 1983 6358.2 Oct 1985 6771.1
Aug 1979 6078.5
Sept 1981 6502.
7
Oct 1983 6336.2 Nov 1985 6882
Sept 1979 6157.8 Oct 1981 6445. Nov 1983 6377.5 Dec 1985 6910.8
3
Oct 1979 6164
Nov 1981 6433.
3
Dec 1983 6456.4 Jan 1986 6753.6
Nov 1979 6188.8
Dec 1981 6506.
9
Jan 1984 6251.4 Feb 1986 6861.9
Dec 1979 6257.2
Jan 1982 6355.
5
Feb 1984 6365.4 March 1986 6961.9
Jan 1980 6112.9
Feb 1982 6432.
4
March 1984 6503.2 April 1986 6997.9
Feb 1980 6207.2 March
1982 6497.
4
April 1984 6477.6 May 1986 6979 Exhibit 1 (Cont.)
June 1986 7007.7 July 1988 7383.1 Aug 1990 7825
July 1986 6991.5 Aug 1988 7353.4 Sept 1990 7925.5
Aug 1986 6918.6 Sept 1988 7503.2 Oct 1990 7870.5
Sept 1986 7040.6 Oct 1988 7477.3 Nov 1990 7849.9
Oct 1986 7030.4 Nov 1988 7508.7 Dec 1990 7941.2
Nov 1986 7034.2 Dec 1988 7622.9 Jan 1991 7668.8
Dec 1986 7116.8 Jan 1989 7424.9 Feb 1991 7739.3
Jan 1987 6902.5 Feb 1989 7569.7 March 1991 7746.5
Feb 1987 7022.3 March 1989 7638.3 April 1991 7750.5
March 1987 7133.4 April 1989 7683.2
April 1987 7109.6 May 1989 7729.3
May 1987 7103.6 June 1989 7720.5
June 1987 7128.9 July 1989 7751.8
July 1987 7175.6 Aug 1989 7727.6
Aug 1987 7092.3 Sept 1989 7854.4
Sept 1987 7186.5 Oct 1989 7817.8
Oct 1987 7177.4 Nov 1989 7870.7
Nov 1987 7182.2 Dec 1989 7941.6
Dec 1987 7330.7 Jan 1990 7712.6
Jan 1988 7169.4 Feb 1990 7809.1
Feb 1988 7247.4 March 1990 7877.5
March 1988 7397.4 April 1990 7894.3
April 1988 7383.4 May 1990 7916.1
May 1988 7354.9 June 1990 7910
June 1988 7378.3 July 1990 7932.9 Exhibit 2
Exhibit 3
Exhibit 4
Exhibit 5
Exhibit 6
Exhibit 7
Exhibit 10