You can find the Jupyter notebook as Gist.

Let’s say you have a series of numbers and wish to forecast it;

A basic plot of the data looks like the following

png

The precise time axis is at this point not relevant and if you create a time series object (using the ts command) out of this raw data

you get an object with frequency one and an abstract or indexed time axis;

The time series command does not automatically detect any cycles or frequencies, so if you ask for it you get

All the 1’s say that all the numbers belong to the same cycle and frequency equal to one is essentially the same statement. From the plot one can see however that there is a rithm and if you wish to have a suggestion for it you do so using the findFrequency command from the forecast package

So, we better adjust the time series with this finding

and now the series correctly says that there are 25 cycles of frequency 7 and the last one has a truncated one of only 4 items;

which can also be seen in the cycle information

Another way to detect cycles is by means of Fourier decomposion. This, in a nutshell, is based on the fact that any continuous function can be decomposed in sines and cosines of increasing frequencies. The intensity or scale of these sines/cosines then gives an indication of the data’s underlying rithm. The plot generated by spectrum uses internally a log which hides the peaks, you can get rid of this like so

png

png

and if you wish to get the prominent frequency out of this you need to read the documentation of the spectrum function which tells you that the axis is scaled with 1/frequency turning the spectrum in the interval [-frequency/2, +frequency/2]. So, we get the main frequency by means of

thus giving seven days as should be.

Now, time series data is sometimes too noisy or too varying to extract e.g. a trend from. A way to make time series more smooth is by means of averaging the data. Technically this is called a simple moving average or SMA. One simply takes the average of some data point within a predefined time window (amount of points being averaged). The procedure has as a side-effect that some points are sacrificed and the larger the averaging window the more it reduced the time series. As an extreme case, a window equal in size to the length of the series would reduce the series to one point.

Mathemaically speaking, the SMA with window of size N is a map from the original series {xi} to {si} via

sn=xn1++xnNN1

and upon applying this to our series (with window size 7) we get

png

We used above the sub-series [5:172] to accomodate for the NA’s in the average. The TTR library contains various other averaging commands (exponential moving average, zero lag exponential moving average and many more).

Another way of smoothing a series is by substracting elements a fixed distance away. The diff operation with lag N is simply

sn=xnxnN

The lag operator on a time series xt is defined as Lxt:=xt1 and powers of it leads thus to the same difference operation LNxt=xtN and (1LN)xt=xtxtN. Since this operation appears often one usually defines ΔN:=1LN. A general lag polynomial is simply a series

p(L)=k=0pkLk=p0+p1L+p2L2+

and we define a corresponding polynomial operator:

p(L)xt=p0xt+p1xt1+p2xt2+

Usually the polynomial is finite and if one sets p(L)xt=0 this describes a general difference equation. Difference equation are the simpler versions of auto-regressive processes (see below) in the sense that they do not include anything random.

When applied to our series we get

png

Note that the lag operation turns our series into a series which fluctuates around zero. This is an indication that the underlying series is a so-called stationary series. A stationary series is one which keeps its characteristics if translated an arbitrary amount in time. That is, the overall characteristics of

X=xk,xk+1,xk+2,,xk+N

is the same as by looking at

LδX=xk+δ,xk+1+δ,xk+2+δ,,xk+N+δ

This immediately leads to the facts that for a stationary series the average of the lags is zero

s=0.

This fact is often used to test whether a time series is stationary even though it’s not bullet-proof.

Of course, beside these smoothing techniques you can use any mapping on a time series given in by the business context or domain expertise which help to extract insights more effectively.

The concept of correlation in any series (not just time series) is the idea that values can be related to previous ones, that there is correlation between values within the series. Obviously, if one finds correlation in a series it helps to predict the next values. Classic series in mathematical analysis are correlated by definition. The Fibonacci series for instance

xn=xn1+xn2

clearly defines the future values by means of past values has hence a strong (auto) correlation. The question is of course whether one can find out about this from the data without knowing anything else (i.e. how the data was created).

The technique to find out is based on the autocorrelation function. This is the linear dependence of a variable with itself at two points in time. More specifically, if xt is a time series with means μt,μs at times t,s then the auto-correlation function (ACF) is defined as

ACF(s,t)=????[(xsμs)(xtμt)]σsσt

For stationary processes μ:=μs=μt and the autocorrelation between any two observations only depends on the time lag h=ts between them:

ACF(s,t)=????[(x0μ)(xhμ)]σ2.

Let’s take the first few items of the Fibonacci sequence and see what this gives:

The acf function creates a plot

png

and what you need to see in this is first that the autocorrelation keeps going and in a way cumulates correlations. Second, that the blue dotted line gives a boundary inside which one can expect the correlation to be part of noise effects. In our case this means that lag 0,1,2 can be considered as resulting from correlations while all the other lags are due to noise. Obviously a term is always strongly correlated to itself and can alwasy be ignored. So, our ACF tells us that the Fibonacci numbers are related to the previous and previous-previous terms. It doesn’t tell us how they are correlated, simply that one can expect a linear dependency. Non-linear dependencies in time series are a whole separate domain.

Correlation between two variables can result from a mutual linear dependence on other variables (confounding). Autocorrelation functions of stationary finite order autoregressive processes are always sequences that converge to zero but do not break off. This makes it difficult to distinguish between processes of different orders when using the autocorrelation function. To cope with this problem the partial autocorrelation function (PACF) is used. The partial correlation between two variables is the correlation that remains if the possible impact of all other random variables has been eliminated:

xt=ϕ1xt1+ϕ2xt2++ϕtk+ηt

here the ϕ‘s are the PACF coefficients and η is pure noise. The precise calculation of the coefficients is beyond the scope of this document, what matters is the conceptual point that the PACF can tell (suggest) you the order of an auto-regressive (AR) process.

png

Of course, the Fibonacci sequence is not noisy but if it would be modelled as a so-called AR process then the PACF tells us that it would best be modelled as

xt=ϕ1xt1+ηt

because of the spike at lag 1 in the PACF plot.

Another way of detecting correlation is by means of a lag-plot which indicates correlations if the points are somewhat aligned along the diagonal. That is, if point xt and xt1 are linearly related then the ratio xt/xt1 should be a constant (slope). We give an example of a lag-plot below.

Mathematical analysis deals with pure series in the sense that they are defined by strict relations and are hence always exactly predictable. Real-world series are usually noisy and one thus includes ‘noise’ as an integral ingredient in the mix resulting in difference equations with a noise terms. This leads to the following definition; the first-order auto-regressive process AR(1) is defined as

xt=δ+αxt1+ut

with α,δ are constants and ut is stochastic noise.

Note that dropping this stochastic noise leads to a simple difference equation. With some coefficients being rededinfed one can cast the AR(1) process into

p(L)xt=ut

with a polynomial or order one. If this is not a first order polynomial one gets the genral AR(p) process. The well-known random walk corresponds to the AR(1) process with δ=0,α=1. If you enumerate a few times the way that an AR(1) process depends on previous points you can easily see that at any point

xk=αkx0+1αk1αδ+j=0k1αjukj.

From this general representation one can see that if k and α<1 then

xkk=11αδ+j=0αjukj.

and the long-term behavior is independent of the initial value x0, it's a purely stochastic dependence. Note that in general the δ series 1+α+α2+α3 will not converge if α is not less than one. For this reason one always assumes that an AR(1) process has α<1. If you try to define in R with a value one for instance you will get an error:

The arima.sim command creates examples of AR (and more general, see below) processes. You can specify the list of coefficients like so

png

png

The lag-plot of this process is

png

If you would defined an AR(1) process with higher (first) coefficient it would mean that the value is more strongly coupled to the immediate previous one and this can be seen in the lag-plot;

png

An AR process is by definition stationary and the diff-plot also suggests it. Can one make this more precise? Yes, the thing to use here is the so-called augmented Dickey-Fuller test which in our case gives

and with a p-value of less than 0.5 one has to rejected the null-hypothesis which in this case is that the series is non-stationary. Strictly speaking, the null is 'having a root of unity' which corresponds to a non-stationary series.

While the AR process correlates values with past values the moving average MA process correlates noise with past noise. The MA(1) process is defined as

xt=μ+ηt+θ1ηt1

and more generically the MA(r) process is defined as

xt=μ+ηt+θ1ηt1++θrηtr

with μ the mean value of the series as can be easily seen since the mean value of the noise is zero.

Taken together, the AR and AM processes give

xt=μ+ηt+i=1pϕixti+i=1qθiηti

and is called an ARMA(p,q) process. The μ term is related but not identical to the mean of the series. It's easy to see that the mean is equal to

x=μ1Σkϕk

which also says that if the ϕ coefficients nears one the ARMA process is ill defined.

If one drops the constant then the definition is equivalent to the following

(1i=1pϕiLi)xt=(1i=1qθiLi)ηt.

If the left-hand polynomial has a root of unity with multiplicity d then one can write

(1i=1pϕiLi)=(1i=1pϕiLi)(1L)d

with p=p+d. If that's the case then

(1i=1pϕiLi)(1L)dxt=(1i=1qθiLi)ηt

is called an ARIMA(p, d, q) process (also knows as Box-Jenkins process) and one can also add the μ term again to include a non-zero mean. This is usually called the drift of the series.

Here is an interesting analogy when dealing with differencing (i.e. a d-parameter). The parameter d=2 in an ARIMA(1,2,0) process means

(1ϕ1L)(1L)2=ηt

and if you write out the differencing then you see that this amount to looking at the accelaration of the data much like one would look in mechanics at the acceleration data rather than the position data;

at=(1L)2xt=(xtxt1)(xt1xt2)d2x(t)dt2=a(t)

and with this the ARIMA(1,2,0) is actually an AR(1) process of the acceleration

(1ϕL)at=ηt.

One can also represent ARIMA models like Feynman diagrams;

which raises the question whether a tadpole diagram makes sense

but this would correspond to a model where

xtη2tdt

and higer order giving loops on loops, which all is outside the linear domain of ARIMA models and leads to the so-called GARCH or autoregressive conditional heteroskedasticity i.e. non-linear models of time series.

Besides SMA and lagging one can also smooth a series by means of exponential smoothing which attempts to correlate current values with past values with a damping factor. That is, the value at some point is related to all values in the past but with an exponential decrease. Technically this amounts to mapping the series xt to another series st via

st=αxt1+(1α)st1

and the parameter α[0,1] puts less or more emphasis on the actual previous value. The name 'exponential' is related to the fact that if you write out a few terms you see that the past actual values are weighted with a geometric factor of α(1α)k.

One can see the effect of α by means of the ses command in the forecast package;

png

One can use the simple exponential smoothing for forecasting (and usually it's always considered in that context). The method can best be seen by looking at the equivalence of the ARIMA(0,1,1) process and simple exponential smoothing.

By definition the simple smoothing is

st=st1+α(xt1st1)

and if you define et=xtst then

xt1xt=(etet1)+(stst1)

and hence

(1L)xt=et+(1α)et1

which corresponds as stated to an ARIMA(0,1,1) with θ=1α. Strictly speaking one needs to also demonstrate the fact that the et has also a normal noise distribution N(0,σ2).

Now, the strength of the smoothing equation is that it does not contain the noise anymore so if α can be found then the recursion leads to a forecast. Because of the equivalence we can try to fit an ARIMA process and use the α value in the smoothing equation.

Besides interesting examples of parametric time series, the ARIMA processes are a way to forecast series. If you can parametrize an ARIMA series based on the given data you can obviously continue the series based on the very definition of the ARIMA process. So, forecasting based on ARIMA amounts to finding the most appropriate parameters in the model.

How can this be done? What does it mean to have a good parametrization?

The following is a toolbox of quick results;

  • the fastest way to estimate the ARIMA parameters of a series is by means of the forecast::auto.arima command which will give all the info you can wish for
  • if the series is non-stationary you need differencing, ARMA models only deals with stationary data
  • the first few peaks of the PACF give you the AR order
  • the first few peaks of the ACF give you the MA order
  • the so-called Akaike Information Criterion (AIC) is an indication of how well the parametrization works for the data, the lower the value the better it is. The coefficient is a measure of how much information is lost with the model and hence a lower value means less loss.
  • the residuals (differences between model values and actual values) should not have any correlations and be pure noise. If not, it means that the model did not embrace fully all the correlations in the data. A good model should account for the information content in data and what remains should be in the hands of chance/noise.

Obviously, all of these tricks have deep roots and proofs but this goes beyond this text.

Example: using auto.arima

Let's take again the data used above and use auto.arima to fit and forecast it.

The suggestion from auto.arima is that the model is an ARIMA(1,1,1)(2,0,0)[7]. So, the original data is not stationary since d=1 is needed. In addition, the second series of numbers refer to a seasonal component in the data which can be modelled through an AR(2) process.

The AIC is here 1941.6 which does not say on its own a lot. If you wish to visually see how well the model fits the data you can use the fitted command like so

png

Finally, the forecast command tells you the (probabilistic) future. A forecast for time steps is

png

Example: using simple exponential smoothing

Let's see how exponential smoothing does on the data above

png

Just by looking at this plot it's clear that this approach to the data is not as effective as the ARIMA one. This can numerically be seen from the AIC which is much higher than the ARIMA one above (1941 vs. 2502).

Example: parametrizing ARIMA

The auto.arima command contains a lot of high-tech and certainly does a good job to infer a model but if you wish to explore things yourself follow the recipe above. Let's take the (well-known tutorial) series describing the age of death of successive Kings of England

png

Looking at the roots of unity we see that the p-value of 0.529 if above the threshold of 0.5 and we should accept the null-hypothesis that the series is not stationary. This can also be seen from the plot but the test is more solid.

So, we need to take a difference in order to use arma (remember arma only deals with stationary series). Taking a diff of one leads indeed to a more stationary plot (which can also be confiromed via the Dickey-Fuller test);

png

The fact that the diff=1 works means that we'll have an ARIMA(,1,) model. Looking at the ACF we see that lag=1 is outside the significance bounds and this means that ARIMA(0,1,1) could be a good model.

png

Looking at the PACF we see that lag=3 is outside the bounds and this suggests that ARIMA(3,1,0) could be a good model.

png

A combination of both is also possible (i.e. ARIMA(3,1,1)) but one usually take the model with the least amount of parameters. This is the so-called principle of parsimony. In our case this would then lead to the ARIMA(0.1.1) model. If you try auto.arima this is also what is being suggested.

So, let's define our model

and a fitted model to the data looks like the following;

png

It might look like the fit is not very good but if you look at a QQ plot of the residuals you can see that it really is noise.

png

You can, in addition, also look at the ACF of the residuals and confirm in this fashion that the residuals are not correlated.

png

And if you really want to go a steap further you can use the Box–Pierce or Ljung–Box which tests the null hypothesis that the data is independent;

and since the p-value is above 0.5 we can accept the null that the residuals are pure noise.
Moving on, now that we have a model we can forecast the series;

png

You can check that if we had taken an ARIMA(3,1,1) model we would have had a slightly higher loss of information( 348 vs. 344);

348.587871362848

Example: non-linear fitting

The following is an example of a series which cannot be casted into the ARIMA model and cannot be difference to any order into a stationary series. The reason being that the exponential on which it's based does not decrease by taking derivatives (which is proportional to the differencing).

png

A plot of the second difference gives

png

There are various ways to deal with such data. One way is to 'guess' the shape of the underlying function and let the nls do the work of finding out which parameters fit best.

The values computed here are very close to the ones the series was based on and the information loss is

809.06629402229

Let's take a closer look at the seasonal ARIMA models now, smetimes called SARIMA. The ARIMA(p,d,q) model corresponds to the equation

ϕ(L)(1L)dxt=θ(L)ηt

where the ϕ,θ are the AR and MA polynomials respectively. As a side-remark, one can also define the so-called ARFIMA models (auto-regressive fractional integrated moving average) where the differencing is not a natural number anymore but a real number.

The SARIMA model takes the seasonality into account by performing the same ARIMA-trick but on a larger scale;

ΦP(Ls)ϕp(L)(1Ls)D(1L)dxt=ΘQ(Ls)θq(L)ηt

where the s refers to the cycle and Φ,Θ are polynomials in powers of Ls. This model gets the notation

(p,d,q)(P,D,Q)[s]

For example, the ARIMA(0, 0, 0)(1, 0, 0)[9] model corresponds to the equation

xt=xt9+ηt

and looks like the following

png

If you want to use the arima command here and parametrize the model you need things like so

png

Tags: