10. Autoregressive Processes

- So far in this course, we have restricted ourselves to models with very few free parameters
- these models are simplifications of the true system and represent key features, such as relaxation times, oscillatory modes, SEIR-model parameters
- In each case, it was clear what we could learn from each parameter and how it was useful, to have the parameter in the model
- Including more parameterswe can always get a better fit of the model to the data

AR(1) model \[x_{t} = a x_{t-1} + \sigma {\xi}_t,\] where the noise has zero mean and uni variance, has two parameters

- There are different ways how to fit these parameters, for example
- the autocorrelation function
- the TAMSD
- a maximum likelihood estimator

- if the model is only an approximate description of the data, we cannot expect these fits to yield the same result

**Bias:** bias is the error of a fit that is unavoidable if the model has not sufficient parameters to represent the full complexity of the data

- models with fewer parameters have more bias

**Variance:** variance is the error of the fit

- models with more parameters, typically have a higher variance

The sqared error of the model can be decomosed into bias and variance

More information on the fundamental concepts of learning and fitting models can be found at https://work.caltech.edu/telecourse

Remember: Mean squared errors

Data: z(k,l), Model: x(k,l)=ak+bl+c: For best model minimize \[ \sum_k\sum_l (ak+bl+c-z(k,l))^2 \]

Ridge Regression: favor small a,b (small impact - especially realistic for model with more dimensions) by adding term to minimize slope \[ \sum_k\sum_l (ak+bl+c-z(k,l))^2 + \lambda (a^2 + b^2) \] with a suitable weight lambda

Additional advantage: can solve models with more dimensions (parameters) than data points (due to constraints)

**Real data is noisy**

**Extreme case:** a model that learns every situation from the past exactly is 'memorizing' these situations, however, predictions are about situations, that are outside of the past data, so the model is not helpful.

If the model has too many free parameters, the model 'overfits' - it fits the noise

Overfitting is increased by

- low number of data points
- noise
- complexity of the true underlying dynamics

Backshift operator \[ Bx_t = x_{t-1} \Rightarrow B^jx_t = x_{t-j} \]

The AR(p) process can be written as \[ x_t=\sum_{j=1}^p a_jB^j x_t + \xi_t \]

- Models generate infinite impulse response

The standard literature on this topic is the book *'Time Series Analysis - Forecasting and Control' by Box and Jenkins*.

Solve

\[ 1-a_1G^{-1}-a_2{G^{-1}}^2 = 0 \] \[ G_{1,2}^{-1} = \frac{a_1\pm \sqrt{a_1^2+4a_2}}{2} [{\Rightarrow} a_1=(G_1+G_2) \;\;\; a_2=-G_1G_2] \]Real or complex (-> oscillations)

Autocorrelations have form: \[ c_\Delta = h_1G_1^\Delta+h_2G_2^\Delta \]

Stationarity: \[ 1>|G_i| \forall i \]

from which we define

\[ \xi_t = \Phi(B)x_t. \]The term can be rewritten as

\[ \Phi(B) = 1 - \sum_{j=1}^p a_jB^j = \prod_{j=1}^p (1-G_jB), \]where Gj^-1 are the roots of \[ \Phi(B)=0 \]

Stationary if all roots are smaller than 1

The correlation at some lag k that results after removing the effect of correlations due to the terms at shorter lags.

Yule-Walker equations

\[ \left(\begin{array}{c c c c c}1&c_1&c_2&\cdots &c_{k-1}\\ c_1&1&c_1&\cdots &c_{k-2}\\ \vdots &\vdots &\vdots &\cdots &\vdots\\ c_{k-1}&c_{k-2}&c_{k-3}&\cdots &1\end{array}\right)\left(\begin{array}{c}\Phi_{k1}\\ \Phi_{k2}\\ \vdots\\ \Phi_{kk}\end{array}\right) = \left(\begin{array}{c}c_{1}\\ c_{2}\\ \vdots\\ c_{k}\end{array}\right) \]The solution is the partial autocorrelation function

\[ \Phi_{kk} = Corr(x_t,x_{t-k}|x_{t-1},x_{t-2},...,x_{t-k+1}) \]If the process is of order p, the partial autocorrelation function has cutoff after lag p \[\Phi_{kk}=0 \;\; \forall \; k>p\]

Negative conditional log-likelihood

\[ L(a,v)=\frac{1}{2}\log(v) + \frac{\frac{1}{T-1}\sum_{t=0}^{T-1} (x_t-\sum_{j=1}^p a_jx_{t-j})^2}{2v}, \]is a useful approximation to the true negative log-likelihood.

A more accurate approximation is given by \[ L(a,v)=\frac{1}{2}\log(v) + \frac{\frac{1}{T-1}\sum_{t=-\infty}^{T-1} \langle\tilde{\xi}_t^2(a|x,\xi)\rangle}{2v}, \]

A process cann be defined as \[ x_t = (1-\sum_{j=1}^{q} m_jB^j) \xi_t \]

Has the autocorrelation function \[ c_k = \left\lbrace \begin{array}{l l} \frac{-m_k + m_1m_{k+1}+...+ m_{q-k} m_q}{1 + m_1^2 + ... + m_q^2} & k=1,2,...,q\\ 0 & k>q \end{array} \right. \]

Find order q: last non-zero lag of the autocorrelation function

- Finite impulse response

- Example: AR(1): \[ x_t = a x_{t-1} + \xi_t \] \[ (1-aB) x_t = \xi_t \] \[ x_t = \frac{1}{1-aB} \xi_t = (1+aB+a^2B^2+...) \x_t \]
- AR(p) models can be written in the moving average representation as an infinite sum \[ x_t = \sum_{j=1}^p a_jB^j x_t + \xi_t = \xi_t + \sum_{j=1}^\infty \xi_{t-j} \]
- On the other hand MA(q) models can be written as an infinite sum in the autoregressive representation \[ x_t = \sum_{j=1}^q m_jB^j \xi_t + \xi_t = \xi_t + \sum_{j=1}^\infty x_{t-j} \]
- 'Duality' of the two types of models

AR and MA models can be mixed; this is called an ARMA(p,q) model \[ x_t = \sum_{j=1}^p a_jB^j x_t + \sum_{j=1}^q m_jB^j \xi_t + \xi_t \]

- the orders p and q can be determined independnetly from the partial autocorrelation function and the autocorrelation function

Autoregressive representation: \[ x_t=\sum_{k=1}^\infty (-1)^{k+1}\frac{\prod_{a=0}^{k-1}(d-a)}{k!}x_{t-k}+\xi_t \]

Moving-average representation: \[ (1-B)^d x_t = \xi_t \]

- Both are infinite series
- The process has power law decay of correlation - autocorrelation times diverge for J>0.5

ARFIMA(p,d,q): \[ (1-\sum_{j=1}^p a_jB^j)(1-B)^d x_t = (1-\sum_{j=1}^q m_j B^j) \xi_t \]