Datenanalyse und Stochastische Modellierung
9. Kalman Filter

### The AR(1) process as a low pass filter

• Treat the noise term as your signal
• The AR(1) process smoothens this signal
• In other words, it increses autocorrelations for short times

### Moving Averages

• Sliding window average
• Similar effect to AR(1) process

### 'Online Learning'

• We have a series of measurements for a variable Z
• Each measurement has some uncertainty
• What is the best estimation after each new measurement
$x_{t} = \frac{1}{t} \sum_{n=1}^{t} z_{n} = \frac{1}{t} \sum_{n=1}^{t-1} z_n + \frac{1}{t} z_t = \frac{t-1}{t} \frac{1}{t-1} \sum_{n=1}^{t-1} z_n + \frac{1}{t} z_t =$ $= \frac{t-1}{t} x_{n-1} + \frac{1}{t} z_t = x_{t-1} + \frac{1}{t} (z_t-x_{t-1})$
• Much less storage is neccessary, compared to calculating the mean after each step

### Alpha-Beta-Filter

What if we want to estimate a parameter, that is not constant

Update

$x_{t|t} = x_{t|t-1} + \alpha ( z_t - x_{t|t-1} )$ $\dot{x}_{t|t} = \dot{x}_{t|t-1} + \beta \left( \frac{z_t - x_{t|t-1}}{\Delta t} \right)$

Predict

$x_{t+1|t} = x_{t|t} + \dot{x}_{t|t} \cdot \Delta t$ $\dot{x}_{t+1|t} = \dot{x}_{t+1|t}$

### Alpha-Beta-Filter: Example

Tracking a random walker

### Kalman Filter weights

Similar to the alpha-beta-filter, the Kalman filter makes estimates based on the previous estimate and the latest measurement z

$\hat{x}_{t} = Kz_t + (1-K) \hat{x}_{t-1}$

Treating x as a stochastic variable, we can look at the variance

$p_{t} = K^2 {\sigma_z^2}_t + (1-K)^2 p_{t-1}$

Now we minimize p

$\frac{\mathrm{d}p_{t}}{\mathrm{d}K} = 2K{\sigma_z^2}_t - 2(1-K)p_{t-1} = 0$ $K({\sigma_z^2}_t + p_{t-1}) - p_{t-1} = 0 \; \; \; \Rightarrow \; \; \; K = \frac{ p_{t-1} }{ p_{t-1} + {\sigma_z^2}_t }$

### Kalman Filter 1D constant state

Update

$x_{t|t} = x_{t|t-1} + \frac{p_{t|t-1}}{p_{t|t-1}+r_t} ( z_t - x_{t|t-1} )$ $p_{t|t} = \left( 1 - \frac{p_{t|t-1}}{p_{t|t-1}+r_t} \right) p_{t|t-1}$

Predict

$x_{t+1|t} = x_{t|t}$ $p_{t+1|t} = p_{t|t}$
• Assumption: Gaussian errors, so distribution is given by mean and variance
• estimations for the variance p are independent of the measurements
• the weight is called Kalman gain $K_t = \frac{p_{t-1}}{p_{t-1}+r_t} , \; \; \mbox{ with } \; \; 1>K_t>0$

### Bayes Theorem

$P( M\cap D | D ) = \frac{P(M\cap D)}{P(D)} = \frac{\frac{P(M\cap D)}{P(M)}P(M)}{P(D)} = \frac{P(M\cap D|M)P(M)}{P(D)}$ $P( M | D ) = \frac{P( D | M )P(M)}{P(D)}$
• Prior P(M)
• Posterior P(M|D)
• Likelihood P(D|M)
• Evidence P(D)

### Simple example

A test that can be positive p or negative n, is used by a person that can be infected i or healthy h

$\mbox{Likelihood: }\; P(p|i)=0.93 \;\;\;\;\;\; P(n|h)=0.99 \;\;\;\;\; \mbox{Prior:}\; P(i)=0.00148$ $P(i\& p) = P(i)P(p|i) = 0.0013764 \;\;\;\;\; P(p)=P(h\& p)+P(i\& p) = 0.0113616$ $P(i|p) = P(i\& p)/P(p) = 0.12$

Second test with prior P(i)=0.12

$P(i|pp) = \frac{P(i)P(pp|i)}{P(pp)} = \frac{P(i)P(pp|i)}{P(i)P(pp|i)+P(h)P(pp|h)} = \frac{0.12\cdot 0.93}{0.12\cdot 0.93 + (1-0.12)\cdot (1-0.99) = 0.93}$

### Bayesian Filter

In our case, Bayes' rule defines the update step in general as $P(x_t|z_{1...t}) = \frac{P(z_t|x_t) P(x_t|z_{1...t-1})}{P(z_t|z_{1...t-1})} = \frac{P(z_t|x_t) P(x_t|z_{1...t-1})}{\int P(z_t|x_t) P(x_t|z_{1...t-1})\mathrm{d}x_t}$

$P(x_{t|t}) \propto P(z_t|x_t) P(x_{t|t-1})$

### Linear Systems (The Kalman Filter)

Hidden Markov model

$x_{t} = F x_{t-1} + q_{t-1}$ $z_t = Hx_t +r_t$ Where q is Gaussian white noise with variance Q, r is Gaussian white noise with variance R, x is a vector, and A and H are matrices

Accordingly, the algorithm is update: $K_t = \frac{p_{t|t-1} H^T}{ Hp_{t|t-1} H^T + r_t }$ $x_{t|t} = x_{t|t-1} + K_t ( z_t-Hx_{t|t-1} ) \;\;\;\; p_{t|t} = (\mathbb{1}-K_t H) p_{t|t-1} (\mathbb{1} - K_t H)^T + K_t r_t K_t^T$ and prediction: $x_{t+1|t} = F x_{t|t} \;\;\;\;\; p_{t+1|t} = F p_{t|t} F^T + q_t$

Find a slower introduction at www.kalmanfilter.net

### Extended Kalman Filter

• If F and H are non-linear, we can use a linearized version
$F(x) \approx F(\mu) + (\partial_x F)(\mu)(x-\mu)$ $H(x) \approx H(\mu) + (\partial_x H)(\mu)(x-\mu)$

where the derivatives is given by the Jacobian matrices

### Monte Carlo Approaches

• Recall the definition of the ensemble average $\langle f(x) \rangle = \int f(x)\rho(x)\;\mathrm{d}x$
• In practice (in the exercise) this was measured by simulating M realizations (different initial conditions or process noise) of x_t, so the ensemble $x_t^{(m)}$ represents the distribution at time t $\rho(x_t)$
• In the case of discrete states the probabilities can be easily identified as weights of the states $\sum_{i} f(x_i)P(x=x_i) = \frac{1}{M} \sum_m f(x^{(m)})$

### Ensemble Kalman filter

The update step with NxN matrices in the Kalman filter is computationally costly. The ensemble Kalman filter approximates some quantities and reduces the dimensions to Nxn with N>n

• Create an ensemble $X = [x^{(1)},...,x^{(n)}]$
• Create a corresponding ensemble of the data, by adding a random vector to each copy of the measurement vector $Z = [z^{(1)},...,z^{(n)}]$
• Then we can calculate $X_{t|t} = X_{t|t-1} + K(Z-HX_{t|t-1})$ with $K_t=Q_tH^T (HQ_tH^T + R_t)^{-1}$ where the covariance Q can now be computed from the ensemble X_t

Such an approximation is called a Monte-Carlo approximation

### Back to Bayes' Theorem

Optimizing the parameters theta

$P( \Theta | D,M ) P(D|M) = P( \Theta,D|M ) = P( D|\Theta,M ) P(\Theta|M)$ $P( \Theta | D,M ) = \frac{P( D|\Theta,M ) P( \Theta|M ) }{ P(D|M) }$

### The task of parameter estimation

$p(\Theta | z_{1...T}) \propto p( z_{1...T} | \Theta ) p(\Theta)$

Calculating the Likelihood

$p( z_{1...T} | \Theta ) = \prod_{t=1}^T p( z_t | z_{1...t-1} , \Theta )$

with

$p( z_t, x_t | z_{1...t-1}, \Theta ) = p( z_t | x_t, z_{1...t-1}, \Theta ) p( x_t | z_{1...t-1}, \Theta ) = p( z_t | x_t, \Theta ) p( x_t | z_{1...t-1}, \Theta )$ $p( z_t | z_{1...t-1}, \Theta ) = \int p( z_t | x_t, \Theta ) p( x_t | z_{1...t-1}, \Theta ) \mathrm{d}x_t$

The update and prediction steps are performed with some Bayesian filter (e.g. the Kalman filter)

$p(x_t | z_{1...t-1},\Theta ) = \int p(x_t | x_{t-1},\Theta) p(x_{t-1}|z_{1...t-1},\Theta) \mathrm{d}x_{t-1}$ $p(x_t|z_{1...t},\Theta) = \frac{p(z_t|x_t,\Theta)p(x_t|z_{1...t-1},\Theta)}{p(z_t|z_{1...t-1},\Theta)}$

### From the likelihood to the optimal parameter

$p(\Theta | z_{1...T}) \propto p( z_{1...T} | \Theta ) p(\Theta)$

Numerically more stable: the energy function

$\phi_T(\Theta) = -\log p(z_{1...T}|\Theta)-\log p(\Theta)$

Solve by

$\phi_0(\Theta) = -\log p(\Theta)$ $\phi_t(\Theta) = \phi_{t-1}(\Theta) - \log p(z_t|z_{1...t-1},\Theta)$

So the optimal parameter is given by

$argmin_\Theta [\phi_T(\Theta)]$
• Expectation values can then be calculated from the posterior

### Example: the SEIR model

$\dot{S} = m- ( m + bI ) S$ $\dot{E} = bSI - ( m + a ) E$ $\dot{I} = aE - ( m + g ) I$ $\dot{R} = gI - mR$

With N=S+E+I+R constant

$m = (m+bI^*)S^* \;\;\;\;\;\; bS^*I^*=(m+a)E^* \;\;\;\;\;\; aE^* = (m+g)I^* \;\;\;\;\;\; gI^* = mR^*$

No birth and death: m->0

Reproduction number

$r = \frac{1}{S^*} = \frac{ab}{(m+a)(m+g)} \rightarrow \frac{b}{g}$

Some parameters can be estimated, i.e. g=1/3 per day, a=1/5.2 per day

### The probabilistic SEIR model

$p(S,E,I,R\rightarrow S-1,E+1,I,R) = bSI/N$ $p(S,E,I,R\rightarrow S,E-1,I+1,R) = aE$ $p(S,E,I,R\rightarrow S,E,I-1,R+1) = gI$

With X=(S,E,I,R); Observable state: Y=I+R; corresponding measurements z

Kahlman filter: $X_{t|t} = X_{t|t-1} - \frac{1}{2} K_{t} (Y_{t|t-1} + \langle Y _{t|t-1} \rangle - 2 )$ Kahlman gain $K_t = \frac{\langle (X_{t|t-1}-\langle X_{t|t-1}\rangle)[(I_{t|t-1}-\langle I_{t|t-1}\rangle)+(R_{t|t-1}-\langle R_{t|t-1}\rangle)] \rangle}{\langle (I-\langle I \rangle)^2 \rangle + 2\langle (R-\langle R\rangle)(I-\langle I\rangle ) \rangle + \langle (I-\langle I \rangle)^2 \rangle + r}$ estimated observation error r=10

### SEIR Parameter Estimation

Approximate Log-Likelihood as $L(t,b) = \frac{1}{2} \frac{|(\langle I \rangle-z_I)+(\langle R \rangle-z_R)|^2}{\langle (I-\langle I \rangle)^2 \rangle + 2\langle (R-\langle R \rangle)(I-\langle I\rangle) \rangle + \langle (I-\langle I \rangle)^2 \rangle + r}$ $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; + \frac{1}{2} \log(\langle (I-\langle I \rangle)^2 \rangle + 2\langle (R-\langle R\rangle)(R-\langle R\rangle ) \rangle + \langle (I-\langle I \rangle)^2 \rangle + r),$

Assuming a Gaussian Likelihood function $-\log( \frac{1}{\sigma_Y}e^{-\Delta Y/(2\sigma_Y^2)} ) = -\frac{1}{2}\log(\sigma_Y^2) - \frac{\Delta Y}{2\sigma_Y^2}$ Obtain b by minimizing $b^* = argmin_b \sum_t L(t,b)$

### Sequential data assimilation of the SEIR model for COVID-19

Engbert, R., Rabe, M. M., Kliegl, R., Reich, S. (2021). Bulletin of mathematical biology, 83(1)

### Markov Chain Monte Carlo

If we do not assume Gaussian distributions, the observables like mean and variance have to be calculated from the posterior. Markov Chain Monte Carlo is a method to approximate

$\rho(x)$

Require:

• Ergodicity
• Detailed Balance
• Markov property

Construct a series x_t of length T that has the probability distribution $\rho(x)$, then $\langle f(x) \rangle = \frac{1}{T} \sum_t f(x_t)$

### The Metropolis-Hastings Algorithm

Detailed Balance

$p(\Theta_{t+1}|\Theta_t)p(\Theta_t) = p(\Theta_{t}|\Theta_{t+1})p(\Theta_{t+1}) \;\;\rightarrow \;\; \frac{p(\Theta_{t+1}|\Theta_t)}{p(\Theta_{t}|\Theta_{t+1})} = \frac{ p(\Theta_{t+1})}{p(\Theta_t)}$

Defining the proposal distribution Q and the acceptance distribution T $p(\Theta_{t+1}|\Theta_t)= Q(\Theta_{t+1}|\Theta_t) T(\Theta_{t+1}|\Theta_t)$

So $\frac{T(\Theta_{t+1}|\Theta_t)}{\Theta_{t}|\Theta_{t+1}} = \frac{p(\Theta_{t+1})}{p(\Theta_t)} \frac{Q(\Theta_{t}|\Theta_{t+1})}{Q(\Theta_{t+1}|\Theta_t)}$

This is fulfilled by $T(\Theta_{t+1}|\Theta_t) = \min\left[1, \frac{p(\Theta_{t+1})}{p(\Theta_t)} \frac{Q(\Theta_{t}|\Theta_{t+1})}{Q(\Theta_{t+1}|\Theta_t)} \right]$

At each point in time, draw sample from Q and compute the corresponding transition probability T. Then generate a random number u between 0 and 1, If u is smaller than T, accept the move and otherwise reject it

### Effective Sample Size

• Rejected moves lead to autocorrelations and, therefore, reduced effective sample sizes
• The first iterations ofthe markov chan can not be used, as they depend on the initial value of the series, which might have a very low probability itself