☕
Machine Learning Demystified
  • Introduction
  • Blogs
    • Spherical counterpart of Gaussian kernel
    • Vanilla Kalman Filter
    • Support Vector Machine
    • One Class SVM
    • Differentiable Bayesian Structure Learning
    • Stein Variational Gradient Descent
    • Finite Step Unconditional Markov Diffusion Models
Powered by GitBook
On this page
  • Graphical model
  • Kindergarten physics
  • Inference
  • State means
  • State covariances
  • Algorithm

Was this helpful?

Edit on GitHub
  1. Blogs

Vanilla Kalman Filter

BS-free introduction to the math of Kalman filter

PreviousSpherical counterpart of Gaussian kernelNextSupport Vector Machine

Last updated 3 years ago

Was this helpful?

Graphical model

Kindergarten physics

Time evolution

F_{ab} = \pmatrix{ 1 & \delta t \\ 0 & 1 }\quad \eta_a \sim {\cal N}(0, {Q})

Observation or measurement

z_a(t) = H_{ab} x_b(t) + \xi_a\quad \\ H_{ab} = \pmatrix{ 1 & 0 \\ } \quad \xi_a\sim {\cal N}(0, {R})

Inference

Expand the recursion, we have Feynman-Kac formula or path integral

Thus, we assume

State means

Use the recursion again,

We must have

Therefore,

State covariances

The covariances

where

Therefore,

Are they consistent? To show that

Given the posterior, the prior covariance for next step

Thus, the forecasted covariance the composed of measurement noise and evolution noise

Algorithm

  • Forecast state mean using posterior as prior

r(t+δt)=r(t)+r˙(t)δt+12r¨(t)δt2+h.o.t{\bf r}(t+\delta t) = {\bf r}(t) + \dot {\bf r}(t) \delta t + \frac12 \ddot {\bf r}(t) \delta t^2 + {\rm h.o.t}r(t+δt)=r(t)+r˙(t)δt+21​r¨(t)δt2+h.o.t

If assume ⟨r¨⟩=0\langle \ddot {\bf r} \rangle = 0⟨r¨⟩=0, then the state xxx is characterized by (r,r˙)({\bf r}, \dot {\bf r})(r,r˙), i.e.

xi(t+δt)=xi(t)+vi(t)δt+12εi(t)δt2vi(t+δt)=vi(t)+εi(t)δtx_{i}(t+\delta t) = x_i (t) + v_i (t) \delta t +\frac12 \varepsilon_i(t) \delta t^2\\ v_{i}(t+\delta t) = v_i(t) + \varepsilon_i (t) \delta t \qquad\qquad\qquadxi​(t+δt)=xi​(t)+vi​(t)δt+21​εi​(t)δt2vi​(t+δt)=vi​(t)+εi​(t)δt
xa(t+δt)=Fabxb(t)+ηax_a(t+\delta t) = F_{ab} x_b(t) + \eta_axa​(t+δt)=Fab​xb​(t)+ηa​

Given observations z<tz_{<t}z<t​ up to t−1t-1t−1, what is the distribution of state xtx_txt​?

p(xt∣z<t)=∫dxt−1  p(xt∣xt−1)p(xt−1∣z≤t−1)p(x_t|z_{<t}) = \int {\rm d}x_{t-1}\; p(x_t | x_{t-1}) p (x_{t-1} | z_{\le t-1})p(xt​∣z<t​)=∫dxt−1​p(xt​∣xt−1​)p(xt−1​∣z≤t−1​)

If there is one more observation ztz_tzt​, what is the distribution of state xtx_txt​?

p(xt∣z≤t)=p(xt∣zt,z<t)∝p(zt∣xt)p(xt∣z<t)p(x_t|z_{\le t}) = p(x_t | z_t, z_{<t}) \propto p(z_t | x_t) p(x_t | z_{<t})p(xt​∣z≤t​)=p(xt​∣zt​,z<t​)∝p(zt​∣xt​)p(xt​∣z<t​)
p(xt∣z<t)∝∫[∏τ=1t−1dxτ] ∏τ=1tp(zτ∣xτ)p(xτ∣xτ−1)p(z0∣x0)p(x0)p(x_t | z_{< t}) \propto \int \left [ \prod_{\tau = 1}^{t-1} {\rm d}x_\tau \right]\, \prod _{\tau=1}^{t}p(z_\tau | x_\tau) p(x_\tau | x_{\tau - 1}) p(z_0 | x_0) p(x_0)p(xt​∣z<t​)∝∫[τ=1∏t−1​dxτ​]τ=1∏t​p(zτ​∣xτ​)p(xτ​∣xτ−1​)p(z0​∣x0​)p(x0​)

If we assume the priors, transition probabilities and emission probabilities are all normal, then the posterior of state xtx_txt​ is also normal.

p(xt∣z<t)=N(x^t∣t−1,Pt∣t−1)p(xt∣z≤t)=N(x^t∣t,Pt∣t)p(x_t | z_{< t}) = \mathcal N (\hat x_{t|t-1}, {P}_{t|t-1})\\ p(x_t | z_{\le t}) = \mathcal N (\hat x_{t|t}, {P}_{t|t})\qquadp(xt​∣z<t​)=N(x^t∣t−1​,Pt∣t−1​)p(xt​∣z≤t​)=N(x^t∣t​,Pt∣t​)
p(xt∣z≤t)∝p(zt∣xt)p(xt∣z<t)p(x_t | z_{\le t}) \propto p(z_t | x_t) p(x_t | z_{<t})p(xt​∣z≤t​)∝p(zt​∣xt​)p(xt​∣z<t​)
N(xt;x^t∣t,Pt∣t)∝N(zt;Hxt,R)N(xt;x^t∣t−1,Pt∣t−1)\mathcal N(x_t; \hat x_{t|t}, P_{t|t}) \propto \mathcal N(z_t; Hx_t, R) \mathcal N(x_t; \hat x_{t|t-1}, P_{t|t-1})N(xt​;x^t∣t​,Pt∣t​)∝N(zt​;Hxt​,R)N(xt​;x^t∣t−1​,Pt∣t−1​)

Maximize xtx_txt​ __ on both sides, the LHS __ xt=x^t∣tx_t = \hat x _{t|t}xt​=x^t∣t​, the RHS

rhs=(zt−Hxt∣R−1∣zt−Hxt)+(xt−x^t∣t−1∣Pt∣t−1−1∣xt−x^t∣t−1){\rm rhs}=(z_t - Hx_t|R^{-1}|z_t - Hx_t) + (x_t - \hat x_{t|t-1}|P_{t|t-1}^{-1}|x_t - \hat x_{t|t-1})rhs=(zt​−Hxt​∣R−1∣zt​−Hxt​)+(xt​−x^t∣t−1​∣Pt∣t−1−1​∣xt​−x^t∣t−1​)
0=∂xt⊤rhs=−H⊤R−1(zt−Hxt)+Pt∣t−1−1(xt−x^t∣t−1)0=\partial_{x_t^\top}{\rm rhs}= - H^\top R^{-1}(z_t - Hx_t) + P^{-1}_{t|t-1} (x_t - \hat x_{t|t-1})0=∂xt⊤​​rhs=−H⊤R−1(zt​−Hxt​)+Pt∣t−1−1​(xt​−x^t∣t−1​)
(x^t∣t−x^t∣t−1)=Pt∣t−1H⊤R−1(zt−Hx^t∣t)=Pt∣t−1H⊤R−1(zt−Hx^t∣t−1−H(x^t∣t−x^t∣t−1))\begin{align} (\hat x_{t|t} - \hat x_{t|t-1}) &= P_{t|t-1} H^\top R^{-1} (z_t - H \hat x_{t|t})\\ &= P_{t|t-1} H^\top R^{-1} (z_t - H \hat x_{t|t-1} - H(\hat x_{t|t} - \hat x_{t|t-1}) ) \end{align}(x^t∣t​−x^t∣t−1​)​=Pt∣t−1​H⊤R−1(zt​−Hx^t∣t​)=Pt∣t−1​H⊤R−1(zt​−Hx^t∣t−1​−H(x^t∣t​−x^t∣t−1​))​​
x^t∣t−x^t∣t−1=(I+Pt∣t−1H⊤R−1H)−1Pt∣t−1H⊤R−1(zt−Hx^t∣t)≡K(zt−Hx^t∣t)\hat x_{t|t} - \hat x_{t|t-1} = (I + P_{t|t-1} H^\top R^{-1} H)^{-1} P_{t|t-1} H^\top R^{-1} (z_t - H \hat x_{t|t}) \equiv K (z_t - H \hat x_{t|t})x^t∣t​−x^t∣t−1​=(I+Pt∣t−1​H⊤R−1H)−1Pt∣t−1​H⊤R−1(zt​−Hx^t∣t​)≡K(zt​−Hx^t∣t​)

where K=(I+Pt∣t−1H⊤R−1H)−1Pt∣t−1H⊤R−1K = (I + P_{t|t-1} H^\top R^{-1} H)^{-1} P_{t|t-1} H^\top R^{-1}K=(I+Pt∣t−1​H⊤R−1H)−1Pt∣t−1​H⊤R−1 is the Kalman gain.

(Pt∣t−1)ab≡⟨(xt−x^t∣t−1)a(xt−x^t∣t−1)b⟩(P_{t|t-1})_{ab} \equiv \langle (x_t - \hat x_{t|t-1})_a (x_t - \hat x_{t|t-1})_b \rangle(Pt∣t−1​)ab​≡⟨(xt​−x^t∣t−1​)a​(xt​−x^t∣t−1​)b​⟩
(Pt∣t)ab≡⟨(xt−x^t∣t)a(xt−x^t∣t)b⟩(P_{t|t})_{ab} \equiv \langle (x_t - \hat x_{t|t})_a (x_t - \hat x_{t|t})_b \rangle(Pt∣t​)ab​≡⟨(xt​−x^t∣t​)a​(xt​−x^t∣t​)b​⟩
(Pt∣t)ab=⟨[xt−(x^t∣t−x^t∣t−1)−x^t∣t−1]a[⋯ ]b⟩=⟨[xt−x^t∣t−1+(x^t∣t−x^t∣t−1)]a[⋯ ]b⟩=⟨[xt−x^t∣t−1−K(zt−Hx^t∣t−1)]a[⋯ ]b⟩=⟨[xt−x^t∣t−1−K(H(xt−x^t∣t−1)+ξ)]a[⋯ ]b⟩=⟨[(I−KH)ac(xt−x^t∣t−1)c−Kacξc][(I−KH)bd(xt−x^t∣t−1)d−Kbdξd]⟩=(I−KH)Pt∣t−1(I−KH)⊤+KRK⊤\begin{align} (P_{t|t})_{ab} &= \langle [x_t - (\hat x_{t|t} - \hat x_{t|t-1}) - \hat x_{t|t-1}]_a [\cdots]_b \rangle \\ &= \langle [x_t - \hat x_{t|t-1} + (\hat x_{t|t} - \hat x_{t|t-1})]_a [\cdots]_b \rangle \\ &= \langle [x_t - \hat x_{t|t-1} - K (z_t - H\hat x_{t|t-1})]_a [\cdots]_b \rangle \\ &= \langle [x_t - \hat x_{t|t-1} - K (H(x_t-\hat x_{t|t-1}) + \xi)]_a [\cdots]_b \rangle \\ &= \langle [(I-KH)_{ac}(x_t - \hat x_{t|t-1})_c - K_{ac}\xi_c][(I-KH)_{bd}(x_{t} - \hat x_{t|t-1})_d - K_{bd}\xi_d] \rangle\\ &= (I - KH) P_{t|t-1} (I - KH)^\top + KRK^\top \end{align}(Pt∣t​)ab​​=⟨[xt​−(x^t∣t​−x^t∣t−1​)−x^t∣t−1​]a​[⋯]b​⟩=⟨[xt​−x^t∣t−1​+(x^t∣t​−x^t∣t−1​)]a​[⋯]b​⟩=⟨[xt​−x^t∣t−1​−K(zt​−Hx^t∣t−1​)]a​[⋯]b​⟩=⟨[xt​−x^t∣t−1​−K(H(xt​−x^t∣t−1​)+ξ)]a​[⋯]b​⟩=⟨[(I−KH)ac​(xt​−x^t∣t−1​)c​−Kac​ξc​][(I−KH)bd​(xt​−x^t∣t−1​)d​−Kbd​ξd​]⟩=(I−KH)Pt∣t−1​(I−KH)⊤+KRK⊤​​

Given the prior covariance, minimize the posterior MSE min⁡K(Pt∣t)aa\min_K (P_{t|t})_{aa}minK​(Pt∣t​)aa​

∂ trPt∣t∂K=(−HPt∣t−1(I−KH)⊤+RK⊤)⊤=KR−(I−KH)Pt∣t−1⊤H⊤=K(R+HPt∣t−1⊤H⊤)−Pt∣t−1⊤H⊤\begin{align} \frac{\partial\, {\rm tr} P_{t|t}}{\partial K} &= (-HP_{t|t-1}(I-KH)^\top + RK^\top)^\top \\ &= KR - (I-KH)P_{t|t-1}^\top H^\top \\ &= K (R + HP_{t|t-1}^\top H^\top) - P_{t|t-1}^\top H^\top \end{align}∂K∂trPt∣t​​​=(−HPt∣t−1​(I−KH)⊤+RK⊤)⊤=KR−(I−KH)Pt∣t−1⊤​H⊤=K(R+HPt∣t−1⊤​H⊤)−Pt∣t−1⊤​H⊤​​
K∗=Pt∣t−1⊤H⊤(HPt∣t−1⊤H⊤+R)−1K^* = P_{t|t-1}^\top H^\top (HP_{t|t-1}^\top H^\top + R)^{-1}K∗=Pt∣t−1⊤​H⊤(HPt∣t−1⊤​H⊤+R)−1

But previously we got K=(I+Pt∣t−1H⊤R−1H)−1Pt∣t−1H⊤R−1K = (I + P_{t|t-1} H^\top R^{-1} H)^{-1} P_{t|t-1} H^\top R^{-1}K=(I+Pt∣t−1​H⊤R−1H)−1Pt∣t−1​H⊤R−1.

K−1=R(Pt∣t−1H⊤)−1(I+Pt∣t−1H⊤R−1H)=R(Pt∣t−1H⊤)−1+H=(R+HPt∣t−1H⊤)(Pt∣t−1H⊤)−1\begin{align} K^{-1} &= R (P_{t|t-1}H^\top)^{-1}(I + P_{t|t-1} H^\top R^{-1} H) \\ &= R(P_{t|t-1}H^\top)^{-1} + H \\ &= (R + HP_{t|t-1}H^\top)(P_{t|t-1}H^\top)^{-1} \end{align}K−1​=R(Pt∣t−1​H⊤)−1(I+Pt∣t−1​H⊤R−1H)=R(Pt∣t−1​H⊤)−1+H=(R+HPt∣t−1​H⊤)(Pt∣t−1​H⊤)−1​​

Thus, K=(Pt∣t−1H⊤)(R+HPt∣t−1H⊤)−1=K∗K = (P_{t|t-1}H^\top) (R + HP_{t|t-1}H^\top)^{-1} = K^*K=(Pt∣t−1​H⊤)(R+HPt∣t−1​H⊤)−1=K∗

(Pt∣t−1)ab≡⟨(xt−x^t∣t−1)a(xt−x^t∣t−1)b⟩=⟨(xt−Fx^t−1∣t−1)a(xt−Fx^t−1∣t−1)b⟩=⟨(xt−Fxt−1∣t−1+F(xt−1∣t−1−x^t−1∣t−1))a(xt−Fxt−1∣t−1+F(xt−1∣t−1−x^t−1∣t−1))b⟩\begin{align} (P_{t|t-1})_{ab} &\equiv \langle (x_t - \hat x_{t|t-1})_a (x_t - \hat x_{t|t-1})_b \rangle \\ &= \langle (x_t - F\hat x_{t-1|t-1})_a (x_t - F\hat x_{t-1|t-1})_b \rangle \\ &= \langle (x_t - F x_{t-1|t-1} + F(x_{t-1|t-1} - \hat x_{t-1|t-1}) )_a (x_t - F x_{t-1|t-1} + F(x_{t-1|t-1} - \hat x_{t-1|t-1}))_b \rangle \end{align}(Pt∣t−1​)ab​​≡⟨(xt​−x^t∣t−1​)a​(xt​−x^t∣t−1​)b​⟩=⟨(xt​−Fx^t−1∣t−1​)a​(xt​−Fx^t−1∣t−1​)b​⟩=⟨(xt​−Fxt−1∣t−1​+F(xt−1∣t−1​−x^t−1∣t−1​))a​(xt​−Fxt−1∣t−1​+F(xt−1∣t−1​−x^t−1∣t−1​))b​⟩​​
Pt∣t−1=Q+FPt−1∣t−1F⊤P_{t|t-1} = Q + F P_{t-1|t-1} F^\topPt∣t−1​=Q+FPt−1∣t−1​F⊤
x^t∣t−1=Fx^t−1∣t−1Pt∣t−1=Q+FPt−1∣t−1F⊤\hat x_{t|t-1} = F\hat x_{t-1 | t-1}\\ P_{t|t-1} = Q + F P_{t-1|t-1} F^\topx^t∣t−1​=Fx^t−1∣t−1​Pt∣t−1​=Q+FPt−1∣t−1​F⊤

Compute Kalman gain K=(Pt∣t−1H⊤)(R+HPt∣t−1H⊤)−1K=(P_{t|t-1}H^\top)(R + HP_{t|t-1}H^\top)^{-1}K=(Pt∣t−1​H⊤)(R+HPt∣t−1​H⊤)−1

Make a measurement ztz_tzt​and a prediction using forecast state __ Hx^t∣t−1H\hat x_{t|t-1}Hx^t∣t−1​__

Compute posterior mean using new observation x^t∣t=x^t∣t−1+K(zt−Hx^t∣t−1)\hat x_{t|t} = \hat x_{t|t-1} + K(z_t - H\hat x_{t|t-1})x^t∣t​=x^t∣t−1​+K(zt​−Hx^t∣t−1​)

Compute posterior covariance Pt∣t=(I−KH)Pt∣t−1(I−KH)⊤+KRK⊤P_{t|t} = (I -KH)P_{t|t-1}(I-KH)^\top + K R K^\topPt∣t​=(I−KH)Pt∣t−1​(I−KH)⊤+KRK⊤

Hidden states are denoted as "x" and observables as "z"