State space / kalman filtering advice for inverse model

Hi Mc-stan, I’m trying again to improve my biogeochemical inverse modeling using stan.

There is a phenomena we can model like so.

\frac{dC}{dt} = A_t + R_t + M_t
A_t = u_t\beta_t
M_t = K_t \frac{(C_t - S_t)}{z}

I have observations of C, u, S and good prior knowledge of K and z.
I want to know R and \beta.
dt is short enough (half hourly) that I’m not worrying about trying to solved with with an ODE, assuming everything is static between time steps is fine. I have good reason to belive \beta and R change slowly over a day or so, but this is one of the reasons for the study. also R is constrained to be positive by definition.

I’ve been doing a lot of reading regarding state-space models which I believe this falls under, given that many of my parameters vary over time. Including Jeff Arnold’s page and many of the questions posted here.

Currently my model, which mostly works, and by work I mean reproduce my synthetic data’s parameters. Is a kinda multi-level model such that…

C_{t+1} = C_{t} + u_{t}\beta_j + M_{t} - R_i

\hat{C} \sim \mathcal{N}(C, \sigma)
\beta \sim \mathcal{N}(\bar{\beta}, \sigma_\beta)
\bar{\beta} \sim \mathcal{N}(0, 1)
R \sim \mathcal{N}(\bar{R}, \sigma_R)
\bar{R} \sim \mathcal{N}(0, 1)
\sigma \sim \mathcal{Cauchy}(0, 1)
\sigma_\beta \sim \mathcal{Cauchy}(0, 1)
\sigma_R \sim \mathcal{Cauchy}(0, 1)
...

The issues are:
This models ability to reproduce R from a known dataset is quite dependent on the size of j and i, that is to say the time-window.
It’s very slow.
I’ve none-centred the multi-level components, which did help.

I’m still pretty new to this sort of modeling, and the Bayesian state-space resources tend to be mostly about prediction rather than uncovering the “true” hidden variables.
I feel like a Kalman type model would perform much better, but having read though https://jrnold.github.io/ssmodels-in-stan/ I don’t really understand which bits I need to implement to get what I want.
I see there is now a gaussian_dlm_obs sampling statement, Is this what I want? I’m struggling to find an example which fits with what I’m trying to do.
I also just saw, http://www.juhokokkala.fi/blog/posts/kalman-filter-style-recursion-to-marginalize-state-variables-to-speed-up-stan-inference/. Where would this sit compared to gaussian_dlm_obs.

Similarly, I’ve tried to fit this with walker, using a simplified version of the DGF with just \frac{dC}{dt} = A_t + R_t, substituting y for C and x1 for u and using rw2. But while it’s very quick to run, the parameters it returned do not look at all like my DGF.

I’m looking for examples closer to my use case, or perhaps explanation of why my case is actually similar to X.
Plus if anyone wants to weigh in on if what I’m attempting to do is reasonable.
Oh and any coding tips for something I’ve done stupid with the multi-level version is also much appreciated.
In particular, I’d be very interested to hear from @jrnold and @james_savage

// demo version

data {
  int<lower=1> N;  // number of samples
  int<lower=1> N_day; // number of days
  int<lower=1> N_filter; // number of filters
  int<lower=1> day[N];
  int<lower=1> filter[N];
  vector[N] C_hat; // observed O2
  vector[N] Cs; // observed O2 at surf
  vector[N] u; // east-west velocity
  real z_hat; // thermocline thickness
  real dt; // time step
}
transformed data {
  vector[N-1] u_avg; // u is semi-sinosoidal, improved estimate by taking average
  for(i in 1:N-1){
    u_avg[i] = (u[i] + u[i+1]) * 0.5;
  }
}

parameters {
  real<lower=0> sigma;
  real<lower=0> R_mu;
  real<lower=0> R_tau;
  vector<lower=0>[N_filter] R_tilde;
  real<lower=1> z;
  real<lower=0> kz;
  real u_mu;
  real<lower=0> u_tau;
  vector[N_day] u_tilde;
  real init;
}

transformed parameters {
  vector[N_filter] R_raw;
  vector[N_day] beta_u;
  real C[N]; // true oxygen
  C[1] = C_hat[1] + init;
  
    // non-centering
  beta_u = u_mu + u_tau * u_tilde;
  R_raw = (R_mu + R_tau * R_tilde) * 0.0001;

  for(i in 1:(N-1)){
    C[i+1] = C[i] +
    (u_avg[i] * dt * beta_u[day[i]] * 0.0001) +
    ((0.00001 * kz * dt * ((Cs[i] - C[i]) / z))) +
    -((R_raw[filter[i]] * dt));
  }
}

model {
  sigma ~ cauchy(0, 1);
  init ~ normal(0, 0.25);
  
  R_mu ~ normal(0, 1);
  R_tau ~ normal(0, 1);
  R_tilde ~ normal(0, 1);
  
  u_mu ~ normal(0, 1);
  u_tau ~ cauchy(0, 1);
  u_tilde ~ normal(0, 1);
  
  z_hat ~ normal(z, 2);
  kz ~ normal(4.5, 1);
  
  C_hat ~ normal(C, sigma);
}

generated quantities{
  vector[N] R; // mmol m-2 s-1
  vector[N-1] dC; // mmol m-3
  vector[N] Mz; // mmol m-2 s-1
  vector[N] adv_u; // mmol m-2 d-1
  for(i in 1:N-1){
    dC[i] = C[i+1] - C[i];
  }

  for(i in 1:N){
    R[i] = R_raw[filter[i]]; // R flux = mmol m-2 s-1
    Mz[i] = (0.00001 * kz * ((Cs[i] - C[i]) / z) ); // flux = mmol m-3 s-1
    adv_u[i] = (u[i] * beta_u[day[i]] * 0.0001); // mmol m-3 s-1
  }
}

and the DGF in R

  # test of adv_fast.stan
require(ggplot2)
require(data.table)

out = list()
ti = 0:((60*60*24*30)+1) # month of data
dat = data.table(
  time = ti,
  hour = ti/60/60
)

freq = (max(ti) / (6*60*60)) / max(ti)
dat[, spring_neap := (max(ti) / (14*24*60*60)) / max(ti)]
dat[, spring_neap := 0.66 + (0.35 * sin(spring_neap * pi * ti))]

# east-west
dat[, u := 0.7 * sin(freq * pi * ti + (8*60*60))]
dat[, u := u * spring_neap]
dat[, u := u + spline(c(0, 928001, 1100000, 2500000), c(-0.02, 0.01, 0.02, -0.02), xout=time)$y]

dat[, dateTime := as.POSIXct(time, origin = "2015-06-01", tz = "UTC")]

R_spline = spline(as.POSIXct(c("2015-06-01", "2015-06-07", "2015-06-09", "2015-06-15", "2015-06-30"), tz="UTC"),
                  c(0.5, 0.5, 0.1, 0.5, 0.8)/(60*60*24), n=nrow(dat))
dat[, R := R_spline$y]
dat[R < 0, R := 0] # R can't be negative

  # gradients, [mmol m-4]
dat[, beta_u := spline(as.POSIXct(c("2015-06-01", "2015-06-10", "2015-06-20"), tz="UTC"), c(2e-04, 1e-04, 2e-04), n=nrow(dat))$y]

dat$Cs = 320

initial_DO2 = 300

  #### x
# x = dat[second(dateTime) == 0]
# x = dat[second(dateTime) == 0 & minute(dateTime) %in% c(0,10, 20, 30, 40, 50)]
x = dat[second(dateTime) == 0 & minute(dateTime) %in% c(0, 30)] # simulate at 30m intervals
x[, dt := unique(diff(time))]

true = vector(length=nrow(x))
true[1] = initial_DO2

x$k = 4.5e-5 # m-2 s-1 (0.45 cm^2 s-1 from palmer2008)
x$z = 15

  # loop to generate true C
for(i in 2:(nrow(x))){
  u_avg = (x$u[i] + x$u[i-1])/2 # m s-1
  A_u = u_avg * x$dt[i] * (x$beta_u[i-1] + x$beta_u[i] * 0.5)
  Mz = (x$k[i-1] * x$dt[i-1] * ((x$Cs[i-1] - true[i-1]) / x$z[i-1]))
  R_ = x$R[i-1] * x$dt[i] # mmol m-3
  true[i] = true[i-1] + (A_u + Mz) - R_
}
x[, true := true]
x[, C := rnorm(.N, true, 0.25)]
x[, day := 1 + yday(dateTime) - min(yday(dateTime))]
x = x[1:length(day)-1] # drop last day

ggplot(x) +
  geom_line(aes(dateTime, true), color="red", alpha=0.5, size=1) +
  geom_line(aes(dateTime, C), color="blue")
ggplot(x) + geom_line(aes(dateTime, u), color="red", alpha=0.5, size=1)
ggplot(x) + geom_line(aes(dateTime, R), color="red", alpha=0.5, size=1)
ggplot(x) + geom_line(aes(dateTime, beta_u), color="red", alpha=0.5, size=1)

x[, filter := cut(day, max(day)/3, label=F)]

dx = list(
  N = nrow(x),
  day = x$day,
  N_day = max(x$day),
  filter = x$filter,
  N_filter = max(x$filter),
  C_hat = x$C, # measured C
  Cs = x$Cs,
  z_hat = 15,
  u = x$u,
  dt = 1800
)

adv_fast = stan_model("adv_fast.stan")
fit= sampling(adv_fast, data=dx, chains=1, iter=2000, control=list(max_treedepth=12, adapt_delta=0.90))

Check out https://cran.r-project.org/web/packages/ctsem/index.html (by @Charles_Driver). That does this sorta time series stuff. There are optimization/full inference options if you’re having performance issues.

For the model you’re working with, maybe ditching the cauchy priors (replacing them with normals) will help performance? The reasoning is Cauchys have really heavy tails that are hard to sample, so unless you need them, ditch them.

Thanks for your reply Ben,
Tweaking the priors does speed things up a little.

I should have mentioned I’ve looked at ctsem and I know it can do discrete time-series stuff as well as continuous time.
However I have similar problems, the examples don’t appear to be much like my problem, and the terminology used in ctsem is quite different from everything else I’ve been reading. e.g. Durbin & Koopman. Presumably because of the social/behavioural science focus.
Plus I feel I’ll be starting from scratch in terms of understanding how the model may or not behave in a new framework given I won’t have written it…

That being said If someone felt like giving an example of fitting ctsem to my toy DGF to recover the \beta and R coefficients that would put my well on my way I think.

After further thought I think an example of a stan multivariate Kalman filter where a DGF is included is what I’m needing. Every example I’ve found so far is based on example real data with unknown structure with a view to prediction, or is a single state variable which is modelled as a random walk.

Hi,

I’ve never implemented a Kalman filter in Stan, but I have implemented particle filters from scratch (the general case, which nevertheless usually have no analytical solution and require stochastic simulations instead) and embedded them into inference methods like what the POMP R package does, or MCMC (the so called particle MCMC), which seems to be the kind of thing you are looking for as opposed to other kinds of time series analyses.

I don’t know how familiar you are with Bayesian inference and/or Stan, and maybe you know all or most of this already, but since I can’t really know if you do, and in case you don’t, here’s what I think may be helpful:

there are a couple of things that may be confusing when the model itself is Bayesian even without any parameter inference involved (prediction,like you said). Namely, for each MCMC step a single parameter set is sampled and the Kalman Filter (alternatively EKF, UKF, or particle filter more generally) is applied to compute the likelihood of that individual parameter set but taking stochasticity in the process/observation into account (which is the whole point of these filters in the first place) – this is a full Bayesian description on its own where the prior comes from the model prediction and the posterior takes into account the observation (i). On top of that is where MCMC Bayesian inference uses each of the likelihoods and priors for the model parameters to produce a trace of the posterior and the its distribution (ii).

If you do know all of that already and the issue is just with the Stan implementation the easiest way may be to reproduce the model you have (i) with the algorithm="Fixed_param" in the sampling command (which doesn’t actually propose new parameter values, therefore not doing (ii) as described above) and reducing it to what you seem to have already in another implementation against which you can validate the implementation.

Hope that helps.

2 Likes

Can you dump some example data to a file? I tried to run the example script but my R doesn’t have cefasMOS (and said it wasn’t able to install the package). Something like:

stan_rdump(names(dx), "mydata.dat", envir = list2env(dx))

should do the trick. I think there should be a way to just do this with brms’s splines and use an expansion so it’s faster, but I wanna check first myself before I go giving you recommendations.

Whoops sorry, I included a work package…

The DGF now uses the data.table functions only.
Unfortunately I’m travelling now, so won’t be able to do that until tomorrow.

Will your implementation put \beta and R on the _t scale? as in, estimated for each time point.
I’d really like to get that flexibility, but my attempts with modelling them as a random walk as mentioned tend to blow up (treelimits and/or divergences)

Things such as this (before even including R:
But I think this will tend to over smooth given the fixed \tau error.

// RW version

data {
  int<lower=1> N;  // number of samples
  vector[N] y; // observed O2
  vector[N] u; // east-west velocity
  real dt; // time step
}
transformed data {
}

parameters {
  real<lower=0> sigma;
  real<lower=0> tau;
  vector[N] beta_u;
  real init;
}

transformed parameters {
  vector[N] C;
  C[1] = y[1] + init;
  
    // process model
  for(i in 2:N){
    C[i] = C[i-1] + (0.0001 * beta_u[i-1] * dt * u[i-1]); // put beta on unit scale
  }
}

model {
  sigma ~ normal(0, 1);
  init ~ normal(0, 1);
  tau ~ normal(0, 1);
    // random walk for beta
  beta_u[1] ~ normal(0, 10);
  for(i in 2:N){
    beta_u[i] ~ normal(beta_u[i-1], tau);
  }
  y ~ normal(C, sigma); // observational error
}

I played around with this a little. I didn’t do the exact model cause I was too lazy to figure it out. The regression we’re fitting here is C_{t + 1} \sim N(\alpha C_{t} + s(\text{time}), \sigma)

library(brms)

df = tibble(Cp = x$C[2:1440],
            C = x$C[1:1439],
            time = x$time[2:1440]) %>%
  mutate(time = time / max(time))

fit = brm(bf(Cp ~ C + s(time, k = 10)), data = df, family = gaussian(), chains = 1)

summary(fit)

plot(marginal_effects(fit, effects = "C"), points = TRUE, ask = FALSE)
plot(marginal_effects(fit, effects = "time"), points = TRUE, ask = FALSE)

predict(fit) %>%
  as.tibble %>%
  bind_cols(df) %>%
  ggplot(aes(time, Cp)) +
  geom_ribbon(aes(ymin = `Q2.5`, ymax = `Q97.5`), fill = "blue", alpha = 0.5) +
  geom_point(size = 0.1)

The code “s(time, k = 10)” means fit a non-parameteric spline term represented with 10 basis functions. By doing the basis function representation, there are only k parameters to fit (instead of one at every point). More k -> higher frequency. The fact that you only care about slow scale stuff helps here.

There’s also gp stuff in brms that you can use too (you could do the above fit with gp(time, k = 10, c = 5/4) and probably get something).

That high frequency term is probably something you gotta do something special for if you want to extract it.

That useful at all?

Thanks Ben

I do, but I think the explanatory variable u is required to separate \beta and R. Otherwise I just get an overall trend…which I could fit to lots of different ways.
Put another way, I know the main control on C is u\beta, I’m interested in the bit left over, i.e. the bit that doesn’t vary with u.

This is useful in showing how I could use a spine approach rather than the bins/boxes I have going on with my first model. Thanks for that.
I see the brms model does lots of target += ... stuff, I got the impression that wasn’t the preferred way of writing models now. I don’t know if that is a requirement for how this model is written, or just an older coding style.

POMP does indeed look like the sort of things I do.

Could you provide an example of one of your particle filters, maybe with a toy DGF?
Perhaps I could learn something there.

There are lots of variations in space-state modelling frameworks about, all pretty similar really, all using different terminology depending on if their field is general stats, econometrics, control theory, ecology (fisheries), social science or pharma.

here’s another one :) https://nwfsc-timeseries.github.io/atsa-labs/index.html

I’m generally thinking my approach is correct, I mean, I have good reason to think my model is an accurate representation of the real world DGF. It feels natural to keep my model close to the “known” way the world works, certainly will make writing the paper more convincing if my stat model looks like the established real world process ran in reverse.
All I’m trying to do it build a model which can determine the underlying latent variables I believe to exist, and provide uncertainties for them given all the prior knowledge I have.
e.g. there are various bits of prior knowledge I want to include such as there is a constant uncertain term which scales everything at certain times. But in the interests of starting small I didn’t mention that.

e.g.
h \frac{dC}{dt} = A_t + R_t + u_t\beta_t

Which is why it’s important for me to code the model myself rather than use a framework, It needs to be flexible enough for me to add informative priors to bits of it.

Yeah, you could try it with ctsem, it will handle all the integration for you but yes, it will restrict your prior choices, or at least, make more complex priors a bit trickier to specify. You’re right that it wasn’t designed from the beginning to handle this sort of model, scope has evolved with time, useability and better docs is a few steps behind but hopefully getting there… I was curious to try specify the model, maybe helps.

library(ctsem)

DIFFUSION=matrix(0,4,4)
DIFFUSION[3,3] <- 'diffusion_logR'
time=c(0,.3)
C=c(2,1.3)
u=c(5,2)
s=c(3,9)
id=1
dat=cbind(id,time,C,u,s)

m<-ctModel(n.latent=4, n.manifest=3, type='stanct',
  latentNames=c('C','u','logR','s'),manifestNames=c('C','u','s'),
  LAMBDA=matrix(c(1,0,0, 0,1,0, 0,0,0, 0,0,1),3,4),
  DRIFT=matrix(c(
    'state[2] * PARS[1,1] + exp(state[3]) + PARS[2,1] * (state[1]-state[4]) / PARS[3,1]',0,0,0,
    0,'drift_u',0,0,
    0,0,'drift_logR',0,
    0,0,0,'drift_s'),byrow=TRUE,4,4),
  PARS=matrix(c('B','K','z'),3,1),
  CINT=matrix(c(0,'cint_u','cint_logR','cint_s'),4,1),
  DIFFUSION=DIFFUSION,
  MANIFESTMEANS=matrix(0,3,1),
  T0VAR=diag(.01,4),
  MANIFESTVAR=diag(1e-5,3))

# ctModelLatex(m,textsize = 'tiny') #ctsem 2.9.2+ only

m$pars$indvarying<-FALSE

plot(m)

fit <- ctStanFit(datalong = dat,ctstanmodel = m,optimize=TRUE, optimcontrol=list(isloops=5,tdf=50))

summary(fit)
k=ctStanKalman(fit)
e=extract(fit)
g=ctStanGenerateData(fit,nsamples = 100,fullposterior = TRUE)
2 Likes

I played with models with the u covariate added in. I just ended up going with a linear model, nothing time varying.

I tried doing something like:

C_{t + 1} \sim N(C_t + u_t \beta_t + R_t + \beta_0, \sigma)

where \beta_t and R_t were low order splines, but when I looked at the time marginal effects there wasn’t really much going on.

I ended up fitting a simple linear model (assuming \beta and R are constant):

C_{t + 1} \sim N(C_t + u_t \beta + R, \sigma)

and this seemed to work pretty well, but that doesn’t seem like the level of detail you were going for.

I was going by: Estimating Non-Linear Models with brms to figure out how to do the fits.

This is the first model (with time-varying coefficients):

library(brms)

df = tibble(Cp = x$C[2:1440],
            C = x$C[1:1439],
            u = x$u[2:1440],
            time = x$time[2:1440]) %>%
  mutate(time = time / max(time))

fit = brm(bf(Cp ~ C +
               beta * u +
               R,
             beta ~ s(time, k = 10) - 1,
             R ~ s(time, k = 10),
             nl = TRUE),
          data = df,
          family = gaussian(),
          chains = 1,
          control = list(adapt_delta = 0.9))

summary(fit)

plot(marginal_effects(fit, effects = "C"), points = TRUE, ask = FALSE)
plot(marginal_effects(fit, effects = "time"), points = TRUE, ask = FALSE)

predict(fit) %>%
  as.tibble %>%
  bind_cols(df) %>%
  ggplot(aes(time, Cp)) +
  geom_ribbon(aes(ymin = `Q2.5`, ymax = `Q97.5`), fill = "blue", alpha = 0.5) +
  geom_point(size = 0.1)

This is the second model (with constant coefficients):

priors2 = prior(normal(0, 1), nlpar = "beta") +
  prior(normal(0, 1), nlpar = "R")

fit2 = brm(bf(Cp ~ C +
               beta * u +
               R,
             beta ~ 1,
             R ~ 1,
             nl = TRUE),
          data = df,
          family = gaussian(),
          chains = 1,
          prior = priors2)

summary(fit2)

predict(fit2) %>%
  as.tibble %>%
  bind_cols(df) %>%
  ggplot(aes(time, Cp)) +
  geom_ribbon(aes(ymin = `Q2.5`, ymax = `Q97.5`), fill = "blue", alpha = 0.5) +
  geom_point(size = 0.1)

Here a Stan function for doing the Kalman filtering, using notation mostly based on https://users.aalto.fi/~ssarkka/pub/cup_book_online_20131111.pdf:

dX/dt = A_c X_t + B_c U_t + N(0, Q)
y_t = HX_t + N(0, R),

real kf(
  matrix Ac,  // q*q, continuous time state transition matrix
  matrix Bc,  // q*m, continuous time input matrix
  matrix H,   // p*q, measurement matrix
  matrix U,   // N*m, input vector
  matrix y,   // N*p, measurement vector
  vector w,   // q, process noise
  vector v,   // q, measurement noise
  real dt,    // time interval
  vector X1,  // q,  initital X  at t=1
  matrix P1   // q*q, covariance at t=1
) {
  int N = dims(y)[1];  // number of time steps
  int p = dims(y)[2];  // number of observation variables
  int m = dims(Bc)[2]; // number of input variables
  int q = dims(A)[1];  // number of states
  
  vector[q] X[N];
  vector[p] v;
  matrix[q, q] Q = diag_matrix(w);   // state covariance
  matrix[p, p] R = diag_matrix(v);   // observation covariance
  matrix[q, q] P;     // covariance
  matrix[q, p] K;     // Kalman gain
  matrix[p, p] S;     // aka F 
  matrix[p, q] H;     // aka C
  matrix[p, p] S_inv; 
  matrix[q, q] A;
  matrix[q, m] B;
  matrix[q, q] I = diag_matrix(rep_vector(1., q));
  real ll = 0;        // logliklihood
  
  // continuous time to discrete time
  A = matrix_exp(A_c*dt);
  B = (A - diag_matrix(rep_vector(1., q))) / A_c * B_c;
  
  // Initialization
  X[1] = rep_vector(0, q);  
  P = rep_matrix(0, q, q);
  
  for (k in 1:N) {
    k__1 = max(k-1, 1);        // k-1
  
    /* Time update */
    X[k] = A*X[k__1] + B*U[k]; // predict state mean values
    P = A*P*A' + Q;            // predict state covariance matrix
  
    /* measurement update */  
    v = y[k]' - H*X[k];        // output mean residual
    S = H*P*H' + R;            // output covariance
    S_inv = inverse(S); 
    K = P*H' * S_inv;          // Kalman gain
  
    // update state covariance
    P = (I - K*H)*P*(I - K*H)' + K*R*K';  
    //P = (I-K*H)*P;      // less stable
  
    X[k] += (K*v);            // update states
  
    // log-likelihood
    ll -= (log_determinant(S) + v'*S_inv*v); 
  }
  return 0.5*(ll - p*N*log(2*pi()));
}

I just simplified the code above from a more specific code of mine. I haven’t tested it and probably missed a parenthesis or something. But, it should give you the general idea of how to implement a Kalman filter in Stan. @Charles_Driver post above shows how to get your model into general state-space representation.

A great resource on Bayesian filters is found here https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python.

2 Likes

Unfortunately I don’t have that example at hand (maybe in a backup drive somewhere in the world). I tend to agree with you that filtering is the best approach to modeling dynamic systems – but of course that depends on what you are trying to describe/explain, and “accurate representation” is always subjective when dealing with real data (“all models are wrong” kind of thing and all).

There are plenty of examples of filters online, though, and the ones @lukas-rokka mentioned seem to be a very accessible ones in Python, and the books will have comprehensive descriptions from scratch. Since he also went through the trouble of writing a Kalman Filter function, you have nearly everything you need to reproduce your previous R implementation with the algorithm="Fixed_param", like I mentioned before too.

In case you don’t want to go through an entire book on the subject, on opposite ends are a intuitive primer on the Kalman Filter, and a more general/formal introduction to particle filters more generally (or this one).

2 Likes

Thanks very much everyone for these insights and ideas,

I’ve got a lots of things to follow up :)

1 Like