Kalman-type model

(Simplified) model

I have a system that evolves in time (discrete time), and a signal x \in \mathbb{R} can be measured at every time. The state of the system s \in \mathbb{R} has the constrain s> 0, and the signal x depends on the state as

x = as + b,

where a \in \mathbb{R} and b \sim \text{normal}(\bar{b}, \sigma_0).

The state s is only known for a few times because it is expensive to measure it directly. The model for the evolution of the state is

\Delta s_n := s_n - s_{n-1} = \Delta s_{n-1} + c,

where c \sim \text{normal}(\bar{c}, \sigma_1), \bar{c} < 0, and \Delta s > 0.

My task is to infer the current state s_N from x_N.

Note: the model is more complex, but I have simplified it to highlight my main questions.

Code

data {
    // N is the total number of observations.
    int<lower=1> N;
    vector[N] x;

    // Ny is the number of direct observations of the state.
    // y denotes the observed states.
    int<lower=1, upper=N> Ny;
    array[Ny] int<lower=1, upper=N> id_y;
    vector<lower=0>[Ny] y;
}
parameters {
    // Parameters describing the evolution of states.
    real<lower=0> s_mu, s_sigma;
    // Unknown states are modelled as parameters.
    vector<lower=0>[N-Ny] s_unknown;

    // Parameters relating x and s.
    real x_alpha, x_beta;
    real<lower=0> x_sigma;
}
transformed parameters {
    // Assign s to the known (y) and unknown (s_unknown) states.
    vector<lower=0>[N] s;
    {
        int count_y = 1, count_s_u = 1;
        for (n in 1:N) {
            if (n == id_y[count_y]) {
                s[n] = y[count_y];
                if (count_y < Ny) {
                    count_y = count_y + 1;
                }
            }
            else {
                s[n] = s_unknown[count_s_u];
                count_s_u = count_s_u + 1;
            }
        }
    }
    // If I remove <lower=0>, then I do not have any more problem with the sampling.
    vector<lower=0>[N-1] d_s;
    d_s = s[2:] - s[1:N-1];
}
model {
    d_s[2:N-1] ~ normal(-s_mu + d_s[:N-2], s_sigma);
    x ~ normal(x_alpha * s + x_beta, x_sigma);
}

Questions

  1. My time series x have length of around 500, but I only know around 20 states. Is modelling the unknown s as parameters sensible, or should I try something else? Do you have some piece of advice or trick to handle this situation?

I read the page on HMM, and it is insightful, but to follow the semi-supervised section I would need something like log_integral_exp, which does not seem quite right. In fact, one option is to use a HMM and to discretize the problem, but I feel my setting is more natural.

  1. When I sample, I get errors because for some times \Delta s < 0, which is forbidden by the model.

  2. I enclose the variables count_ in braces because otherwise cmdstanpy complains; I am not supposed to use a parameter of type int. Is there other option for using local variables in the blocks.

Not sure why this didn’t show up on the orphans page. Sorry it hasn’t been answered. Hard model questions often languish here.

Yes, you can do that. With some time series models, you can integrate/marginalize out the missing data.

This is because your definition of d_s isn’t respecting the positivity constraint. It’s important with Stan to make sure that all parameter values satisfying the declared constraints are in support. If they are not, then Hamiltonian Monte Carlo can devolve to a very inefficient rejection sampler.

One way to keep something like a time series positive is to model it as unconstrained and then exponentiate. The other way to do it is work through all the algebra on the natural scale and constrain/transform the components making it up to preserve positivity.

Also, some notes on the Stan code.

if (count_y < Ny) {
  count_y = count_y + 1;
}

can be simplified to:

count_y += count_y < N_y;

This can be a one-liner:

vector<lower=0>[N - 1] d_s = s[2:] - s[1:N - 1];

The biggest thing you cold do for efficiency is remove the n == id_y[count_y] test, which can be precomputed and just unfold that whole lop in the transformed data block. All you need from the block is an indication of which index defines the s[n] for cases where n == id_y[count_y] and which index in the other cases. If it matters and this isn’t unclear, I’d be happy to help explain in more detail.

That’s good. The braces define a local scope and thus the count_XXX variables are local variables, which can be integers.

Stan won’t let you declare a parameter of type int. And no, there are no other options than using the block syntax to make a variable local that would otherwise be block-level.

Thank you for your time and advice, quite appreciated. I am not Bayesian, I just have to choose the right tools for the problems that the company poses (few data, uncertainty, costly experiments), so it helps to know that sampling from a very high dimensional space, filling missing data with parameters is not wrong.

The model has evolved since then. Now I declare \Delta s > 0 first, and then define s as a cumulative sum, and it seems to work fine. Known states are modelled as data Y[n_k] \sim \text{normal}(s[n_k], \sigma), with a “known” \sigma coming from measurement errors.

The biggest thing you could do for efficiency is remove the n == id_y[count_y] test,

Not sure what to do. Probably, something like

for (n in 1:Ny) {
    s[id_y[n]] = y[n];
}

And then to compute in the transformed data block an array with the indices of the unknown states, and to do other for loop for them. I suppose the problem with my posted code is the fact that for every sample stan has to test n == id_y[count_y], which could instead be computed only once as data. What did you have in mind?