Covariates in Hidden Markov Model with hmm_marginal

Hi everyone,

I am working on a Hidden Markov Model, which will eventually be used for analysing species detections from camera trap data.

Currently I have a model which uses the new hmm_marginal function - the full code for the model is below. The model has two hidden states (1 = unoccupied, 2 = occupied) and a Poisson emission process representing the number of detections at each time step (e.g. the number of camera trap photos taken in each 30-minute block of time). The model takes data from multiple sites, each with a different number of time steps. The model appears to be working well, and successfully recovers the true parameter values when I give it simulated data.

I would like to extend the model to include one or more time-varying covariates for the state transition probabilities. However, my understanding is that doing so requires assuming that the transition probability matrix is also time-varying, and I am unsure whether hmm_marginal allows for this. My question is whether I can adapt my current model to include covariates, or whether I need to re-write it in a way that avoids using hmm_marginal.

Thanks!

My model code:

data{
  int sites; // Number of sites, indexed s
  array[sites] int t_steps; // Number of time steps, indexed t - differs per site
  int t_max; // Number of time steps in site with the longest time series
  array[sites, t_max] int y; // Observed data - NA values replaced with -9999
  int<lower=0> K; // Number of hidden states (K = 2: 1 = unoccupied, 2 = occupied)
}

parameters{
  // Rows of transition matrix
  simplex[K] t1;
  simplex[K] t2;
  
  // Initial state probabilities
  simplex[K] delta;
  
  // Parameters of measurement (emission) model
  positive_ordered[K] lambda; 
}

transformed parameters{
  matrix[2, 2] Theta = rep_matrix(0, 2, 2);
  matrix[K, t_max] log_omega;
  
  // Build the transition matrix
  Theta[1, :] = t1';
  Theta[2, :] = t2';
  
  // Compute log likelihoods in each possible state
  for(s in 1:sites){
    for(t in 1:t_steps[s]){
      log_omega[1,t] = poisson_lpmf(y[s,t] | lambda[1]);
      log_omega[2,t] = poisson_lpmf(y[s,t] | lambda[2]);
    }
  }
}

model{
  // Priors
  lambda ~ exponential(1);
  
  delta ~ dirichlet([9, 1]); // More likely to begin in state 1 (unoccupied)
  
  t1 ~ dirichlet([1,1]);
  t2 ~ dirichlet([1,1]);
  
  // Add to target log density
  target += hmm_marginal(log_omega, Theta, delta);
}
3 Likes

Last I checked time varying transition probabilities were not supported: Are time-varying transition probabilities supported by the new hmm_* functions?

AFAIK support for time varying transition matrices has not been added to hmm_marginal, but I’m not super in the loop.

5 Likes

Here to agree with Max. @charlesm93 would know!

3 Likes

Welcome to Discourse! To my knowledge @mbjoseph is correct (see here 27 Hidden Markov Models | Stan Functions Reference) [edit: I now see that @vianeylb posted to say the same thing while I was typing; she’s the real expert here].

Do you need any resources/tips about implementing the forward algorithm in Stan to avoid hmm_marginal? Relevant documentation to get started is here 2.6 Hidden Markov models | Stan User’s Guide

Note that for problems like yours, you can exploit the fact that the true state is known whenever a timestep has a detection to reduce the number of terms that the forward algorithm needs to sum over. If a lot of timesteps have nonzero detection, then this can be a big efficiency gain (though I have no idea if it is likely to be competitive with whatever hmm_marginal is doing under-the-hood).

Tangentially related: note that in the model without time-varying transition probabilities, it might be reasonable for you to simplify your model by computing delta based on the equilibrium frequencies implied by the transition matrix, rather than estimating them separately. But once you no longer assume constant transition probabilities, this idea goes out the window.

2 Likes

Thanks everyone for your advice! That’s unfortunate that the feature isn’t in hmm_marginal yet, although I see from the replies to the link that @mbjoseph posted that it’s coming in the next release.

I’ve had a look at implementing the forward algorithm by adapting the code on Luis Damiano’s tutorial, starting with just one site (code below). It seems to work with my simulated data, so now I’ll look at extending to include multiple sites and then covariates.

@jsocolar for some species at least there could be quite a lot of timesteps with non-zero detection (although I guess this depends to an extent on how long a timestep is defined to be, which is something I’ll need to think about some more before analysing the real data). It could definitely be useful to have an increase in efficiency in these cases. How would I go about implementing this feature in the forward algorithm?

data{
  int<lower=0> t_steps; // Number of timesteps, indexed t
  int<lower=0> K; // Number of hidden states (here K= 2, 1 = unoccupied, 2 = occupied)
  int<lower=0> y[t_steps]; // Observed data (counts of detections at each time t)
}

parameters{
  // State model
  simplex[K] delta; // Initial state probabilities
  simplex[K] Theta[K]; // Transition probability matrix
  
  // Observation (emission) model
  positive_ordered[K] lambda; // Expected count of detections for each state k
}

transformed parameters{
  vector[K] logalpha[t_steps]; 
  
  { // Forward algorithm log p(z_t = j | x_{1:t})
    real accumulator[K];

    logalpha[1] = log(delta) + poisson_lpmf(y[1] | lambda);

    for (t in 2:t_steps) {
      for (j in 1:K) { // j = current (t)
        for (i in 1:K) { // i = previous (t-1)
                         // Murphy (2012) Eq. 17.48
                         // belief state      + transition prob + local evidence at t
          accumulator[i] = logalpha[t-1, i] + log(Theta[i, j]) + poisson_lpmf(y[t] | lambda[j]);
        }
        logalpha[t, j] = log_sum_exp(accumulator);
      }
    }
  } // Forward
}

model {
  // Priors
  lambda ~ exponential(1);
  
  delta ~ dirichlet([9, 1]); // More likely to begin in state 1 (unoccupied)
  
  Theta[1,] ~ dirichlet([1,1]);
  Theta[1,] ~ dirichlet([1,1]);
  
  // Increment target
  target += log_sum_exp(logalpha[t_steps]); // Note: update based only on last logalpha
}
2 Likes

Inside the loop over timesteps, where you loop over i and j, you are computing the likelihood increments associated with four belief states (0 → 0, 0 → 1, 1 → 0, 1 → 1). If you have nondetections in both periods, then all four belief states contribute to the likelihood. But if you have detections in both periods, then only the final state contributes to the likelihood. If you have a detection followed by a non-detection, then only states 3 and 4 matter. If you have a non-detection followed by a detection, then only states 2 and 4 matter. You can use if/else statements to break these cases apart and the avoid doing the computation associated with any of the disallowed hidden states.

1 Like

That’s great, thanks!

In some particularly simple special cases it might be possible to encode time-varying state transition structure by augmenting the space of hidden states and adding some additional state to track some internal “clock”.

I doubt this would apply for your application, if there are time-varying covariates that influence transition probabilities (I am guessing this could be something like the observed temperature at each site at timestep t, or some measurement of the relative attractiveness of the site s at time t compared to neighbouring sites) it seems like it should be relatively straight forward to directly include those dependencies in the expression for the log-probability term in the forward algorithm.

If we didn’t have time-varying covariates but instead had a particularly simple special case of time-varying state transition structure, it might be possible to encode the time variance by adding/augmenting artificial states into the state space and end up with a time independent model:

Suppose the time dependent state transition probabilities had some cyclical relationship in sync with the time step duration with period three, we could take the product of the original 2-element unoccupied occupied state space \{u, o\} with a 3-element phase state space \{0, 1, 2\} to give a resulting 6-element product state space \{(u, 0), (u, 1), \ldots, (o, 2)\}. Then we could define a single 6 \times 6 time-independent state transition matrix that encodes three sets of transition probabilities for u \rightarrow u, u \rightarrow o, o \rightarrow u and o \rightarrow o – one set for each of the three “clock” phases {0, 1, 2} – as well as encoding the rule forcing the clock phase component of the state space to advance with time, i.e. the probability of all transitions (x, \tau) \rightarrow (x^{\prime}, \tau^{\prime}) must be zero unless \tau^{\prime} = \tau + 1 \mod 3.

This technique to turn a time-dependent model into a time-independent one by embedding some kind of artificial “clock” in the state space probably isn’t a great idea if it can be avoided, as the computational work required for HMM algorithms tends to scale like \mathcal{O}(S^2) or worse for S hidden states – and if you need to infer model parameters that control the state transition or observation (emission) models then maybe inflating the state space also explodes the number of parameters that need to be estimated.

1 Like

Hi all,
It’s cool to see people are using the hmm routines :)

Regarding adding a new feature, the first step is to create a GitHub issue on the stan-math repo. If I understand correctly, we want to overload hmm_marginal so that users can pass an array of transition matrices, so an object

matrix[2, 2] theta[K]

where K is the number of hidden states, right? We’ll also want to extend hmm_latent_rng and hmm_hidden_state_prob.

I don’t think we got around testing the performance of hmm_marginal, and the other hmm routines (beyond unit tests). I’d be interested to hear feedback from users.

2 Likes

Hi,

Thanks for linking the repo! I see from @betanalpha’s reply to @mbjoseph’s post (Are time-varying transition probabilities supported by the new hmm_* functions? - #3 by martinmodrak) that time-varying transition matrices are coming in the feature’s next iteration - is it still worth creating an issue on the repo in this case?

From my understanding (which may be wrong, as I’m pretty new to HMM’s!) we’re wanting to pass an array which looks something like

matrix[K,K] theta[T]

where K is the number of hidden states and T is the number of time steps, with the values of theta at each timestep depending on one or more covariates.

There’s also the issue of having multiple time series (in my case camera trap sites) which are generally not the same length (camera traps may end up being deployed for different numbers of days, especially when in remote areas which take time to access) - I’m guessing that this means we actually need an array which contains a separate transition matrix for each site/time combination (as this is what I did in R when simulating data to test the model on), but I’m not certain.

That makes sense to me – for a time varying transition matrix, in general you need to store a whole state transition matrix for each time step T.

Makes sense. The HMM for each site contributes a term to the overall log-probability. So perhaps in Stan you could have an outer loop over sites, then inside that loop perform the HMM forward calculation with all the site-specific data such as observation sequence, covariate sequence, site-dependent transition matrices.

To encode the site-specific data (sequences of observations, for example) into the value types offered by Stan, one technique could be to pack them all into a single array (with respect to some ordering of the sites), then use an auxiliary array for book-keeping to store the lengths of each subarray, so you can figure out the range of indices of each site-specific subarray at runtime and avoid adding spurious transitions between observations made in different sites.

Some ideas in: 8.2 Ragged data structures | Stan User’s Guide

1 Like

Yes. @betanalpha , @vianeylb and I wrote the hmm routines in Stan (with @Bob_Carpenter reviewing the PR). This is it: you’re talking to the folks developing the next iteration of the feature. We just never got to it.

Let’s use the notation in the reference manual. By “number of hidden states”, I meant number of unobserved discrete variables, rather than the cardinality of the hidden state space. I realize I misread your code and didn’t use the correct notation. We’ll go with what you have and use N instead of T, and \Gamma instead of \theta, to be consistent with the manual. So we want to pass in

matrix[K, K] Gamma[N]

The other arguments and the outcome of hmm_marginal, hmm_latent_rng and hmm_hidden_state_prob remain unchanged. @vianeylb can you confirm? I’ll then write a GitHub issue.

3 Likes

That’s great, thanks for your help!

1 Like

A few comments: the original prototype code was written to allow for varying transition matrices but we implemented the simpler functionality in Stan first. Extending the current implementation to varying transition matrices is actually straightforward because the Jacobian-adjoint products are compromised of products of the form \boldsymbol{\Gamma} \cdot \boldsymbol{\alpha}_{n} and the only change needed is replacing each \boldsymbol{\Gamma} with \boldsymbol{\Gamma}_{n}.

Introducing the intermediate matrix[K, K] Gamma[N] into a program may introduce substantial memory/performance burden, especially if K and/or N is large, but that’s going to be unavoidable for a general HMM implementation.

In general yes each time series will require a separate HMM. That said there will be numerous ways to implement these multiple HMMs as efficiently as possible, for example defining matrix[K, K] Gamma[N] that covers all days once and then subsetting the array when calling each individual HMM (presuming that the transition matrices are the same different sites).

2 Likes

Just wanted to see if there’s any update on this front - I’ve left a comment on the github page too (Allowing changing transition matrix for hmm suites · Issue #2678 · stan-dev/math · GitHub) but would love to have this feature in and lend a hand if I can.

1 Like

Hi @adi_r,
No updates. Developing this feature is on my todo list. I’ll try and get to it in the next coming weeks. Thanks for offering to help.

Hi @charlesm93 - thanks for the reply. Just wanted to check, how difficult/feasible would it be to also allow sparse transition matrices in the hmm_marginal and the hmm_hidden_state_prob functions? I was wondering if this would provide efficiency in cases where the state space is massive but most transitions are impossible (my use case is one where the only possible transitions are from t->t and t->t+1)

1 Like

Hi,
It’s feasible and it’s difficult :)

There is a path to doing this in C++, using Eigen’s support for sparse matrices, but what’s not clear to me is whether the automatic differentiation will immediately take advantage of the sparsity or whether this is something we need to configure. Another question lies in how do we make this available in the Stan language. We need to come up with an API that let’s users specify sparse matrices, while enabling operations on these matrices. This would mean supporting sparse matrices in general.

I know a few devs have looked into it, notably @stevebronder.

In your case, it sounds like the transition matrix is block diagonal. In theory, I could add an argument that allows you specify the size of the block and then does operations on block diagonal matrices, but this seems too specific a case to warrant adding a new feature.

A short term solution would be to code the relevant algorithm directly into Stan, using a sequence of 2 x 2 transition matrices. There are examples where the operations used by the Stan function are custom written in Stan (for example [1806.10639] An Introduction to Animal Movement Modeling with Hidden Markov Models using Stan for Bayesian Inference by @vianeylb and colleagues). You can then see how the math simplifies in your problem and write functions accordingly.

Would it simplify the implementation to consider an alternative where the transition probabilities have some parameter form?

So for instance, if you assume it is something like

for (t in 1:T)
    for (i in 1:K)
        Gamma[t][i, i] = Phi_approx(X[t] * B[i]);

would be one possibility if K=2. Then you wouldn’t need to save the whole Gamma matrix. You just need to estimate B and then pass in X and B.