Forward algorithm for hidden Markov models

I’ve been using Stan to fit HMMs (in an unsupervised manner) for a few months now and am curious about how to speed up evaluation of the forward algorithm, if anyone has any suggestions.

The data sets we commonly work with can be quite large (100,000+ to 1 million+) so any computational time we can save would be helpful. I asked this same question on twitter yesterday, but am not clear about the response, “Yes. But my guess is the problem isn’t in the computation time of the forward algorithm.” I’ve read through Michael Betancourt’s post on mixture models as well as through other posts here on HMMs, and of course through the manual.


Welcome to the forums!

Just fyi, it helps to provide a working Stan program. It’s really hard to give specific advice when it’s guesswork as to what you’ve done.

The first thing to define is what you consider “computational time;” it might seem like it’s easy to pin down, but it’s tricky. For anyone following along, this is the twitter thread. Computational time could reasonably mean these things:

  • wall time (what you’ve quoted in that thread)
  • time per iteration
  • time per effective sample (if you’re using MCMC, this is the quantity you should be interested in)

Then there’s scaling. That’s another can of worms, but increasing data can make time per effective sample faster. (I can think of at least two ways data can scale here: more observations with the same groups or more groups with the same number of observations. Those will have different properties.)

If this is what you’re really interested in, then here are a few things that can be done. If you provide the Stan program, I’m sure we can come up with more suggestions that are specific:

From Stan (easier):

  • use the data structures correctly so there’s memory locality as you access data and parameters
  • minimize recomputation; it’s not just a matter of the computation, but this adds to the expression graph
  • prefer transformed data and generated quantities to defining anything in the transformed parameters or model blocks
  • use compound functions in Stan whenever possible; these are typically written so they don’t add as much to the autodiff stack.

From C++ (harder):

  • try to make things as memory local as possible; this will be different for different sizes of data!
  • custom code to add parallelization
  • add custom code to add specialized derivative calculations
  • reduce the expression graph using custom functions

To actually get better time per effective sample (this is probably where you should focus unless you’ve coded your model really inefficiently):

  • look at treedepth; find a way to have that number drop by one or two (that’ll roughly make it go 100% faster or 400% faster; this is why speeding up the computation per evaluation is ok, but it’s really better to not focus just on computation)
  • better priors
  • look at stepsize and inverse mass matrix
  • use non-centerered reparameterization when appropriate (you’ve got to determine when it is first if you want to be optimal about it)

Hope that helps a little. The next thing to do is to post the full model if you want something deeper than generic guidelines.

1 Like

Thanks! I’ll try to go through each suggestion more carefully to fully understand each possibility. I implement the forward algorithm as described in the manual.

Here’s one model for the earthquake data example in “HMMs for time series, an intro using R”.

dat <- read.table(“”) <- list(K=dim(dat)[1], N=3, obs=dat$V2)

// Fitting an HMM to the earthquake data

data {
  int<lower=0> K; // length of the time series
  int<lower=0> obs[K]; // data
  int<lower=1> N; // number of states 

parameters {
  simplex[N] theta[N]; // tpm
  positive_ordered[N] mu; // mean of poisson - ordered

model {
  vector[N] log_theta_tr[N]; // log, transposed tpm 
  vector[N] lp; // for forward variables
  vector[N] lp_p1; // for forward variables

  mu ~ gamma(0.1, 0.01); // assigning exchangeable priors 
                //(mus are ordered for sampling purposes)

  // transposing tpm and taking the log of each entry
  for (n_from in 1:N)
    for (n in 1:N)
      log_theta_tr[n, n_from] = log(theta[n_from, n]);

  lp = rep_vector(-log(N), N); // 

  for (k in 1:K) {
    for (n in 1:N)
      lp_p1[n] = log_sum_exp(log_theta_tr[n] + lp) + poisson_lpmf(obs[k] | mu[n]); 
    lp = lp_p1;

  target += log_sum_exp(lp);

[edit: escaped model code]

fit <- stan(file=“earthquakeHMM.stan”,,

As we’ll be fitting HMMs using a variety of state-dependent distributions for different problems/data sets, sometimes spline-based densities, incorporating covariates into both the state process and observation process, I focused on the forward algorithm, rather than ask a specific question about one model. I assume (and have discovered) that different formulations will have their own problems, so thought that maybe increasing evaluation speed of forward algorithm could at least help in general. I appreciate all of your suggestions!

I’ve run into loads of issues not uncommon for mixture models, perhaps related to not specifying initial values or good enough priors(?), and I frequently get the warning that I should increase adapt_delta, increase max_treedepth.

I think that the lp initialization and target increment is in the wrong place. The identation is right, but it doesn’t match the braces, so it looks like you only update the target once. I think it should be

for (k in 1:K) {
  vector[N] lp_p1 = rep_vector(-log(N), N);
  for (n in 1:N)
    lp_p1[n] += log_sum_exp(log_theta_tr[n] + lp) + poisson_lpmf(obs[k] | mu[n]); 
  target += log_sum_exp(lp_p1);

Thanks for looking at the code!

For a single time series, K being the number of observations and N the states, I’m not sure I understand why the target wouldn’t only be updated once (only after implementing the forward algorithm). And in this case, lp_p1 would be reset to the initial distribution after each observation? I’ll have a closer look.

Is there anything in the works for an HMM case study or possibilities for an HMM package (like moveHMM or similar) in the near future? I noticed this has been mentioned over the past year or so.

1 Like

Sorry, you’re absolutely right. That initialization is right, and it just keeps updating as you go along. I was getting confused about a standard mixture model, where you’d have everything inside the N loop.

We’d love to get an HMM case study if you’d like to contribute one. It’s stacked up on my to-do list. I’ll probably be rolling out one on multi-armed bandits before that.

I’d be happy to contribute an HMM case study! Théo Michelot and I are working on it at the moment. In part, we’re reproducing the example given in the moveHMM paper on analyzing movements of the wild haggis. If alright with you, will email over a draft once we’ve finished.

1 Like

We do have a StanCon notebook on HMMs:

But these seem to be pretty popular models, so the more good examples we have the better!

I’m working on adding a proper forward-backward algorithm to extract sufficient statistics for an efficient Stan implementation (at least of the basic stochastic matrix transition without covariates).

I’m also working on writing up all the basic algorithms.

As Jonah says, can’t have too many. I’d particularly like to see the movement HMMs done with circular statistics—the change-in-bearing for foraging states seem like they’d benefit particularly from this.

1 Like

Yes, I’ve seen Luis Damiano’s HMM implementations.

This week, Théo Michelot (who wrote all the code for moveHMM) and I wrote up the Stan code to replicate the haggis movement analysis example here:

The forward algorithm is of course implemented and we use an alternate parametrization for the parameters of the von Mises distribution. We included an example on how to include covariates in the transition probability matrix using the multinomial logit link. In the generated quantities blocks, we have the forward-backward algorithm and Viterbi (though still unsure how this should be applied). Is this what you’re referring to for the movement HMMs?

I need to finish writing everything but there’s not much left to do. We didn’t write code for the pseudo-residuals, though it shouldn’t be difficult to include. I’m hoping I can finish this soon and then any feedback would be greatly appreciated! Hopefully this can be useful as a Stan case study or just a general example of how to implement HMMs for analysis of movement data in Stan.

1 Like

I was referring to how the change-in-bearing component was parameterized. The approach of using a constrained real for an arbitrary angle,

real<lower=-pi(), upper=pi()> angle;

has the problem that even though -pi() and pi() are right next to each other on the circle, they are mapped to negative infinity and positive infinity on the unconstrained scale.

Instead, I was suggesting this:

parameters {
  unit_vector[2] xy;
transformed parameters {
  real <lower=-pi(), upper=pi()> angle = atan2(xy[2], xy[1]);

This way, angles that are close to each other are close to each other in parameter space. The parameter space is also kept relatively compact. I don’t know how well it’ll work. This would be for representing the parameter. As defined, you get a default uniform prior on angle. Putting a more complicated prior would involve dealing with the Jacobian, which is a bit tricky because of how the unit vectors are finessed (two unconstrained parameters with standard normal priors leading to a uniform on the circle).

1 Like

Ah, right! We’re using the same transformation of the concentration parameter (kappa) and angle mean (mu) of the von Mises distribution that’s used in moveHMM.

    # compute the working parameters for the angle distribution
    x <- kappa*cos(angleMean)
    y <- kappa*sin(angleMean)

Our parameter block contains the xy, just like you mentioned. We then compute anglemean and kappa in the transformed parameters block.

Code from moveHMM here: