Built-in HMM design


#1

If we are willing to assume that the transitions are a multi-logistic regression based on predictors, then I think we can get away with this interface:

real hmm_lpdf(T0[] y | matrix x, 
                       matrix beta,
                       vector phi,
                       (T0, T1, T2, ..., TN):real f,
                       T1[] alpha1, ..., TN[] alphaN);

where

  • y is an N-array of emissions, each of type T0
  • x is an (N x K)-matrix of covariates/predictors
  • beta is a (K x K) matrix of multi-logistic regression coefficients
  • phi is the initial distribution for the first emission
  • f is the emission function, taking parameter args of type T1,…,TN and returning density of T0 argument
  • alpha1, …, alphaN are K-vectors of parameters for the emission mixture

Implementing this would require the C++11 parameter packs, as we’ve started using elsewhere, to deal with the polymorphism of T1, …, TN (parameters to f).

As usual with multi-logit, there is a decision to make as to whether to use the identified (K-1) x K matrix of regression coefficients or the redundant (K x K). Given all our recent work, I’m thinking we want the redundant ones with soft centering.

This is also not vectorized. Usually in HMMs, the sequences are ragged. We could do something like map_rect and hack some of this in, but it’ll be messy. There’s no shared
computation across sequences, although there could be the usual reduction in expression graph size given shared paramters beta, phi, alpha. So a question remains of whether we put off vectorization until we have ragged arrays.

Also, this will restrict us to emissions thar can be represented by Stan data types. HMMs don’t have this restriction—their emissions could be a count, a boolean, and a vector.


#2

Happy to follow whatever is most feasible at the moment.

In this case, if I understood everything correctly from before, we’ll use the complete-data likelihood (using the forward-backward algorithm) rather than the likelihood (which only uses the forward-algorithm) to implement this more efficiently in Stan.

Is the next step to program the forward-backward algorithm in C++? Should @LuisDamiano and I write pseudo code to outline what everything should look like in the background?


#3

The complete data likelihood is is p(y, z \mid \cdots) where y are observations and z the latent state sequence and \cdots represents model parameters. We don’t know z and can’t sample it because it’s discrete.

The likelihood is p(y \mid \cdots). Frequentists like to treat z as “missing data” (i.e., potentially observable), because otherwise the expression p(y, z \mid \cdots) isn’t coherent. That’s why they call p(y, z \mid \cdots) the “complete data likelihood”. In the Bayesian world, we’re fine with a p(y, z \mid \cdots) that treats z as an unobservable parameter. This doesn’t change anything algorithmically, just philosophically.

Forward algorithm: computes the likelihood p(y \mid ...) by marginalizing z out of p(y, z \mid \cdots).

Forward-backward algorithm: Computes two marginal posteriors,

  1. p(z_n = k \mid y, \cdots)

  2. p(z_n = k, z_{n+1} = k' \mid y, \cdots)
     
    Together, these compute the E-step of the expectation-maximization (EM) algorithm.

For Stan, we can implement HMMs in one of two ways:

  1. Use the forward algorithm and add \log p(y \mid \cdots) to the target. This is what the examples in the manual do. And it’s the best you can do from the Stan object language.

  2. Use the forward-backward algorithm and use the two marginal posteriors as sufficient statistics. This has to be done in C++ by pulling out the double values of y and anything in the parameters \cdots. We then add the following terms to the target:

    • Emissions:
      \sum_n \sum_k p(z_n = k \mid y, \cdots) \times \log p(y_n \mid z_n = k, \cdots)

    • Transitions:
      \sum_n \sum_k \sum_{k'} p(z_n = k, z_{n+1} = k' \mid y, \cdots) \times \log p(z_n = k, z_{n+1} = k' \mid y, \cdots)

    The transitions further simplify if there are no predictors in \cdots and just a stochastic matrix.

The exercise is to show that you get a term equal to the log likelihood plus a constant either way, which is what we need.

The best common presentation I know of this is in Bishop’s Pattern Recognition and Machine Learning or Durbin et al.'s Biological Sequence Analysis (I’m particularly fond the of the latter). But there are probably good class notes for this from natural language processing or biostatistics and more general versions in more general graphical modeling books/classes (this is a simple case where observations form a linear sequence—the algorithm can be generalized to directed acyclic graphs).


#4

Right, I should have been clear and said that the form of the complete-data likelihood is used in implementation of the EM algorithm, not that we’ll use it to sample from the states. Thanks for adding a clear description and for the resources.

I’ll read over https://github.com/stan-dev/stan/wiki/Introduction-to-Stan-for-New-Developers and get started on item 2. Will likely have some questions in the near future.


#5

Model

This is my understanding of the proposed model, please correct me if I’m wrong.

  1. We’re assuming univariate emissions.
  2. We’re allowing for time-varying transition probabilities (input-driven transitions for the hidden states).
  3. We’re also allowing for one user-chosen emission density.
  4. No multivariate emissions.
  5. No state-dependent density. For example, we rule out cases where the emissions follow one distribution in one state and another distribution (say, fatter tails) in other states.

Questions

  1. I’d personally start with a softmax regression with soft centering, but I’d also assume that the K-1 parametrization may make convergence way easier. Would it be reasonably easy to change this parametrization later?
  2. I’m not following the ragged sequences bit. I do understand that Stan has no built-in ragged structure, but I’m not sure why we’d need it.
  3. Assuming the emission array is univariate, will the emission function really take T1, …, TN parameters? Say we want to fit a model with two states and Gaussian density, wouldn’t we not need only two means and two standard deviations? If the latter is true, then there’s no fixed number of density parameters.
  4. What happens if we need to fit a time-homogeneous HMM (i.e. no covariates for the transition model)? Using a constant input would feel hacky and computationally inefficient. Would we need to implement two interfaces (perhaps same name but different number of arguments, assuming we can overload functions)?
  5. Is there a straightforward way to allow for regressors in the emission mean? Think for example of an Markov-Switching AutoRegressive model for time series (i.e. AR1 with a pair of regression coefficients for each hidden states). Does the proposed interface accept non-parameters for alpha1, …, alphaN? If so, I think we may use these quantities to model emission means when needed (e.g. instead of being free parameters, these could be a linear combination of some elements of the data block and some free parameters).

Inference

I believe the main goal of having HMM built-in is to gain performance, mostly by avoiding autodifferentiation. As such, I’d stick with the forward algorithm since adding the forward quantity to the target is enough to get the sampler working. The forward-backward pass is optional and can be programmed in the generated quantities block. Same with Viterbi decoding.

Don’t get me wrong, I like smoothed probabilities a lot, but I think the first step should be getting the right posterior. After that, we can provide with convenience functions to compute the other optional quantities (e.g. an overloaded instance of hmm_lpdf with an extra argument where we store the smoothed probability or the jointly most likely path).

Finally, @Bob_Carpenter: I assume we’d need to implement this built-in density as a “template”. Is there a template for another model that you’d recommend us to take a look at? I was considering taking a loot at the GP work but you may know of another built-in model that is more or less similar than our use case.


#6

We can allow multivariate emissions that can be encapsulated in a Stan type. So the movement HMMs with bearing and distance could be represented either as vector or real[] emission types. What we won’t be able to do until we have tuples is allow heterogeneous emissions, like tuple<vector, matrix, int, real>

We don’t need to start there, but it seems like we’ll need it in the long run.

I was assuming a single mixture family, with parameters varying by state. Thus if you choose an appropriate density like Student-t, then the tails can vary based on the degrees-of-freedom parameter.

Is there motivation for letting the families vary—do they even need the same support?

I want to keep the complexity bounded rather than targeting the most general thing possible. But this is a judgement call.

We’ve found soft-centering to make convergence easier than the hard sum-to-zero constraint.

If we want to vectorize, we’ll observe multiple emission sequences of potentially different sizes.

Not univariate, but we can’t bound the number of parameters—the idea is that if there are K states, then you need T1[K], …, TN[K] so the mixture components each get their own parameters.

For efficiency, that’d make sense. There’s a lot of reduction that can happen if we know it’s time-homogeneous. It might also be easier to start there rather than with the more complicated case.

Yes, we can overload functions.

That wasn’t part of the design I laid out. I want to avoid having a huge interface that’s difficult to configure. Maybe we can build something simple to start and then build out.

That will only work if you can analytically compute the derivatives of the log density w.r.t. all the parameters. The forward-backward algorithm computes sufficient statistics, so will be much more efficient.

It might be useful to expose the marginals to users (forward-backward) or first-best (Viterbi, or even better, the generalized form that does efficient n-best).

What are smoothed probabilities in this context?

It’s not only going to require templates, it’s going to specifically require parameter packs to deal with the heterogeneity of arguments (different emission families require different parameters). The best thing to look at is the adjoint-vector product abstraction—it’s the only place we use parameter packs in the current code.

We can’t really do that because arguments in Stan are constant references. I do not want to add the complexity of allowing mutable references. Instead, we’ll solve this problem with tuple outputs.