Identifiability and convergence (Input-Output Hidden Markov Model)

Hey there!

I’m a student in Applied Stats trying to write three models in Stan for a Google Summer of Code Project. As stated in the description, part of the goal is to contribute my code as a Case Study and I’d personally love to present in the next StanCon if you find it interesting.

So far I’ve written a sampler for an Input-Output Hidden Markov Model. Basically, it’s a HMM where latent states dynamics are driven by input observable variables instead of a Markov chain (a not-so-Markov model in the end). In a sense, it’s just an arbitrary observation model whose parameters switch according to the (latent) output of a softmax regression. More about the model in the writeup.

I implemented the forward, forwards-backwards, Viterbi and some sort of backwards-sampling algorithms. It works but convergence is a bit sketchy. I’m confident about the results because they make sense based on what I know about the subject, but effective sample size and the Rhat statistics are worryingly bad (see by yourself).

Since the states are latent and the observation model is a mixture of Gaussian, there are so many hard-to-identify components that I can’t run more than one chain. I set up an ordered constrain to the mixture means and made them hierarchical hoping to improve convergence speed but there’s between-chain switching nonetheless.

I’ve read Michael Betancourt’s Case Study a few times but I still can’t come up with a better way to improve identifiability. (1) Do you think lack of natural location for the softmax regression parameters could be part of the problem? I added a prior normal(0, 0) but maybe it’s not enough. (2) Do you think I can break symmetry in some other way? (3) Any general advice about code style or performance will be much appreciated!


Does your I/O HMM just use covariates to help model the transitions? That’s not uncommon and there are Stan models doing that in many domains (like movement HMMs and population models in ecology). You may also want to look at conditional random fields (CRFs), which are Markov random fields with predictors, which have been applied to a lot of chain examples where you might otherwise use HMMs.

The mixture of Gaussian observation model is going to be problematic for R-hat. Especially if they’re multivariate (I always wonder how anyone got all those old speech recognition models to fit).

Yes, the lack of identifiability for softmax may be a problem.

I didn’t see the Stan models in the link to “the writeup”. Is there something else I should be looking at, like a GitHub repo?

And yes, we’re always happy to take submissions for StanCon and/or other case studies. I’ve been meaning to do something myself with HMMs and CRFs for ages but haven’t even gotten around to implementing CRFs.

Hey Bob! Thanks for your answer.

1. The model
The model uses covariates (input variables or control signals) to model the transitions. We run a softmax regression on the covariates to estimate the latent states at a given point in time (this is unsupervised - the output of the regression is not known), and we then use these probabilities to marginalize our observation models for each state.

Thanks for the suggestions, I didn’t know some of those models so I’ll try to check them out and get some inspiration.

2. Gaussian Mixture
Fortunately the output is univariate. Thing is: there are three Gaussian components for each hidden state (ie 3x4 components), and hidden states are latent too.

3. Softmax and identifiability
Any tip on this? The best I could come up with is a prior normal(0, 5) on all the parameters of the softmax regression. It’s a non-intercept regression btw.

4. Code
The whole thing is stored in this repo. It’s got a lot of research code, but you may want to see (1) the Stan code and (2) the R code that downloads the data and runs the model. The latter should run the model out of the box, next you can use shinystan to check the diags by yourself.

5. StanCon
The more I hear about the lineup the more hyped I get. Hope I can make my topic as interesting as possible to make it to the conf! Also, I’ll try to present this work or some further derivations at next’s year R-finance conference. I presented some Stan code this year and got to meet a few people interested in Stan and more generally Bayesian stats. I’m always happy to help spread the word.

For (3), you may want to try modeling only K-1 categories. Then each set of parameters is interpreted as log odds vs. that fixed category.

(2) sounds like a recipe for some pretty strong non-identifiability as you have moving components and states.

For the code (4), can’t you declare unA_ij as a matrix and define it by matrix multiplication? That’s more efficient than the loop. Every time I look at one of these complicated mixtures it makes me realize we need better vectorized mixtures built-in. But it’s a design challenge figuring out how to specify all the shared computations efficiently and on the right scale (log, log odds, linear). If you define accumulator in the forward algorithm as a vector, you can vectorize the innermost loop. Do you really need those alpha_tk or can you define them as generated quantities? As generated quantities, there’s no autodiff overhead. I didn’t follow the forwards-backwards bit. I found I only had to calculate the forward algorithm and then I could read the likelihood off the final row with a final log_sum_exp. I’ve been wondering if using more sufficient stats in forward-backward would be a win.

You want to vectorize that look over K in the model. It’s much more efficient. I don’t like variable names like “hyperparams”–it’s too much like naming a variable “variable”. They’re also too easy to get confused that way. The ony tricky vectorization is mu_kl, which will be

mu_kl ~ normal(hypermu_k, hyperparams[3]);

The others you just drop the loop and the j index.

The moveHMM package in R has a nice overview of the movement HMMs with covariates. The CRF paper by McCallum, Lafferty and Pereira also compares some similar alterntives and introduces CRFs—it’s in a natural languae setting, but there have been lots of other applications.

Do you really need to run Viterbi within the algorithm? What do you use it for? Viterbi only finds the first-best output (or the n-best in the natural generalization with backtracking).

1 Like