Mixture Transition Distribution model



Thanks for the help last time. I am trying to cluster a number of time series sequences. There are C clusters. There are N states in every the time series and K states are unique in the system. I am trying to create an MTD model. The model of each sequence is like HMM, but the latent states follow Mixture Transition Distribution. The latent states form an L-order Markov chain, but the chain follows Mixture Transition Distribution. Every latent state has an emission distribution for observable sequence Y. The transition distributions form a K x K column-major matrix. Each column is a Dirichlet distribution. The hyperparameters of these transition-distributions are property of a cluster. The lag distribution is also a Dirichlet. The hyperparameter of this distribution is also a property of the clusters. Can anyone help me write a working model? I got stuck because stan cannot handle discrete parameters.

Sorry, not familiar with that kind of models. Maybe @martinmodrak can help you find the right person for this kind of models.

1 Like

Hi @martinmodrak ,

Can you help me with building this model with Stan? It is very like AR models. I got stuck because Stan doesn’t like discrete parameters and there are two discrete parameters here : cluster assignment, c and unique state k at the lag l.


Looks like a tough one. One quick thought: couldn’t you rewrite this as a pure HMM? So assuming you have two clusters, with K states and transition matrices \mathbf{T_1}, \mathbf{T_2}, then you could create a HMM with 2K states and transition matrix:

\begin{pmatrix} \mathbf{T_1} & \mathbf{0}\\ \mathbf{0} & \mathbf{T_2} \end{pmatrix}

So the choice among clusters would be purely in the distribution of initial states. Credit where credit is due, this is inspired by some of the stuff @vianeylb (the resident HMM expert ;-) taught me.

In any case you would need to place some strong constraints on the transition matrices / observation matrices to avoid non-identifiability due to label switching (i.e. the likelihood being the same if you switch the transition matrices of two clusters). Alternatively, hard-labelling the clusters for a subset of the time series should also avoid this non-identifiability.

Does that make sense?


I need to write it as a higher order Markov model and MTD has some advantage. The Stan documentation has code for AR models, which is almost same as MTD. So I was thinking that someone from the Stan team might be able to help. I have problems with inferring cluster labels as they are discrete parameter. Also I had problems with assigning an unique state k to lag l . Can you explain how to marginalize the likelihood of cluster assignment to derive a cluster assignment for a sequence? And how to marginalize the likelihood of the k th unique state being the l th lag state to derive which unique state is the l th lag state?

I am probably misunderstanding something - I’ve never seen MTD models before and only did a very quick look at them before answering. It looks to me that the main problem for you is clustering not the MTD part, right? So, although not properly explained (sorry for that), my suggestion above is how to move from HMM to clusters of HMMs in hope that if you can implement non-clustered MTD in Stan you’ll know how to apply the same trick to model clusters of MTDs. So let’s figure out where you are stuck:

  1. Do you know how to implement a HMM model in Stan?
  2. Do you know how to implement an MTD model in Stan? If so, could you share the code?
  3. Do you know how to implement any clustering in Stan (e.g. something like K-means)?

I don’t really think the connection between AR process and HMM/MTD is very strong (because the discreteness makes it very different), so I don’t think AR models would be helpful to you. Instead understanding HMMs and the forward algorithm is almost certainly a necessary ingredient.

In any case Bayesian clustering without any labels given from the outside is very hard to do well, because of the “label switching” non-identifiabilities. There probably are some ways to work around those (e.g. the label.switching R package), but it is a tough problem all on its own so unless you are 100% sure you need the clustering I would try to avoid it. In particular, I would first fit a HMM, only if the HMM proved problematic (e.g. via posterior predictive checks) than I would move to MTD and only if MTD was problematic I would try something else (not necessarily clusters).

1 Like