Name for hidden markov model lpdf

Hi everyone,

this PR implements an lpdf for hidden Markov models with a discrete latent state that’s marginalized out. There’s a bit of back-and-forth about what the name of the density should be, with sensible arguments made by both sides.

Here are the candidates:

  • hmm_marginal_lpdf
  • hmm_lpdf
  • hidden_markov_model_lpdf

All good options to me. The first one because we do marginalize out the hidden states. The second one because the user doesn’t pass the hidden state x, so it should be obvious that it is marginalized. The third one is a bit more explicit. The main concern is how these notations interplay with other conventions used by practitioners.

@vianeylb, @betanalpha, @Bob_Carpenter

6 Likes

For context a hidden Markov model is specified through a joint probability density function over observations y_{1:N} and discrete hidden states z_{1:N} with the conditional decomposition

\pi(y_{1:N}, z_{1:N} \mid \theta) = \pi(y_{1:N} \mid z_{1:N}, \theta) \, \pi(z_{1:N} \mid \theta).

The particular conditional structure \pi(y_{1:N} \mid z_{1:N}, \theta) and$ \pi(z_{1:N}$ allows us to explicitly integrate out the discrete hidden states to gives the marginal probability density function

\pi(y_{1:N} \mid \theta) = \sum_{z_{1:N}} \pi(y_{1:N}, z_{1:N} \mid \theta).

It is this marginal probability density function that has been implemented.

The problem with calling this marginal density hmm_lpdf or similar is that most applied fields consider the joint density \pi(y_{1:N}, z_{1:N} \mid \theta) to be hidden Markov model itself. This is supported by the many software packages that support hidden Markov model inference that work with both the observed data and the hidden states.

The preliminary name hmm_marginal_lpdf was chosen to explicate the automatic marginalization of the hidden states relative to other packages that work with the full joint model.

Others have complained about “marginal” being ambiguous, but in this context there are only two natural marginals,

\pi(y_{1:N} \mid \theta) = \sum_{z_{1:N}} \pi(y_{1:N}, z_{1:N} \mid \theta)

and

\pi(z_{1:N} \mid \theta) = \int \mathrm{d} y_{1:N} \, \pi(y_{1:N}, z_{1:N} \mid \theta),

with the second being both a trivial marginal and not relevant to the inferential context where we condition on observed data. Consequently \pi(y_{1:N} \mid \theta) is the only nontrivial, natural marginal density function. Only one marginal being natural is common in other models as well, for example “collapsed” latent Dirichlet allocation models where the “collapsed” refers by convention to only the discrete parameters without any ambiguity.

4 Likes

A hidden Markov model is conceptually similar to Kalman filter which is currently implemented by Stan under the name Gaussian Dynamic Linear Model. The full name of the function is gaussian_dlm_obs_lpdf. Marginalization over the latent states is indicated by infix _obs.

Imitating this suggests HMM function name

  • categorical_hmm_obs_lpmf
2 Likes

My vote remains for hmm_lpdf.

My preference isn’t because I think marginal is wrong, but because I think it’s redundant. Nobody ever implements the joint density of the observed data and “missing data”, nor is that in the proposal for Stan.

Are there example packages that implement \pi(y \mid \theta) and call it the “HMM marginal lpdf” in a function name?

That was me. It is technically ambiguous and the \pi(z \mid \theta) marginal is part of the standard generative story and precisely the distribution that is Markovian. That is, it’s not some obscure marginal I pulled out to make a point.

I agree it’s the one of interest (both are natural). And that’s common when there are unobserved latent states. In fact, I agree so strongly that I think writing “marginal” isn’t worth the extra characters.

@betanalpha : Would you have preferred log_mix to be called log_marginal_mixture? It’s exactly the same marginalization as you get in an HMM, namely of the so-called “missing data”. Similarly, we don’t put “marginal” in the name of the beta-binomial (Dirichlet-multinomial)—it’s just taken for granted that we marginalize out the probability (simplex) parameter.

Yup—obs has the advantage over marginal of being explicit about which outcome variable we’re talking about. I just don’t think it’s necessary here. I would’ve preferred gaussian_dlm_lpdf there. Or even just dlm_lpdf, with non-Gaussian ones getting more qualified names.

Similarly, we write just log for the natural logarithm, but we qualify log2 and log10.

I’m confused – how is the marginal HMM \pi(y_{1:N} \mid \theta) “the only thing anyone ever implements” and yet the marginal itself would be ambiguous in the context of HMMs?

People absolutely do implement the joint HMM over the observed data and hidden states and in many cases it’s more common than the marginalized approach. As I commented in on my earlier post in this thread people in multiple fields, especially ecology, are continuously making the mistake that Stan cannot fit HMMs because it cannot handle the discrete hidden states needed for the joint implementation. Implementing the forward algorithm to marginalize out the hidden states is only half of the challenge; one first has to recognize that the hidden states can be marginalized to yields equivalent inferences of the parameters.

Once again the name hmm_marginal was chosen to distinguish it from the joint HMM model with the presumption that once one is marginalizing in the inferential context \pi(y_{1:N} \mid \theta) is the only relevant marginal.

1 Like

Thanks for the input. I think both perspectives are sensible, but I want to make a decision and move on.

Given the input parameters, there’s little room for confusion. So the question is of marginal importance and I’d rather move on to the next step. If a person read hmm_lpdf and thought this was a joint, they’d soon realize that they cannot pass the discrete sate.

But this is the convincing point to me: if someone fitted the hmm with continuous hidden states in Stan, those hidden states would be parameters and they would write the log joint in the model block. The marginalization happens here because we cannot run HMC over the discrete parameters. This is distinct from what we would put in the model block in the continuous case. So I’ll keep marginal. For what’s it’s worth, it’s consistent with what’s done with the nested Laplace approximation.

1 Like

Because as soon as you say “marginal”, people are going to start trying to figure out which marginal.

Once again from this side, I completely understand your motivation. I just think it’s wrongheaded and will lead to all of our functions having really long names.

Do you think we should rename log_mix to log_marginal_mix on the same grounds? It’s exactly the same argument that we should mention that there’s a discrete parameter being marginalized.

You could try to do that and define the sampler via Gibbs, but that’s a really bad implementation strategy. I’ve never seen an interface that implements the “complete data” likelihood as an interface function that users were supposed to use.

The states themselves are always discrete in an HMM.

I’m very disappointed, but I suspect I’m in for even more disappointment if you keep the simplexes as columns. I think it’s a very bad idea to go against established convention.

The simplexes have been changed to rows weeks ago. This was never contentious and I immediately rolled with your proposition.

It really bothers me to disappoint anyone… Again, I think both sides make sensible points. But I also want to move forward. I’ll bring it up at the Stan meeting on Thursday. When I wrote this post, I wanted to hear from more people in addition to @betanalpha and @Bob_Carpenter, who have already voiced their opinion, though I appreciate your clarifying your positions. @nhuurre thank you for chiming in.

I agree that you have to make a decision. You’re going to disappoint someone with this one. I’ll try not to roll my eyes every time I have to say hmm_marginal :-)

None of this makes sense to me.

We have three possible distributions under the name “hidden Markov model”.

You are arguing that the name “hidden Markov model” alone implies the marginal \pi(y_{1:N} \mid \theta). In particular this implies both an inferential context, requiring the data variables to be present, and a marginal consistent with that context.

If marginal is already implied then adding “marginal” to the name should not change anything within this context other than adding redundancy. But you then argue that explicitly saying marginal somehow overrides everything you claim is already implicit in the name “hidden Markov model”.

I think what we can agree on is that there is an implicit inferential context.

What we disagree on is whether the marginal is implied or not. I am constantly encountering people who argue that Stan cannot fit hidden Markov models because we don’t do discrete parameters. That is an argument from people who attribute “hidden Markov model” to the joint model and aren’t aware that the discrete parameters can be marginalized out without affecting the inferences. It’s not a question of people not knowing how to do the marginalization, it’s a question of people not even knowing that it’s possible in which case “marginal” is definitely not implied.

I find this to be a slippery slope fallacy. We already have function names that are as long or longer (integrate_ode_*, multi_normal*, etc) that are needed to ensure clear meaning, as we are proposing here.

It is not. The joint model here is not a mixture but rather a categorical model; we get a mixture only when the discrete categorical variables are marginalized out. Here there is no ambiguity in to what distribution “mixture” refers.

Regardless, were there ambiguity I would have no problem adding “_marginal” given the short length of the initial name.

This topic is moot in the sense that Charles already made a decision in your favor and I’m not going to try to fight this any further. So please take the rest in the spirit of my explaining why I’m right and I think the wrong decision was made. I’m not trying to change the decision. It’s not worth going back and forth on this much other than for people as pedantic as me and @betanalpha.

None of this makes sense to me.

I know and that’s how we got into the mess of putting _marginal on our HMM density function. The problem is that you’re not a typical user and language doesn’t work from first principles.

You are arguing that the name “hidden Markov model” alone implies the marginal

Not quite. I’m arguing that adding marginal isn’t necessary. Have you looked at other HMM packages? Do they add “data-marginal”, “state-marginal”, or “joint” to all their names?

The relevant linguistic point is that there’s often an “unmarked case” in distinctions which is taken as a default. We don’t write cos_radians just because someone might think it applies to degrees because that’s a thing in the math world. We just write cos and assume radians because that’s the default. Similarly, we write log for the natural log, and call out log2 and log10 with special suffixes. It’s not that log is unambiguous—it needs a base for that. It’s that there’s a default people expect.

I’m just saying that adding “marginal” to “HMM” is like adding “radians” to “cos”. It’s not wrong, just redundant.

My point about there being two marginals is that just saying “marginal” doesn’t actually serve the role of uniquely picking out the density of interest. So I’m arguing it’s failing on its own terms of trying to be explicit.

People say that Stan can’t do any model that involves discrete parameters either because they don’t want to (or don’t know how to) marginalize.

You mean it’s not a slippery slope leading to longer names?

HMMs are just a kind of mixture model with a Markovian condition on the category. If the inputs are all size 1, then an HMM literally reduces to a mixture model.

The Wikipedia on mixture models treats the mixture responsibilities as parameters in the model. You could say the Wikipedia’s confused, but I think we’re discussion popular conceptions of models.

2 Likes