Marginalization of latent discrete states

Hello all,

I’ve been hesitant in converting from JAGS to STAN explicitly because of my lack of understanding on how STAN deals with latent discrete states. This topic has gotten peripheral mention in a couple places. I was speaking with a colleague who admitted he was also waiting for some clarity before embarking on learning STAN.

I’d like to reopen this topic and hear the experience of others.

For example, consider the classic N-mixture from Kery’s book where there is a binomial observation process of some counts, with a underlying poisson process model generating those counts. Similar translated examples can be found here for Kery’s BPA book.

What if we want to track the parameter that is being marginalized? That is, the latent N in a N-mixture model? In some cases we want to know the estimated discrete state for a given observation, not just the predicted observed data generated from that latent process.

I like the idea floated by Bob and Dave here

https://groups.google.com/forum/#!topic/stan-users/9mMsp1oB69g

of approximating a latent state with a normal with low variance (seems reasonable for ecological counts, animal populations etc), but what about when the states constitute the basis of a mixture model?

A verbal example,

In animal movement ecology, we might estimate the spatial coordinates of an animal movement based on the autocorrelation in step and turning lengths. We could assume that these step lengths (e.g. speed) come from some underlying discrete behavioral process (‘resting’, versus ‘foraging’), as such we might model them as a mixture model, with some prior information to avoid degeneration. I would want to know the estimated latent state per observation. I don’t want to marginalize out the latent behavioral state, but rather track that state! Is this antithetical to the STAN mindset?

Finally, even well the model is formulated in a STAN fashion, slide 39

would suggest that the effective number of samples/second is far lower? Am I reading this correctly?

Thanks!

Ben

PS: I would have added more links, but new users only get 2!

In this case you simulate from the process model in the generated quantities block using the parameters. It is typically much faster.

That does appear to be what the slide suggests and there may be models where OpenBUGS/JAGS is faster but don’t think this needs to be one. It probably needs some troubelshooting… but ti doesn’t list the parameterization so idk what’s hapening in there. I’m pretty sure that entire model can be written as something like

target += poisson_lpdf(y | lambda * p);
I've been hesitant in converting from JAGS to STAN explicitly because of my lack of understanding on how STAN deals with latent discrete states.

Did you try reading the manual chapter on latent discrete parameters? It explains how this all works and why it’s preferable, even in JAGS or BUGS, to marginalize if at all possible. Discrete sampling is terribly behaved in many cases.

What if we want to track the parameter that is being marginalized?

You’re also way better off marginalizing here as explained in the manual.

I like the idea floated by Bob and Dave here https://groups.google.com/forum/#!topic/stan-users/9mMsp1oB69g of approximating a latent state with a normal with low variance

Not a good idea either statistically or computationally.

In animal movement ecology, we might estimate the spatial coordinates of an animal movement based on the autocorrelation in step and turning lengths

Ben Lambert coded this up and validated Stan’s results versus moveHMM, including with covariates. See
https://groups.google.com/forum/#!msg/stan-users/72BEgxLIZjo/f2JAXaaEBgAJ

I don't want to marginalize out the latent behavioral state, but rather track that state! Is this antithetical to the STAN mindset?
Not at all. You get to have your cake and eat it, and it's better cake. You can see in the simple example in the manual for change points (HMMs are also covered in the manual, by the way, as are circular statistics), that you get *much* better tail estimates by marginalizing. That's because if there's a small probability of a discrete state, a sampler that has to visit that state is going to have a very hard time estimating it.
would suggest that the effective number of samples/second is *far*lower? Am I reading this correctly?
I have no idea what's up with that slide, but the computing times for things like the CJS model and movement HMMs are going to be *much* faster in Stan than in OpenBUGS. And it's going to be much more robust.

You might want to check this out this careful and unbiased report (in the sense that the Stan developers didn’t write it):

http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12681/full

P.S. It’s “Stan”, not “STAN”, because it’s not an acronym.

It’s not antithetical to Stan, it’s antithetical to statistics as a whole. You cannot identify the latent state exactly and so you cannot “track” it. The best you can do from a Bayesian perspective is quantify all of the latent states that are consistent with the measured data with the posterior distribution.

Marginalization replaces the posterior over the discrete latent states with a posterior over a summary statistic, such as the probability of each latent state. As Bob notes the posterior over the summary statistics is much better behaved statistically because it’s easier to estimate the probability of rarer latent states.

Even ignoring that benefit, however, a very important property of marginalization is that you can always reconstruct the posterior over discrete states. If you want a random latent state consistent with the data? Just sample from the distribution that you marginalized over originally, p(discrete latent state | latent summary statistic). If you want to chose a “best” latent state then you play around with point estimates.

You lose nothing by marginalizing. Not sure where the converse was introduced into ecology but it has become a common misconception! It’s really important, however, so let me repeat. You lose nothing by marginalizing.

2 Likes

A pair of over-optimistic BUGS enthusiasts: https://www.amazon.com/Hierarchical-Modeling-Inference-Ecology-Metapopulations-ebook/dp/B0089WNPP8/ref=sr_1_fkmr0_2?ie=UTF8&qid=1496888263&sr=8-2-fkmr0&keywords=royale+and+dorazio

This book (and a few others) were critical in encouraging the craze for heavy stats in Ecology (which in general resulted in a lot of cool stuff) but I think there may not have been enough appreciation for the dismal state of computational statistics relative to ecologists’ ambitions.

It wasn’t just Dorazio and Royle. There was a lot of detailed stats in ecology before the Bayesian craze. Not surprisingly, they were using EM and other optimization techniques and all the early papers in the 80s have all the marginalizations you need.

Nevertheless, many researchers in ecology (and other fields) have found marginalization challenging. Many others seem to have no problem with it. I would agree that it’s clunky to code. But the benefits are well worth it. I just think we have to make it easier for people.

P.S. When I shared my Stan replication of his and Royle’s occupancy model paper, Dorazio told me he’d abandoned BUGS for complex hierarchical models and was coding his own samplers.

Of course there were other stats but the wide adoption of “state-space formulations” (i.e.- no marginalization) was driven by BUGS/JAGS implementations of some really popular models. Royle and Dorazio is definitely the source people cite for those models, there were others too. I was a grad student in an ecology lab in the middle of this stuff until 2015. I saw firsthand a lot of people forget that you need data about your unobserved state to make any inference on it (people who in any other context would know better!). It was totally magical thinking but it’s understandable because it let people fit and visualize process models that otherwise required much heavier stats expertise or very indirect inference. Ultimately it was over-optimistic not because of ecologists but because the samplers weren’t there (even for the marginalized versions of some of these models).

We can’t win over every one :) Of course now the literature is clogged with papers that extol the virtues of un-marginalized state-space models and that’s why there are a lot of people irritating Michael. It’s harder to publish papers with a title like “That was all a horrible mistake, learn more math instead.”

Well, the easily accessible samplers.

I apologize in advance as I am a bit of a novice with Stan. My work often utilizes Bayesian and dynamic Bayesian networks. I typically use something like Netica or BUGS/Jags to handle these models (depending on whether I want/need EM or MCMC approaches to estimation). I’m planning a Monte Carlo simulation study and would like to use Stan (rStan, in particular) given the speed and ease of integration with R. How will marginalization impact me in this case? My concern is that one of the aspects of performance I am interested in is classification accuracy of examinees into latent performance categories (e.g., “master” or “non-master”). Is Stan going to allow me to perform inference given a particular response string? For example, assume I have a DBN consisting of a single dichotomous LV at each time slice serving as the parent of five dichotomous, observed performance indicators. I need to know the posterior probability of membership in each of the two latent classes given a particular pattern of responses on the five performance tasks. Would this require additional sampling in Stan? Are there some simple BN/DBN examples in Stan that I can draw from that include both learning and inference? Thanks, again!

For N-mixture models, the number of latent states (the true population abundance, N) is theoretically infinite, but in practice the likelihood that marginalizes over N is approximated by summing over all possible integer values of population size up to some large integer (declared as K in slide 38), which represents an upper bound on the population size. I suspect the summation over possible values of N may explain the low n_eff/time particularly when K is large (as it should be).

This issue is discussed more in a paper “Computational aspects of N-mixture models” by Dennis et al. in the journal Biometrics. That paper also presents a multivariate Poisson parameterization of N-mixture models that does not require approximating this infinite series, and also does not have discrete parameters.

For the other models discussed in the above slides, there are just two or three latent states, so the summation over all possibilities is relatively inexpensive.

It will make everything faster and give you much more accurate expectations. This is the Rao-Blackwell theorem in example. Check out the first example in the Stan manual’s chapter on latent discrete parameters. Sampling discrete parameters is much less efficient—you still need to calculate all the category probabilities, only rather than using them all, you replace them with a much less informative draw.

So you should marginalize wherever possible in BUGS, too. You might find the following paper useful in understanding all this:

  • Cole C. Monnahan, James T. Thorson, and Trevor A. Branch. 2016. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution.

When would you want an MLE when you could get a Bayesian posterior?

I think “Bayesian Network” is just what we’d call a model. I don’t know what a “dynamic Bayesian network” is.

That can be true if K is very large. Stan and BUGS will both have to calculate the conditional probability; BUGS will then sample while Stan will calculate log density and gradients. The latter can be more work, but it should be more than compensated by much better mixing.

That sounds useful.

You may also want to check out Michael Betancourt’s case study on mixture models on the Stan web pages (users >> documentation >> case studies).

Thanks for the reply.

  1. I, personally would prefer the Bayesian posterior. Many off-the-shelf BN software packages, however (Netica, Hugin, maybe BayesiaLab) only offer ML estimates. Furthermore, many of the options for MCMC estimation (Stan, BUGS, JAGS, some of the C++ libraries, etc.) have a high cost of entry (i.e., a steep learning curve) for most applied researchers and practitioners. Under many realistic research conditions, the two approaches are likely to yield results similar enough so as to not impact the conclusions that a researcher would draw. Therefore I think it’s important to study both approaches so as to better understand where one might succeed while the other fails. More specifically, it would be nice to have some empirical guidance for practitioners with respect to where they can safely use some of the more accessible (often GUI-based) software packages and at what point they need to take the plunge in opting for something like Stan.

  2. By “dynamic Bayesian network” I was referring to the general class of extensions to Bayesian networks meant to handle observations occurring over multiple time points. HMMs (and other state-space models) being a special case of a DBN, for example.

  3. Thank you for the reference. In looking through the manual, I didn’t see an example that seemed to map on to my application, although my search was only a cursory one, admittedly. What additional steps, if any am I going to need to take in order to build person/case-specific posterior distributions for the discrete latent variables? I am particularly interested in knowing the time slice at which a particular individual reaches, or is classified into a particular state in a set of ordinal, non-regressive states. For example, when an examinee is classified as having “mastered” the exam content where “mastery” is the highest state possible and a student is assumed to never regress to a lower state once a higher state is achieved. If this is covered in the manual, then I apologize and would appreciate being directed to the page/example number.

Thanks, again!

To a first approximation, always use Stan. ML can only predict, rather than estimate, that which it integrates out to form the likelihood function, so you don’t get good uncertainty estimates for the latent states. Bayesian approaches that are not Stan — or at least that do not integrate out the latent states — have a considerable probability of being wrong by some unknown amount for any finite number of MCMC draws. And if you are going to do the analytical work the figure out the marginalization anyway, you might as well use Stan at that point.

1 Like

@bgoodri already answered the first point about MLE vs. Bayes.

You want to look at the first example in the latent discrete parameters chapter. It’s a change-point example, which is a simple HMM with two states that always starts in one state and then can only move to the second state. The example shows how to do inference for the probability that the change is at any given time. It also illustrates why inference is so much better with marginalization as it constructs the sampling version and compares.

The chapter goes on to other examples of how to do inference over discrete states: CJS mark-recapture, which is a more complex HMM-like structure, and Dawid-Skene noisy data coding, which has item-level marginalization and inference.

You can also look at the mixture and HMM examples—they’re also in their own chapter. I need to write a case study for HMMs as this question keeps coming up.

1 Like

EDIT: Upon further thought, perhaps the better question would be to ask whether it’s possible to get some pointers on translating my existing BUGS code for the models below into Stan language? I’m not sure if that task is better suited to this forum, another forum, or email. I didn’t want to be presumptuous and clog up this thread with chunks of BUGS code :)

Thank you both for the replies. One additional question – do you have, or know of any code examples for running a latent transition analysis with Stan? I’m interested in coding three models in total:

(1) a simple HMM (one latent variable per time slice with two states; one observed variable per time slice with two states; repeated over, say 15 time slices; need to estimate a static transition model and static observation model)
(2) a simple LTA (one latent variable with two states; 3-5 observed variables each with two states; ~15 time slices; estimate static transition and observation models)
(3) something approximating a two factor model (two latent variables each with two states; 3-5 observed variables per latent variable all with two states; ~15 time slices; estimate static transition and observation models)

To the best of my understanding, all of these models are possible in Stan and none should be overly taxing from a computational perspective (I’d be training these models with no more than 2000-3000 cases, most likely). Performing inference on a few thousand cases (via sampling) might take a while, I’m guessing. Unfortunately, I need to know the posteriors for each latent variable at each time slice for each case. Any code examples would be appreciated. I think I can figure out (1) from the example in 15.2 of the manual. I can probably engineer (2) and (3) using that as a starting point as well, though I figured you might have seen something that is a better approximation of those models that someone else had already done the leg work on. Thank you in advance!

Feel free to post BUGS code. Just please indent it and quote it so its readable. That’s not to say anyone’s going to have time to help you translate.

There are general pointers in the manual and lots of examples—we’ve translated all the built-in examples and the BUGS examples in several books.

There’s an example in the manual and some two-output HMM movement models discussed on the old Google list with code that works. I need to write a case study for these to make it easier for everyone.

There are also some custom models like this for mark-recapture models in the latent discrete parameters chapter of the manual.

I’m afraid don’t know what an LTA is.

You just code those as you would in math. There are identifiability problems, as there are in BUGS. I don’t know that we have any examples lying around conveniently. I’ve done some, but never organized it well enough to dump it in the manual.

The speed is determined not only by data size, but also how well it fits the model and how well the programs are implemented in Stan. There’s lots of advice on efficiency in the manual.

Oh, I should’ve also mentioned there’s a multinomial latent mixture model (latent Dirichlet allocation in the manual, which is a kind of latent-factor model for multinomial data.