Bi-annual question on discrete parameters

So I recognize that discrete parameters have been asked for many times, and I’m fully convinced that most models would be much more efficiently computed by marginalization/Rao-Blackwellization and that many models that can’t be marginalized are intractable anyways. However, I’m seeing a bunch of other bayesian inference packages offer mixture type methods for this problem. For example, Turing.jl offers “Compositional inference” that mixes Gibbs samplers and NUTS to sample from both discrete and continuous variables, and PyMC3 lets you do something similar.

Given the quality of governance I’ve seen on this project, I’m assuming there’s a reason it’s not in Stan. As someone with a passable understanding of HMC and essentially no understanding of how Gibbs works, is there a reason I shouldn’t trust these methods?

8 Likes

Good question. I was going to try to respond, but you’d be better served with a response from someone who has thought more deeply about this particular issue than I have. Tagging @betanalpha and @Bob_Carpenter in case they have time to respond.

2 Likes

I’m at a three letter agency about to make a decision where a million lives hang in the balance, and the decision hinges on an analysis blindly that uses Turing.jl’s compositional inference to sample missing counts in a poisson regression! Quickly, I swear to God I’m gonna push the inference button!

Okay, none of that is true, but bump? Am I allowed to do that here?

3 Likes

There’s 2 problems with adding discrete parameters:

  1. If you add them, there is a 100% chance that people will start using them even when they really should be marginalizing those parameters out.
  2. Adding features is hard! It takes dev time and effort that we could put into other features, or takes that time away from our studies (since most of us are academics or students). PyMC3 and Turing have an advantage here in that they’re written in much simpler programming languages (Python and Julia, rather than C++). Turing has another major advantage from working in Julia – it’s much more modular, because of Julia’s multiple dispatch features. Adding an extra sampler to Turing is as easy as making a new package, then loading it alongside Turing. As long as the package implements a couple of methods defined by the AbstractMCMC API, the Turing devs don’t have to do anything to support it. We can just leave the package on its own to mature before we accept or reject it (pun fully intended). We’re doing something like that with Annealed Importance Sampling right now – the package started off as its own thing, but now that we’ve seen it and like it we have plans to integrate it into Turing directly.
4 Likes

We are actually starting to build a Stan compatible sampler that includes discrete variables and conditionally continuous variables such that we can, for example, do nuts in gibbs and nuts in rjmcmc with Stan. We aren’t quite there yet but i would be interested in hearing about models that people would like to sample from and which involve discrete variables.

2 Likes

@s.maskell I think models with multiple changepoints would be something that people would find interesting/useful.

1 Like

@dmuck: are you imagining that the number of changepoints would be discrete but the other parameters of the model would be continuous? If so, that fits within the set of things we are thinking about.

@s.maskell yes, that’s right

@s.maskell @dmuck note that even if the locations of the changepoints are continuous and the number of changepoints is known, the likelihood is discontinuous in the locations of the changepoints (it jumps discontinuously whenever the location of a changepoint moves past the location of one of the data points) and therefore can be difficult to sample. Presumably in part for this reason, the SUG introduces changepoint models in a context where the locations of the points themselves are discrete:

2 Likes

I wrote a long thread about this on Twitter this summer that covers the main reasons why “compositional inference” is not the benefit it might naively seem, https://twitter.com/betanalpha/status/1412488546351562762. Were it shorter I would have copy and pasted here, but it should be accessible to anyone.

Long story short just because you can build an asymptotically consistent Markov chain Monte Carlo algorithm doesn’t mean that it will do anything useful in practice, and engineering Markov chain Monte Carlo algorithms that work well on specific discrete spaces, let alone generic discrete spaces, and offer useful diagnostics is very very very hard. The challenge is that the desire for these methods to work is so strong, and workings of Markov chain Monte Carlo estimation so subtle, that people will keep reimplementing these ideas, and funding those reimpementations, based solely on hope.

If you have any questions about anything in the Twitter thread then don’t hesitate to ask them here.

4 Likes

Sorry for the delayed response, but I didn’t spot this thread had progressed.

I completely agree that there are significant challenges in developing and validating the performance of algorithms for models of this kind. Our current work aims to provide an environment where we can demonstrate the challenges (eg by having generated quantities for discrete variable models that allow us to see that Rhat is not what we want) and to develop novel numerical Bayesian inference algorithms that address the deficiencies of the current state-of-the-art in MCMC.

For what it’s worth, my fear is that there’s a lot of research going on at present looking to develop numerical Bayesian inference algorithms, but I think the world needs to get better at exposing the challenges users would like to see solved and therefore helping research to focus on tackling the “real issues” in a way that could migrate into future variants of Stan. That’s probably more aligned with another topic (here: Reimplementing the inference algorithms - Algorithms - The Stan Forums (mc-stan.org)).

I hope it is OK to revive this topic.

I just wanted to ask if the some of the thoughts on Twitter (basically, if you do whatever-within-Gibbs, for a nontrivial choice of whatever that requires adaptation, you would need to adapt at every iteration so it kills efficiency) are discussed in a paper, either as a practical experiment or based on theory.

Note that I find the argument compelling and intuitively appealing, just curious.

2 Likes