A question on "stan cannot deal with discrete parameters"


I read a short stan handout and it says that the major limitation of stan is it cannot deal with discrete parameters. I wonder what does this exactly mean?

The question I am facing now is like this, for example: I want to estimate the coefficients of a linear regression model Y= X*Beta. I want to introduce an indicator function H to help me decide which coefficients are 0 and which are not 0 (This is like, besides estimating the value of the regression coefficients, I also want to decide which variables I want to keep). So my question is, is stan able to estimate H?

In the process of estimating H, what usually has been done is put a probability on each H[i] (how likely it is to include that predictor), is stan able to give a posterior estimate for that probability?

I am a new user of stan and have a good experience in using this platform. I appreciate people’s answer. Thanks.


No. You could, in principle, marginalize out the H variables and then compute their posterior distribution in generated quantities, but that entails 2^K possibilities where K is the number of predictors in X.

But, Stan developers tend to have a negative opinion toward these spike-and-slab priors anyway. There are several examples out there of using continuous priors that are not quite spike-and-slab but are symmetric, have a lot of density at zero, and a lot of volume in the tails. Aki has a couple of papers about horseshoe priors and there are some other options in rstanarm as well. Finally, if the models being considered are just subsets of a large model with large K, then the projection approach to choosing a submodel is probably a better idea.

which also has a few citations at the end of the README.md.


Thanks. I see your point. But I still want to use Stan as an initial step. So I probably will use a different continuous prior for H just to start with.


You can use horseshoe prior and projpred with Stan. For Stan horseshoe code see the appendix in https://projecteuclid.org/euclid.ejs/1513306866 or use brms package to generate Stan code. See projpred package documentation how to use it with Stan.


May I inquire as to the source?

@bgoodri already answered about trying to drop parameters, but you may be interested that the answer is often positive in cases like state-space models (e.g., HMMs) or simple mixtures—there’s a manual chapter on latent discrete parameters that shows how to handle them in Stan.


stan_handbook.pdf (487.3 KB)
Section 1, paragraph 4


Thanks. I don’t think the authors (our own developers!) understood at the time how many discrete parameters could be marginalized out.

From the user’s perspective, the main limitation of Stan is that it does not allow inference for discrete parameters. Stan allows discrete data and discrete-data models such as logistic regressions, but it cannot perform inference for discrete unknowns. Many statistical applications with discrete parameters are mixture models and can be rewritten more efficiently in terms of continuous parame- ters as discussed in the chapter on latent discrete parameters in the Stan manual (Stan Development Team, 2015), but there are some examples where this sort of re-expression does not exist.

While we don’t support declaring discrete parameters, we perform inference with discrete parameter models quite well except in the case of unbounded unknowns (e.g., missing count data) and combinatorial discrete parametres (e.g., variable selection in regression; discrete Markov random fields). There’s a chapter in the manual on latent discrete parameters and how to deal with them. And also a mixtures chapter and a discussion of HMMs in the time-series chapter.

After they wrote that paper (I begged off, it being in a non-open access journal), I had this conversation with Andrew, which I then posted on his blog), where I point out that you can even code up HMMs efficiently in Stan:


Then I found a sort of contradiction here. If Stan can estimate discrete unknown without decaring it, see my example above, in the model, I would say H_ij~Bernoulli(.2) and in the parameter section I would define H_ij as real something (because if I say int, Stan won’t even buy it) then Stan would report mistake: No real~Bernoulli(). I don’t think the model can work out.


Not a contradiction. We can’t literally declare an integer parameter. What we can do is marginalize out an integer parameter. That lets us do inference for it more efficiently than if we’d done sampling. See the first example in the manual chapter on latent discrete parameters for an example of the efficiency of marginalization, both statistically and computationally.

You can marginalize these parameters out, but the problem is that it causes a combinatorial explosion if there are more than a few.

Trying to do something like Gibbs is tricky here as it’s very hard to mix through the indicator variables.

The case you can solve exactly is MLE for L1-regularized regressions. But that’s a little different than the mixture formulation in that everything remains continuous (even if non-differentiable at zero).


I should’ve qualified that the two cases that come up regularly and Stan has problems with are:

  • missing count data

  • feature selection priors

The latter is because of combinatorics, the former because we can’t do the marginalization exactly. BUGS/JAGS can deal with the former, but the latter are very challenging for anyone. If you try to run, make sure you run multiple times and get the same zeroes. Then it’s hard to figure out what posteriors are, because even with spike-and-slab priors, the posterior’s unlikely to be truly zero unless the indicator function is very very near zero and you luck out with underflow.


For the discrete case, any eureka moments (or at least some incremental progress :-) ) here maybe?

Nishimura, A., Dunson, D., & Lu, J. (Feb 20, 2018). Discontinuous Hamiltonian Monte Carlo for sampling discrete parameters. arXiv:1705.08510v2 [stat.CO]


Nope. I did some very quick tests and it doesn’t seem to work. Also the choice of embedding is really hard and very important. I don’t think this is the way forward here. (Or, the more optimistic version: the current state of the method isn’t good enough. But I’ll always read new papers just in case :p)


May I have your code that you did for the tests?

I want to take a look at how you set up the test to help me see whether it is the same goal that I had initially and then decide whether I will pursue stan or not.


It died on a previous laptop unfortunately. (it wasn’t a very serious test - it didn’t do well enough to move onto that. So it’s possible that I’ve dismissed it prematurely, but I don’t think I did.) But you can definitely break their discrete algorithm if you set it up so a trajectory ends at a discontinuity (I don’t quite remember the details, but I worked it was easy to see once I noticed it in a simulation)

From a Stan point of view, I’d probably think that the biggest barrier would be that there’s just no way of telling how that first embedding step should be done. It’s definitely important (they even say so in the paper), but there’s no guidance as to how to build it.

I feel like it’s been discussed somewhere on this forum before, so there might be more info in a search.


Wow. It feels like we wrote that ages ago! It was definitely written for an audience coming from JAGS where we thought the biggest language hurdle would be the discrete parameters.

The second part of paragraph 4 states that we can rewrite a lot of problems as continuous parameters. Just not all problems.


Not sure what your criteria are, but even if you use something else, marginalizing out discrete parameters is hugely effective. So is marginalizing out continuous parameters where possible. The efficiency follows from the Rao-Blackwell theorem, which basically states you’re better off working in expectation where possible than with samples.


This is a really nice case study about marginalising out discrete parameters http://elevanth.org/blog/2018/01/29/algebra-and-missingness/


Hi, Daniel

Thank you for being so warm-hearted. I read this post and it is pretty interesting. As I was thinking whether it would be helpful for my own research question, I read every detail of it and found something missing…
// priors
p_cheat ~ beta(2,2);
p_drink ~ beta(2,2);
sigma ~ beta(2,2);

// probability of tea
for ( i in 1:N_children ) {
if ( s[i] == -1 ) {
// ox unobserved
target += log_mix(
sigma ,
bernoulli_lpmf( tea[i] | p_drink ) ,
bernoulli_lpmf( tea[i] | p_cheat ) );
} else {
// ox observed
tea[i] ~ bernoulli( s[i]*p_drink + (1-s[i])*p_cheat );
s[i] ~ bernoulli( sigma );
target is undefined any else and I think it is a key step to understand the marginalization. What is it?


target is a built in. It’s the target log density accumulator. In general,

y ~ foo(theta);

is just shorthand for

target += foo_lpdf(y | theta);

The ~ form also drops constants that aren’t needed for sampling, but otherwise they’re identical.


Just wanted to throw this out there, post a while back about using CONCRETE or REBAR methods in stan for discrete params

Though it appears no one took the final steps to do the comparisons from the stan manual on marginalization vs. these methods