Finally A Way to Model Discrete Parameters in Stan


#1

Inspired by a talk at NIPS yesterday on including discrete parameters in a neural network, while still allowing for calculation of derivatives with back propagation, I set out to see if I could model discrete parameters in Stan using the same approach.

The method uses the CONCRETE (“CONtinuous RElaxation of discreTE random variables”) distribution approach (see http://www.stats.ox.ac.uk/~cmaddis/pubs/concrete.pdf ) and REBAR, an improved version (get it?) found here: https://arxiv.org/pdf/1703.07370.pdf

The distributions produce a variable that comes arbitrarily close to a one-hot encoding of a discrete parameter.

The method avoids having to marginalize over a hidden category allocation probability, as it the current recommended approach when dealing with discrete parameters in a graph.

As an example, if you are modeling a distribution with two modes, mu[1] and mu[2], where mu = [mu[1], mu[2]], and X is a N x 2 matrix where each row is a one-hot encoding of the unknown category, then it would be nice to model this in Stan using the line:

y ~ normal( X * mu, sigma);

Modeling each row of X as a CONCRETE or REBAR allows this model to be constructed without the need to resort to marginalizing over the alpha parameter.

I produced a toy example with Stan and R code which can be found here:

Please let me know if you find this useful, if you have any ideas to enhance it, or have useful applications. This method produces a neat and more intuitive alternative to the current marginalization approach.


How to model a choice between 2 distributions
#2

So what I’m curious about with these methods is if they actually work for things that you can’t do in Stan. It’s a pretty small class of models… Infinite mixtures that don’t converge to something known?


#3

I tried the Rebar approach on a problem in the wild – see How to model a choice between 2 distributions – and it seemed to work nicely, but the standard marginalization approach of Stan produced very unsatisfactory results (perhaps I had an error in the code for the latter?).

There might be a class of problems where Rebar would be a preferable approach even though there could be alternatives in Stan. TBD!


#4

Experiments showing how similar result you can get with this approximation compared to analytic marginalization would be nice. Also you mention that you had to choose tau which is a parameter for the approximation. Could you provide advice on how to choose it?

I don’t find rebar more intuitive as it requires understanding the approximation, potential bias caused by the approximation and potential problems it causes for HMC due likely having nasty density for tau->0. Currently we lack proper experiments giving at least some intuition when rebar would be better choice to marginalization.


#5

I wouldn’t say I am an expert or anything but do you think a good case example to try out Rebar v. Marginalization would be the LDA example in the Stan Manual? There is the original model as well as a Correlated Topic model with added complexity.


#6

Sure, models in Stan manual should be tested and thus good comparison cases. I think it would be also good to have an example which doesn’t have multimodality or label switching problem. Topic models are difficult even with marginalization. One example where marginalization is difficult is missing data in discrete covariates. In this case it would be possible to also make a small test case where marginalization is possible, but when there is a large number of missing values, rebar might be useful (if multimodality doesn’t cause too much trouble).


#7

The CONCRETE distribution is based upon Extreme value distribution (GUMBEL). In my experiences I find out that its hard to sample for Stan especially in “high” (>5) dimensions.
My 2 cents.


#8

Well this is very interesting. I replicated the toy example from your paper, “Sparsity information and regularization in the horseshoe and other shrinkage priors”, Juho Piironen and Aki Vehtari. I believe this paper describes the state of art in Stan for feature selection in a sparsity problem.

To summarize the toy problem challenge for readers:

  • We have 100 data points.
  • Each point consists of a vector of 400 measurements or features.
  • 20 of these measurements are signal with a distribution of Normal(A,1) and the rest are noise with a distribution of Normal(0,1), for some value A.

The goal is to estimate a 400-element vector, “beta”, consisting of the means of the noise and signal measures. Hopefully, 20 of the estimated beta values will be A and 380 will be zero.

The “Sparsity” paper uses a horseshoe prior to estimate the betas and calculates the MSE of the estimated mean beta vector against actual for different values of A. From the “Sparsity” paper, MSEs for different A are (the “tau 0” line represents the horseshoe prior results):

I duplicated the above chart using a Rebar model. The code is shown in the files “sparse_01.R” and “sparse_02.stan” in the github: https://github.com/howardnewyork/rebar

The results using Rebar for the first six A values:

mse

This shows a reduction in MSE but two orders of magnitude! Given the magnitude of the improvement, I am concerned I missed something in the setup of the toy problem, but if not, these are very promising results.

Here for example, are the estimated beta values using horseshoe prior and Rebar

Horseshoe Prior: Estimated Beta from “Sparsity” Paper: A=6

sparsity_beta
(red = true, solid black is estimated beta)

Rebar Method: Estimated Beta using Rebar: A=6

A_6

The “Adjusted Beta” is beta .* one_hot. Rebar does better at identifying noise features and shrinking them to zero and not shrinking the signal features towards zero.

As can be seen, it looks like Rebar is a very promising tool for feature selection in Stan.

(With regard to how to choose a value of tau in Rebar, I do not have any particular scientific insight. I followed the heuristic of setting it to a small enough value to force the Rebar parameters to be close to zero or 1. Too small a value will slow down the MCMC and require higher adapt_delta and tree depth values, so you have to pick a reasonable value that works and the model is not too slow.)


How to model a choice between 2 distributions
#9

Your rebar model assumes all non-zero beta’s are almost equal (almost as tau is not 0). In this simulated data non-zero betas happen to be equal, and thus the rebar prior matches the simulation more accurately than horseshoe. Horseshoe is usually used when we assume that effectively non-zero betas may have different values. It may be possible to adjust rebar type of model to allow more variation in non-zero betas, and although interesting itself, this goes towards continuous priors and away from the original topic of modeling discrete parameters.

@andre.pfeuffer mentioned potential problems with EV distribution and I also would assume rebar to produce often nasty posterior (horseshoe can also produce nasty posterior). Have you checked all the convergence diagnostics as recommended here http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html ?


#10

If I understand the simulation setting in “Sparsity” paper correctly, it’s different from your model. The y_i are one-dimensional and are generated as y_i = beta_i + noise, where beta_i, i=1,…,20 are non-zero and beta_i, i = 20,…,400 are zeros. The 100 data realizations are just replications of this setting. In your model, you have 100 samples instead of 1 of each y_i.


#11

Good catch!


#12

Great, thanks. This is very helpful. So if I understand it then, the toy problem estimates MSE based on a single 400 element data sample, and then averages over 100 such MSEs, as opposed to my method of one calculated MSE based on a single model with 100 vectors (i.e. it is the average over 100 models with one data point each, rather than one model with 100 data points).

So to close the loop, I tried this approach and calculated the MSE for a single data point and it seems as if the MSE using Rebar is similar to the horseshoe results. (I would need to leave the model running a long time to do all 100 runs - perhaps another day for this). I used a single prior, beta ~ normal(0, 10), across all signal (“A”) values to ensure that we have one model across all possible signal values.

Here is the estimated MSE for a single run on a single data 400-element vector for different signal values (“A”):

mse_one_data_point

Some final comments:

With regard to diagnostics, you can generally find a balance between tau, tree_depth and adapt_delta that produces a good result, using the workflow referenced by Aki above.

With regard to the ability to manage higher dimensional problems: this case has 400 features of 2 dimensional Rebar parameters. Also, you can find another example of a single Rebar parameter with 70 dimensions here: How to model a choice between 2 distributions
You do have to adjust adapt_delta and tree_depth depending on your choice of tau.

With regard to whether Rebar is a exactly equivalent to a discrete distribution, the answer is no. It is a continuous distribution but is close to discrete distribution (or a “RElaxation” as was termed in the original “ConcREte” paper), which may be good enough in many circumstances.

For example, I uploaded an automatic feature selection model in a regression framework to https://github.com/howardnewyork/rebar (see feature_selection_01.R and feature_selection_01.stan) where Rebar nicely identifies the signal features. The examples shows a challenging task of identifying the important features in a high dimensionality problem with limited data. At some stage, it would be worth comparing this approach to a Bayesian implementation of Lasso, forward stepwise regression or other feature selection algorithms.


#13

While I’m a bit more excited about Rebar than @avehtari, I do agree with many of the points he raises. If you can spare the time, comparisons with some of the examples in the Stan manual would be very helpful. Additionally, an extra example where the marginalised analytical answer is known would be cool too. The crux of the matter is to determine when Rebar breaks and when it shines.

Regarding feature selection, Rebar has some appeal (as do the horseshoe approaches) because, among other things, the Bayesian lasso just doesn’t work. Someone please call me out here but I think one of the things with feature selection is that there’s no way of marginalising over the 2^P models, which would make Rebar the only viable approach if one wants to have a discrete underlying structure a la Kuo & Mallick.


#14

@howard Thanks for the additional information.

Good question is that when we would like to have a discrete underlying structure?


#15

It really is a good question. I am not arguing in favour of it, mind you. One of the arguments one might advance is that if you frame your model (“feature”) selection in terms of inclusion probabilities (and associated binary variables), you can get Bayes factors for “support” of a given covariate/predictor analytically, which I think is attractive at first glance.

Of course, specially under multicolinearity, you’ll have an unavoidable problem with multimodality, which I guess is nastier in the discrete setting – but again make no strong claims about.


#16

Probably not. That model has a terribly multimodal posterior. Stan can’t fit it as is. We just get stuck in a mode like every other approach to LDA. We don’t care so much about models we can’t fit in principle like LDA.

Specifically count variates. If it’s just a finite categorical thing, we can marginalize (unless the number of options is too huge).


#17

I also want to point out that we usually don’t want to sample discrete parameters if we can marginalize them out. Marginalization is way more effcient, especially in unbalanced cases (small discrete probabilities of interest).


#18

I’d suggest something like

  • simple mixtures (one discrete parameter per data point)

  • change point (one discrete parameter with 200 possible values)

  • HMM decoding (lots of linked discrete parameters); that is, set up training data (x, y), use it to estimate parameters p(theta | x, y), then give it some new x’ and let it predict new y’.

  • Cormack-Jolly-Seber mark-recapture : the animals live/death state is discrete and marginalized out in the Stan example

  • Dawid-Skene model – discrete parameter per item (not quit per data point), marginalized out in the Stan model