PyMC3 lets you do this. As does Edward. I don’t think anyone’s ever evaluated what happens when mixing HMC/NUTS with other samplers. If there are evals, I’d very much like to see them. We are afraid that HMC/NUTS is so sensitive to tuning parameters that just a naive alternation will lead to highly biased samples because NUTS won’t be able to set the right step size or mass matrix to move effectively.

You can’t get a sparsity-inducing prior in full Bayes for continuous parameters—you need to write something like spike-and-slab that gives a finite probability mass to the zero solution. So something like L1 (Lapace prior; aka double-exponential) regularization won’t lead to a sparse posterior the way it will with MLE point estimates.

Yes, Aki’s looking into this, but I’m pretty sure that it’s a Bayesian posteiror plus post-processing to get exact zero coefficients. I didn’t actually see it in the paper, so hopefully @avehtari or someone else can clarify.

This isn’t totally true. In rstanarm we have the product_normal prior, which has the stochastic representation as a product of n zero-mean normal variates but the product has an “explicit” (in terms of the Meijer-G) density function. As explained in section 1.1 of https://arxiv.org/abs/1309.4344
the key properties are that

The kernel is ( -log(abs(x) / sigma) ) ^ (n - 1) as x approaches zero

The kernel is (abs(x) / sigma)) ^ (1 / (n - 1)) * exp(-0.5 * n * abs(x / sigma) ^ (2 / n)) as abs(x) approaches infinity

Thus, I call this prior “spike-and-slabish”, but it is proper. Of course, we don’t have the Meijer-G function in Stan and even if we did using it this way would be a disaster for NUTS with the absolute values and the infinite spike at zero and all. But it works decently well (albeit with a large adapt_delta) if you use the stochastic representation and multiply n zero-mean normal random variates together. That creates its own problems because the order of these n zero-mean normal random variates can be permuted without affecting target, but the solution is not to declare them as ordered[n] because that makes the sampling worse. Just let them be unordered and ignore the Rhats of the things you put the zero-mean normal priors on.

All that said, I am not sure how good an idea using this prior is. If you were mode finding, then the mode would be at the origin. If you are doing MCMC, then there will be two modes — one at the origin and one somewhere else — and neither the mean nor the median will be exactly zero.

This paper soon to appear in Electronic Journal of Statistics https://arxiv.org/abs/1707.01694
describes how the new regularized horseshoe prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width.

Combining this with projection predictive approach

will provide also exactly zero coefficients while still taking (partly) into account the uncertainty whether those coefficents are zero.

PyMC3 lets you do this. As does Edward. I don’t think anyone’s ever evaluated what happens when mixing HMC/NUTS with other samplers. If there are evals, I’d very much like to see them. We are afraid that HMC/NUTS is so sensitive to tuning parameters that just a naive alternation will lead to highly biased samples because NUTS won’t be able to set the right step size or mass matrix to move effectively.

I think Radford touched upon HMC updates within Gibbs blocks in his HMC tutorial. If I recall, the problem is that the Gibbs part of updating different blocks of parameters separately will bottleneck how independent your samples are negating the point of using HMC anyway.

He did. And he cites some earlier work. His application was the hierarchical parameters in a hierarchical model; basically to try to deal with the funnel. As far as I recall, he didn’t consider discrete parameters or adaptive forms of HMC. And for the funnel-like cases, you can reparameterize and stick to HMC.

I know that this is an old thread, but I’m curious: What was the end of the story? Did you get a matrix factorisation model to work with Stan? It would be super useful…

I wasn’t fully successful. I could fit models that didn’t seem entirely wrong on toy data with 2 latent factors using autodiff variational inference. They could get close to the MAP,. Who knows how good the variances were estiamted (probably not good). On more complex data the variational appxoimation was more obviously bad. I could never get a model to fit using stan-mc. I did try JAGS and it seemed to work ok as long as I started all the chains in the same mode. I presented a poster on this, but abandoned the project as I wouldn’t trust the results on real data.