Hi all,
I’m working on a model where the likelihood is a matrix normal distribution, and the prior over each edge of the graph is a spike-and-slab inspired Gaussian mixture. The spike component is a narrow Gaussian centered at zero, and the slab components are wider Gaussians with a mixture over multiple variances. I want to perform variational inference to approximate the posterior over the graph matrix G. Since the prior is a non-conjugate Gaussian mixture and the likelihood is matrix normal, I’m wondering:
- Do I need to derive a custom variational family for this setup?
- Is it okay to use a default like a mean-field Gaussian, or would that be too simple given the structure of the matrix normal?
- Would black-box variational inference with Monte Carlo estimation of the ELBO work well here?
- Has anyone tried using a matrix-variate variational distribution (like a matrix normal) instead of a factorized normal?
- Is using pathfinder as initialization and using NUTS would be more practical here? I wonder if this reduces the warm up stage in NUTS and can locate the NUTS within the posterior mode so much stable and faster
I already have the code working with NUTS, but it’s not that scalable, so I’m mainly looking for a faster and scalable variational alternative. Would love guidance on how to proceed or any examples if something similar has been done.
Thanks!
I’m afraid this isn’t going to work well in Stan, because it’s just doing a soft version of the combinatorial problem you get with a proper discrete slab.
I understand matrix normals, but I don’t understand what graph you are talking about. Did the matrix come from a graph?
That’s a tall order. None of the variational inference methods produce answers as good as NUTS and in some cases, they have convergence issues and can be slow.
You could always try a custom variational family for your problem, but that takes you well outside of Stan.
Yes, the matrix comes from the graph.
One question I have, is it generally better to use Pathfinder as an initialization strategy for NUTS compared to running NUTS alone with default initializations?
If so, do you recommend running a single Pathfinder per NUTS chain (e.g., 3 parallel chains), possibly adding some noise to each Pathfinder result to encourage exploration of different modes in a multimodal posterior?
Alternatively, is it better to use a multi-path Pathfinder run and split the resulting paths to initialize the 3 NUTS chains?
I’ve been exploring the challenges of implementing spike-and-slab priors in NUTS too.
I think it cannot well be solved today by means of any kind of prior or mixture of priors as there is a more fundamental problem at the level of the markov walk.
More specifically, I believe the core issue relates to how the slab component loses its connection to the likelihood when the mixture heavily favors the spike.
The Problem
Consider a simplified spike-and-slab formulation according to your Gaussian mixture proposal:
theta1 ~ N(0, 1e-3)
with theta1 ≥ 0
(spike component)
theta2 ~ N(6e-3, 1e2)
with theta2 ≥ 0
(slab component)
p ~ U(0,1)
(mixture weight)
beta = p*theta1 + (1-p)*theta2
(parameter used in likelihood)
When p
approaches 1:
- The contribution of
theta2
to beta
becomes negligible
- The likelihood becomes insensitive to changes in
theta2
- The gradient of the log-likelihood with respect to
theta2
approaches zero
This creates a situation where theta2
can “wander” without constraint in the posterior, as the Hamiltonian dynamics provide no meaningful direction for this parameter.
Proposed Solution
I have been wondering of following solution, though it would require new functionalities in STAN.
I’d value any input on this.
Could Stan implement a specialized “SPIKE” prior operator that:
- Takes a parameter name as an argument
- During the leapfrog integration:
- When spike = 1: Forces the parameter to zero and skips momentum updates
- When spike = 0: Applies normal HMC/NUTS sampling
Expected Behavior
This would result in:
- When spike = 1: Parameter trace shows exact zeros
- When spike = 0: Parameter follows its conditional posterior
This approach would maintain sparsity while potentially improving computational efficiency by avoiding unnecessary calculations for parameters that are effectively zero. Also, it should be faster than approaches where multiple subset models are first trialed and a posterior predictive benchmark is used subsequently for selecting the best parsimonious predictive model, since only one MCMC model is run.
I’m curious if this approach seems feasible within Stan’s architecture, and whether it might address the fundamental challenges of spike-and-slab priors in gradient-based samplers.
Any thoughts or feedback would be greatly appreciated! (and some direction as where to post the idea if worth trying out).
Kind regards.
Note: SPIKE could also use a hyperparameter to control a ‘parsimony’ weight alpha in ]0,1[.
As in spike sample = I( u > alpha) where u ~ U[0,1].