Variational inference or pathfinder + NUTS for matrix normal likelihood with gaussian mixture prior

Mu_Ohm · May 25, 2025, 5:16am

Hi all,

I’m working on a model where the likelihood is a matrix normal distribution, and the prior over each edge of the graph is a spike-and-slab inspired Gaussian mixture. The spike component is a narrow Gaussian centered at zero, and the slab components are wider Gaussians with a mixture over multiple variances. I want to perform variational inference to approximate the posterior over the graph matrix G. Since the prior is a non-conjugate Gaussian mixture and the likelihood is matrix normal, I’m wondering:

Do I need to derive a custom variational family for this setup?
Is it okay to use a default like a mean-field Gaussian, or would that be too simple given the structure of the matrix normal?
Would black-box variational inference with Monte Carlo estimation of the ELBO work well here?
Has anyone tried using a matrix-variate variational distribution (like a matrix normal) instead of a factorized normal?
Is using pathfinder as initialization and using NUTS would be more practical here? I wonder if this reduces the warm up stage in NUTS and can locate the NUTS within the posterior mode so much stable and faster

I already have the code working with NUTS, but it’s not that scalable, so I’m mainly looking for a faster and scalable variational alternative. Would love guidance on how to proceed or any examples if something similar has been done.

Thanks!

Bob_Carpenter · May 27, 2025, 10:28pm

I’m afraid this isn’t going to work well in Stan, because it’s just doing a soft version of the combinatorial problem you get with a proper discrete slab.

I understand matrix normals, but I don’t understand what graph you are talking about. Did the matrix come from a graph?

That’s a tall order. None of the variational inference methods produce answers as good as NUTS and in some cases, they have convergence issues and can be slow.

You could always try a custom variational family for your problem, but that takes you well outside of Stan.

Mu_Ohm · May 29, 2025, 7:41pm

Yes, the matrix comes from the graph.

One question I have, is it generally better to use Pathfinder as an initialization strategy for NUTS compared to running NUTS alone with default initializations?

If so, do you recommend running a single Pathfinder per NUTS chain (e.g., 3 parallel chains), possibly adding some noise to each Pathfinder result to encourage exploration of different modes in a multimodal posterior?

Alternatively, is it better to use a multi-path Pathfinder run and split the resulting paths to initialize the 3 NUTS chains?

notastudent · June 2, 2025, 9:06am

I’ve been exploring the challenges of implementing spike-and-slab priors in NUTS too.
I think it cannot well be solved today by means of any kind of prior or mixture of priors as there is a more fundamental problem at the level of the markov walk.

More specifically, I believe the core issue relates to how the slab component loses its connection to the likelihood when the mixture heavily favors the spike.

The Problem

Consider a simplified spike-and-slab formulation according to your Gaussian mixture proposal:

theta1 ~ N(0, 1e-3) with theta1 ≥ 0 (spike component)
theta2 ~ N(6e-3, 1e2) with theta2 ≥ 0 (slab component)
p ~ U(0,1) (mixture weight)
beta = p*theta1 + (1-p)*theta2 (parameter used in likelihood)

When p approaches 1:

The contribution of theta2 to beta becomes negligible
The likelihood becomes insensitive to changes in theta2
The gradient of the log-likelihood with respect to theta2 approaches zero

This creates a situation where theta2 can “wander” without constraint in the posterior, as the Hamiltonian dynamics provide no meaningful direction for this parameter.

Proposed Solution

I have been wondering of following solution, though it would require new functionalities in STAN.
I’d value any input on this.

Could Stan implement a specialized “SPIKE” prior operator that:

Takes a parameter name as an argument
During the leapfrog integration:

When spike = 1: Forces the parameter to zero and skips momentum updates
When spike = 0: Applies normal HMC/NUTS sampling

Expected Behavior

This would result in:

When spike = 1: Parameter trace shows exact zeros
When spike = 0: Parameter follows its conditional posterior

This approach would maintain sparsity while potentially improving computational efficiency by avoiding unnecessary calculations for parameters that are effectively zero. Also, it should be faster than approaches where multiple subset models are first trialed and a posterior predictive benchmark is used subsequently for selecting the best parsimonious predictive model, since only one MCMC model is run.

I’m curious if this approach seems feasible within Stan’s architecture, and whether it might address the fundamental challenges of spike-and-slab priors in gradient-based samplers.

Any thoughts or feedback would be greatly appreciated! (and some direction as where to post the idea if worth trying out).

Kind regards.

Note: SPIKE could also use a hyperparameter to control a ‘parsimony’ weight alpha in ]0,1[.
As in spike sample = I( u > alpha) where u ~ U[0,1].

Topic		Replies	Views
New algorithm: Gradient-based Adaptive Markov Chain Monte Carlo Algorithms mcmc	13	2114	February 16, 2022
[help]Mixture priors problem Modeling priors	6	767	May 18, 2021
Spike and slab Modeling	4	4737	May 18, 2020
Variational Bayes results seems sensible, but vary - What to change? Modeling variational-bayes	6	963	November 6, 2020
FYI: BayesFlow (part of TensorFlow) Developers	2	2552	April 17, 2018

Variational inference or pathfinder + NUTS for matrix normal likelihood with gaussian mixture prior

The Problem

Proposed Solution

Expected Behavior

Related topics