Sparse NUTS: preconditioning with sparse matrix operations

drezap · May 6, 2026, 3:29am

Software is important, did you develop the same algorithms side by side, in the same language with the same dataset/models? If the methods focus on hierarchical models, did you fit linear models with categorical covariates against hierarchical models with categ.? There’s many degrees of freedom you’re not adjusting for. Moreover, correlation is helpful but meaningless for 1,000 parameters with 1,000 different means, no?

drezap · May 6, 2026, 3:59am

I guess what I’d like to see is a monte carlo simulation N=10,000 runs benchmarking runtime in the same programming languages/backend, all adjusted for, on the same models, using SNUTS and NUTS, against hierarchical and non-hierarchical models. Then I might be convinced. I’m not saying it doesnt work but the metrics you’re judging on are meh.

i.e. theres no treatment and control group or experimental design. Max correlation does not in any way prove that this method is faster and yields the same parameter estimates for every single parameter. I can show counter examples but this is trivial to mathematicians. Execerise for the reader.

monnahc · May 6, 2026, 3:44pm

Software is important, did you develop the same algorithms side by side, in the same language with the same dataset/models?

Yes.

If the methods focus on hierarchical models, did you fit linear models with categorical covariates against hierarchical models with categ.?

We are using “hierarchical” in a much broader sense than categorical variables in regressions. In fact none of our examples (off the top of my head) are categorical. This is a not a manuscript about hierarchical vs. non-hierarchical, sum-to-zero constraints, etc. It demonstrates that if a user has a hierarchal model that using information from the marginal mode (the covariance of the full posterior) can eliminate the need to adapt a mass matrix and improve sampling efficiency.

There’s many degrees of freedom you’re not adjusting for. Moreover, correlation is helpful but meaningless for 1,000 parameters with 1,000 different means, no?

The maximum correlation is only a quantity that is reported to give context. It shows that large correlations are common in many types of models, and that Q can readily estimate it. The main empirical proof is in the improvement in sampling efficiency shown in the figure above.

monnahc · May 6, 2026, 3:46pm

May I suggest you read the MS more thoroughly as you seem to misunderstand the method and key message? All we’re saying is that it is possible and effective to decorrelate a posterior prior to sampling using sparse linear algebra, and when the model has high correlations this is particularly advantageous. We apply this to a set of models well beyond what you’re assuming. The “control” is NUTS defaults in Stan, if that even makes sense.

Topic		Replies	Views
Models where Stan outperforms Nutpie/Walnuts Algorithms	12	525	May 14, 2026
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6251	July 24, 2020
Comparing Stan's adaptation phase to that of nuts-rs? Algorithms	20	1934	August 11, 2023
Inefficient sampling and divergent transitions in network meta-analysis (ported from WinBUGS) Modeling fitting-issues , meta-analysis	12	1291	July 24, 2020
Case study on spatial models for areal data - Poisson CAR/IAR Modeling	116	11621	July 12, 2018

Sparse NUTS: preconditioning with sparse matrix operations

Related topics