Sparse NUTS: preconditioning with sparse matrix operations

Software is important, did you develop the same algorithms side by side, in the same language with the same dataset/models? If the methods focus on hierarchical models, did you fit linear models with categorical covariates against hierarchical models with categ.? There’s many degrees of freedom you’re not adjusting for. Moreover, correlation is helpful but meaningless for 1,000 parameters with 1,000 different means, no?

I guess what I’d like to see is a monte carlo simulation N=10,000 runs benchmarking runtime in the same programming languages/backend, all adjusted for, on the same models, using SNUTS and NUTS, against hierarchical and non-hierarchical models. Then I might be convinced. I’m not saying it doesnt work but the metrics you’re judging on are meh.

i.e. theres no treatment and control group or experimental design. Max correlation does not in any way prove that this method is faster and yields the same parameter estimates for every single parameter. I can show counter examples but this is trivial to mathematicians. Execerise for the reader.

Software is important, did you develop the same algorithms side by side, in the same language with the same dataset/models?

Yes.

If the methods focus on hierarchical models, did you fit linear models with categorical covariates against hierarchical models with categ.?

We are using “hierarchical” in a much broader sense than categorical variables in regressions. In fact none of our examples (off the top of my head) are categorical. This is a not a manuscript about hierarchical vs. non-hierarchical, sum-to-zero constraints, etc. It demonstrates that if a user has a hierarchal model that using information from the marginal mode (the covariance of the full posterior) can eliminate the need to adapt a mass matrix and improve sampling efficiency.

There’s many degrees of freedom you’re not adjusting for. Moreover, correlation is helpful but meaningless for 1,000 parameters with 1,000 different means, no?

The maximum correlation is only a quantity that is reported to give context. It shows that large correlations are common in many types of models, and that Q can readily estimate it. The main empirical proof is in the improvement in sampling efficiency shown in the figure above.

May I suggest you read the MS more thoroughly as you seem to misunderstand the method and key message? All we’re saying is that it is possible and effective to decorrelate a posterior prior to sampling using sparse linear algebra, and when the model has high correlations this is particularly advantageous. We apply this to a set of models well beyond what you’re assuming. The “control” is NUTS defaults in Stan, if that even makes sense.