Models where Stan outperforms Nutpie/Walnuts

I think it would be incredibly valuable to add more models to posteriordb. It has a lot of small models, and a few very big but close to intractable models. Adding more big …

I agree. Not sure whether this is the right place to chime in, but I’ve been excited at the idea of speeding up Stan using Nutpie, Walnuts, etc. For pretty much all the models I would want to reach for this, though, I haven’t actually been able to get these to do better and in most cases they don’t converge or have lower ESS/draw.

These models I’m referring can have 1+GB compressed data with several 100K parameters. So I think quite large for Stan models.

The Stan models converge but the Nutpie or Walnuts implementations would not. In a few models I was able to get Walnuts to converge, and although its time per draw was up to 2x Stan, it’s ESS/draw was far lower so I would need to double or more the iterations to get the same effective draws. I don’t have a strong enough grasp of the why and I’m not able to share these models, unfortunately, so having much larger complex models to test against would hopefully be a step in understanding what I’m seeing that doesn’t seem to match the performance charts floating around.

This isn’t a baseball model by any chance? I still have scars from some of those. They break everything.

I’m definitely interested to hear more about that, but if you can’t share the model that of course makes it a bit tricky. Maybe you can open a separate thread with what info you can give? (and ping me there).

One thing I’ve seen before what that nuts was choosing a too short trajectory length for some reason. nutpie has an extra_doubling argument that you could set to 1 or so, which will make it run one extra tree doubling after it detected a u-turn. It would be interesting if that fixes the problem. Does stan get good ess (let’s say > 300) for all of the parameters?

Ah, those baseball models were a few of the models I thought of!

But it’s consistent with others, like most recently I have a large soccer model, hierarchical in several ways, 6 likelihoods, parameter sharing across them, 86 leagues of data, etc. Same issue.

In each of these cases ESS from Stan is fine.

If it is the trajectory being too short, one of the main differences in result between Stan and Nutpie is that Stan is more conservative in finding an optimal tree depth, and Nutpie is more aggressive in making the trajectories shorter, which is where much of the speed up comes from, to over simplify.

I could set extra_doubling, and if it fixes it, it would also make the speed per draw more similar to Stan. Which means that in the models that Nutpie succeed in posteriordb there is something about the geometry and model structures where Stan’s approach is too conservative, and in the models I’m running, Nutpie is too aggressive. But I don’t know of an obvious pattern to generalize.

I had also forked Flatiron Walnuts repo to give myself extra tuning parameters, one of which is a continuous --unit-mass to float between the two ideas but I haven’t had a lot of time to actually investigate:

Feel free to move this into a new thread if this is too adjacent to the Sparse Nuts.

I think this discussion is interesting and could go on for some time, so I have split it off

Figuring something like this out without the model itself is a bit tricky. So I wrote you a little checklist, I hope that helps:

  • First, just to make sure we are comparing the same way. Ess/s can mean very different things. Is it the minimum over all parameters? Bulk ess? On the constrained or unconstrained space? Does it also include the logp? Is the absolute min ess value big enough so that it is not mostly noise? Are there divergences that might indicate we don’t actually sample the full posterior? I’m also not so sure how well the max energy cutoff for a divergence actually works in models that are that big. If the model only looks like it is converging but actually isn’t, all sampler comparisons would be pretty pointless.

  • With big models checking parallelization is always good. Are you seeing consistent times for time per grad eval? There could also be a difference in how things are parallelized. Always good to check.

  • Is tuning long enough for all samplers? 1000 tune nutpie is not necessarily the same as 1000 tune in Stan, they might have used a vastly different amount of grad evals. (There was a small change in nutpie recently, now the window size goes to inf as the tuning length increases.)
    In nutpie you can save the mass matrix (store_mass_matrix) and check if it continues to change. You can also have a look at the fisher_distance statistic, which should also more or less settle near the end of tuning. Stan doesn’t expose either of those however…

  • I’d also check the acceptance statistic (different from the acceptance rate!). Both nutpie and stan store this as a sampler statistic. The mean of this over the posterior should be roughly the target acceptance rate you specified (0.8 by default). Stan has a tendency to give you a higher acceptance statistic than you asked for, and sometimes that actually happens to benefit the efficiency. I’m not actually sure why that is, but it certainly happens.

  • (later edit) Check if there are multiple modes. If one of the samplers only returns one mode but the other sampler finds multiple, then the convergence statistics of the second sampler will look horrible, even though it might be doing a better job.

Past the checklist, a few more thoughts on the mass matrix specifically:

Let’s first assume that the posterior is mvnormal, because otherwise we don’t really have good theory about the whole thing. We have decent theory about how good a mass matrix is for a perfectly tuned hmc sampler.
If we assume for a moment that this coincides with what nuts does, we can use kappa’ as defined in the paper to tell how good our choice is. I recently had a closer look at how that depends on the spectrum of the original covariance, and the picture looks something like this (probably worth it to update the paper at some point…): If the covariance is close to diagonal, then the diagonal nutpie mass matrix is always better than the stan mass matrix. It converges faster, and it converges to a mass matrix with better kappa’ than stan. If the covariance is not at all like a diagonal matrix, it depends a bit on the concrete spectrum. Single large eigenvalues favor diagonal nutpie a lot over diagonal stan, and single small eigenvalues slightly favor stan. It is not symmetric though, if you have both large and small ones (so eg a lognormal distribution of eigenvalues) nutpie is quite a bit better in expectation over stan.

Things as always get more complicated if we add the dynamic nature of nuts to the mix: we don’t have a perfect hmc sampler, we have nuts. And it might choose good trajectory lengths or it might not. I’m afraid I really don’t understand well what the circumstances of this are. It definitely also depends on the eigenspectrum of the covariance again. I’ve been reading this paper recently to finally understand that a bit better, and what I get so far is that nuts likes it if (\sum \sigma_i)^2/\sum\sigma_i^2 is large, and I think that should be better with the nutpie mass matrix? I’m not 100% about this yet however. So I don’t think the uturn criterion should be more problematic with nutpie than with stan, but I could be wrong about this. But anyway, the extra_doublings parameter gives you a knob to try that out. If you increase it nutpie will definitely be slower in wall time, but if the problem really is that the trajectory was too short, it should much more than make up for this in terms of ess/s.

Then of course there’s nonlinear stuff, and in a 100k hierarchical model there might be a lot of that. The posterior is not a mvnormal, and I don’t have any theory about that, and I don’t think someone else really does either. I’d love to be wrong about this though. The theory behind the nutpie mass matrix (especially appendix B) I think suggests that it should usually be better with an ideal hmc sampler, but I have no idea how one would quantify that. And again, NUTS might be doing who-knows-what.

Thanks for the checklist.

I’ve moved on from apples-to-apples at the moment (hoping more development that irons whatever kinks); but to give a brief example I pulled information from an email on the difference between the same model in nutpie and stan on 1000 samples from a fairly recent correspondence.

The Nutpie had around 1000 parameters with low ESS bulk or tail or both:

> nutpie_summary |> filter(ess_bulk < 100 | ess_tail < 100) 

# A tibble: 996 × 10

   variable            mean    median        sd       mad        q5       q95  rhat ess_bulk ess_tail

<chr>              <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl> <dbl>    <dbl>    <dbl>

 1 xi[1]          0.0985    0.0985    0.0169    0.0167    0.0705    0.127     1.03     51.8      92.4

 2 xi[2]          0.317     0.317     0.0167    0.0166    0.288     0.344     1.03     49.0      66.2

 3 eta_lg[4]      1.12      1.12      0.0285    0.0297    1.08      1.17      1.00     97.9     261. 

 4 eta_lg[6]      1.03      1.03      0.0270    0.0267    0.991     1.08      1.02     89.1     215. 

 5 eta_lg[7]      0.918     0.919     0.0253    0.0243    0.873     0.959     1.01     82.3     210. 

 6 beta_z[1]      2.33      2.30      0.185     0.172     2.08      2.69      1.02     46.6     110. 

 7 beta_z[2]      2.46      2.38      0.705     0.749     1.40      3.78      1.08     11.5      23.6

 8 T_extra        0.0000360 0.0000364 0.0000153 0.0000206 0.0000155 0.0000619 1.41      2.19     12.7

 9 tau_clock[2,4] 0.113     0.113     0.000432  0.000393  0.113     0.114     1.000    79.8     141. 

10 tau_clock[2,6] 0.130     0.130     0.000681  0.000661  0.129     0.131     1.00     94.2     231. 

# ℹ 986 more rows

# ℹ Use `print(n = ...)` to see more rows

The Stan sampler did have a single low ESS parameter:

> stan_summary |> filter(ess_bulk < 100 | ess_tail < 100)

# A tibble: 1 × 10

  variable        mean median      sd     mad    q5   q95  rhat ess_bulk ess_tail

<chr>          <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>

1 rho_z          0.171  0.170 0.0323  0.0333  0.120 0.229  1.04     48.1     163.

This kind of difference has been the norm of my experience when comparing. No divergences with either. I’ve seen a similar flavor trying it on literally every model I’ve reached for it to try, all large and complex.

I’ll pull the papers and take a look.

That’s an interesting example, thanks for sharing.

Cases like this, where both samplers have parameters with low ess are common, but I never could really figure out how do deal with those in benchmarks. In the paper we simply exclude all models where none of the samplers got a decent min-ess. That’s not because I don’t care about those, but I really don’t know how to compare things there. ess/second is a measure of efficiency, and efficiency only makes sense once you’ve established that you’re getting to the right place. If you’re not, faster isn’t better, it’s just producing wrong draws faster. If all samplers fail to some degree the question has to shift a bit: Instead of “how efficient are we” it should be “how close to the true posterior do we get”. And I just don’t know how to answer that in general.

A very natural thing to do is to just count how many parameters look bad, or to look at the median ess or something like this. But it isn’t hard to come up with counter examples to this:

Let’s take this simple model with a few coupled mild funnels, where we do know the true posterior anyway. Both nutpie and stan struggle with this quite a bit, and in essentially the same way.

data {
    int N;
    vector[N] scale;
}
parameters {
    real log_std;
    vector[N] y;
}
model {
    log_std ~ normal(0, 1);
    y ~ normal(0, exp(scale * log_std));
}

With these settings

rng = np.random.default_rng(1234)
scale = rng.normal(size=1000) * 0.2
compiled = nutpie.compile_stan_model(filename="model.stan")

If we sample with seed 1234 we get something that looks superficially ok based on most diagnostics:

tr = nutpie.sample(compiled.with_data(N=len(scale), scale=scale), chains=4, seed=1234)

No divergences, all the y variables have decent ess (even tail ess doesn’t look too bad), and rhat is at most 1.02 for y. Only log_std has an rhat of 1.07 and a low ess.

But the posterior for y is completely incorrect. Here is a plot of the actual standard deviations vs the standard deviations the draws give us for the different y values:

Suffice it to say that the dots are not on the diagonal…

If we just change the seed to 123456 we get way worse diagnostics. Suddenly rhat of log_std is 1.23, and lots of y values have bad ess and bad rhat values. But still, the standard deviations of y look much better:

This is of course a somewhat synthetic example, and I’m not trying to say that this is what’s going on in your model. But comparing non-convergent or only somewhat convergent model runs is a hard problem that I wish I knew how to do.

Agree this is a real challenge for comparing software/samplers and I have no good ideas/solutions. One pragmatic way to deal with it is to run multiple analyses with different chains to quantify this behavior.

FWIW I ran this example through SNUTS too using the following code.

library(RTMB)
library(SparseNUTS)
set.seed(12345)
scale <- rnorm(1000)*.2
dat <- list(scale=scale)
pars <- list(log_std=1, y=rep(1,1000))
f <- function(pars){
  getAll(pars,dat)
  lp <- dnorm(log_std,0,1, log=TRUE) +
    sum(dnorm(y, 0, sd=exp(scale*log_std), log=TRUE))
  return(-lp)
}
obj <- MakeADFun(f, parameters=pars, random='y')
# SNUTS on the full posterior, fails as expected
snuts <- sample_snuts(obj, globals=list(dat=dat), seed=514)

And it has the same issues, to no surprise. All SNUTS is doing here is descaling based on Q since there are no correlations, as is apparent by the console output prior to sampling:

Optimizing marginal posterior with 1 fixed effects and 1000 random effects...
Getting Q and its stats...
Q is 99.95% sparse | Ratio of marginal SDs=1 | Max abs cor >=0
Rebuilding RTMB obj without random effects...
diag metric selected b/c low max cor (0)
log-posterior at inits=(-1513.51,-1168.45,-1125.59,-1313.2); at conditional mode=-919.857

However I can use SNUTS to sample the marginal posterior (in this case just log_std) using “embedded Laplace approximation” (ELA) and it works just fine. Wall time is less, and way more effective samples.

# SNUTS on the marginalized posterior (only 1 parameter)
snuts_ela <- sample_snuts(obj, globals=list(dat=dat), laplace=TRUE, seed=514)

Model 'RTMB' has 1 pars and was fit using SNUTS with a 'diag' metric
4 chain(s) of 1150 total iterations (150 warmup) were used
Run time per chain: average= 5.73 and max= 5.8 seconds 
Min bulk ESS=1561.3 (39.03%) [log_std] and maximum Rhat=1.001 [log_std]
There were 0 divergences after warmup

Even for this funnel example the Laplace approximation does a good job of approximating the posterior of log_std . The asymptotic estimate from Q is N(0,1) which is the prior. If we had a non-normal prior on log_std then the approximation wouldn’t be exact, but we’d still be able to efficiently sample the posterior of the parameter. We mention this in the SNUTS paper that ELA + SNUTS may be useful in some cases with challenging geometries like this. So while decorrelating may not help in cases like this, the ELA option is trivial to apply and has some promise.

I just tried with the flow adaptation in nutpie too, and that works quite well. It has to learn the dependence structure with neural nets though, which takes some time (one chain takes about 10min). It would be really cool to combine the conditional independence info you have in SNUTS with this. I think it would make it much more efficient, if we can figure out how to build good coupling flows or autoregressive flows based on it.

I wonder if fitting an R vine copula is faster and leads to similar results for the dependency structure as the flow model? Can you point me to the flow code in nutpie? A relatively quick test is to see if the cpp library https://github.com/vinecopulib/vinecopulib can estimate the dependency structure and output a decent mass matrix

I never really thought about vine copulas in this context at all, but it looks like that could actually be quite interesting.

The way I have been thinking about this so far is that we could use the dependence structure to inform our choices about how to split the variables into transformed and untransformed parameters in a Coupling layer. Usually people just choose that by random, but I think we can do much better. In a funnel for instance, we’d only need one Coupling layer if we know to put the scale parameter in the untransformed group and the funnel variables in the transformed group. I’ve been experimenting with using a max-cut algorithm based on the cross-covariance of transformed gradients and draws in the transformed space to get a more informed split, but I think that’s probably not nearly as good as knowing the correct split from the model structure right away.

The nutpie flow adaptation code is here:

It is still a bit research code at the moment though. It works, but could use quite a bit of cleanup.

The basic algorithm has some similarity to VI. In standard VI with a normalizing flow, you have a flow F that maps a standard normal to a proposal q, and you train F by drawing from the standard normal, pushing samples forward through F, and minimizing the KL divergence KL(q, p) (an integral over q).

We flip both of those. We take HMC draws from the actual posterior p, push them backward through F, and ask whether the result looks like a standard normal. The integral is over p, not q, and we use the Fisher divergence rather than KL.

We then run a window of HMC draws in the new transformed space; equivalent to Riemannian HMC in the original space with the identity metric on the transformed side pulled back through F. (So exactly the Riemannian HMC that’s scalable with d and can still use explicit leapfrog steps that Bob wants, I think 😉). F plays the role of an adaptive, nonlinear mass matrix or metric. This gives us new training data for F, and we repeat just like we repeatedly change the mass matrix during tuning in stan/nutpie.

This is really interesting but way over my head at the moment. If you want to use TMB to get the conditional independence structure for testing, that would be straightforward. E.g., for this model above it is a returned object

snuts$mle$Q |> str()

Which is a dscMatrix object from the Matrix R package. Or without running MCMC

opt <- with(obj, nlminb(par,fn,gr)) # optimize to marginal mode
Q <- sdreport(obj, getJointPrecision=TRUE, skip.delta.method=TRUE)$jointPrecision

10 minutes is surprisingly slow. It takes 0.02s to optimize and 0.06s to calculate Q here.

I think we should add one item to the checklist: Check if there is multi-modality that one sampler is picking up and the other isn’t.

I did some more comparisons for a couple models, and also found two models where nutpie seemed to perform much worse. Turns out the difference was that nutpie finds different modes and stan doesn’t. I have no idea why that would happen, but if that happens systematically I’d consider that a feature, not a bug. If there are multiple modes we want to know about them. I could also reproduce them in stan by initializing at nutpie draws from the different chains.

Something like this would also explain the pattern you are seeing, where stan shows more or less decent ess/rhat and nutpie has lots of very bad rhat values.

You can verify that those really are multiple modes if you initialize stan with points from the nutpie posterior:

inits = []
for chain in range(4):
    point = nutpie_trace.posterior.isel(draw=0, chain=chain).dataset.to_dict()
    inits.append({name: val["data"] for name, val in point["data_vars"].items()})

model = cmdstanpy.CmdStanModel(...)
stan_trace = model.sample(data=data, inits=inits)