Any benefit to supplying gradient myself?

I’ve got a Stan program that fits a model with a gaussian sampling distribution, whose mean is a nonlinear function of the parameters. The model seems to run well when the number of parameters is relatively small (I can recover the parameters from data simulated using the nonlinear function, and using real data I get seemingly reasonable results). Small meaning in the context of the specific problem I’m solving (10-50 params). I’d like to increase the number of parameters by quite a bit (~200), but the runtime starts to get prohibitively long.

I have a closed form expression for calculating the gradient of the nonlinear function, and was curious if there would be any speed benefit to supplying Stan with the gradient, instead of using the autodiff.

When you have more parameters in the model do the algorithm diagnostics (treedepth, stepwise, acceptance rate, divergences for hmc) still look good? If so then likely yes. If not then you have bigger problems.

To be completely honest, I’ve never actually gotten a ~200 parameter model to complete (I gave up after about a day of runtime on a cluster).

With a 50 parameter model, the run completes with no divergent transitions, no treedepth saturations, and shrink factors well beneath 1.5, although this is with adapt_delta=0.99 and max_treedepth=15.

If the model really needs that adapt_delta and treedepth then the gradient might be the way to go. It’s not a big deal if you know some C++.

is there a place to start to learn how to do it in C++? The wiki perhaps?

There’s a guide on the wiki about how to contribute functions to Stan, but that’s only if want to try to get something incorporated into Stan proper. There’s another doc @bgoodri started (can’t find it so tagging him) about how to just write the right kind of C++ function and #include it via rstan or CmdStan.

1 Like
1 Like

Yes, thanks!

Thanks!

Hi folks,

A brief (edit: lie) update on this: so I wound up thinking a bit more on sakrejda’s original comment to question the statistical model before trying to speed up the physical model (the physical model being the slow bit that gives me the mean function). One thing I had been ignoring in the model diagnostics, were the estimated fraction of bayesian missing information warnings that this model was throwing about quite frequently. The prior I’m using is actually relatively weak while introducing a healthy bit of between-parameter correlations (I use N-1 gaussian distributions that are centered on the first differences of the N parameters).

I sat down and thought about the model a bit more, and realized that in the literature, it’s known that for this specific problem, assuming infinite, perfect/noise-free data, and without regularizing, there’s actually a closed form solution for the ML parameter values, which consists of a finite number of delta functions, superimposed on a background with a value approaching infinity. This seemed to jive with the shape of the posterior distributions I was getting, since the marginal posteriors I observe (with a “regularizing” prior like the one mentioned above) tend to be very heavily positively skewed, with only a handful of marginal distributions looking “nicely constrained”. This led me to believe that I was seeing something similar to the “delta functions on an infinite background” scenario from the closed form solution (where the well constrained params correspond to the delta functions, the healthy positive tails for the rest of the params are related to the infinite background, and whatever modes do appear in the skewed/heavy tailed distributions are primarily due to the prior). On top of this, I’m actually purposefully “overparameterizing” the model, to where I expect a significant portion of the parameters will be very poorly constrained by the data, which I don’t think the

So, I imposed an “additional” gaussian prior on the parameter values (i.e. N mean zero gaussians for the N parameters). This allowed me to bring the treedepth and the stepsize back to their default values, while still getting the model to estimate 100 parameters relatively quickly. The low EBFMI errors still appear from time to time, but for the most part only seem to consistently appear when I try to fit to “garbage” data (low SNR). Otherwise, they will still appear sporadically when I am fitting to “good” data, but very infrequently (which i’m tempted to call “good enough”).

I don’t know that I’m necessarily super fond of this “two prior” approach, since it feels like this is bordering on choosing a prior for computational convenience, but I’m starting to feel also like the problem is just intrinsically “poorly constrained”. I guess the tl;dr then is that I think I’m still going to take a crack implementing the analytic gradient in C++ (since then I could try just running really long sampling and warmup periods). I’m wondering if the quickest route would be to implement the gradient computation directly in the Stan, then take the C++ generated by Stan for the mean function and its gradient, and then try to combine those using the tut for adding a function and its gradient to Stan.

Sorry, that devolved into more of a thought dump than anything else!

Thanks,
Taylor

1 Like

Given these considerations I think that implementing the analytic gradient is still premature and it’s better to push further to better understand the experimental design and what is reasonable to infer in your model and what is not. This will provide a better foundation for modeling priors, specially if you consider simulated data where you sample ground truths from the prior and data from the corresponding likelihood and analyze the interaction between the two. See, for example, the discussion in Section 2 of https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.