It sounds like integrating with Stan at the C++ level would be the way to go. There sound like a lot of advantages, and things like repeatedly reading and writing values out in files are a deal breaker.
I confess that right now my programming experience is limited to higher level languages, although studying C++ has been on my to do list for a while now. Although somethings you’ve written suggest it might not be pleasant.
I’ve very interested in this, so I’ll start taking serious looks within the next couple of weeks.
I do recall seeing discussions about getting higher order derivatives via AD in Stan.
Finite differencing would be okay; the step is only performed once per model fit (at the unconstrained space mode). ForwardDiff in Julia does give Hessians via AD (but I imagine similarly to the Stan ODE case, only for pure Julia functions), and hypothetical users have the choice of the API.
I imagine most people would prefer to be able to define models using Stan’s syntax.
Dropping the additive constants in a model is fine, as grids are easy to renormalize. The one caveat is that removing the need to renormalize might be a convenient way to get error estimates, although alternatives exist for building adaptive grids so I don’t think it’s necessary.
Why grids?
Of course, traditional tensor product grids are pretty awful beyond a small number of dimensions.
We could roughly define the idea of badness as error relative to the number of function evaluations.
If we want to be really notational, we can say that MCMC error is O(n^(-1/2)) while grid error is more like O(n^(-1/p)), where p is the number of parameters. At p = 3, MCMC is already looking better. If p is 20, or 100…
In a nutshell, simple grids are bad because if our grid is a 2 x 2 x 2 x… grid of gauss rules, if p = 20 we need over a million function evaluations (2^20).
Yet our estimates are actually going to be pretty horrible, because a univariate gauss rule needing only 2 function evaluations is only exact for polynomials of order 2n-1 = 3. Meaning if any of those 20 marginals aren’t approximated well by a third order polynomial, our answer is going to be pretty bad.
Letting theta be our parameter vector, that grid is good for 60th order multiple polynomials of the form theta_1^3theta_2^3theta_3^3*… In other words, with our huge number of function evaluations we are buying resolution in the interaction terms.
Realistically, we want to spread our resolution more evenly instead of concentrating it on interactions. Or, given transforms that lead our parameter vector to be asymptotically independent standard normals – in which case interactions vanish – we’d buy the most overall accuracy by focusing a little more heavily on the marginals than interactions.
For comparison, the simplest pattern for constructing sparse grids (Smolyak), with a sequence resulting in exactly integrating >50th degree marginal polynomials only results in 189,161 function evaluations.
This grid has 35 marginal nodes; if you made the standard interaction-focused grid with the same marginal accuracy you’d need 35^20.
Some research (https://arxiv.org/abs/1604.08466) suggests the author came up with a construction removing the dependence on p entirely. In other words, the number of function evaluations you need for a given accuracy [under conditions outlined in that paper] can be made independent of the number of parameters, but with O(n^-2) or so convergence instead of O(n^-1/2).
When I saw this I naturally got excited about Bayes, but I confess I have a lot of reading and work to do before I begin to understand the technical details.
What I’ve written so far generates grids using a simpler construction that does still ultimately fail as p increases, but has lead to great performance in a few babied problems with small p. Long term, I want to see if it can be made general / not need any babying, but the immediate application is sample size determination (rerun models thousands of times for each of a large number of different sample sizes to get an idea of power), where model-specific tuning and hand-holding can be justified.