Using CmdStan to evaluate an array of log densities

chriselrod · June 14, 2017, 7:31am

RStan provides the log prob methods; does CmdStan offer anything similar?
I would like to be able to pass a matrix[p,n] of parameter values (or an array of vectors) and receive a vector[n] of log densities.

In RStan you have to pass a named list of parameters, but being able to pass a vector[p] per log density would be much simpler on my end.

Thanks,
Chris

I chose the category “Other” because I am using stan.jl, but that package is simply a wrapper for CmdStan.

Bob_Carpenter · June 15, 2017, 9:49pm

Not directly. But you can pass in init values and run for a single iteration.

But Stan always provides a single log density for a model, not an array of them. So I’m not quite sure what you’re looking for.

chriselrod · June 16, 2017, 1:07am

Hmm, okay. That doesn’t quite work for two reasons:

I imagine there is some overhead, like calculating Hamiltonian trajectories? I haven’t yet read Betancourt’s paper on an intuitive explanation of HMC, so I confess I don’t know too much about what goes on under the hood.
I would prefer to be able to submit values within the constrained spaces, eg for log(sigma) instead of sigma when sigma< lower = 0 > is defined.

I am writing a Julia package using sparse grids for posterior inference.
Basically, a log density function is taken as an argument, automatic differentiation is used to help find the maximum and also return a Hessian. These are used to implicitly center and scale the parameter vector when evaluating the function on a grid.

The log density however has to be written so the parameter vector is unconstrained, meaning I need to mess around with a lot of transformations (and Jacobians).

Stan, of course, can do all of this automatically, taking only an extremely convenient “x ~ normal(mu, sigma)” style syntax for defining the model.
So being able to run off of Stan would be a huge boon.

It would need is for Stan’s optimizing to still include the Jacobians (I would prefer to find the maximum within the unconstrained space), and for Stan to be able to quickly calculate log densities from a vector of parameter values from within the unconstrained space.
However, we’d probably be much more interested in posterior inference within the constrained spaces the model was originally defined in. Therefore having access to the particular Stan model’s transform to return to the constrained space would also be immensely helpful.

Bob_Carpenter · June 16, 2017, 8:19pm

You might be able to hack around the first point if tree depth 0 is allowed in an iteration; then you can turn off adaptation and you should just get the evaluation at the density. But it won’t deal with unconstrained input, which is presumably what you meant.

Everything you need is available at the C++ level, including Hessians. But it’s not available through CmdStan and the Hessians aren’t available yet through the other interfaces other than through finite diffs (less accurate and less efficient, and it works for arbitrary functions, unlike our Hessians, which won’t work through our ODE solver). I can see the utility of something that evaluates a log density using the constrained inputs, but I don’t think many people would want to push unconstrained inputs into CmdStan.

One problem one would face trying to use CmdStan is that it’s very inefficient in the way it reads data compared to working at a binary level (like RStan and PyStan do). We haven’t put any time into improving efficiency because it’s never a bottleneck in model fitting, being dwarfed by other computational costs. But for just evaluating, it’d probably be a dealbreaker to have to write the parameter values out in an ASCII file then read them back in.

Be careful with Stan’s log densities, they’re only defined up to an additive constant when you use ~.

Because we use optimization to fit MLE/MAP, we turn off the Jacobians in that mode. But again, this is all controllable from C++ if you look at the generated model class file.

P.S. Why grids? They’re notoriously difficult to scale with dimension, which is why everyone uses MCMC to evaluate high-dimensional integrals.

chriselrod · June 17, 2017, 12:00am

It sounds like integrating with Stan at the C++ level would be the way to go. There sound like a lot of advantages, and things like repeatedly reading and writing values out in files are a deal breaker.
I confess that right now my programming experience is limited to higher level languages, although studying C++ has been on my to do list for a while now. Although somethings you’ve written suggest it might not be pleasant.
I’ve very interested in this, so I’ll start taking serious looks within the next couple of weeks.

I do recall seeing discussions about getting higher order derivatives via AD in Stan.
Finite differencing would be okay; the step is only performed once per model fit (at the unconstrained space mode). ForwardDiff in Julia does give Hessians via AD (but I imagine similarly to the Stan ODE case, only for pure Julia functions), and hypothetical users have the choice of the API.
I imagine most people would prefer to be able to define models using Stan’s syntax.

Dropping the additive constants in a model is fine, as grids are easy to renormalize. The one caveat is that removing the need to renormalize might be a convenient way to get error estimates, although alternatives exist for building adaptive grids so I don’t think it’s necessary.

Why grids?
Of course, traditional tensor product grids are pretty awful beyond a small number of dimensions.
We could roughly define the idea of badness as error relative to the number of function evaluations.
If we want to be really notational, we can say that MCMC error is O(n^(-1/2)) while grid error is more like O(n^(-1/p)), where p is the number of parameters. At p = 3, MCMC is already looking better. If p is 20, or 100…

In a nutshell, simple grids are bad because if our grid is a 2 x 2 x 2 x… grid of gauss rules, if p = 20 we need over a million function evaluations (2^20).
Yet our estimates are actually going to be pretty horrible, because a univariate gauss rule needing only 2 function evaluations is only exact for polynomials of order 2n-1 = 3. Meaning if any of those 20 marginals aren’t approximated well by a third order polynomial, our answer is going to be pretty bad.
Letting theta be our parameter vector, that grid is good for 60th order multiple polynomials of the form theta_1^3theta_2^3theta_3^3*… In other words, with our huge number of function evaluations we are buying resolution in the interaction terms.
Realistically, we want to spread our resolution more evenly instead of concentrating it on interactions. Or, given transforms that lead our parameter vector to be asymptotically independent standard normals – in which case interactions vanish – we’d buy the most overall accuracy by focusing a little more heavily on the marginals than interactions.

For comparison, the simplest pattern for constructing sparse grids (Smolyak), with a sequence resulting in exactly integrating >50th degree marginal polynomials only results in 189,161 function evaluations.
This grid has 35 marginal nodes; if you made the standard interaction-focused grid with the same marginal accuracy you’d need 35^20.
Some research (https://arxiv.org/abs/1604.08466) suggests the author came up with a construction removing the dependence on p entirely. In other words, the number of function evaluations you need for a given accuracy [under conditions outlined in that paper] can be made independent of the number of parameters, but with O(n^-2) or so convergence instead of O(n^-1/2).

When I saw this I naturally got excited about Bayes, but I confess I have a lot of reading and work to do before I begin to understand the technical details.

What I’ve written so far generates grids using a simpler construction that does still ultimately fail as p increases, but has lead to great performance in a few babied problems with small p. Long term, I want to see if it can be made general / not need any babying, but the immediate application is sample size determination (rerun models thousands of times for each of a large number of different sample sizes to get an idea of power), where model-specific tuning and hand-holding can be justified.

Bob_Carpenter · June 17, 2017, 7:51pm

C++ isn’t very pleasant for working with at the modeling level. And compared to developing in Python or even Java, it’s a huge pain. Though it’s getting better with C++11 and C++14.

We have the higher-order derivatives, they’re just not well tested. And they won’t work through ODEs. We have implementations of Hessians in terms of reverse-mode nested in forward-mode (O(N)) and in terms of forward-in-forward (O(N^2), but with less of a constant overhead).

Topic		Replies	Views
Evaluating the log posterior for a particular paramter value General	4	499	November 16, 2019
How to evaluate log posterior probabilities at certain parameter values? Modeling techniques	2	873	December 15, 2021
Unexpected log_prob() behavior Developers fitting-issues	5	616	April 20, 2023
Log_prob_grad via csv files for cmdstan CmdStan	7	455	May 12, 2021
Efficiently evaluating logprob for projections of the parameters Algorithms	8	601	March 31, 2021

Using CmdStan to evaluate an array of log densities

Related topics