Making arbitrary C++ functions available to Stan


#1

I am trying to implement a custom function in C++ that will be ultimately used to compute the likelihood (and can be used to compute its gradient) – an example of a different gaussian process kernel and function to obtain the “K” matrix is shown below.

I am trying to follow instructions on Adding a new function with known gradients, but I gather from another thread that some function templating may be required, but I’m missing quite a few details on how to actually get it working. I understand that this is more straightforwardly done in CmdStan, so I installed it for that specifically (I have used RStan and PyStan just a couple times before that in the “standard” way).

I assume it will go something like this: write the templated C++ function, include the file under some Stan-math library folder, write the Stan program using the function and compiling it with the CmdStan interface.
Bearing in mind that I am new to Stan, can handle some C++ but I’m no developer, I’d appreciate if I could get help with the steps described above. Thanks.

#include <stan/math/rev.hpp>

namespace stan {
  namespace math {

    double exp_quad_kernel(double xa, double xb, double l1, double l2) {
        double k = exp(-pow(abs(xa-xb),2)/(pow(l1,2)+pow(l2,2)));
        return k;
    }

    MatrixXd k_matrix(VectorXd x, ArrayXd lengthscales, double sigmaf) {

    MatrixXd K(x.rows(), x.rows());
    
    double l1 = lengthscales[0];
    double l2 = lengthscales[1];
    for (int i=0; i<x.cols(); i++) {
        for (int j=0; j<x.cols(); j++) {
            K(i,j) = pow(sigmaf,2)*exp_quad_kernel(x(i), x(j), l1, l2);
        }
    }
    return K;
    }
    
  }
}

#2

That’s a very tall order. Our doc’s the best we’ve done so far. And we’re always happy to take contributions to make it easier.

If you only need the function for a single use, then you don’t need to template it—it just needs to implement the types that will actually be used. But you will need to use the reverse-mode precomputed_gradients class to implement the vari.

The best doc we have for the whole autodiff stack is the arXiv paper.


#3

Yea, I understand, it’s just that I’m struggling with the limitations of the modeling language, and the interfaces are just that, so I thought C++ was a direct route to flexibility in the modeling even if its not the friendliest language.
It turns out that I’m struggling with the C++ development as well. It’s probably feasible but it’s probably too much work as well.

It’s great that the autodiff works smoothly with most things, but the move to HMC makes things somewhat less modular than say MH, where all you need was to plug the likelihood. Of course there are good reasons for that, but I was hoping to be able to take advantage of other features of Stan, whether it’s MCMC in C++, or using HMC with an analytically provided gradient, and so on. But well, the tradeoffs are always there.

In any case, I’m not sure if I could contribute directly to this, but thanks for the support.


#4

It’s not so much lack of modularity as a differentiability requirement on the log density. That’s the price we have to pay for efficient transitions (the good reason you mention).

C++ is a huge pain on many different fronts, and our autodiff interface isn’t super easy. We’ve thought about how to make it easier, but it’s hard right now given the heterogeneity of the functions—you need the Jacobian w.r.t. each argument. For simple functions, it might be useful for us to provide a simpler user-defined function interface that is purely double-based. Something like:

Eigen::VectorXd
user_defined_function(
    const Eigen::VectorXd& alpha,
    const Eigen::VectorXd& x_r,
    const std::vector<double>& x_i,
    Eigen::MatrixXd& dy_dalpha);

where there is an argument alpha that can be parameters, then real and integer arguments that have to be data, then a mutable last argument for the Jacobian of the output vector with respect to alpha.

We could also simplify for the case of single-output functions:

struct foo {
  static double apply(const Eigen::VectorXd& alpha,
                      const Eigen::VectorXd& x_r,
                      const std::vector<double>& x_i,
                      Eigen::VectorXd& dy_dalpha);
};

If a user could implement the above, then we could easily wrap it in a templated metafunction that a user would invoke like this to define the function bar:

vector bar(vector alpha, vector x_r, int[] x_i) {
  return stan::apply_user_defined<foo>(alpha, x_r, x_i);
}

where we would write the apply_user_defined function.

It’s limiting in that it doesn’t efficiently degrade to cases where you don’t want derivatives. We could handle that limiting case by letting users define a signature with two static functions, simpler function:

struct foo {
  static double apply(const Eigen::VectorXd& alpha,
                      const Eigen::VectorXd& x_r,
                      const std::vector<double>& x_i,
                      Eigen::VectorXd& dy_dalpha);

  static double apply(const Eigen::VectorXd& alpha,
                      const Eigen::VectorXd& x_r,
                      const std::vector<double>& x_i);
};

so that the second could be called in cases where derivatives aren’t needed.

This also rigidly fixes the calling patter to be alpha data or parameters and x_r and x_i data only.


#5

Yes, it’s a good trade-off but a trade-off nonetheless. That kind of function would be very useful for (when convenient) using analytically provided gradients or (worse comes to worst) finite differences, and maybe the same would go for the generally simpler likelihood function – I’m thinking Kalman filters and particle filters more generally that would require stochastic simulations.
I guess many of us modelers got used to being able to do inference by computing the likelihood in whatever way possible but for HMC the gradients are needed as well. For me the generality of being able to use a programming language to specify generic likelihood/gradient would outweigh the inconvenience of not being able to model in a purely statistical language. But, yeah, like verything there’s a tradeoff there as well.


#6

To be pedantic, many people got used to running algorithms that required only likelihood/joint density evaluations – there has been little guarantee that the output of those algorithms was accurate enough to generate faithful inferences.

Gradients aren’t just about speed, they’re about maintaining accurate Bayesian computation in challenging problems, especially in high-dimensions. If you want as many guarantees as possible then the use of gradient-based algorithms isn’t a luxury but a necessity. There isn’t a trade off so much as required investment to be able to accurately explore your model.


#7

well, there are many ways of being pedantic: the bayesian statisticians may say gradients are necessary, others may say sequential monte carlo is the only way to go, modelers will insist that the most complex model is used regardless of the underlying inference algorithm, the experimentalists won’t care if you’re using the most sophisticated method as long as you get an answer that makes sense to them.
Most non-statisticians won’t even care if the method is bayesian or not (some geneticists are still using maximum parsimony or neighbor joining to build phylogenetic trees).

I’ve suffered through enough seemingly endless MCMC chains that I’m all for HMC, but I am yet to see it solve all my problems. Also because as soon as you have a better algorithm for some previously intractable model, people will want to try a more complex one that may be intractable with the newer algorithm – I know I would.


#8

Your response convolves modeling with computation – one of the driving motivation of Stan is to abstract computation away from the users so that they can focus on building models appropriate for their analyses. We can’t do that all the way, but we can provide as many principled diagnostics driven by theory as we can to inform the user when their computation is inaccurate and they’re not getting faithful inferences from their model. Given how poorly theoretical statistics communicates these concepts to applied users I understand why their importance is not obvious, but as someone who spends their time running back and forth between the two communities let me confidently state that if you want a reasonable chance at accurate computation in high dimensions for any model then gradients are going to be a necessity. And if the practical hurdles of implementing sufficiently fast gradients seems insurmountable I ask that you do what you can to verify that other gradient-free algorithms are actually recovering a useful answer.


#9

Inasmuch as it specifies the likelihood, the model formulation is always somewhat convoluted with the computation: the complexity and/or size of the model can make gradient computation infeasible in practice, or simulation-based approximations of the likelihood may be infeasible in principle, like particle filters (I’m not sure, didn’t really think about it). It may be the case that if you choose computing gradients you may not be able to model the process the way you’d like.
I don’t disagree with you I haven’t personally been able to compare HMC with MH for models that are actually useful to me – most people don’t, they just use whatever they can get away with, that’s what the other examples I gave further downstream from the computation and modeling hinted at.
My point for this topic was “simply” that allowing users to specify an arbitrary likelihood and a gradient functions directly (which presumably would be more straightforward in C++) would make the modeling more flexible in some cases while allowing the rest of the underlying implementation of Stan to be used.


#10

Only for efficient computation for general models circa 2018. In lots of specific cases, you can get by without gradients or even without MCMC. There are a lot of specific models where you can do things without gradients. Kalman filtering’s a good example—you don’t even need MCMC for that.

I hope nobody thinks we’re claiming NUTS or Stan does that!

In my experience, that depends on the inferences you get. We try to stop before introducing models that can’t be estimated with the data at hand.

Sure. But they’re not our audience.

How so? I can specify the likelihood symbolically in mathematics.

It can. That’s why we’re also interested in approximate methods. But most of our effort’s devoted to extending the envelope of where gradient-based computation may be employed.

Many of us are pretty devoted to an educational mission in addition to software development. But it’s slow going.

There isn’t much else to the implemntation of Stan! Most of it’s keyed off of gradient calculations. Almost every one of our math functions would be pretty straightforward to implement if it weren’t for gradients. And all of our algorithms use gradients. Seriously, there’s not much left outside of things that use gradients.

You could take the Stan language and use it to define models that you could plug in elsewhere. But most of those other techniques you mention like SMC are not black-box methods in their efficient forms. There are generic methods like MH, but we’ve never been able to make them work to our satisfaction. And while you’re certainly right that a lot of people will use tools that fail to do what they advertise, we don’t want to be purveyors of such tools!


#11

To be clear, I’m no trying to win an argument here, but to give a counterpoint to the developer/bayesian statistician’s point of view, which I’m not.

Coming from the physical/natural sciences, often we’re trying to model a specific problem, so it may not really be a choice of adding more terms or not: either you can model the system or you can’t. Of course we may try to simplify, but the result may be grossly unsatisfactory.

Well, numerical solutions are an example where you can’t do it symbolically (although there are still strategies to get useful numerical solutions for the likelihood and gradients), but again the example of particle filters rely not only on simulation, but also on stochastic states approximating the likelihood, which may preclude strategies to get numerical results. Hence the request for a flexible function, where things like finite differences could be used.

I think it’s totally fine that Stan has its main audience of statisticians, but I don’t see myself as too far away to be able to interact with the community, and I don’t think many other researchers farther off need be either. But that’s just my opinion.

Anyway, thanks for taking the time for entertaining this discussion, I really appreciate it.


#12

I hope we’re not coming off as too argumentative here. That’s certainly not my intent.

That’s not just the sciences—there are models that are dealbreakers for a lot of people.

In my own experience in the sciences, the statistical models are often pretty flexible. For instance, there may be enough data to estimate a two-pool soil carbon respiration model but not a three-pool model. Not because the science says there are two or three pools—the whole thing’s just an approximation to build a predictive model. Whehter to include covariates such as temperature, humidity, etc., as well as whether to model spatial characteristics of hierarchical models are often also pretty flexible in geological models.

Whereas if you’re trying to estimate the gravitational constant from observations of dropping balls, you have a pretty good idea of what the model is (though even still, how to model things like drag, etc., or even whether to model them can be up for grabs depending on the goal and the data).

You mean like implicit functions? We have an algebraic solver, but I’m sure that’s not general enough for every kind of numerical solution. And we’re only just getting around to adding a 1D integrator, which some densities require.

You mean taking finite diffs of a black-box function?

It’s possible right now to take a multivariate function and use Stan’s built-in first-order finite difference functional to implement a function that could be embedded in Stan. Or to do something custom if the function is a different shape—that’s what makes these things difficult to architect into our API and we wind up working through sequences so often in our interfaces like ODE solvers.

With a few examples like that which worked, I’d be tempted to integrate the functionality directly into Stan somehow. Right now, we don’t even have an automatic way to take in user-defined gradients which is probably a slightly higher priority for us.

We’re certainly not trying to exclude anyone! Maybe I should’ve phrased that not as “statisticians” as “audience of people working on applied statistics problems”. I’m a computer scientist/linguist! I’ve never taken a stats class in my life and would utterly fail Columbia’s Ph.D. quals!

Given our limited resources, we tend to focus on problems we can solve. Now if you’re volunteering to do open-source development on Stan, we can talk about how such functionality would have to be engineered to be integrated into Stan.

In case people have the wrong idea, there are only three of us being paid at Columbia to work on the math lib and language: me, Sean, and Mitzi, and I barely have time to code. Ben and Jonah both concentrate on RStan. Breck’s working on the business side and Andrew doesn’t write code. Charles works on the math lib, but he’s a student and doesn’t have much time these days.

It’s tough to find time to do anything!

My priorities now are (a) MPI [multi-core, multi-machine], (b) GPU, © tuples, ragged arrays, and function types in the math lib, (d) testing forward-mode autodiff, and (e) rearchitecting the service writer patterns. I’d also very much like to make it easier to install and manage Stan and its interfaces.


#13

The ODE solver is a great thing to have, but there are discrete stochastic models where I’m not sure there are solution other than finite differences, maybe there are. The code you propose above would go a long way to solve cases not yet implemented or generic cases where the user just wants to use their own gradients in whatever way they can, and from what you say it may not be that complicated.

I understand resources are limited, and I’m happy to work through a specific case study – like I mentioned what I have in mind right now are particle filters for dynamic models, where the likelihood is approximated through stochastic simulation. Right now I’m not actively working on this kind of MCMC inference, but I have a few examples in the drawer. I also read some post on BEAST and bayesian phylogenetic inference. That’s another one, but I think it’s probably more complicated and I don’t have anything at hand.


#14

To put things into context, I was originally trained as an experimental particle physicist and not a statistician. I focus on getting the statistics and computation right because I’ve seen case after case of scientists doing bad science using bad algorithms because they rationalized that the methods were “good enough” for their problem without actually knowing how to verify that. These are not afterthoughts to an analysis, and I claim that one can’t do really great work on the frontiers of science without them, but that’s me.


#15

Well, I think we probably have somewhat similar views on this, I’m not a statistician either, or computer scientist, but was trained in the natural sciences as well (among the Fortran-programming quantum chemists and physicists), but I’m now doing research in the biological sciences, where the care for method robustness are arguably even poorer (not to judge, but already judging, I can’t get my head around researchers using maximum parsimony estimation).

So I care deeply about proper bayesian inference, but also about treating stochasticity in dynamics systems appropriately, for instance, so problems begin when the methods are not in place to treat everything correctly and I have to choose between them, or write everything myself – which probably takes more time than I’m allowed to spend on doing so.