Changing Function Implementation Alters Results

In the stan_math/stan/math/prim/mat/fun directory we have cumulative_sum.hpp with the simple implementation:

template <typename T, int R, int C>
inline Eigen::Matrix<T, R, C> cumulative_sum(const Eigen::Matrix<T, R, C>& m) {
    Eigen::Matrix<T, R, C> result(m.rows(), m.cols()));
    if(m.size() == 0)
        return result;
    result(0) = m(0);
    for(int i = 1; i < result.size(); i++)
        result(i) = m(i) + result(i - 1);
    return result;

if we alter this to:

template <typename T, int R, int C>
inline Eigen::Matrix<T, R, C> cumulative_sum_new_way(const Eigen::Matrix<T, R, C>& m) {
    Eigen::Matrix<T, R, C> result(m.rows(), m.cols()));
    if(m.size() == 0)
        return result;
    // Here's the change. Using an array to do the calculations.
    T resultArray[m.size()];

    resultArray[0] = ((stan::math::var)m(0)).val();
    for(int i = 1; i < m.size(); ++i) {
        resultArray[i] = ((stan::math::var)m(i)).val() + resultArray[i - 1];

    // Moving the values back to the Eigen matrix.
    for(int i=0; i < m.size(); ++i) {
        result(i) = resultArray[i];
    return result;

the two functions give the same values doing a stepwise comparison (“Values not equal” never prints):

for(int i = 0; i < size; i++) {
    if(originalWay(i) != newWay(i)) {
        printf("Values not equal");

But running my model using cumulative_sum_new_way() takes about 19 times longer and yields slightly different results (i.e. a mean of 0.52381 vs 0.51734).

I’d like the new implementation to yield the same results in terms of speed and values, but I’m unable to figure out how. BTW, this is just an example with a simple function. I wouldn’t actually do things this way for this function.

Statements like ((stan::math::var)m(0)).val(); will mess up the autodiff. Your gradients will end up being wrong.

When you take the val of a var, you throw away the autodiff information that is being built up in the forward pass. The best place to figure out what is going on is here:

1 Like

Thanks @bbbales2 makes sense. What I’m ultimately trying to do is offload work to the GPU with CUDA. Since CUDA doesn’t work with stan::math::var, I’m trying to figure out how to move data out and back.

Cool beans! I know @rok_cesnovar and @stevebronder are doing a bunch of OpenCL GPU stuff in Stan. Maybe they could point you to example code for stuff like this. There’s an open pull request for some GPU stuff that might be a place to look as well:

If you’re getting rid of vars, that means you’ll need to inject your own gradient calculations. If you’re not sure how that works, I’ll point you to some stuff, but I’ll spare you the long response otherwise :D.

I was hoping to sneak out to the GPU and back without doing any gradient calculations, but I’m guessing that isn’t possible. Yes please, any info on the gradient stuff would be greatly appreciated. Other than I know what that is (HMC), I don’t know too much about it.

I did see their OpenCL work. I actually got the Stan math code for that from the CmdStan gp3_example branch. I wanted to try my hand at adding support for other functions using CUDA that we might be using in our projects.

@bbbales2 The function I’ve been working with resides in the “prim” directory. I see if I add a similar function inline Matrix<var, R, C> cumulative_sum_new_way(const Matrix<var, R, C> &m) () to the “rev” directory and just return m (or construct a Matrix of new vars), I can get a dramatic speed up (even if final calculated values are still off). Should I assume that any gradient calculations I need to do must reside in that “rev” cumulative_sum_new_way.hpp file? Or do I still need to add gradient calculations to the “prim” cumulative_sum_new_way.hpp file as well?

Oh wow apologies for the delay. I missed your previous message.

You’re making a specialization of cumulative_sum, so in the simplest form that would go in rev, yes. The generic templated version would go in prim (no custom reverse mode stuff goes in there). There are other folders for different forward mode autodiff specializations.

To understand what rev is, you’ll want to read: . That’ll explain what a var is and what it does and it has lots of implementation details. There might be some little changes in some of the code – I’m not sure. It should all still be really accurate though.

The way I’d go about this is look at simple examples.

Here is the prim version of exp:

Here is the reverse mode specialization:

The code that tests the prim version is:

The code to test the reverse mode implementation is:

To run those tests you should be able to go into your Math folder and type:

./ test/unit/math/prim/scal/fun/exp_test.cpp
./ test/unit/math/rev/scal/fun/exp_test.cpp

If you were going to GPU-ize the exp function as an example (which would be really slow since it’s just a single operation), you’d add your GPU code here: to compute the exp on the GPU and then you’d put the code here: to do the reverse mode autodiff stuff (that’s explained better in the Arxiv paper).

This make sense?

1 Like

@bbbales2 Thanks! This will help.

Before I look at the PDF, I want to try to clear up some confusion. Would I add the GPU code to just the rev version? Or should I add GPU code to the prim version as well? It’s confusing that both the prim and rev files have code that calls std::exp().

Looking at softmax() as a more complicated example, it appears that the prim version does the actual softmax calculation, while the rev version has to code to calculate the gradient. To add to the confusion, the rev version of softmax() calls the prim version, but the rev version of exp() doesn’t call its prim version. So I’m unclear on the rev and prim relationship.

That’s it, that’s the thing. The rest is calling conventions. The autodidff paper describes it in gory details with diagrams and examples but you don’t absolutely need all that detail. It’s helpful to look at a two arg function, for example multiplication, to have a clear example of the calls needed in rev.

Yeah, if the rev version is implemented, then the prim version is only used for higher order autodiff (and the higher order stuff isn’t used in the current Stan algorithms).

Thanks @bbbales2 and @sakrejda I’m making some headway. Assuming I’m doing derivative for the cumulative sum function correctly:

When i = j:

\frac{\partial( \sum_{j=1}^N a_j)}{\partial a_i} = \sum_{j=1}^N 1

when i \ne j:

\frac{\partial( \sum_{j=1}^N a_j)}{\partial a_i} = \sum_{j=1}^N 0

I arrive at the code for the rev version of the cumulative sum (note: I’m using prim and rev cumulative_sum.hpp to test modifications):

class cumulative_sum_vari : public vari {
        vari **vi_array_;
        const int size_;
        const int idx_;
        cumulative_sum_vari(double val, vari **vi_array, int size, int idx) : vari(val), vi_array_(vi_array),
                                                                              idx_(idx) {}
        void chain() {
            for (int i = 0; i < size_; ++i) {
                if (i == idx_) {
                    vi_array_[i]->adj_ += adj_ * 1;
                    break; //No need to continue. If i ! != idx_ derivative is zero and adds no information to adj_

inline Eigen::Matrix<stan::math::var, Eigen::Dynamic, 1>
        cumulative_sum(const Eigen::Matrix<stan::math::var, Eigen::Dynamic, 1> &m) {
    using Eigen::Dynamic;
    using Eigen::Matrix;

    vari **m_vi_array = reinterpret_cast<vari **>(
            ChainableStack::instance().memalloc_.alloc(sizeof(vari *) * m.size()));
    for (int i = 0; i < m.size(); ++i) {
        m_vi_array[i] = m(i).vi_;

    Matrix<double, Dynamic, 1> m_d(m.size());
    for (int i = 0; i < m_d.size(); i++) {
        m_d(i) = m(i).val();

    Matrix<double, Dynamic, 1> cumulative_sum_d = cumulative_sum(m_d);

    Matrix<var, Dynamic, 1> cumulative_sum_return(m.size());
    for (int i = 0; i < cumulative_sum_return.size(); ++i) {
        cumulative_sum_return(i) = var(
                new cumulative_sum_vari(cumulative_sum_d[i], m_vi_array, (int) m.size(), i));

    return cumulative_sum_return;

With this in place, I do see a speed up with my new array approach for calculating the cumulative sum, however it’s not near as fast (~ 5 times slower) as it was using the original cumulative sum implementation without the rev version. Also the final values after sampling are still off a bit.

Is this normal (seeing different values and times) when calculating the gradients manually? Or am I still missing something? I leaning toward I’m missing something.

You probably want to write your derivatives for your partial sums (up to index m, m < N)

\frac{\partial \sum_{j = 1}^m a_j}{\partial a_i} = 1 if i < m, and zero otherwise

The mth output gets contributions from everything up to and including the mth input.

And your chain should be:

void chain() {
  for(int i = 0; i <= idx_; ++i)
    vi_array_[i]->adj_ += adj_ ;

or something like that

The regular autodiff is pretty fast. It can be hard to beat if you don’t have a bunch of math that lets you simplify things or linear algebra operations you can write more efficiently.

Sampling is gonna be random. The way to test your functions is with the unit tests in the math library, especially for something like this. It’ll probably be more confusing than anything to try to set up tests at the sampling level for functions like this.

I probably made in a mistake in what I typed out too. This stuff is really easy to make mistakes on. I don’t trust it till I have my little unit tests set up.

@bbbales2 That was it (the derivative and chain() implementation)! It’s only a tiny bit slower and the resulting values are identical (I’m using the same seed and parameter init values).

Thanks a bunch for walking me through this. It’s been an enormous help!

Thanks for sharing the process, we really need more examples.