The autodiff tree can get large pretty quickly. The thing to remember is that variables in Stan are mostly autodiff variables. So even if they act like doubles, they carry around more information. For a scalar there will be the value (8 bytes), the adjoint (for autodiff, another 8 bytes), and then a pointer for how the C++ stuff works behind the scenes (another 8 bytes).
And then for every operation you do, there will be temporaries and these temporary variables also take up space. Even though they aren’t visible in the code, they need to be saved so reverse mode autodiff will work.
So like in this:
real a = 1.0;
real b = 2.0;
real c = 3.0;
real d = a + b + c;
There will a hidden expression for either a + b
or b + c
and that will take memory as well. And this is true for all the loops and everything in them in your model, so I assume that’s where the blow up is happening.
Have you tested that partial_sum with reduce_sum yet? I think the way reduce_sum works you should be able to limit how much of the autodiff tree is in memory at any point. Can you try the tests you have with reduce_sum and see if the max memory characteristics change?