Hi,
I wrote the likelihood for the difference between two gamma distributions with the same rate using the saddle point approximation (thanks to @martinmodrak’s blog post!). I could find an analytical solution to the approximation, and here it is below.
The thing is that it’s quite slow (~10 mins for 1000 observations), and I don’t think it can be optimized further (I’ll be happy to be proved wrong). Will I gain a significant speedup if I code it in c++ with its derivative? (I’m not fluent in C++ and it’ll be a major work for me, so I’m trying to figure out if it makes sense first). Or is there something that could be tried first?
real gammadiffs_lpdf(real y, real alpha1, real alpha2, real beta) {
// K = -alpha1*log(1 - t/beta) - alpha2*log(1 + t/beta)
// K' = (alpha1*(beta-t) - alpha2*(beta + t))/(t^2-beta^2)
// K'' = alpha1/(beta^2 - 2*beta*t + t^2) + alpha2/(beta^2 + 2*beta*t + t^2)
// with t between [-beta, beta]
// Solve the saddlepoint equation for K'(s) = x
// or K'(beta*2/(1+exp(-s_raw))-beta) = x, where s_raw is unbounded
// s = beta*2 / (1 + exp(-s_raw)) - beta
/* root: s_raw = - log(2.0)- log(alpha1) + log(sqrt((alpha1 - alpha2 - 2.0 *beta* y)^2.0 */
/* + 4.0 * alpha1 * alpha2 ) */
/* - alpha1 + alpha2 + 2.0 * beta*y); */
// PDF= 1/sqrt(2*pi*K''(s))*exp(K(s) - s*x)
real sqrtterm = sqrt(4.0 * alpha1 * alpha2 + (alpha1 - alpha2 - 2.0 * beta * y)^2.0);
real term = (-alpha1 + alpha2 + 2.0*beta*y + sqrtterm);
real term2 = (2.0*alpha1/term + 1.0);
real rterm2 = 1/term2;
real k1 = - 2.0*log(1.0/sqrt(2.0)); // 0.6931472
real tau = pi()*2.0;
real logalpha1 = log(alpha1);
real logpdf =-alpha1*(log(2)+log1m(1.0/term2))
+ alpha2*log(term2)
- k1*alpha2
+ beta*y
- 2.0*beta*y*rterm2
- log(tau)
- log_sum_exp(log(alpha1)-log(2.0*beta^2 - 4.0*beta^2*rterm2 + beta^2.0 + (-beta + 2.0*beta*rterm2)^2.0),
log(alpha2)-log(-2.0*beta^2 + 4.0*beta^2*rterm2 + beta^2.0 + (-beta + 2.0*beta*rterm2)^2.0))/2 ;
return logpdf;
}
This is a small cog in a large model, I need to save brute force for later! I will need it.
I now it has to run faster with analytical gradients, but my question is if it means that it’ll take 9 minutes instead of 10, or 2 instead of 10. I’m not expecting an exact answer just a ballpark estimation, to think if I go that way.
But you could compare the speed of multi_normal_cholesky_lpdf for which we have analytical gradients vs a custom coded one in Stan. This way you get a “feel” for this… but what’s that worth for you problem… don’t know.
Another thing which usually buys you a lot is vectorisation. So if you find a way to take advantage of shared terms whenever you not only calculate for a scalar “y” the log-lik, but for an entire vector of multiple "y"s… then you can gain a lot.
Are there plans to make it possible for users to add the derivatives in Stan? That would be sooo convenient, since in many cases I can just get the analytical derivatives by copy&pasting from Wolfram alpha…
While analytical gradients could help, my (limited) experience is that they only help a lot if you actually save computation, i.e. either when there are a lot of terms shared between the derivatives wrt. individual variables and value. For some functions the derivative is just a big long hairy formula and in that case, I’ve found gains from moving to C++ underwhelming (N = 1).
Anyway, some stuff I would totally try before inflicting yourself the pain to move to C++:
I used the unbounded s_raw parameter in my blog only because algebra_solver broke otherwise. Since you’ve found an analytical solution, you don’t need to work with an unbounded parameter - maybe using s directly would lead to less computation?
Vectorization - rewriting the function to real gammadiffs_lpdf(vector y, vector alpha1, vector alpha2, vector beta) should be pretty straightforward - all of the functions you use should happily accept vectors and you shouldn’t need any cycles in your code, maybe except for the log_sum_exp term. (obviously if some parameters don’t differ between observations, you can keep them as real)
Reuse subexpressions - there is a bunch of subexpressions that you reuse (log(alpha1), beta^2, (-beta + 2.0*beta*rterm2)^2.0), …). Compute them just once and store in a local variable for reuse.
Hope those small changes will reduce your runtime substantially
I did that because I had to find roots for the equation where s was bounded (between beta and -beta). I replaced s with a function of an unbounded s_raw so that I was sure that my solution was valid. Not sure if the root of the equation is going to be valid if I ignore that constraint. But maybe it’s worth to try…
Thanks for the tips, I’ll try also to reuse all the expressions I can. (I won’t be able to use vectorization when I use this function for real since it goes in a mixture).
My (limited) understanding of the math is that (for non-crazy distributions), the solution is guaranteed to exist on the interval. If you are like me and use Wolfram Alpha to find the solutions, you usually can just add the constraints to the system to be solved and it works :-)
Maybe I am missing something, but you totally should be able to vectorize even when there is mixture - you just need to return not the sum of all log likelihoods, but the vector of individual log-likelihoods. You can then use this vector in a mixture - I would guess this would still give you most of the benefits of vectorization (you basically just skip a final sum call, but vectorize all the previous steps).
For more complicated stuff I use the free online notebooks and the Reduce function - I don’t really understand the Wolfram language, but I’ve found it reasonably easy to apply as long as I don’t need anything fancy, see e.g. the first few lines of Wolfram Cloud Document where I solve a bunch of equalities and inequalities for rho and k (I then realized I really just need one of the solutions and work with that in the later blocks).