Optimization of a custom likelihood --when is it worthwile to code it in C++ with its derivative?

bnicenboim · July 16, 2020, 12:14pm

Hi,
I wrote the likelihood for the difference between two gamma distributions with the same rate using the saddle point approximation (thanks to @martinmodrak’s blog post!). I could find an analytical solution to the approximation, and here it is below.
The thing is that it’s quite slow (~10 mins for 1000 observations), and I don’t think it can be optimized further (I’ll be happy to be proved wrong). Will I gain a significant speedup if I code it in c++ with its derivative? (I’m not fluent in C++ and it’ll be a major work for me, so I’m trying to figure out if it makes sense first). Or is there something that could be tried first?

   real gammadiffs_lpdf(real y, real alpha1, real alpha2, real beta) {
 // K = -alpha1*log(1 - t/beta) - alpha2*log(1 + t/beta)
    // K' = (alpha1*(beta-t) - alpha2*(beta + t))/(t^2-beta^2)
    // K'' = alpha1/(beta^2 - 2*beta*t + t^2) + alpha2/(beta^2 + 2*beta*t + t^2)
    // with t between [-beta, beta]
    // Solve the saddlepoint equation for K'(s) = x
    // or K'(beta*2/(1+exp(-s_raw))-beta) = x, where s_raw is unbounded
    //   s = beta*2 / (1 + exp(-s_raw)) - beta
   /* root: s_raw =  - log(2.0)- log(alpha1) + log(sqrt((alpha1 - alpha2 - 2.0 *beta* y)^2.0 */
   /*                                                     + 4.0 * alpha1 * alpha2 ) */
   /*                                                - alpha1 + alpha2 + 2.0 * beta*y); */
    //    PDF= 1/sqrt(2*pi*K''(s))*exp(K(s) - s*x)
    real sqrtterm = sqrt(4.0 * alpha1 * alpha2 + (alpha1 - alpha2 - 2.0 * beta * y)^2.0);
    real term = (-alpha1 + alpha2 + 2.0*beta*y + sqrtterm);
    real term2 = (2.0*alpha1/term + 1.0);
    real rterm2 = 1/term2;
    real k1 = - 2.0*log(1.0/sqrt(2.0)); // 0.6931472
    real tau = pi()*2.0;
    real logalpha1 = log(alpha1);

    real logpdf =-alpha1*(log(2)+log1m(1.0/term2))
      + alpha2*log(term2)
      - k1*alpha2
      + beta*y
      - 2.0*beta*y*rterm2
      - log(tau)
      - log_sum_exp(log(alpha1)-log(2.0*beta^2 - 4.0*beta^2*rterm2 + beta^2.0 + (-beta + 2.0*beta*rterm2)^2.0),
                    log(alpha2)-log(-2.0*beta^2 + 4.0*beta^2*rterm2 + beta^2.0 + (-beta + 2.0*beta*rterm2)^2.0))/2 ;


    return logpdf;
 }

wds15 · July 16, 2020, 1:20pm

It should run faster with analytical gradients, of course.

… or you go brute-force with multiple cores and run things with reduce_sum. This should also speed this type of problems up; given you got the cores.

bnicenboim · July 16, 2020, 1:24pm

This is a small cog in a large model, I need to save brute force for later! I will need it.

I now it has to run faster with analytical gradients, but my question is if it means that it’ll take 9 minutes instead of 10, or 2 instead of 10. I’m not expecting an exact answer just a ballpark estimation, to think if I go that way.

wds15 · July 16, 2020, 1:57pm

No idea what you get, sorry.

But you could compare the speed of multi_normal_cholesky_lpdf for which we have analytical gradients vs a custom coded one in Stan. This way you get a “feel” for this… but what’s that worth for you problem… don’t know.

Another thing which usually buys you a lot is vectorisation. So if you find a way to take advantage of shared terms whenever you not only calculate for a scalar “y” the log-lik, but for an entire vector of multiple "y"s… then you can gain a lot.

bnicenboim · July 16, 2020, 2:04pm

Are there plans to make it possible for users to add the derivatives in Stan? That would be sooo convenient, since in many cases I can just get the analytical derivatives by copy&pasting from Wolfram alpha…

martinmodrak · July 16, 2020, 2:19pm

While analytical gradients could help, my (limited) experience is that they only help a lot if you actually save computation, i.e. either when there are a lot of terms shared between the derivatives wrt. individual variables and value. For some functions the derivative is just a big long hairy formula and in that case, I’ve found gains from moving to C++ underwhelming (N = 1).

Anyway, some stuff I would totally try before inflicting yourself the pain to move to C++:

I used the unbounded s_raw parameter in my blog only because algebra_solver broke otherwise. Since you’ve found an analytical solution, you don’t need to work with an unbounded parameter - maybe using s directly would lead to less computation?
Vectorization - rewriting the function to real gammadiffs_lpdf(vector y, vector alpha1, vector alpha2, vector beta) should be pretty straightforward - all of the functions you use should happily accept vectors and you shouldn’t need any cycles in your code, maybe except for the log_sum_exp term. (obviously if some parameters don’t differ between observations, you can keep them as real)
Reuse subexpressions - there is a bunch of subexpressions that you reuse (log(alpha1), beta^2, (-beta + 2.0*beta*rterm2)^2.0), …). Compute them just once and store in a local variable for reuse.

Hope those small changes will reduce your runtime substantially

bnicenboim · July 16, 2020, 2:30pm

yeah, that is what I was afraid of.

I did that because I had to find roots for the equation where s was bounded (between beta and -beta). I replaced s with a function of an unbounded s_raw so that I was sure that my solution was valid. Not sure if the root of the equation is going to be valid if I ignore that constraint. But maybe it’s worth to try…

Thanks for the tips, I’ll try also to reuse all the expressions I can. (I won’t be able to use vectorization when I use this function for real since it goes in a mixture).

martinmodrak · July 16, 2020, 2:39pm

My (limited) understanding of the math is that (for non-crazy distributions), the solution is guaranteed to exist on the interval. If you are like me and use Wolfram Alpha to find the solutions, you usually can just add the constraints to the system to be solved and it works :-)

Maybe I am missing something, but you totally should be able to vectorize even when there is mixture - you just need to return not the sum of all log likelihoods, but the vector of individual log-likelihoods. You can then use this vector in a mixture - I would guess this would still give you most of the benefits of vectorization (you basically just skip a final sum call, but vectorize all the previous steps).

Best of luck!

bnicenboim · July 16, 2020, 2:44pm

ha, yes Wolfram Alpha and sympy, I know this is not the right forum, but how?? I didn’t know how to do it! I also tried in sympy, but with no success.

ahh, I see. I can at least recycle the beta which is always the same. Good idea.

martinmodrak · July 16, 2020, 2:50pm

Simple stuff IMHO works directly, e.g.:

For more complicated stuff I use the free online notebooks and the Reduce function - I don’t really understand the Wolfram language, but I’ve found it reasonably easy to apply as long as I don’t need anything fancy, see e.g. the first few lines of Wolfram Cloud Document where I solve a bunch of equalities and inequalities for rho and k (I then realized I really just need one of the solutions and work with that in the later blocks).

Topic		Replies	Views
Given likelihood of gamma (external approximation convoluted gamma) calculate likelihood of NB Modeling	2	516	August 19, 2020
Bad estimation of shift and location in a complex likelihood Modeling fitting-issues , specification	2	425	April 27, 2021
Gamma quantile function Developers features	29	4904	July 1, 2020
Generalized Gamma Implementation Leads To Divergences Modeling	4	475	December 8, 2023
Using integration to fit the difference of gamma distributed random variable Modeling fitting-issues , integrator , user-defined-functions	15	1462	July 4, 2020

Optimization of a custom likelihood --when is it worthwile to code it in C++ with its derivative?

Related topics