Fine tuning for polynomial Posteriors

Hello,

I just started using Stan, and this is my first post here. I am looking for a way to accelerate the runtime of Stan for the kind of problem I have.

My workflow consists of sampling many Log posteriors, each one of which has a polynomial form. I tried a toy model to get a feeling of how things would work out. My model is just a 1-D polynomial, for which I wrote the code:

parameters {
    real x; // Parameter to sample
}
model {
    target += -pow(x,4) +pow(x,2);
}

When I run the stan executable with RunStan I got the usual messages about sampler options (I am just running everything in the default) and what surprised me was the gradient evaluation time which printed like:

Gradient evaluation took 4e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take \
0.04 seconds.
Adjust your expectations accordingly!

I thought it should be faster since autodiff (according to ChatGPT) should calculate the gradient in a time comparable to the time of calculating the function, which I estimated to be around 10^-8 seconds (I must say that I don’t do C-coding, so I calculated this number by compiling the function to C using Mathematica and then evaluated the timing there).

So my question is: Am I doing something wrong? Is there a way to make this faster, considering you will be dealing specifically with polynomials ? (generally high dimensions ~ 10, and high orders ~ 10). I also got a huge compilation time, around 0.2 seconds, so I think I must be doing something wrong there as well.

Hi, @Felipe_Augusto and welcome to the Stan forums. Sorry it took a while to get a response out, but a lot of the team’s been out for the holidays.

I don’t trust the timing here, but there is constant overhead from the Stan function calls that you won’t see in straight C calls.

The time to calculate gradients is about 5 times the time to calculate a function—it’s a constant cost multiple. It varies a bit depending on which gradient. You have to be very careful doing micro-benchmarking of C because a lot of the actual work can get optimized away, though 10^-8s isn’t unreasonable for this function. It’ll almost certainly be more efficient if you do it this way:

model {
  real x_sq = square(x);
  real x_sq_sq = square(x_sq);
  target += x_sq - x_sq_sq;
}

Similarly, if you have higher order ones, just put this in a loop.