Gradient evaluation time differs across chains

Dear Stan users,

I have a somewhat complex Gaussian process model, and the gradient evaluation time slightly differs across chains, does anyone know what causes the difference? Below are a single run in rstan for 4 chains, I also encounter >0.006 seconds for the same model.

Gradient evaluation took 0.004218 seconds
1000 transitions using 10 leapfrog steps per transition would take 42.18 seconds.
Gradient evaluation took 0.004234 seconds
1000 transitions using 10 leapfrog steps per transition would take 42.34 seconds.
Gradient evaluation took 0.003572 seconds
1000 transitions using 10 leapfrog steps per transition would take 35.72 seconds.
Gradient evaluation took 0.003702 seconds
1000 transitions using 10 leapfrog steps per transition would take 37.02 seconds.

Any help is appreciated, thanks.

Computation timing is a stochastic calculation that will very from hardware to hardware, run to run, based on the exact environment in which your code is processed. This variation becomes more pronounced the smaller of a time interval you wish to measure, and it is no surprise that you see this much variation when trying to time down to the millisecond.

1 Like

Those actually look quite very similar, I sometimes get things like this:

Gradient evaluation took 0.001223 seconds
1000 transitions using 10 leapfrog steps per transition would take 12.23 seconds.
Adjust your expectations accordingly!


Iteration:   1 / 400 [  0%]  (Warmup)
Iteration:  25 / 400 [  6%]  (Warmup)
Iteration:  50 / 400 [ 12%]  (Warmup)
Iteration:  75 / 400 [ 18%]  (Warmup)

SAMPLING FOR MODEL '3a010658fb94b613e3bcd4bd00c2cfe2' NOW (CHAIN 2).

Gradient evaluation took 0.017012 seconds
1000 transitions using 10 leapfrog steps per transition would take 170.12 seconds.
Adjust your expectations accordingly!

One thing that I have done to evaluate performance of the model is to use system.time, replicate `` and the log.prob function with random starting parameters, which creates much more consistent estimates of the running time.

Thanks I see, it is just the small variations can lead big differences when the leapfrog steps are large. I was trying to see if I could reduce the time a bit by finding the reason for the difference in gradient evaluation time. Apparently no :(

Thanks for sharing your approaches to estimate the running time, I’d try those out.

I don’t think it has anything to do with the leapfrog steps - just that computational timing can be stochastic. For example if the kernel does a context switch in the middle of an evaluation, or some floating point functions can take variable numbers of clock cycles to execute depending on the values passed in, or some functions may be approximate that compute up to a specific tolerance and the number of iterations can vary based on the inputs.

The number of leapfrog steps will also vary based on where you are in the posterior. Time to evaluate each log density should be consistent, but it’s subject to communication load and other processor load inside the computer (as @betanalpha already noted).

In general, yes, the dynamics HMC algorithm that drives Stan uses a different number of leapfrog steps depending on where in the parameter space the sampler is. For the time extrapolation being discussed here, however, a constant number of leapfrog steps are used so the suspects are communication and processor load.