Hey @jsocolar, sorry I forgot to reply to your post. Here it is:
If the number of parameters grows with N, then a super-linear performance hit is guaranteed regardless of the curvature. The number of partial derivatives you need to evaluate grows linearly with the number of parameters, and the cost of evaluating them grows positively (linearly?) with N.
As I mentioned in my reply to @avehtari, I’m (now) focusing on measuring efficiency in terms of the Min ESS / grad. Hence, how long each gradient evaluation takes (i.e. the cost of each gradient evaluation) is irrelevant and so are computational issues such as the speed of memory access. Those only matter when measuring the “real life” efficiency (e.g., Min ESS / second).
What matters is not the magnitude of the curvature, or the “number of points with high curvature” but rather the variability in the curvature. For example, consider an isotropic multivariate Gaussian. If the standard deviation is large, the curvature is small. If the standard deviation is small, the curvature is larger. But this difference is easily handled by the adaptation of the diagonal mass matrix. In fact, in this case mass-matrix adaptation wouldn’t even be necessary, because the step size adaptation itself provides an isotropic rescaling of the parameter space.
Thanks very much for this useful info! Helps me understand curvature better.
As N increases, in both Stan and the custom algorithm L (leapfrog steps) goes up and the step size ( \epsilon ) goes down. As I mentioned, I have some parameters (the “nuisance” parameters) for which the S.D does not change as N increases - however, for all of the “main” parameters, the S.D goes down as N increases. Hence the mass matrix M is crucial since the parameters have greatly different scales, with the difference between the scales becoming larger as N increases.
So currently, i’m still leaning towards what I suggested in my other comment - i.e. that Stan (and the custom algorithm) are having a harder time adapting M for larger N. I can’t think of any other reason why the ESS / grad (NOT ESS / second) would decrease as N increases - can you?
I have not got around to calculating Hessian yet (i’ll probably just do it using finite differences, as a start), so that might shed some more light on what’s going on.
Also, I checked the correlations between the parameters using the traces and they’re all pretty low.