Definition / example of "strong curvature"?

In the literature on algorithms I often see people refer to some posterior distributions as having “strong” or “high” curvature

It would be greatly appreciated if somebody could give an example of this and/or explain what this means?

I * think * it is the same thing as saying that the parameters are highly correlated but i’m not 100% sure - Also, do models tend to increase in curvature when the number of observations N increases?

3 Likes

Big hessian

\det(H)

Example, even with a single parameter (nothing to be correlated with, a posterior can be highly curved if the second derivative is big.

4 Likes

Just to flesh out the correct answer from @laifuthegreat, the posterior distribution can be expressed as a probability density function over the (unconstrained) parameters declared in a Stan model. We care particularly about the logarithm of the posterior density function. When we talk about “curvature” we are talking about how “curved” this function is. Visualization can be a little bit tricky in high dimensions (and posteriors are often high-dimensional objects), but the notion of a function’s second derivative as a measure of how “curved” it is readily generalizes.

The reason this matters is because as the integrator moves around the posterior, it uses gradient information to understand which way to point. If the posterior is relatively “straight” (so the gradient doesn’t change quickly), then the integrator can take long steps without worrying about how the gradient changes at shorter length scales. If the posterior is “curved” then short steps are required to stay in a regime where gradients are approximately constant over the length corresponding to a step. If the curvature is variable, then there is a risk that Stan’s algorithm could adapt to the curved parts and then waste huge amounts of time with tiny steps through the straight parts, or alternatively that it could adapt to the straight parts and then diverge when it hits a curved part.

You’re right that correlated posteriors can induce strong and variable curvature (e.g. near the tip of a “ridge” in the posterior). However, the forms of variable curvature that really give Stan trouble tend to be more extreme than what is produced by correlated parameters, as long as the correlation isn’t associated with a seriously degenerate posterior (see Identity Crisis).

Edit: Note that the curvature is a property of the probability density function, which itself depends on the choice of parameterization. Thus, the solution to problems with curvature is often to reparameterize such that the posterior distribution gets encoded in a probability density function with less variable/extreme curvature.

6 Likes

Thanks @jsocolar and @laifuthegreat !

So the curvature for a given parameter can vary during sampling? e.g. from one iteration to the next?

Is there any way for me to determine how curved my posterior is without having to conpute the Hessian?

Basically I am working on efficient algorithms for multivariate probit model and have implemented an algorithm which greatly outperforms stan (over 20-fold more efficient) which is based on an adaptive Randomized HMC algorithm called “SNAPER-HMC” (Sountsov et al, 2022) - a lot of the speed up is due to using manual gradients rather than autodiff, not just becasuse of the MCMC algorithm itself. However, despite being much more efficient than Stan, it still shares the same issue with scalability - i.e. the algorithm efficiency does not scale linearly with N.

So for example - going from an N=2000 to N =8000 dataset gives around 6-fold (rather than 4-fold) less ESS/ second (or ESS/ grad) with either Stan or the custom algorithm, then going from N =8000 to N=32,000 gives around 30-fold (!!) less efficiency rather than 4-fold less! So despite being much more efficient, like Stan, it still quickly becomes impractical for very large N (Its just that such an N will end up being a lot bigger than you can do with Stan).

The algorithm is Euclidean (just like NUTS) so I am wondering whether using a Manifold-based (e.g. with a diagonal metric based on the diagonal of the negative Hessian or a faster approximation such as the Monge patch metric etc) would solve the scalability issue and make it scale linearly with N - however before attempting to implement such an algorithm (which would be a huge amount of work, especially deriving second-order derivatives and depending on the specific metric used sometimes even third-order derivatives are needed) I would want to make sure it would actually be worth it - hence im wondering if there is way I can “check” how curved my posterior is (with a strongly-curved posterior suggesting possible benefit from Reinmann Manifold methods) without having to compute the Hessian? is this possible?

Thanks again for the explanation!

1 Like

Yes, in the sense that the curvature varies across parameter space, and different iterations land at different locations in parameter space. But I think what is of interest here is not just the curvature at the locations where each iteration lands, but rather the curvature all along the trajectories traced out by the numerical integrators. It is more precise to think of the curvature as a property of the PDF that changes across space, rather than as something that changes from iteration to iteration (after all, the PDF itself does not change from iteration to iteration).

The curvature is so closely related to the Hessian that I think if there’s another way it will simply be a good way to approximate the Hessian. It sounds like you already know quite a bit more than I about possible approximations here.

2 Likes

If you don’t want to work out a matrix of second derivatives, the fastest way (in terms of implementation) to approximate the hessian is to just use finite differences on the first derivatives. The tricky thing in higher dimensions will be to pick out the points in space to examine.

However, if the dimensionality of your parameter space is fixed or growing more slowly than N, but the performance is degrading superlinearly with N, something is more fundamentally wrong. For any fixed set of parameters, as N grows the region of high probability should be compressing and getting easier to sample

1 Like

The dimensionality of the parameter space grows linearly with N because there are nuisance parameters / latent variables for each individual. However the “main” model parameters (coefficients, correlations) stay constant as N grows.

What if there are more points in space with high curvature (big Hessian) as N grows? (are there cases where this might happen?) - then performance might degrade super-linearly with N, since as N increases so does the Hessian? @jsocolar @laifuthegreat

If the number of parameters grows with N, then a super-linear performance hit is guaranteed regardless of the curvature. The number of partial derivatives you need to evaluate grows linearly with the number of parameters, and the cost of evaluating them grows positively (linearly?) with N.

What matters is not the magnitude of the curvature, or the “number of points with high curvature” but rather the variability in the curvature. For example, consider an isotropic multivariate Gaussian. If the standard deviation is large, the curvature is small. If the standard deviation is small, the curvature is larger. But this difference is easily handled by the adaptation of the diagonal mass matrix. In fact, in this case mass-matrix adaptation wouldn’t even be necessary, because the step size adaptation itself provides an isotropic rescaling of the parameter space. Very heuristically, what matters is the maximum curvature within the “typical set” (after rescaling by the adapted mass matrix) compared to the maximum distance between points within the “typical set”. That is, can the sampler traverse the typical set in a reasonable number of steps, while also taking steps small enough to handle the highest curvature that it is likely to encounter during sampling?

Note also that when you speak of “the number of points with high curvature”, it sounds like you’re imagining the curvature being expressed over a space defined by the response variables. But this isn’t true. The data points do not reside in the space where we are interested in the curvature; the parameter vector does.

1 Like

Part of the decrease in the efficiency is caused by memory access getting slower with more data, see a detailed explanation in The Myth of RAM

2 Likes

@jsocolar @laifuthegreat I think I might have worked out why the ESS / grad (this adjusts for memory access getting slower so it should remain roughly constant as N increases!) isn’t remaining constant as N increases

The variances of the main (i.e., non-nuisance - so coefficients and correlations) model parameters decrease as N increases. However, the variances of the nuisance parameters do not (they stay the same).

So, even if we empirically estimate the metric (M) during the burn-in to adjust for these varying scales, we start out with an inferior M (diagonal matrix with 1’s on the diagonal) - and hence we will begin with poor sampling - in other words, as N grows so does the difference between the variances of the main model parameters vs the variances of the nuisance parameters, which means worse initial sampling, which means a poorer M will be estimated during burn-in (since estimating M is like a chicken and egg situation - to estimate a good M you need good initial sampling and vice-versa).

So, I might try experimenting with different initial M’s and try and figure out some “default” initial M’s to use for various N and see whether this makes the efficiency scale more linearly with N. Then if that doesn’t work I’ll look at estimating the Hessian to directly explore the curvature as and depending on the results look at implementing a Reinmann Manifold-based sampler.

@avehtari I’m looking at ESS/grad which will take this into account - this should remain constant regardless of slower memory access.

ESS / sec will however not stay constant, since the time taken to complete a single gradient evaluation grows non-linearly with N, due to memory access getting slower and slower, as you point out

edit: I should have just used ESS / grad in my comment before to avoid confusion, as @avehtari points out the raw efficiency (i.e. per unit time - such as ESS / sec) won’t scale linearly with N due to additional overhead / slower memory access - even if ESS / grad stays constant.

1 Like

Hey @jsocolar, sorry I forgot to reply to your post. Here it is:

If the number of parameters grows with N, then a super-linear performance hit is guaranteed regardless of the curvature. The number of partial derivatives you need to evaluate grows linearly with the number of parameters, and the cost of evaluating them grows positively (linearly?) with N.

As I mentioned in my reply to @avehtari, I’m (now) focusing on measuring efficiency in terms of the Min ESS / grad. Hence, how long each gradient evaluation takes (i.e. the cost of each gradient evaluation) is irrelevant and so are computational issues such as the speed of memory access. Those only matter when measuring the “real life” efficiency (e.g., Min ESS / second).

What matters is not the magnitude of the curvature, or the “number of points with high curvature” but rather the variability in the curvature. For example, consider an isotropic multivariate Gaussian. If the standard deviation is large, the curvature is small. If the standard deviation is small, the curvature is larger. But this difference is easily handled by the adaptation of the diagonal mass matrix. In fact, in this case mass-matrix adaptation wouldn’t even be necessary, because the step size adaptation itself provides an isotropic rescaling of the parameter space.

Thanks very much for this useful info! Helps me understand curvature better.

As N increases, in both Stan and the custom algorithm L (leapfrog steps) goes up and the step size ( \epsilon ) goes down. As I mentioned, I have some parameters (the “nuisance” parameters) for which the S.D does not change as N increases - however, for all of the “main” parameters, the S.D goes down as N increases. Hence the mass matrix M is crucial since the parameters have greatly different scales, with the difference between the scales becoming larger as N increases.

So currently, i’m still leaning towards what I suggested in my other comment - i.e. that Stan (and the custom algorithm) are having a harder time adapting M for larger N. I can’t think of any other reason why the ESS / grad (NOT ESS / second) would decrease as N increases - can you?

I have not got around to calculating Hessian yet (i’ll probably just do it using finite differences, as a start), so that might shed some more light on what’s going on.

Also, I checked the correlations between the parameters using the traces and they’re all pretty low.

1 Like