Definition / example of "strong curvature"?

CerulloE · September 14, 2023, 6:53pm

In the literature on algorithms I often see people refer to some posterior distributions as having “strong” or “high” curvature

It would be greatly appreciated if somebody could give an example of this and/or explain what this means?

I * think * it is the same thing as saying that the parameters are highly correlated but i’m not 100% sure - Also, do models tend to increase in curvature when the number of observations N increases?

laifuthegreat · September 14, 2023, 8:23pm

Big hessian

\det(H)

Example, even with a single parameter (nothing to be correlated with, a posterior can be highly curved if the second derivative is big.

jsocolar · September 14, 2023, 9:09pm

Just to flesh out the correct answer from @laifuthegreat, the posterior distribution can be expressed as a probability density function over the (unconstrained) parameters declared in a Stan model. We care particularly about the logarithm of the posterior density function. When we talk about “curvature” we are talking about how “curved” this function is. Visualization can be a little bit tricky in high dimensions (and posteriors are often high-dimensional objects), but the notion of a function’s second derivative as a measure of how “curved” it is readily generalizes.

The reason this matters is because as the integrator moves around the posterior, it uses gradient information to understand which way to point. If the posterior is relatively “straight” (so the gradient doesn’t change quickly), then the integrator can take long steps without worrying about how the gradient changes at shorter length scales. If the posterior is “curved” then short steps are required to stay in a regime where gradients are approximately constant over the length corresponding to a step. If the curvature is variable, then there is a risk that Stan’s algorithm could adapt to the curved parts and then waste huge amounts of time with tiny steps through the straight parts, or alternatively that it could adapt to the straight parts and then diverge when it hits a curved part.

You’re right that correlated posteriors can induce strong and variable curvature (e.g. near the tip of a “ridge” in the posterior). However, the forms of variable curvature that really give Stan trouble tend to be more extreme than what is produced by correlated parameters, as long as the correlation isn’t associated with a seriously degenerate posterior (see Identity Crisis).

Edit: Note that the curvature is a property of the probability density function, which itself depends on the choice of parameterization. Thus, the solution to problems with curvature is often to reparameterize such that the posterior distribution gets encoded in a probability density function with less variable/extreme curvature.

CerulloE · September 14, 2023, 11:09pm

Thanks @jsocolar and @laifuthegreat !

So the curvature for a given parameter can vary during sampling? e.g. from one iteration to the next?

Is there any way for me to determine how curved my posterior is without having to conpute the Hessian?

Basically I am working on efficient algorithms for multivariate probit model and have implemented an algorithm which greatly outperforms stan (over 20-fold more efficient) which is based on an adaptive Randomized HMC algorithm called “SNAPER-HMC” (Sountsov et al, 2022) - a lot of the speed up is due to using manual gradients rather than autodiff, not just becasuse of the MCMC algorithm itself. However, despite being much more efficient than Stan, it still shares the same issue with scalability - i.e. the algorithm efficiency does not scale linearly with N.

So for example - going from an N=2000 to N =8000 dataset gives around 6-fold (rather than 4-fold) less ESS/ second (or ESS/ grad) with either Stan or the custom algorithm, then going from N =8000 to N=32,000 gives around 30-fold (!!) less efficiency rather than 4-fold less! So despite being much more efficient, like Stan, it still quickly becomes impractical for very large N (Its just that such an N will end up being a lot bigger than you can do with Stan).

The algorithm is Euclidean (just like NUTS) so I am wondering whether using a Manifold-based (e.g. with a diagonal metric based on the diagonal of the negative Hessian or a faster approximation such as the Monge patch metric etc) would solve the scalability issue and make it scale linearly with N - however before attempting to implement such an algorithm (which would be a huge amount of work, especially deriving second-order derivatives and depending on the specific metric used sometimes even third-order derivatives are needed) I would want to make sure it would actually be worth it - hence im wondering if there is way I can “check” how curved my posterior is (with a strongly-curved posterior suggesting possible benefit from Reinmann Manifold methods) without having to compute the Hessian? is this possible?

Thanks again for the explanation!

jsocolar · September 15, 2023, 1:02am

Yes, in the sense that the curvature varies across parameter space, and different iterations land at different locations in parameter space. But I think what is of interest here is not just the curvature at the locations where each iteration lands, but rather the curvature all along the trajectories traced out by the numerical integrators. It is more precise to think of the curvature as a property of the PDF that changes across space, rather than as something that changes from iteration to iteration (after all, the PDF itself does not change from iteration to iteration).

The curvature is so closely related to the Hessian that I think if there’s another way it will simply be a good way to approximate the Hessian. It sounds like you already know quite a bit more than I about possible approximations here.

laifuthegreat · September 15, 2023, 2:55am

If you don’t want to work out a matrix of second derivatives, the fastest way (in terms of implementation) to approximate the hessian is to just use finite differences on the first derivatives. The tricky thing in higher dimensions will be to pick out the points in space to examine.

However, if the dimensionality of your parameter space is fixed or growing more slowly than N, but the performance is degrading superlinearly with N, something is more fundamentally wrong. For any fixed set of parameters, as N grows the region of high probability should be compressing and getting easier to sample

CerulloE · September 15, 2023, 8:05am

The dimensionality of the parameter space grows linearly with N because there are nuisance parameters / latent variables for each individual. However the “main” model parameters (coefficients, correlations) stay constant as N grows.

What if there are more points in space with high curvature (big Hessian) as N grows? (are there cases where this might happen?) - then performance might degrade super-linearly with N, since as N increases so does the Hessian? @jsocolar @laifuthegreat

jsocolar · September 15, 2023, 1:14pm

If the number of parameters grows with N, then a super-linear performance hit is guaranteed regardless of the curvature. The number of partial derivatives you need to evaluate grows linearly with the number of parameters, and the cost of evaluating them grows positively (linearly?) with N.

What matters is not the magnitude of the curvature, or the “number of points with high curvature” but rather the variability in the curvature. For example, consider an isotropic multivariate Gaussian. If the standard deviation is large, the curvature is small. If the standard deviation is small, the curvature is larger. But this difference is easily handled by the adaptation of the diagonal mass matrix. In fact, in this case mass-matrix adaptation wouldn’t even be necessary, because the step size adaptation itself provides an isotropic rescaling of the parameter space. Very heuristically, what matters is the maximum curvature within the “typical set” (after rescaling by the adapted mass matrix) compared to the maximum distance between points within the “typical set”. That is, can the sampler traverse the typical set in a reasonable number of steps, while also taking steps small enough to handle the highest curvature that it is likely to encounter during sampling?

Note also that when you speak of “the number of points with high curvature”, it sounds like you’re imagining the curvature being expressed over a space defined by the response variables. But this isn’t true. The data points do not reside in the space where we are interested in the curvature; the parameter vector does.

avehtari · September 15, 2023, 1:21pm

Part of the decrease in the efficiency is caused by memory access getting slower with more data, see a detailed explanation in The Myth of RAM

CerulloE · September 15, 2023, 1:23pm

@jsocolar @laifuthegreat I think I might have worked out why the ESS / grad (this adjusts for memory access getting slower so it should remain roughly constant as N increases!) isn’t remaining constant as N increases

The variances of the main (i.e., non-nuisance - so coefficients and correlations) model parameters decrease as N increases. However, the variances of the nuisance parameters do not (they stay the same).

So, even if we empirically estimate the metric (M) during the burn-in to adjust for these varying scales, we start out with an inferior M (diagonal matrix with 1’s on the diagonal) - and hence we will begin with poor sampling - in other words, as N grows so does the difference between the variances of the main model parameters vs the variances of the nuisance parameters, which means worse initial sampling, which means a poorer M will be estimated during burn-in (since estimating M is like a chicken and egg situation - to estimate a good M you need good initial sampling and vice-versa).

So, I might try experimenting with different initial M’s and try and figure out some “default” initial M’s to use for various N and see whether this makes the efficiency scale more linearly with N. Then if that doesn’t work I’ll look at estimating the Hessian to directly explore the curvature as and depending on the results look at implementing a Reinmann Manifold-based sampler.

CerulloE · September 15, 2023, 1:23pm

@avehtari I’m looking at ESS/grad which will take this into account - this should remain constant regardless of slower memory access.

ESS / sec will however not stay constant, since the time taken to complete a single gradient evaluation grows non-linearly with N, due to memory access getting slower and slower, as you point out

edit: I should have just used ESS / grad in my comment before to avoid confusion, as @avehtari points out the raw efficiency (i.e. per unit time - such as ESS / sec) won’t scale linearly with N due to additional overhead / slower memory access - even if ESS / grad stays constant.

CerulloE · September 16, 2023, 6:28pm

Hey @jsocolar, sorry I forgot to reply to your post. Here it is:

If the number of parameters grows with N, then a super-linear performance hit is guaranteed regardless of the curvature. The number of partial derivatives you need to evaluate grows linearly with the number of parameters, and the cost of evaluating them grows positively (linearly?) with N.

As I mentioned in my reply to @avehtari, I’m (now) focusing on measuring efficiency in terms of the Min ESS / grad. Hence, how long each gradient evaluation takes (i.e. the cost of each gradient evaluation) is irrelevant and so are computational issues such as the speed of memory access. Those only matter when measuring the “real life” efficiency (e.g., Min ESS / second).

What matters is not the magnitude of the curvature, or the “number of points with high curvature” but rather the variability in the curvature. For example, consider an isotropic multivariate Gaussian. If the standard deviation is large, the curvature is small. If the standard deviation is small, the curvature is larger. But this difference is easily handled by the adaptation of the diagonal mass matrix. In fact, in this case mass-matrix adaptation wouldn’t even be necessary, because the step size adaptation itself provides an isotropic rescaling of the parameter space.

Thanks very much for this useful info! Helps me understand curvature better.

As N increases, in both Stan and the custom algorithm L (leapfrog steps) goes up and the step size ( \epsilon ) goes down. As I mentioned, I have some parameters (the “nuisance” parameters) for which the S.D does not change as N increases - however, for all of the “main” parameters, the S.D goes down as N increases. Hence the mass matrix M is crucial since the parameters have greatly different scales, with the difference between the scales becoming larger as N increases.

So currently, i’m still leaning towards what I suggested in my other comment - i.e. that Stan (and the custom algorithm) are having a harder time adapting M for larger N. I can’t think of any other reason why the ESS / grad (NOT ESS / second) would decrease as N increases - can you?

I have not got around to calculating Hessian yet (i’ll probably just do it using finite differences, as a start), so that might shed some more light on what’s going on.

Also, I checked the correlations between the parameters using the traces and they’re all pretty low.

CerulloE · April 22, 2024, 4:25pm

Sorry to bump an old thread but I thought it was better than starting a new topic since it’s pretty much be a copy-paste of a comment I made on this thread anyway.

@avehtari @jsocolar Do you have any ideas as to why ESS/grad (so again just to be clear this is totally independent of memory bandwidth and other computational issues etc) takes a massive hit as N increases?

Just to make it clear this doesn’t mean there’s anything wrong with the model or the implementation - this is actually a well-known thing that happens with data augmentation models such as the multivariate probit - at least for Gibbs - see e.g. https://jmlr.csail.mit.edu/papers/volume19/17-573/17-573.pdf

However, clearly this isnt limited to Gibbs though and happens with Stan/HMC too. I just wonder why.

In the paper I linked to, from my understanding they mitigate this issue by adjusting the conditional variance of the full conditional posteriors (since it’s Gibbs) rather than forcing it to be 1 (which is what is typically done). However I would think that HMC would essentially do this anyway by adjusting the diagonal mass matrix M - so I wonder why this issue still occurs with HMC?

Also tagging @charlesm93 and @bgoodri

jsocolar · April 22, 2024, 5:45pm

The generic answer is just that the larger N is inducing a posterior geometry of a shape that is more difficult to explore.

A reproducible example written in Stan might enable a specific and relatively definitive response.

CerulloE · April 22, 2024, 6:28pm

Thanks very much for the reply! I will look into doing a MWE (maybe as a separate post)

In the meantime I am looking to see if its possible to implement CDA (this: https://jmlr.csail.mit.edu/papers/volume19/17-573/17-573.pdf) but with HMC. I don’t see why it wouldn’t work with HMC as opposed to Gibbs. I have developed an efficient algorithm for MVP (which uses HMC and manual gradients as well as other optimisations) which is very efficient, however it still suffers from the “scaling with N” issue. So implementing CDA would in theory fix this for larger N.

avehtari · April 23, 2024, 4:03pm

Really hard to say without more information about the model and posterior. There are a few different possibilities. With unimodal posteriors it is possible that with small N there is a smaller posterior correlation, and while increasing N may make the posterior closer to normal the posterior correlation can increase, which will decrease the performance if diagonal mass matrix is used. With multimodal posteriors the number of modes can increase when N increases, and the modes can get more separated. With posteriors having entropic barriers, those barriers can get more difficult.

CerulloE · April 23, 2024, 4:32pm

Thanks for the info @avehtari

I dont know if you’ve read it but in the CDA paper from Duan et al which I previously linked to, for these models (i.e. any model that uses data augmentation) the issue is likely due to:

" … in large samples … a miscalibration problem in which conditional posterior distributions given augmented data are too concentrated"

Now I know that HMC is using the log-posterior and the gradient to generate proposals, not conditional posteriors like Gibbs, but I think the same kind of thing is happening - so the log-posterior given the augmented data becomes too concentrated

avehtari · April 23, 2024, 4:54pm

That’s what the high correlation does.

jsocolar · April 23, 2024, 7:36pm

Absent a reprex, one detail that might help a diagnosis here is whether the ESS per iteration gets worse, or whether it’s just the ESS per gradient eval.

If the ESS per iteration is worsening, and you aren’t saturating the treedepth, and you are using a NUTS-like stopping criterion, and you aren’t seeing divergences, then the issue must be that the posterior has contorted into some shape that triggers the stopping criterion without allowing the sampler to traverse the full posterior effectively. Multimodality could be one example of how this would happen.

If, on the other hand, the ESS per iteration is preserved, and it’s just that you’re seeing more leapfrog steps per iteration, then there’s a good chance that the posterior has contorted into some shape that requires small step sizes.

Topic		Replies	Views
Fast / Slow dynamics and widely varying degrees of identification Algorithms	9	1184	September 3, 2017
Coping with varying curvature Modeling fitting-issues	9	750	August 24, 2019
Scaling Bayesian updating (old posterior to new prior) Modeling fitting-issues	3	196	October 30, 2024
Review of automatic differentiation and its efficient implementation Publicity	0	516	November 14, 2018
Parameters with probability distribution that are not be fitted to data Modeling	4	293	August 24, 2022

Definition / example of "strong curvature"?

Related topics