Concentration of measure and typical sets

stan-math
discourse

#21

Maybe we need to define something like a set A’s “thinness” as ∫​​​_A m(A ∩ B(x,e))dx, where B(x,e) denotes a ball radius of e centered at x and m() is Lebesgue measure.

This at least might help capture the intuition for why the Metropolis Hastings algorithm doesn’t do well in high dimensions.


#22

Concentration of measure will always be weird in a non-compact space like the reals and reference measures like Lebesgue, especially if you consider probability mass relative to some neighborhood that you allow to move with dimension. The phenomenon manifests in properties like vanishing tail probabilities relative to any fixed neighborhood (i.e. mass rapidly moves away from the mode) and relative width (width divided by location) concentrating.

If you really want to see explicit concentration at some boundary then consider a compact space such as a ball of fixed radius or a sphere. There probability mass concentrates in neighborhoods of increasingly smaller uniform reference measure.

Again, the utility of this phenomenon is conceptual to understand what kind of properties you want in your algorithms. For example, Random Walk Metropolis fails because in high dimensions because isotropic jumps will increasingly fall away from the mode where the density quickly vanishes and the Metropolis step rejects. Hamiltonian Monte Carlo scales well because measure-preserving flows follow the typical set even in high dimensions.


#23

Another way to think of it is that “all the steps go steeply in the wrong direction”. In the case of a distribution with spherical symmetry like the multivariate normal, from the current point a proposal with higher (or at least similar) probability would be nice and that means that we would like to go closer to the center. But most of the proposals would have much lower probability. In high dimensions almost all the directions are “out” and even small steps mean that the probability density at the destination is much lower than at the origin.


#24

In high dimensions almost all the directions are “out” and even small steps mean that the probability density at the destination is much lower than at the origin.

I thought I understood this intuition, but I have trouble deriving analytical results or confirming this idea in simulation.

For example, consider a Metropolis Hastings algorithm with proposal distribution q_{t+1} ~ N(q_t, 1/sqrt(n) I_n), where q_{t+1} is the candidate draw, q_t is the current location of the chain, n denotes the dimension of the parameter space, and I_n is the n-dimensional identity matrix. The covariance of the proposal is scaled to ensure that the average distance between the proposal and current state remains at 1.

Draw q_t from a N(0, I_n) and then compute the acceptance probability min(1,f(q_{t+1})/f(q_t)). I don’t find that this acceptance probability declines with n. Doesn’t that seem to contradict your statement?

Here’s the simple R code:

library(“mvtnorm”)

n = 100 # number of dimensions
N = 1000 # number of draws
q = rnorm(n) # current state
qprime = rep(q, each = N) + matrix(rnorm(n*N, sd = 1/sqrt(n)), ncol = n, nrow = N) # draw N proposals

mean(pmin(1,dmvnorm(qprime)/dmvnorm(q))) # compute average acceptance probability


#25

I wrote that without thinking too much about it. I’d say that you’re correcting the issue by scaling the step size. In some sense you are keeping it constant, but as a matter of fact it’s getting small relative to the size of the region that you need to sample. So maybe the intuition remains valid that to have constant acceptance you need to do “small” steps and it will take longer to visit the distribution as n grows. But I don’t dare to guess how does the time to obtain an independent sample grow with n…


#26

Yes but now we’re back to this tricky question of what “small relative to the size of the region” means.

To me, the main thing I have gotten out of this thread is the need to base heuristics and intuition off some more precise definitions.

Maybe @betanalpha can help by precisely stating the mathematical meaning behind his statements like “the typical set becomes more singular with increasing dimension” or “every Random Walk Metropolis proposal will produce a point on the outside of the typical set”. In both cases, what I thought the statement meant (the Lebesgue measure of the typical set goes to zero as dimension increases / the acceptance probability of a Metropolis-Hastings algorithm that keeps the step-size constant will approach zero) turned out to be wrong. Which is not to say that his ideas are incorrect, just that I failed understand them.


#27

The length of the MH steps you proposed is of order ~1 and the diameter of the high probability region (or the typical set, if you will) is of order ~sqrt(n), so there is a clear divergence in scales as n goes to infinity. But I agree that determining what is the precise consequence is tricky and I won’t speculate further. The example of a multivariate standard normal may not be very representative of the general case anyway.


#28

@ungil is correct that your example is misleading because you change your step size – the need to scale the step size is exactly a manifestation of concentration of measure as he describes.

Concentration of measure captures how the behavior of probability distributions change as their dimensionality increases (one of the reasons this can be confusing is because there’s no canonical way to “increase the dimensionality of a distribution” and we often rely on IID products). That means you have to fix everything and then see what happens as the dimensionality of your distribution increases.

In particular, your focus on Lebesgue measure is misleading because the product measure in R^{N} is very different from the base Lebesgue measure over R. Concentration of measure does not describe concentration relative to the base Lebesgue measures but rather the product. In other words, a distribution exhibits concentration of measure if its typical set grows more slowly than the product Lebesgue measure expands.


#29

Just to try to sum up, the typical set is where the draws fall that you take them at random (forget MCMC—just take independent draws from the density of interest). These random draws will have bounded log densities and very often the mode falls outside the typical set even though the mode has higher density.

It may not even contain the higest density areas, so it’s not the smallest volume covering some fraction of probability.

I didn’t mean to imply that! I didn’t realize there was a limit.

Thanks. I can fix that.

It’s really just convention—there’s a parallel world of electrical engineers working with different terminology like the AEP (which I see just got referenced in the next reply).

The Bernoulli example is traditional. If it wasn’t clear, I borrowed it from MacKay’s awesome info theory book.

No! Their probabilities vary by a lot. Just look at the binary case going from a 50% chance of success (all outcomes are equivalent—typical set is everything) and 90% chance of success (where you need a pretty big span of probability to cover most of the mass).

The typical set covers orders of magnitude. But they have similar average log densities per element compared to elements in support but not in the typical set.

Wide typical sets (in terms of energy) can be a hard case for Stan because the change in energy level from one iteration to the next is bounded by chi-squared(N), where N is the number of dimensions.

No, it’s just the “typical” one. You’d get a bit smaller shaving off a bit and adding the volume around a mode of high density. It’s just that you’ll never get near that mode if you just take random draws.

As to size of the typical set, it depends on the variance of the elements. But we often do this in the continuous case where you have to think in terms of volumes rather than combinations.


#30

The problem is that if you try to move any distance at all with RWM, you leave the typical set and encounter too many rejections. So you need a small step size to stay in the typical set, which means diffusion rather than directed movement.

Stan uses gradient information to simulate the Hamiltonian for a fictional particle. The potential energy (negative log density) has the effect of keeping the particle in the typical set while still allowing it to move long distances in a single iteration (at the cost of potentially many leapfrog steps incurring evaluations of log densities and gradients).

The discussion of typical sets in my blog post was mainly to explain why ensemble methods that do interpolation won’t work in high dimensions because the interpolated points aren’t in the typical set you want to sample from to calculate expectations over the posterior.