Concentration of measure and typical sets

Ok but why do we care about entropy in this context? Is there some particularly useful property of log() that I’m missing? I guess that’s one key point I don’t seem to understand.

In his case study, @Bob_Carpenter cites Cover & Thomas (2006) which details the connections between (differential) entropy and the typical set. I think the Bernoulli example is taken from that book too. I guess what you’re looking for in order to understand the connection between entropy and typical sets is the asymptotic equipartition property. See this lecture, specially slide 11 that shows how one can bound the probability of the typical set from the entropy of the generating distribution.

Hope this helps.

Those lecture notes are very helpful - thank you!

Here are the main points I got out of the slides:

  1. As n grows, almost all points are in the typical set and they are almost all equally probable
  2. The typical set is the smallest set with probability near 1 (although this point isn’t proven)
  3. The size of the typical set has 2^n(H(X) + e) elements while the size of the sample space has |X|^n elements

The fact that all points in the typical set have approximately the same probability is interesting and something I definitely overlooked. Point 2 justifies our focus on the typical set, but a proof of this point would be helpful (maybe it’s in the Cover and Thomas book?). Point 3 provides the rate at which the size of the typical set declines relative to the size of the sample size. That gives some meaning to Betancourt’s phrase, but it’s unclear if/how points 2 and 3 apply to the continuous case.

(2) is not correct. The smallest set with probability 1-delta is the high probability set, which is “similar” to the typical set but not identical (of course both contain almost all the mass, so they can only differ a bit). In particular, it will include the high probability points which are not present in the typical set (in the multivariate normal example, the high probability set is a solid ball including the center).

1 Like

Thanks for the information! I had not heard of the “high probability set” previously. These notes indicate that the difference in size between the high probability set and the typical set diminishes as n grows.

I still haven’t been able to find much on the asymptotic behavior of typical sets in the continuous case, which is of primary interest to HMC. In what sense does the typical set for a sequence of iid continuous random variables “become more singular with increasing dimension”? And how is this related to concentration of measure?

Bob’s case study already covers a continuous example of concentration of measure, as does section 4.2 of https://github.com/betanalpha/stan_intro.

I cannot emphasize enough not un-elightening it is to chase analytic results about concentration of measure. What matters is not the details but the gross conceptual consequences, such as the neighborhood around the mode being a terrible characterization of a distribution in high dimensions.

1 Like

I agree that analytic results may not be particularly helpful, but it’s the general concept that I’m confused about. I thought the sentence “the typical set becomes more singular with increasing dimension” meant that the Lebesgue measure of this set converges to zero. But apparently that isn’t the right way to think about it. So what is the right way? Maybe it would help if you defined what you mean by a “singular set”.

Here’s a related sentence from the paper you linked:

Almost all of our target probability can be found in a thin shell at r = σ√D and the relative width of that shell decreases as we add more and more dimensions

What precisely do you mean by “relative” width? And why is the typical set’s relative width important when thinking about integration in high dimensions?

I share your confusion. I can see how in the proposed example if we are thinking of the 1-d distribution across entropy levels we could say that “the typical set becomes more singular with increasing dimension” and talk about “concentration of measure”. But these remarks do not seem applicable in the context of the multi-dimensional space where the probability distribution is defined.

Most of the volume of a hyper-ball is close to the surface. If we have a probability distribution which is constant on a spherical region in high dimension, I wouldn’t say that the probability is “concentrated” in the outer shell, as I wouldn’t say that most of the population of England is “concentrated” outside of London.

In the end, the discussion about the entropy and the typical set seems to be a way to give some intuition about geometry in high dimensions. Personally, I find that adressing the geometry directly is easier. The typical set reasoning requires a i.i.d distribution with n growing to infinity. In my opinion, geometry arguments are easier to understand and generalize better to arbitrary distributions and finite-dimensional spaces.

By the way, in the multivariate normal example the relative width of the sphericall shell (relative to its radius, that is) does indeed decrease (as I mentioned in the second paragraph). But of course its volume is huge, and the shell accounts for essentially all the volume of the sphere.

1 Like

Maybe this is confusing because the suburbs of London are not particularly high-dimensional.

Does “concentrated” mean something different in high-dimensional spaces?

I thought the main point here was that the spatial intuition we have from 2-3D space just doesn’t transfer to higher dimensional spaces so we can’t expect these words to have the same meaning. If London were high dimensional (e.g.-add age, height, etc…) most of the population would be in the “suburbs”.

But the “size” of the suburbs also grows, no?

In case you missed my point, in low-dimensional England most of the population does live outside of London. Do you think that the population of England is “concentrated” outside of London? If not, what is “concentrated” supposed to mean in high dimensional spaces?

1 Like

Yes, I guess there’s some confusion to organizing this whole thing around radius.

1 Like

I was thinking of the London metropolitan area, but now I see you meant all of England. The term ‘concentration’ comes up from thinking of everything in terms of radius and we’re already mixing math and common language concepts so I’m not really sure what you’re asking.

I suppose when considering all of England and looking at it in high dimension you would concentrate the population on the shell even faster than in London (since the countrywide population distribution is flatter in multiple dimensions). But this is all so sloppy I don’t think it helps much (?)

I’m not asking for anything, nor do I want to extend England into higher dimensions. It was just an example to illustrate why I wouldn’t say that the probability mass of a hyperspherical (or multinormal, for that matter) distribution is “concentrated” on some spherical shell. Even though I do understand that most of the volume of a hypersphere (or more precisely, hyperball) is close to its surface. And I understand that you can say that the radius is typically large, or concentrated around some value, etc. But this is a property of the radius (or, in the multinormal distribution example, a property of the probability density and entropy which are directly related to the radius). I think the distinction is sometimes not clear enough, and adding the “typical set” to the mix instead of keeping it purely geometrical doesn’t help in my opinion.

Maybe we need to define something like a set A’s “thinness” as ∫​​​_A m(A ∩ B(x,e))dx, where B(x,e) denotes a ball radius of e centered at x and m() is Lebesgue measure.

This at least might help capture the intuition for why the Metropolis Hastings algorithm doesn’t do well in high dimensions.

Concentration of measure will always be weird in a non-compact space like the reals and reference measures like Lebesgue, especially if you consider probability mass relative to some neighborhood that you allow to move with dimension. The phenomenon manifests in properties like vanishing tail probabilities relative to any fixed neighborhood (i.e. mass rapidly moves away from the mode) and relative width (width divided by location) concentrating.

If you really want to see explicit concentration at some boundary then consider a compact space such as a ball of fixed radius or a sphere. There probability mass concentrates in neighborhoods of increasingly smaller uniform reference measure.

Again, the utility of this phenomenon is conceptual to understand what kind of properties you want in your algorithms. For example, Random Walk Metropolis fails because in high dimensions because isotropic jumps will increasingly fall away from the mode where the density quickly vanishes and the Metropolis step rejects. Hamiltonian Monte Carlo scales well because measure-preserving flows follow the typical set even in high dimensions.

1 Like

Another way to think of it is that “all the steps go steeply in the wrong direction”. In the case of a distribution with spherical symmetry like the multivariate normal, from the current point a proposal with higher (or at least similar) probability would be nice and that means that we would like to go closer to the center. But most of the proposals would have much lower probability. In high dimensions almost all the directions are “out” and even small steps mean that the probability density at the destination is much lower than at the origin.

In high dimensions almost all the directions are “out” and even small steps mean that the probability density at the destination is much lower than at the origin.

I thought I understood this intuition, but I have trouble deriving analytical results or confirming this idea in simulation.

For example, consider a Metropolis Hastings algorithm with proposal distribution q_{t+1} ~ N(q_t, 1/sqrt(n) I_n), where q_{t+1} is the candidate draw, q_t is the current location of the chain, n denotes the dimension of the parameter space, and I_n is the n-dimensional identity matrix. The covariance of the proposal is scaled to ensure that the average distance between the proposal and current state remains at 1.

Draw q_t from a N(0, I_n) and then compute the acceptance probability min(1,f(q_{t+1})/f(q_t)). I don’t find that this acceptance probability declines with n. Doesn’t that seem to contradict your statement?

Here’s the simple R code:

library(“mvtnorm”)

n = 100 # number of dimensions
N = 1000 # number of draws
q = rnorm(n) # current state
qprime = rep(q, each = N) + matrix(rnorm(n*N, sd = 1/sqrt(n)), ncol = n, nrow = N) # draw N proposals

mean(pmin(1,dmvnorm(qprime)/dmvnorm(q))) # compute average acceptance probability

1 Like