Concentration of measure and typical sets

discourse
stan-math

#1

So I spent today trying to better understand typical sets. I first went through Michael Betancourt’s discussion and then studied Bob Carpenter’s case study carefully. This part confused me:

“Not only are the intervals moving away from the mode, they are concentrating into a narrow band in higher dimensions. This result follows here from the central limit theorem in this example and in more generality from concentration of measure” (section 4 of Carpenter’s case study)

Maybe I mis-interpreted this statement, but it seems to imply that the width of the confidence bands vanishes to zero as the number of dimensions increases. This claim simply isn’t true: using the Central Limit Theorem and Delta Method, I found that the length of the confidence intervals in fact approaches 3.642. This figure can be confirmed by looking more closely at the graph in Bob’s simulation. The “narrowing” of the intervals after 8 dimensions is just a combination of a little simulation noise and an optical illusion from the graph’s curvature.

For fixed ϵ > 0, Bob defines the typical set A_n^ϵ as

A_n^ϵ = {(x_1, … x_n): |1/n ∑ log p(x_i)− H_X | ≤ ϵ}

where H_X is the differential entropy of p. He then claims that the typical set “covers most sequences” in the sense that

Pr[(X_1,…,X_n) ∈ A^n_ϵ] > 1−ϵ.

The confusing part of this statement is that the coverage does not hold for all n. In the Gaussian example, it is easy to explicitly compute

A_n^ϵ = {(x_1, ⋯, x_n): |1/n ∑x_i^2 - 1| ≤ 2ϵ)​ = ​{(x_1, ⋯, x_n): n(1 - 2 ϵ) ≤ ∑x_i^2 ≤ n(1 +2ϵ)}​

Since ∑x_i^2 is distributed as a chi-square, we can explicitly compute Pr[(X_1,…,X_n) ∈ A^n_ϵ] and observe that Pr[(X_1,…,X_n) ∈ A^n_ϵ] > 1 - ϵ only holds asymptotically. The convergence is also not uniform over ϵ > 0.

Finally, consider the following statement from Betancourt’s paper:

“As the dimension of parameter space increases, the tension between the density and
the volume grows and the regions where the density and volume are both large enough to
yield a significant contribution becomes more and more narrow. Consequently the typical
set becomes more singular with increasing dimension, a manifestation of concentration of
measure.” (pg. 7)

First of all, the “thickness” of the typical set in the above example grows at rate sqrt(n), so it’s not really “narrowing” in a straight-forward sense. I thought maybe Michael meant that the Lebesgue measure of A_n^e vanishes as n increases. But this statement also turns out to be false. A_n^e is the region between two spheres centered at the origin, one with radius sqrt(n(1 + 2e)) and the other with radius sqrt(n(1 - 2e)). Using the formula for the volume of an N-ball, it is easy to check that the Lebesgue measure of this region in fact grows quickly with the dimension of the space.

So I’m not sure what to make of statements like “the typical set becomes more singular with increasing dimension”. The concentration of measure results generally apply to the distribution of functions of independent random variables. For example, Terry Tao’s blogpost shows that a 1-Lipschitz function f satisfies

P(|f(X_1, …, X_n) - Ef(X_1, …, X_n)| > t) < 2 exp (-c t^2)

when the X_i’s are drawn independently from the standard normal distribution. Although this results says f(X_1, …, X_N) concentrates as if it were Gaussian, it doesn’t say anything about how the “size” of the set

{(x_1, …, x_n) : |f(x_1, …, x_n) - Ef(X_1, …, X_n)| > t}

changes with n.


#2

I have not checked your calculations, but note that the “covers most sequences” claim is only valid for n sufficiently large (for any given epsilon, there exist an n_0 such that etc.). This detail is not clear in Carpenter’s use case.

I also think that the references to the typical set are a bit confusing. If you have not looked at it already, you may find the discussion at Gelman’s blog interesting even though you will remain confused (there is a link at the Acknowledgements section of the use case).


#3

In that case, the statement is really just the Law of Large Numbers. Why use entropy at all? For any measurable function F that satisfies E |F(X)| < infty, we have

lim_{n -> infty} Pr({|1/n ∑ F(X_i) - E(F(X))| > ϵ} -> 0


#4

That’s true. For example, the summary of http://mat.hjg.com.ar/tic/img/lecture4.pdf is “The AEP is the weak law of large numbers as applied to the estimation of entropy.”


#5

Ok but why do we care about entropy in this context? Is there some particularly useful property of log() that I’m missing? I guess that’s one key point I don’t seem to understand.


#6

In his case study, @Bob_Carpenter cites Cover & Thomas (2006) which details the connections between (differential) entropy and the typical set. I think the Bernoulli example is taken from that book too. I guess what you’re looking for in order to understand the connection between entropy and typical sets is the asymptotic equipartition property. See this lecture, specially slide 11 that shows how one can bound the probability of the typical set from the entropy of the generating distribution.

Hope this helps.


#7

Those lecture notes are very helpful - thank you!

Here are the main points I got out of the slides:

  1. As n grows, almost all points are in the typical set and they are almost all equally probable
  2. The typical set is the smallest set with probability near 1 (although this point isn’t proven)
  3. The size of the typical set has 2^n(H(X) + e) elements while the size of the sample space has |X|^n elements

The fact that all points in the typical set have approximately the same probability is interesting and something I definitely overlooked. Point 2 justifies our focus on the typical set, but a proof of this point would be helpful (maybe it’s in the Cover and Thomas book?). Point 3 provides the rate at which the size of the typical set declines relative to the size of the sample size. That gives some meaning to Betancourt’s phrase, but it’s unclear if/how points 2 and 3 apply to the continuous case.


#8

(2) is not correct. The smallest set with probability 1-delta is the high probability set, which is “similar” to the typical set but not identical (of course both contain almost all the mass, so they can only differ a bit). In particular, it will include the high probability points which are not present in the typical set (in the multivariate normal example, the high probability set is a solid ball including the center).


#9

Thanks for the information! I had not heard of the “high probability set” previously. These notes indicate that the difference in size between the high probability set and the typical set diminishes as n grows.

I still haven’t been able to find much on the asymptotic behavior of typical sets in the continuous case, which is of primary interest to HMC. In what sense does the typical set for a sequence of iid continuous random variables “become more singular with increasing dimension”? And how is this related to concentration of measure?


#10

Bob’s case study already covers a continuous example of concentration of measure, as does section 4.2 of https://github.com/betanalpha/stan_intro.

I cannot emphasize enough not un-elightening it is to chase analytic results about concentration of measure. What matters is not the details but the gross conceptual consequences, such as the neighborhood around the mode being a terrible characterization of a distribution in high dimensions.


#11

I agree that analytic results may not be particularly helpful, but it’s the general concept that I’m confused about. I thought the sentence “the typical set becomes more singular with increasing dimension” meant that the Lebesgue measure of this set converges to zero. But apparently that isn’t the right way to think about it. So what is the right way? Maybe it would help if you defined what you mean by a “singular set”.

Here’s a related sentence from the paper you linked:

Almost all of our target probability can be found in a thin shell at r = σ√D and the relative width of that shell decreases as we add more and more dimensions

What precisely do you mean by “relative” width? And why is the typical set’s relative width important when thinking about integration in high dimensions?


#12

I share your confusion. I can see how in the proposed example if we are thinking of the 1-d distribution across entropy levels we could say that “the typical set becomes more singular with increasing dimension” and talk about “concentration of measure”. But these remarks do not seem applicable in the context of the multi-dimensional space where the probability distribution is defined.

Most of the volume of a hyper-ball is close to the surface. If we have a probability distribution which is constant on a spherical region in high dimension, I wouldn’t say that the probability is “concentrated” in the outer shell, as I wouldn’t say that most of the population of England is “concentrated” outside of London.

In the end, the discussion about the entropy and the typical set seems to be a way to give some intuition about geometry in high dimensions. Personally, I find that adressing the geometry directly is easier. The typical set reasoning requires a i.i.d distribution with n growing to infinity. In my opinion, geometry arguments are easier to understand and generalize better to arbitrary distributions and finite-dimensional spaces.

By the way, in the multivariate normal example the relative width of the sphericall shell (relative to its radius, that is) does indeed decrease (as I mentioned in the second paragraph). But of course its volume is huge, and the shell accounts for essentially all the volume of the sphere.


#13

Maybe this is confusing because the suburbs of London are not particularly high-dimensional.


#14

Does “concentrated” mean something different in high-dimensional spaces?


#15

I thought the main point here was that the spatial intuition we have from 2-3D space just doesn’t transfer to higher dimensional spaces so we can’t expect these words to have the same meaning. If London were high dimensional (e.g.-add age, height, etc…) most of the population would be in the “suburbs”.


#16

But the “size” of the suburbs also grows, no?


#17

In case you missed my point, in low-dimensional England most of the population does live outside of London. Do you think that the population of England is “concentrated” outside of London? If not, what is “concentrated” supposed to mean in high dimensional spaces?


#18

Yes, I guess there’s some confusion to organizing this whole thing around radius.


#19

I was thinking of the London metropolitan area, but now I see you meant all of England. The term ‘concentration’ comes up from thinking of everything in terms of radius and we’re already mixing math and common language concepts so I’m not really sure what you’re asking.

I suppose when considering all of England and looking at it in high dimension you would concentrate the population on the shell even faster than in London (since the countrywide population distribution is flatter in multiple dimensions). But this is all so sloppy I don’t think it helps much (?)


#20

I’m not asking for anything, nor do I want to extend England into higher dimensions. It was just an example to illustrate why I wouldn’t say that the probability mass of a hyperspherical (or multinormal, for that matter) distribution is “concentrated” on some spherical shell. Even though I do understand that most of the volume of a hypersphere (or more precisely, hyperball) is close to its surface. And I understand that you can say that the radius is typically large, or concentrated around some value, etc. But this is a property of the radius (or, in the multinormal distribution example, a property of the probability density and entropy which are directly related to the radius). I think the distinction is sometimes not clear enough, and adding the “typical set” to the mix instead of keeping it purely geometrical doesn’t help in my opinion.