So I spent today trying to better understand typical sets. I first went through Michael Betancourt’s discussion and then studied Bob Carpenter’s case study carefully. This part confused me:
“Not only are the intervals moving away from the mode, they are concentrating into a narrow band in higher dimensions. This result follows here from the central limit theorem in this example and in more generality from concentration of measure” (section 4 of Carpenter’s case study)
Maybe I mis-interpreted this statement, but it seems to imply that the width of the confidence bands vanishes to zero as the number of dimensions increases. This claim simply isn’t true: using the Central Limit Theorem and Delta Method, I found that the length of the confidence intervals in fact approaches 3.642. This figure can be confirmed by looking more closely at the graph in Bob’s simulation. The “narrowing” of the intervals after 8 dimensions is just a combination of a little simulation noise and an optical illusion from the graph’s curvature.
For fixed ϵ > 0, Bob defines the typical set A_n^ϵ as
A_n^ϵ = {(x_1, … x_n): |1/n ∑ log p(x_i)− H_X | ≤ ϵ}
where H_X is the differential entropy of p. He then claims that the typical set “covers most sequences” in the sense that
Pr[(X_1,…,X_n) ∈ A^n_ϵ] > 1−ϵ.
The confusing part of this statement is that the coverage does not hold for all n. In the Gaussian example, it is easy to explicitly compute
A_n^ϵ = {(x_1, ⋯, x_n): |1/n ∑x_i^2 - 1| ≤ 2ϵ) = {(x_1, ⋯, x_n): n(1 - 2 ϵ) ≤ ∑x_i^2 ≤ n(1 +2ϵ)}
Since ∑x_i^2 is distributed as a chi-square, we can explicitly compute Pr[(X_1,…,X_n) ∈ A^n_ϵ] and observe that Pr[(X_1,…,X_n) ∈ A^n_ϵ] > 1 - ϵ only holds asymptotically. The convergence is also not uniform over ϵ > 0.
Finally, consider the following statement from Betancourt’s paper:
“As the dimension of parameter space increases, the tension between the density and
the volume grows and the regions where the density and volume are both large enough to
yield a significant contribution becomes more and more narrow. Consequently the typical
set becomes more singular with increasing dimension, a manifestation of concentration of
measure.” (pg. 7)
First of all, the “thickness” of the typical set in the above example grows at rate sqrt(n), so it’s not really “narrowing” in a straight-forward sense. I thought maybe Michael meant that the Lebesgue measure of A_n^e vanishes as n increases. But this statement also turns out to be false. A_n^e is the region between two spheres centered at the origin, one with radius sqrt(n(1 + 2e)) and the other with radius sqrt(n(1 - 2e)). Using the formula for the volume of an N-ball, it is easy to check that the Lebesgue measure of this region in fact grows quickly with the dimension of the space.
So I’m not sure what to make of statements like “the typical set becomes more singular with increasing dimension”. The concentration of measure results generally apply to the distribution of functions of independent random variables. For example, Terry Tao’s blogpost shows that a 1-Lipschitz function f satisfies
P(|f(X_1, …, X_n) - Ef(X_1, …, X_n)| > t) < 2 exp (-c t^2)
when the X_i’s are drawn independently from the standard normal distribution. Although this results says f(X_1, …, X_N) concentrates as if it were Gaussian, it doesn’t say anything about how the “size” of the set
{(x_1, …, x_n) : |f(x_1, …, x_n) - Ef(X_1, …, X_n)| > t}
changes with n.