The typical set and its relevance to Bayesian computation

tl;dr

The typical set (at some level of coverage) is the set of parameter values for which the log density (the target function) is close to its expected value. As has been much discussed, this is not the same as the posterior mode. In a d-dimensional unit normal distribution with a high value of d, the 99% typical set looks like an annulus or sphere or doughnut, corresponding to the points whose distance from the origin is approximately \sqrt{d}. This intuition is important because it helps understand the challenges of HMC and similar algorithms, which essentially have to cycle around the sphere rather than directly cutting through the center.

The term “typical set” is a standard term in information theory and is more general than the “typical set” concept discussed in Stan (as in the above paragraph). There are some differences between the official definition of typical set and the use of the term in Stan discussions. We can continue to work with the concept of typical set, and it will help for us to clarify the definition when we use the term.

Background

The concept of the “typical set” comes up a lot when we’re discussing Hamiltonian Monte Carlo and Stan. For example, in his case study, “Typical Sets and the Curse of Dimensionality,” Bob Carpenter writes, “Roughly speaking, the typical set is where draws from a given distribution tend to lie. That is, it covers most of the mass of the distribution. This particular choice of volume not only covers most of the mass of a distribution, its members reflect the properties of draws from that distribution.” He also writes that the typical set “is roughly defined as the central log density band into which almost all random draws from that distribution will fall.”

Bob also quotes Bernhard Schölkopf as saying, “a high-dimensional space is a lonely place.” which reminds me of my own line that I use when teaching this stuff: “In high-dimensional space, no one can hear you scream.” The funny thing is, I never saw that movie—I don’t like horror movies—but I remember the slogan.

In his article, “A Conceptual Introduction to Hamiltonian Monte Carlo,” Michael Betancourt refers to “the neighborhood between these two extremes [the neighborhood immediately around the mode, and the complementary neighborhood far away from the mode] known as the typical set” and writes, “In high-dimensional parameter spaces probability mass . . . and hence the dominant contributions to expectations, concentrates in a neighborhood called the typical set.” Michael also writes, “because probability densities and volumes transform oppositely under any reparameterization, the typical set is an invariant object that does not depend on the irrelevant details of any particular choice of parameters,” which I don’t think is correct (see below).

There’s a lot of consensus that the typical set is important. But what exactly is it?

Notation and definition

For consistency and clarity I’ll first define some notation. Assume we have a d-dimensional continuous probability density p(\theta) defined on some space \Theta, so that \int_\Theta p(\theta) d \theta = 1. In Bayesian notation we would write p(\theta|y), but here I’m suppressing the conditioning on y. So it’s just a math problem now. I’ll also assume we can compute p(\theta) and that we can use Stan to get random draws \theta from the probability distribution with density p(\theta). And I’ll define the target function, u(\theta) = \log p(\theta), the log density function. Finally, the entropy is defined as H = - E(\log(p(\theta))) = \int \log(p(\theta)) \, p(\theta) d \theta.

I’m being a little sloppy here because typically we can only compute p(\theta) up to an arbitrary multiplicative constant, so we we can only compute \log p(\theta) up to an arbitrary additive constant, and we usually define the target function only up to an arbitrary additive constant too. But for our purposes here we are only working with relative values of the target function, so for simplicity I’ll just pretend the normalizing constant is known. It won’t matter.

The official definition of typical set comes from information theory. Here’s what it currently says on wikipedia: “In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution.” The set is defined in the space of sequences \theta_1,\dots,\theta_n. Thus, the typical set is a subset of \Theta^n. Wikipedia’s definition is using x where we’re using \theta, and it’s using \log_2 where we’re using \log_e, otherwise its notation is consistent with ours.

In Stan, however, people seem to be talking about the typical set as a set of values of \theta, that is, the typical set is a subset of \Theta, not a subset of \Theta^n.

But then we can just take the information-theory definition and use the special case n=1. Then the definition of typical set is: all the values of \theta for which \log p(\theta) is within some \epsilon of its expectation. That is, the set of values of \theta for which | \log p(\theta) + H | < \epsilon. (OK, the wikipedia definition is less then or equal, but that doesn’t matter here because we’re working with continuous distributions.)

10-dimensional normal example

I’ll now illustrate with a 10-dimensional normal example in R:

library("mvtnorm")
n_sims <- 1e4
d <- 10
theta <- rmvnorm(n_sims, rep(0, d), diag(d))
u <- dmvnorm(theta, rep(0, d), diag(d), log=TRUE)
H <- - mean(u)

H is the entropy, which can be calculated analytically for this example as 0.5 d \log(2 \pi e). But just to keep things simple and general, I estimated it based on 10,000 simulations; it’s just the negative of the expectation of the log density, and it comes to 14.18. The mode has density -(d/2) * \log(2 \pi) = -9.19. But, as we well know, the mode is not a typical value of the distribution! The maximum of u in my 10,000 simulations is max(u) = -9.62.

Using the wikipedia definition, the typical set for any \epsilon corresponds to the values of \theta for which u(\theta) is within \epsilon of - H. For example, if \epsilon = 1, it’s the values for which u(\theta) is in the range [-15.18, -13.18].

We can figure out coverage with a little R function:

coverage <- function(epsilon){
mean(abs(u + H) < epsilon)
}

Then we can try some values:

coverage(0)
[1] 0
coverage(1)
[1] 0.3403
coverage(2)
[1] 0.6372
coverage(3)
[1] 0.8475
coverage(4)
[1] 0.9413
coverage(5)
[1] 0.971

As \epsilon increases, the coverage approaches 100%.

But this official definition of typical set is not quite what we’ve been talking about with Stan! The problem is that, to get high enough coverage using this interval centered at - H, you’ll end up including the mode after all! Check it out: - H is -14.18, and the maximum value of u is at -9.19. So if we set \epsilon = 5, the typical set will include the mode.

Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern. And the definitions implicitly used by the Stan people are clearly working within \Theta, not \Theta^n. See for example the quotes by Bob and Michael above. Just to be clear, the n in the official definition of the typical set is not the same as the dimension of \theta, which is 10 in my example. In my example, \Theta is 10-dimensional, and \Theta^n, if we were to work with it, would have dimension 10n.

Differences between the official definition of “typical set” and some of how it’s discussed in the Stan community

  1. The first difference is that the official definition is in \Theta^n, while in Stan it’s used to refer to a subset of \Theta. So the use in Stan is a special case of the general definition.

  2. The second difference is that, as noted above, the 99% typical set for the 10-dimensional normal as officially defined includes the mode, which is not quite how we’ve been thinking about things. The Stan intuition is not all wrong—as dimensionality increases, 99% typical sets will ultimately exclude the mode. Everything’s fine if d=100, for example. But for low dimensions, such as d=10, the skew in the posterior distribution of \log(p(\theta)) makes a difference. 10 is not a high number of dimensions, but 10-dimensional problems do come up in applications.

  3. The third difference is that the typical set is not invariant to nonlinear transformations of \theta. This can be seen as follows: if you transform \theta nonlinearly, you’ll need to divide the density by the absolute value of the determinant of the Jacobian of the transformation, that is, you’re adding -log |det(Jacobian)| to the log density. The typical set of \theta is the inverse image of a set defined on the log density (that is, you take the values of \theta for which \log(p(\theta)) is in the range -H \pm \epsilon), and changing \log(p(\theta)) by subtracting a log abs det Jacobian will change the mapping and thus change the inverse image. This is not a big deal, nor is it a problem—if you nonlinearly transform \theta, you’re changing the geometry of the problem, so it makes sense that the typical set changes too. We should just be clear on it.

You can also see this by continuing with the above simulation in R. Just define phi = exp(theta), that is, the pointwise exponential, taking the exp() of each of the 10 elements of theta. We can then compute the determinant of the Jacobian in R:

phi <- exp(theta)
jacobian <- apply(phi, 1, prod)
u_phi <- - log(jacobian) + u

Under the new parameterization, we can compute typical sets of \phi by working with u_\phi. The typical sets of \phi will be inverse images of ranges of u_\phi near its expectation. The point is that the inverse image of a particular value of u_\phi will not be the same as the inverse image of a particular value of \phi. So the typical sets will be different. The issue here is not about cancellation of density and area in computing total probability, nor is it about E(u_\phi) being different from E(u). Rather, the issue is that sets of constant u_\phi are different from sets of constant u. The rule for being in the typical set of \theta is different from the rule for being in the typical set of \phi, because the inverse image sets are different.

Of the three items listed above, I think the one that could be most confusing is point #2.

There are various ways to handle point #2:

  • We can just accept that in moderate dimensions such as d=10, the 99% typical set can contain the mode.

  • We can talk about 50% typical sets instead of typical sets more generally. (Somehow from the discussions with in Stan of typical sets, I’ve been picturing something like 99%, but that’s never really been specified.) But that won’t work if we’re using “typical set” to include most of the domain of integration.

  • We can work with a different summary of the distribution of \log p(\theta). Instead of an interval centered at E(\log p(\theta)), we could use a central interval or shortest interval of \log p(\theta) that contains specified probability. Take the inverse image of either of such sets and you’ll get an alternative “typical set” that is doughnut-shaped and excludes the mode in that 10-dimensional normal example. Bob’s case study would work fine if you take the typical set to be the central or shortest posterior interval of \log p(\theta).

  • We can talk less about the typical set and more about the distribution of the log density or target function. For example, instead of saying, “If we start our algorithm far from the typical set, then Nuts takes you toward the typical set, and not until then do you want to start sampling,” we can say, “If we start our algorithm with the log density far from its expectation, then Nuts takes the log density toward that expectation, and not until you are close to that expectation do you want to start sampling.”

I kind of like that last solution. Often when we talk about typical sets we can just talk directly about the log density or target function. I think this could be helpful for users (or developers) because this is the function that Stan is computing anyway!

Also, this last solution reinforces the message that the mode is not “typical.” If we want our simulations to be in the areas of \Theta where \log(p(\theta)) is near its expectation, then we don’t particularly want to be at the mode. By definition, the log density is at an extreme value, not at its expectation, when \theta is at its mode.

Again, I think most of the things that people in the Stan community have been saying about typical sets are vaguely correct. It’s just been hard for me to follow some of it without the precise definition, hence this post.

24 Likes

Thanks for writing this up Andrew.

I like the idea of thinking about this in terms of the log density being close or far from its expectation. The mode being having highest density but not being “typical” is a tricky point to get across, so I like that this is also a way to reinforce that.

I’m tagging @betanalpha and @Bob_Carpenter to make sure they see this because they are mentioned in this post. I’m curious to hear their thoughts since they’ve thought more about this than I have.

In relation to Stan pedagogy, I’m curious what other people reading this think about Andrew’s suggestion to move away from using the term “typical set” and speaking directly about the log density and its expectation so please chime in.

I like it. It will then provide more concrete explanation why looking at the split-Rhat of log density (lp__) is useful. When we are approaching the expectation there is a trend or between-chain variance is bigger than within-chain variance (assuming we are not locally stuck) and split-Rhat will be big. When any of the diagnostics indicate (the usual disclaimer about the limitations of the diagnostics hold) that we don’t have relatively accurate estimate of the expectation of the log density, then we can’t know whether we are where we want to start the sampling. It will also intuitively explain that we need more than one draw to figure out what is the expectation and what does being close to expectation mean.

2 Likes

Talking about log p rather than the typical set has been very liberating, and now it is making me think of some other questions.

  1. One way to think of the typical set is in terms of level sets of log p. The typical set of theta corresponds to the level sets of log p(theta) near its expected value. But then this makes me thing of . . . slice sampling! In slice sampling, we’re implicitly reparameterizing the density in terms of level sets of p, which is mathematically equivalent to level sets of log p.

In particular, slice sampling has two steps: Draw theta from within the level set of constant p(theta), and draw a value of p(theta) given theta. That first step (draw theta given p(theta)) is a lot like HMC. This is no surprise–the original NUTS algorithm uses slice sampling–but at the very least it’s helping me understand what people have been talking about when they say “typical set.”

  1. Hamiltonian paths have constant energy: that’s potential energy (i.e., - log p) and kinetic energy (from the speed of the particle moving through space). In his famous 2011 article, Radford talked about the challenge of HMC that it stays in a narrow band of potential energy. HMC can move efficiently within the level sets of log p, but it has random walk behavior in the dimension of log p itself.

But this makes me wonder: are there some HMC diagnostics that would help us understand this? We’re already monitoring log p and have a sense of how it varies. But we also know the distribution of kinetic energy of the Hamiltonian paths. As Bob put it (https://statmodeling.stat.columbia.edu/2020/08/02/the-typical-set-and-its-relevance-to-bayesian-computation/#comment-1399953), “In HMC, we use the gradient not to literally climb toward the mode, but to exert force toward the mode on our simulated particle that’s balanced with the particles momentum to keep the flow in the volume where we want to sample.”

So it seems that we should have internal evidence from our HMC runs (in warmup as well as in sampling) as to how the sampler is moving within and between level sets of log p. I’m wondering if this sort of thinking could move us beyond talking about the typical set?

2 Likes

I’ve couple times discussed that No-U-Turn criterion is optimizing the efficiency of
estimating E[theta] within the level sets of log p, but the efficiency of estimating E[g(theta)] in general is limited by to the random walk behavior in the dimension of log p. The simple example is multivariate unit iid normal for which may observe effective sample size for estimating E[theta_i] that is 3 times bigger than the actual sample size, but the effective sample size for estimating E[theta_i^2] is less than the actual sample size. Thus, if we are interested in efficient estimation of both E[theta_i] and E[theta_i^2], then we could modify NUTS-criterion to stop earlier and spend less computation time between energy re-sampling. This doesn’t reduce the random walk in energy, but would give better effective sample size per second for estimating E[theta_i^2].

Mike’s energy Bayesian fraction of missing information (E-BFMI) (in Section 6.1 of A Conceptual Introduction to Hamiltonian Monte Carlo) is related: “ill-suited kinetic energies are easy to identify by visualizing the marginal energy density and energy transition density using the Markov chain itself” and “visualization quickly identifies when the momentum resampling will inefficiently explore the energy level sets, especially when aggregating multiple chains initialized from diffuse starting points”

1 Like

Yeah like Aki said I think the energy diagnostics introduced by @betanalpha are related to this.

For example, the bayesplot package has a function bayesplot::mcmc_nuts_energy() that produces a plot (inspired by Michael’s paper) contrasting the marginal energy distribution with the transition energy distribution (in practice computed using first differences):

This picture is for 4 chains trying to fit the centered parameterization of 8-schools. For the non-centered parameterization the histograms look the same as each other.

The typical set, at least as defined in information theory for the N = 1 case, doesn’t quite match what we’ve been talking about in lower dimensions. So I’m all in favor of trying to clean up our language and get more precise about what we’re talking about. At least I shouldn’t be saying “typical set” for the concept I have in mind (though it gets more similar as dimensionality increases because then the dimensions in the multivariate normal look like the elements of the typical set definition for N > 1). My case study mistakenly conflated central intervals and typical sets. I’m not so keen on highest-density intervals, but the one-dimensional case below shows where they’ll be different than central intervals.

I worked through the same examples as Andrew to try to figure out what he was getting at. I also created some plots and simulations to help understand what’s going on. Hope this helps.

30-dimensional standard normal

For example, in my case study, I wind up plotting 99.99% intervals of the log density, which does not match the bounds of a 99.99% coverage typical set following the definition with N = 1.

Given

Y \sim \textrm{normal}(0, \textrm{I}_{30})

here’s a histogram of \log \textrm{normal}(Y \mid 0, \textrm{I}_30).

The mean (plotted as a yellow vertical line) is the negative entropy, where entropy is defined as

\textrm{H}[Y] = -\mathbb{E}\!\left[\log \textrm{normal}(Y \mid 0, \textrm{I}_{30})\right].

The log density at the mode (plotted as a vertical red line) is

y^* = \textrm{arg max}_{y} \textrm{normal}(y \mid 0, \textrm{I}_{30}) = 0,

The bounds of the 99.9% coverage typical set (plotted as a pair of vertical blue lines) with the definition at N = 1 is given by the smallest \epsilon such that

\textrm{Pr}[- \log \textrm{normal}(y \mid 0, \textrm{I}_{30} \in \left( \textrm{H}[Y] - \epsilon, \textrm{H}[Y] + \epsilon \right).

The central 99.9% interval of \log \textrm{normal}(Y \mid 0, \textrm{I}_{30}) is plotted as a pair of vertical green lines.

The typical set boundaries for the N = 1 definition extend almost to the mode because the distribution of \log \textrm{normal}(Y \mid 0, \textrm{I}_{30}) is so skewed (the left boundary of the central interval is wider than the right boundary as plotted). In fewer dimensions, the mode would be included in the 99.9% typical set, while still being excluded from the 99.9% central interval. This is why we don’t think it’s doing what we were saying it’s doing in practice. At least not what I thought it was doing, because I thought he typical set would be almost identical to the 99.9% central interval.

Here’s a summary of 5 million draws for comparison to see how extreme they can get:

> summary(log_p_y, digits = 2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    -71     -45     -42     -43     -40     -30 

The tails are much fatter than you can see in the histogram with this many draws (I didn’t clip, which is why there’s so much space on left and right). But the maximum log density is still -30, whereas the log density at the mode is -28.

One dimension

The skew gets worse in fewer dimensions. In one dimension, here’s just the histogram with no styling:

The tails are pretty wide as can be seen from the whitespace in the histogram, but can be seen more clearly in the summary:

> summary(log_p_y1, digits = 2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -14.00   -1.60   -1.10   -1.40   -0.97   -0.92 

The log density at the mode is -.92, so sampling 5 million draws gets you to within 2 decimal places of the mode.

R code to try it at home

Here’s the R code

log_p <- function(y) {                     # std normal log density
  sum(dnorm(y, log = TRUE))
}

D <- 30                                    # dimensions to evaluate

H_Y <- D * 0.5 * log(2 * pi * exp(1))      # entropy H[Y], Y ~ multi_normal(0, I_D)
                                           # equal to D * H[U], U ~ normal(0, 1)

log_p_0 <- log_p(rep(0, D))                # log density at mode = 0

M <- 5e6
y <- matrix(rnorm(M * D), M, D)            # M standard normals of dimension D

log_p_y <- rep(NA, M)                      # log densities of M random draws
for (m in 1:M) {
  log_p_y[m] <- log_p(y[m, ])
}

printf <- function(msg, ...) cat(sprintf(msg, ...), "\n")

printf("dimensionality D = %d;  simulations M = %d\n", D, M)
printf("entropy H[Y] = %7.2f\n", H_Y)

printf("log density of some random draws:")
print(log_p_y[1:20], digits = 3)
printf("\n")

lb99_9 <- quantile(log_p_y, 0.0005)
ub99_9 <- quantile(log_p_y, 0.9995)
printf("99.9 pct interval of log density = (%7.2f, %7.2f)\n", lb99_9, ub99_9)


printf("at mode = 0, log p(0) = %7.2f\n", log_p_0)

for (epsilon in seq(10, 15, by=0.25)) {
  printf("epsilon = %5.2f  interval = (%7.2f, %7.2f)  coverage =  %7.3f",
         epsilon, H_Y - epsilon, H_Y + epsilon,
	 sum(abs(-log_p_y - H_Y) <= epsilon) / M)
}

library('ggplot2')
plot <-
  ggplot(data.frame(log_p_y = log_p_y), aes(x = log_p_y)) +
  geom_histogram(bins=100, color="white", size = 0.25) +
  geom_vline(xintercept = -H_Y, color = "yellow") +
  geom_vline(xintercept = -H_Y - 14.25, color = "blue") +
  geom_vline(xintercept = -H_Y + 14.25, color = "blue") +
  geom_vline(xintercept = lb99_9, color = "green") +
  geom_vline(xintercept = ub99_9, color = "green") +
  geom_vline(xintercept = 30 * dnorm(0, log = TRUE), color = "red") +
  scale_x_continuous() +
  scale_y_continuous(expand = c(0, 1e4)) +
  ggtitle(expression(paste("30-dim std normal ", Y, " ~ ", normal(0, I[30]))),
          subtitle = "99.9% typical (blue), central (green); -entropy (yellow), mode (red)") +
  xlab("log normal(Y | 0, 1)") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
ggsave('log_p_y.jpg', width = 5, height = 4)
1 Like

This is easily seen in practice by monitoring the lp__ in Stan, which is log density plus a constant that doesn’t vary across draws. It’s why I keep suggesting monitoring R-hat on lp__ as the single best convergence diagnostic. I haven’t tried to prove it’s sufficient, but I can’t recall having seen cases where it mixes more quickly than the slowest mixing parameter.

When the velocity p is drawn standard normal, the momentum \frac{1}{2} p^{\top} p winds up having a chi-square distribution(*). That bounds how much the log density can move because the Hamiltonian (negative log density plus momentum) is conserved through the dynamics.

Livingstone, Faulkner, and Roberts have a more in-depth analysis in their paper Kinetic energy choice in Hamiltonian/hybrid Monte Carlo.

(*) This is for an identity mass matrix. In general, with a positive definite mass matrix M as we estimate in Stan during warmup, the momentum becomes \frac{1}{2} p^{\top}M^{-1}p and the random impulses are drawn multinormal with covariance M^{-1} so the kinetic energy of a random impulse remains chi-squared.

1 Like

This is not how typical sets work in continuous spaces.

Those reading the Wikipedia page carefully will see that the explicit definition given is for discrete spaces with a finite number of elements. On these discrete spaces an invariant concept of entropy can be defined as the Kullback-Leibler divergence between some target distribution and the canonical, invariant counting measure from which concepts like the typically set can be properly derived.

Even then, however, the explicit definition of the typical set is designed for asymptotic properties. The condition |\log \pi(x) - H| < \epsilon does not uniquely define a set – for any \epsilon there will are an infinite number of sets satisfying this criteria. It’s only when the sequence gets large and asymptotic equipartition sets in that the typical set becomes more uniquely defined. When trying to construct an explicit set one has to speak of a typical set and not the typical set. Alternatively one can instead think about “the typical set” as the equivalence class of sets satisfying the defining property in which case the idiosyncratic properties of any single element of the equivalencies class are not relevant to the class itself.

There is not a corresponding definition for the typical set in continuous spaces because there is no longer an invariant reference measure and hence an invariant definition of entropy. Any attempt to generalize based on the non-invariant Lebesgue measure will of course lead to non-invariant consequences, which is exactly why people don’t do it. Note also that the concept of entropy on continuous spaces comes from classical physics, but those systems are defined not on the real numbers but rather special symplectic manifolds for which there is a unique reference measure that can be used to define the entropy as a Kullback-Leibler divergence.

What does generalize, however, is the defining qualitative features of the discrete typical set: on continuous spaces sequence of samples (i.e. an ensemble) will concentrate into relatively narrower and narrower neighborhoods as the dimensionality of the ambient spaces increases. It is based on this qualitative behavior that I introduced the concept of the typical set for statistical computation (intentionally avoiding the ill-defined generalizations based on entropy).

By definition an ensemble of samples approximately represents a given target distribution – empirical averages of functions converge to the target expectation values as the ensemble grows. The concept of a typical set allows us to ask where those samples congregate and hence build intuition for statistical computation. In particular the smallest set containing the samples provides a rough generalization of the classic information theoretical typical set, including the interpretation as a neighborhood that most effectively “compresses” the ambient space while preserving information about the target distribution.

For those interested in reading more about well-defined treatments of the typical set and its relationship to probabilistic computation I encourage you to read https://betanalpha.github.io/assets/case_studies/probabilistic_computation.html, especially Section 2, which has now been publicly available for over a year.

Regarding the energy diagnostic – the situation is different for Hamiltonian Monte Carlo. Once we lift to the cotangent bundle we induce a natural symplectic form and its associated invariant reference measure. This then allows us to well-define “entropy” as the Kullback-Leibler divergence from the reference measure to the lifted target distribution. The information theoretic generalization of the typical set will then align with the qualitative generalization based on sample behavior. In particular one could find a way of defining the typical set as a collection of nested Hamiltonian level sets, with the “width” of the typical set quantified in terms of an interval of energies that label each of the Hamiltonian level sets.

Hamiltonian dynamics are constrained to level sets of the Hamiltonian function, and a diffusion over the momenta fibers (i.e. occasional momentum resampling) is introduced to generate a one-dimensional random walk between those Hamiltonian level sets. The energy fraction of missing information or E-FMI (the Bayesian in the original paper in incorrect, based on my misunderstanding of the summary being used) quantifies the efficacy of this random walk transition and how limited the overall sampler behavior will be due to the random walk regardless of how effective the trajectories are exploring each level set. Because it is capturing only the relative shape of the random walk transition to the marginal energy distribution, however, it is not directly related to the geometry of the Hamiltonian typical set.

2 Likes

No wonder I was confused. I hadn’t realized you’d coined your own usage of the term.

Cover and Thomas, in their book Elements of Information Theory, introduce differential entropy (p. 224) and a notion of (weakly) typical set for i.i.d. draws from continuous distributions (p. 226). I have the first edition, but I suspect there might be a second. It’s exactly the same idea as for discrete variables only with differential entropy (which I agree is a weird concept given that it can be negative, but hey, that’s just densities not being bounded by 1). The concentration results are given in Theorem 9.2.2 in the same section on the AEP.

Right after the main definition, under “Properties”, the Wikipedia mentions the notion generalizes to continuous processes satisfying the AEP using differential entropy.

McKay, in his book Information Theory, Inference, and Learning Algorithms (nice to see CUP using the Oxford commad), introduces a notion of typical set for continuous variables on p. 363 while discussing importance sampling to highlight the curse of dimensionality. This is very much how I was using it informally in my case study. But there’s no crisp definition of a subset of the sample space. He just says that draws from \textrm{normal}(y \mid 0, \textrm{I}_N) fall in a typical set around a radius of \sqrt{N}\sigma from the origin. The notion of typical set he does define precisely early in the book is just for sequences of discrete variables.

Whatever else is going on in these definitions, it’s useful to think about the distribution of \log p_Y(Y)!

I did an experiment to try to use this to replace the initial 75 draws with some sort of automatic thing (the assumption being that if you’re drawing p samples and they are vaguely what they should be, maybe the qs are verging on alright).

In retrospect I don’t think I evaluated it that well at all (I think it was just some plots on one model and never actually as part of warmup or anything). You can look at the kinetic energy all together or you can look at it for each dimension individually.

Hi Bob,

Many thanks for providing the R code and clear graphic. A few minor points.

  1. For the simple D=30 multivariate normal case, the lower and upper bounds can be determined from the chi-squared distribution. Thus this R code might be useful:

expected_mean <- -(D/2)log(2pi) - (D/2)
expected_lb99_9 <- -(D/2)log(2pi) - qchisq(0.9995, df=D)/2
expected_ub99_9 <- -(D/2)log(2pi) - qchisq(0.0005, df=D)/2

These yield values for the lower and upper bounds (your green lines) which are very similar to the ones you found in the simulation. In the 1 dimensional case, a chi-squared on 1 df looks like it should replicate your second graph.

  1. For more complex multivariate normal distribution (e.g. a D=30 with all off-diagonals 0.999 etc.), we might anticipate that it would look more like 30*(D=1 case) rather than the D=30 case. My logic here is that all 30 points will be very similar.

  2. If you are looking at the log density distribution as a mechanism for determining whether we have reached a stationary distribution, there are clearly many measures (e.g. ‘convergence’ in terms of absolute (relative) changes in parameters, log density etc.). I recall Monolix uses a Stochastic Approximation EM type algorithm (= not gradient based), whereby it has a ‘warm up’ phase, and then a ‘post warm up’ phase, which I think was based on something as simple as the slope (of their ‘density’) being considered ‘near flat’ in the ‘warm up’ phase. I think it worked very well, so you might want to check that out.

Best of luck,
Al

Let’s be very careful about the Cover and Thomas definitions. They do not define a typical set for a probability distribution over a continuous space but rather a continuous random variable. The difference is subtle but important as “random variables” define not just a continuous ambient space but also preferred parameterization of that space. In other words their definition is for the typical set of a specific probability density function, and shares all of the transformation issues as probability density functions with respect to the Lebesgue measure. The consequences of this ambiguity are all over, especially when they try to build up intuition about the typical set as the “smallest volume” containing a given probability – any definition of small and volume will depend on a given parameterization. in other words their typical set of a random variable will share the same problems as trying to define “highest density intervals”.

This isn’t an issue for “discrete random variables” because of the invariance of the counting measure under bijections.

Because there are many parameterizations, or “random variables” in the Cover and Thomas parlance, that specify the same probability distribution there would be many typical sets corresponding to a single distribution, and that’s beyond the \epsilon dependence of the definition. Most of these typical sets, however, share the same qualitative features, especially distance away from the mode of the target density function within each parameterization.

I’m speculating that this is why MacKay abandoned the formal definition when discussing the consequences for importance sampling and it’s definitely why I didn’t bother when generalizing MacKay’s intuition to general statistical computation. Once you recognize that there is no unique typical set for a given probability distribution, but rather a collection with similar qualitative properties, you can demonstrate those qualitative properties with any convenient choice. In particular a typical set can be defined in terms of differential probability mass which provides an invariant definition (really covariant) that avoids the parameterization issues.

The samples generated by Markov chain Monte Carlo, and hence Markov chain Monte Carlo estimators, are invariant (again covariant if people want to get pedantic) to reparameterizations. Consequently the behavior of those estimators, such as their convergence, will not depend on any particular choice of parameterization. A probability density function within a given parameterization may be useful as a function that is sensitive to all of the parameters but in general there is nothing special about its convergence relative to that of estimators of other functions.

2 Likes

Michael:

You write:

I’m confused. Assuming that pi(x) is the same as what I’m calling p(theta) in my original post above, it seems to me that for any epsilon that set is defined precisely, i.e. if you specify p(theta) (or pi(x)) and you specify epsilon, a unique set is defined. Bob and I did it in our examples with the R code.

given a parameterization with parameter vector x, and associated p(x) then {x : abs(log(p(x)) - H) < epsilon} is a well defined set of x values.

And for epsilon chosen appropriately and dimension high enough, sampling uniformly on that set is close to the same as doing proper sampling. I’ve often wanted to utilize that insight to create a cheaper sampling procedure.

1 Like

From Probabilistic Computation

This sampling perspective is particularly helpful for understanding some of the more subtle properties of the typical set. For example, the typical set is not the highest probability set for a given volume, especially for small volumes.

Also, for a given volume the typical set will be much less likely than the high density set and for a given probability coverage the typical set will be much larger than the high density set (as the volume/probability coverage gets larger they will converge and the difference between them will be negligible).

By the way, I think it may be better to use the “typical set” label to refer only to the case where it contains essentially all the probability mass and not to refer, for example, to a unit-volume typical set with probability 0.000000001.

Because we don’t know how an exact sample will fluctuate, and where in the typical set it will manifest, our best guess of where to find a single exact sample will be a neighborhood around the mode that encompasses small fluctuations in all directions. It’s only when we consider many samples that we are better off guessing neighborhoods in the typical set.

One immediate application of this intuition is in the context of gambling on the games in a tournament, such as the college basketball championship. Absent all other information, the best strategy for a single game is to bet on the favorite.

Betting on the favorite may or may not be the best strategy, it depends on how the odds offered relate to their actual probability of winning. Unless you just want to pick the winner: in that case, and assuming that the favorite has actually more than 50% probability of winning, it will be indeed the right choice. In the following I will consider that interpretation.

Taking the favorite for all of the games, however, ends up being a terrible strategy.

Say there are four matches, the favorites have 3/4 chance of winning (no draws), and we pay $1 to bet on a set of results with a fixed payoff of $10 if we get all of them right.

The optimal bet is picking the favorites. The probability of winning is 32%. The expected payoff is $3.16 ($2.16 profit on our $1 ticket).

While an upset in any single game is rare, having no upsets over the entire tournament is rarer.

The probability of a favorite losing at least one of the four matches is 68% which is indeed higher than 32%. But to bet on that event you have to buy 15 tickets, which will cost you $15. What strategy is more terrible, spending $1 for a expected payoff of 32% x $10=$3.16 (216% gain) or spending $15 for a expected payoff of 68% x $10=$6.84 (54% loss)?

At the same time we don’t know which games will produce upsets, and so any single bet on the outcome of every game is exceedingly unlikely to win. In order to increase the odds of winning one has to make many bets, filling out the typical set of tournament outcomes.

Buying the typical set of tickets increases our chance of winning but lowers the expected payoff (if there were more matches in this example, it would lead to an expected loss). The typical outcome will be one favorite losing, there are four such tickets so we spend $4 to buy them. The probability of winning when we hold the four typical set tickets (42%) is higher than when we have the single optimal ticket (32%) but the payoff is much worse. Spending $4 in typical set tickets we get $10 if we win (expected payoff $4.22, barely coming ahead), spending $4 in the favorites ticket we get $40 if we win (expected payoff $12.66, tripling our money).

This is one way that casinos and professional gamblers are able to maintain steady profits while even the smartest casual gambler enjoys only small expected gains.

1 Like

Regarding the college basketball championship: Indeed, it’s not clear why taking the favorite for all games should be considered “a terrible strategy.” What’s a good betting strategy depends on the betting odds. If all tickets cost the same amount, I agree with ungil that betting on the favorites is actually the best strategy It’s true that to increase the odds of winning one has to make many bets, but in a betting situation it is not clear that making more bets is a good strategy: unless you have special knowledge, it will just increase your expected loss to buy more tickets.

The situation could be different, however, if you are not betting on the winners but instead you are making some sort of trick bet. For example, if the tournament has 63 games and you are betting not on the winner but on the number of upsets, then, yes, 0 upsets is an unlikely outcome. To connect to the discussion of typical sets: the mode of the distribution of u is not the same as the mode of the distribution of theta. (It can be confusing to write this, because u = p(theta), and if we talk about “the mode of p(theta)” it sounds like we’re talking about the mode of theta.)

I’ve had this idea going around in my head for a while:

suppose you have a unimodal distribution. you find a single point inside the set of x such that log(p(x)) > E(lp) - dlp

now, sample from the following piecewise deterministic stochastic process:

choose a direction v at random, step the point forward as x(t) = x(t-1) + v*dt until you step outside the region. then go back to the last spot and choose a new velocity until the next point stays in the region… just continue doing this forever. record every nth step.

this should sample from the distribution “uniform on the set x such that p(x) > exp(mean(lp)-dlp)”

in dimensions more than around 25 or so, this distribution is a very good approximation to p(x). in dimensions higher than say a few hundred it’s as close as makes no difference

if you in addition take your sample and equilibrate it with a post-processing step of diffusive Metropolis Hastings, you get a sample from p(x) after not too much MH even for dimensions like 10. it may not be great for dimensions like 3 but the diffusive MH correction will work even there.

now, zero derivatives are required. the big problem with this sampler is the same as HMC where it can’t sample into or across funnely bits.

I have a very simple version of this written in Julia. anyone want to suggest a good sample problem for this sampler?

Btw I call it the white box sampler. its basically sampling from the distribution of positions for photons bouncing around inside a white-painted box.

I don’t think this algorithm would work well for example in the 100-dimensional normal distribution, as it would have random walk behavior within the set. HMC should work better.