Hello everyone,
I am reading the paper of A Conceptual Introduction to Hamiltonian Monte Carlo, and have questions on typical set.
My understanding is that , the whole point of designing HMC is to efficiently explore the typical set in high dimensional space. And in high dimensional space, typical set is a more propriate concept if we want to compute expectation. Typical set is a different concept from probability mode.
Usually speaking, we can see a probability distribution will have multiple modes. Gaussian distribution just has one mode. For a complicated probability distribution in high dimensional space, do we have multiple typical sets? When reading the HMC paper, I can see how the design aims to explore the typical set, but I do not see how it can help the simulation to find different typical sets.
Thanks.
You guess could always say that there is only one typical set, but for multimodal distributions it is disconnected. And this is a problem for HMC, as it is difficult to move from the region close to one mode to another.
do we have multiple typical sets?
One way to interpret this is in the sense that @jtimonen suggests – is the typical set guaranteed to be connected, or could it consist of multiple disconnected regions.
Another (probably far less useful) way to interpret this could be to ask – is the concept of a typical set uniquely defined?
I’m not familiar with HMC, but had a quick skim through the paper where the typical set terminology is introduced. It seems that the concept of a typical set is used informally to give the reader intuition, but there isn’t a formal mathematical definition of what a typical set might be precisely – and fair enough too, that might derail a conceptual introduction with unhelpful formality.
Figure 3 depicts the typical set as a subset of the sample space Q where \pi(q) \, \mathrm{d} q is large.
One naive way to attempt to formalise the idea of a typical set could be in terms of how good an approximation is produced. E.g. suppose we have Q, \mathrm{d}q and \pi(q) such that \int_Q \mathrm{d}q \pi(q) = 1. Fix a constant \epsilon > 0 expressing the maximum approximation error we are willing to tolerate. Then we might define any subset A \subseteq Q to be an “\epsilon-approximate typical set with respect to (Q, \mathrm{d}q, \pi(q))” iff A satisfies the bounds \int_Q \mathrm{d}q \pi(q) - \int_A \mathrm{d}q \pi(q) \leq \epsilon.
That definition has the very modest advantage that we know at least one such set exists: we can always trivially take A = Q. But in any interesting scenario the definition is non-unique, there will be many such sets A that are reasonable approximations of Q.
Perhaps we could constrain our candidate typical sets to be superlevel sets: we might wish to to select all the points q \in Q where \pi(q) \, \mathrm{d} q is sufficiently large, say where \pi(q) \, \mathrm{d} q \geq \delta for some constant \delta \geq 0, where the choice of \delta will depend upon our desired approximation error \epsilon > 0 and the geometry of (Q, \mathrm{d}q, \pi(q)) .
Maybe for our second attempt at a formal definition of a typical set we could say that any subset A \subseteq Q is an “\epsilon-approximate superlevel typical set with respect to (Q, \mathrm{d}q, \pi(q))” iff there exists some \delta \geq 0 such that A = \{q : q \in Q, \mathrm{d}q \; \pi(q) \geq \delta \} and A satisfies the bounds \int_Q \mathrm{d}q \pi(q) - \int_A \mathrm{d}q \pi(q) \leq \epsilon.
With this very simple definition in terms of level sets, it isn’t necessarily true that any nontrivial typical set will exist: consider the case where Q is the 1 dimensional unit interval, \mathrm{d}q = 1 and \pi(q) = 1 for all q \in Q. Then if we were to pick \delta \leq 1 we get a set A = \{q : q \in Q, \mathrm{d}q \; \pi(q) \geq 1 \geq \delta \} = Q, and if instead we pick \delta > 1 then we get A = \emptyset . There are many subsets of A \subseteq Q that would produce decent approximations of the integral \int_Q \mathrm{d}q \pi(q), but it doesn’t seem like we have a sensible way to pick any nontrivial typical set using level sets.
So it might be more accurate to talk about a typical set rather than the typical set, since it doesn’t seem like a typical set is guaranteed to be uniquely defined. Hopefully it doesn’t matter exactly which typical set we pick as an approximation of the full set Q, as long as we can pick an approximation that has the right properties for whatever it is we are attempting to approximate.
There has been some discussion about the concept of typical set: The typical set and its relevance to Bayesian computation that you might find interesting
I discuss the typical set in far more detail in Section 2 of Probabilistic Computation. As @rfc notes I stick to a qualitative definition of the typical set as the quantitative definition from information theory doesn’t, in my opinion, provide much insight. In particular I present the typical set as “fuzzy” set and not an exact set to abstract some of the degrees of freedom in the formal definition.
From this qualitative perspective the most useful interpretation of the typical set of a multimodal target distribution would be the union of the “local” typical sets around each mode. In this case the typical set would appear to be disconnected (although remember that each component typical set is “fuzzy” and there’s some overlap in the fuzz). For some visualizations and more discussion see Section 3.2.2.2 of Markov Chain Monte Carlo in Practice.