I’m learning HMC and Stan. I just finished reading Betancourt 2017 (A Conceptual Introduction to Hamiltonian Monte Carlo) and realized that HMC excels at exploring the typical set: posterior x volume. This worries me because I have always thought that MCMC, including HMC, is a way to sample from the posterior distribution, and therefore we can plug the MCMC sample into a function to calculate the average as approximate population mean of the function (i.e. Monte Carlo). If the sample are actually drawn from the typical set rather than from the posterior whose distributions are different (see Fig 3 in Betancourt 2017), what am I calculating by doing Monte Carlo using this sample (i.e. plugging sample into a function and then calculate the average)?

I guess I’m missing something in my understanding of HMC or Stan. Any help is greatly appreciated. Thanks!

From page 7 of the paper you cite: “The immediate consequence of concentration of measure is that the only significant contributions to any expectation come from the typical set; evaluating the integrand
outside of the typical set has negligible effect on expectations and hence is a waste of precious computational resources.”
It’s not that HMC “samples from the typical set” as this is not even a well-defined sentence, mathematically speaking. It’s that only samples that belong to the typical set will have any real bearing on expectations. The typical set is the region of the sample space \mathcal{X} that has any appreciable mass (density\timesvolume) under the target distribution \pi.

I understand and appreciate what you said: were I to calculate the posterior mean, then HMC sampling mostly from the typical set will be great for that purpose. However, that still doesn’t fully answer my question. Please allow me try to ask it in another way. If I plot the sample I got from HMC, will I get the posterior distribution? Or do I get a distribution of the typical set?

The typical set is defined in relation to the target distribution. Hence, “a distribution of the typical set” is not a well-defined mathematical object. Suppose you have a target distribution \pi defined on a sample space \mathcal{X}. If you are trying to obtain samples from \pi, a good sampler will give you samples that belong to the typical set \mathcal{T} \subset \mathcal{X}, since states y \in \mathcal{T} \setminus\mathcal{X} have negligible mass under \pi and hence do not contribute to any expectations. For the sake of intuition, think unidimensionally: if the target is \text{normal}(0, 1), any sampler worth its salt won’t produce samples in the neighbourhood of -100, since this region of \mathcal{X} has negligible mass under \pi. Just bear in mind that unidimensional intuition usually breaks down in higher-dimensions.

Thanks again for the answer. Now I understand better. I do however feel I still need a bit more explanation or confirmation. Let me explain. The question that motivates me to create this post is that I want to find not only the posterior mean, but also the posterior mode and median. So whether the sample I got from HMC correctly reflects these moments is crucial. In the unidimensional case you just mentioned, the sample indeed includes all of the mode, median, mean in the typical set. In high dimensional situation, the typical set seems to be a ring around but doesn’t include the poster mode (e.g. Fig 12 in Betancourt 2017). Can I still recover the posterior mode and median in this situation?

I learned that there are loss functions that give the posterior mode and median. But those loss functions are at least in some region non-differentiable so that one can’t get analytical formulae to calculate these two moments using the expectation approach. In other words, I cannot calculate the posterior mode and median by calculating expectations of some functions because these functions don’t exist. Is that right?

Alright, there’s a lot to unpack there. First, anything you may be interested in can be cast as an expectation. The posterior median, or 50% quantile, for instance, can be seen as the expectation of an indicator function. Again, the discussion about the typical set is about what it means to efficiently sample from a distribution. To obtain good samples of \piis equivalent to having samples that belong to \mathcal{T}. That’s what it means to sample from a (high-dimensional) distribution. So, under normal circunstances, HMC does sample from the correct distribution and those samples are all you need to compute whatever quantities are well-defined (i.e. expectations of measurable functions which actually exist).You’re fixating on an illusory distinction between the (target) posterior \pi defined on a sample space \mathcal{X} and its typical set \mathcal{T} \subset \mathcal{X}. Maybe reading this by @Bob_Carpenter will help - don’t get thrown off by the catchy title and please heed his point (2).

Now, there are a few subtleties in all this. For instance, you can sample from a Cauchy distribution, but it doesn’t make much sense to compute the “posterior” mean, for instance. See this. In general, when things do not have well defined expectations, we can’t employ (a variant of) the central limit theorem to analyse MCMC output, and things get complicated really quickly. See this an links therein.

@maxbiostat, thanks very much for the prompt reply and the links, especially on a Sunday : )

Thanks to your explanations and Bob’s blog post, I think I understand it. In the following I try to summarize what I learned. Feel free to correct or add anything.

Typical set is a subset of the domain of the posterior distribution.

All Bayesian analysis is based on some expectations with respect to the posterior distribution.

To calculate the expectations efficiently we want to draw samples belongs to the typical set, which means we want to draw samples with frequencies proportional not just to the posterior density, but also (in high dimensions) to volume

We can cast any of posterior mean, median, and any other quantile of the posterior distribution into expectations with respect to the posterior distribution.

Because the quantiles of the sample distribution (counting using indicator functions) (approximately) coincide with the quantiles of the posterior distribution (expectation of the same indicator functions), the sample distribution actually is the (approximate) posterior distribution. And therefore the scatter plot of the sample is the (approximate) plot of the posterior distribution.

On the right track. The crux of Bayesian analysis (others can feel free to interject here) is that everything you want to know is contained in the posterior distribution. And yes, many inferences take the form of expectations. It is important, however, to separate inference from computation. Not all Bayesian analyses necessitate MCMC and not all applications of MCMC are in Bayesian analysis or even to sample from distributions.

Correct.

Yup.

Assuming MCMC is working fine, this is correct.

Wish you the best of luck in your journey through statistical learning.