I’ll try it out in the 100 dimensional unit normal and get back to you. I think it’s going to work well myself. I agree that HMC should work better, but HMC also has a lot higher computational cost, so per unit wall time they may not be that far from each other.
@andrewgelman Ok, here are the preliminary results from 100 dimensional unit normal sampling:
First, we sample from a 100 dimensional unit multivariate normal… bouncing off when lp reaches the 99th percentile (as measured from an iid sample of mvnormal ahead of time… finding the correct lp threshold is a separate “warmup/tuning” type issue)
How long does it take?
julia> @time foo = wbsample(mvmaker(zeros(100),1.0),pth,zeros(100),1000,200,.25);
0.324761 seconds (1.03 M allocations: 880.091 MiB, 20.13% gc time)
So 1000 samples from 200 timesteps between samples with velocity 0.25 takes 0.32 seconds, which is pretty nice.
How well mixed is it? Let’s take a look at the traceplot of the first component:
This looks pretty well mixed at first glance. So we aren’t seeing obviously terrible diffusive type behavior like with small-scale diffusive MH
Here’s the density plot of the first component compared to iid sampling from a normal:
And a scatterplot of the first two dimensions (blue) compared to 2D iid sampling (red)
I think this would benefit from post-sampling “metropolization”. Basically this is kind of an “umbrella” sampler that gives you samples near equilibrium, but if each sample had further diffusive equilibriation, I think it’d pretty soon be indistinguishable from iid honestly.
Here’s a qqnorm plot comparing the first component to theoretical normal:
And here’s the autocorrelation plot for the first dimension:
Where I think this might struggle is when the geometry is less isotropic. So for example if the standard deviations of the dimensions varied from 0.1 to 50 for example.
WHOOPS: I used the 99th percentile highest value of the lp, but what I meant to do was to bounce off when we get to unusually small values of lp… which is why the tails were relatively light… most other plots look similar, though the density plot and the qqnorm are improved (this is for scattering when lp(x) hits 0.01 quantile)
and
Also note that this sampler is an HMC sampler on a distribution which is specially constructed to have gradient of lp = 0 everywhere that it’s defined, and is infinite at the boundary (ie. when it hits the “white wall”). Also, since lp is a constant, we can immediately accept every transition that takes us to a point where lp is defined, without doing any metropolis type correction.
Of course, this distribution isn’t exactly the distribution we want to sample from, but it’s a shockingly good approximation to that distribution. if we do replica exchange of this distribution and diffusive metropolis, every say 5 steps, we’re going to very often allow replica exchange, and we’re going to get the correct distribution we want from the diffusive metropolis chain. That chain will look like diffusion plus jumps when we do replica exchange.
Thanks—I noted that in my case study, but wanted to keep this discussion more general so I was avoiding analytic mathematics.
The goal’s not just to get the first draw (reaching stationarity) but to mix through the distribution well enough to estimate expectations accurately (large effective sample size for quantities of interest). The reference manual explains our R-hat measure of non-convergence and effective sample size estimators.
Right. That’s what we found in simulation.
And this explains our confusion about how general the invariance was.
That’s useful. But I don’t want to introduce a new use for “typical set” when there are already definitions. That just seems confusing.
We used the “standard” definition from Cover and Thomas. I think @betanalpha’s just agreeing with you here that it’s not the only definition that works for what we use typical sets for—the domain of integration or sampling.
Thanks for sharing.
I have a prototype for a parallel algorithm that does this using a combination of optimization and cheap MCMC.
Your whole approach is reminiscent of slice sampling along a trajectory with a fixed bound on the slice that’s not necessarily below the current position.
This doesn’t tell us much in and of itself (it helps that it at least looks like it fit!) We need time / effective sample size or its inverse to compare methods. If we can’t evaluate ESS, we can run multiple times and empirically estimate the variance in square error and use it to back out ESS. That’s what we did to calibrate our ESS estimators.
For comparison, RStan ran 4 chains in parallel with 1000 warmup iterations and 1000 sampling iterations each for this in 0.1s. Effective sample size per parameter is around 6000. NUTS makes anti-correlated draws for normal distributions so this is higher than the number of draws, which is only 4000. So that’s around 60K ESS/second on my iMac using 4 cores.
It’s also important to measure how well alpha[i]^2 converges to get a sense for how well we can estimate variances. With anti-correlated draws like from NUTS, that can be a lot slower than alpha[i]. For instance, in Stan’s NUTS version, ESS for the squared parameters is only around 3000 (and commensurately higher with fewer warmup iterations). This can get really extreme with HMC when the integration time is close to a half orbit, at which point the means converge ridiculously fast and the variances much more slowly.
If I cut warmup down to 200 rather than 1000 iterations, which is more than enough here, effective sample size per 0.1s goes up to 100K ESS/second for parameters and about half that for their squares. This is all being done with R’s data structures to save results.
I should’ve also mentioned that Matt Hoffman’s working on similar ideas of taking a bunch of very short MCMC chains (that haven’t gotten to coupling and thus technically aren’t quite from the stationary distribution yet) and averaging the results. Andrew and Yuling want to do the combination with stacking. So lots of stuff to explore here. The nice part about the just run once for the typical set thing is that it’s massively parallelizable.
The idea of just starting at random places and then running an optimizer to get past the lp threshold, and then running MCMC for a short time to equilibriate that single sample, is another one I have thought of. It’s problematic it obviously depends on how you choose starting points and other details. However the “white box” sampler really does sample from Uniform(x : lp(x) > threshold), it’s an honest to goodness markov process. So you can think of it as a very good proposal distribution for importance sampling. The replica-exchange between it and a diffusive MCMC is an importance sampling algorithm in essence.
I have Julia code to do the white box sampler. I would like to do a generic julia code to replica-exchange, and to diffusive MCMC, and then compose the three functions and throw a bunch of tests at it. If you’re up for working on this, I’d like to collaborate. If Matt Hoffman is interested, perhaps you can send an email to the three of us or something and we could have an offline discussion?
I think you have my email from previous interactions, but if not Andrew certainly has it.
That’s essentially what I built, but you have to figure out where that log density threshold is, so for points along the optimization paths I spawned short MCMC chains. If the log density of those went down, they were in the tail, if they went up, they were past the typical set toward the mode, and if they bounced around, they were in the typical set.
How do you use importance sampling? I thought you needed to know the proposal distribution.
OK.
I think if the log density goes down… they were towards the mode, and if it goes up they were in the tail… (remember lp is negative of the potential energy in HMC, this confuses me sometimes as well).
This is exactly the idea I was thinking for the “warmup phase” to figure out where the threshold should be… Just run an optimization algorithm N steps, then spawn off short MCMC runs to see if you’re at equilibrium lp yet. Then take the 0.01 quantile of a short chain that bounced around… and call it the threshold.
It’s not quite importance sampling. The idea of replica exchange (AKA “simulated tempering” AKA Umbrella sampling, AKA a lot of other things) is you run two chains in parallel, and then every so often you try to swap their states. The swap is allowed according to the typical Metropolis algorithm probability. In this case, since you’re sampling on “uniform in the high probability set” in one chain, and “diffusive MCMC in the proper posterior” in the other, you can pretty easily work out the metropolis acceptance probability. But from the perspective of the diffusive MCMC it’s more or less like “someone just proposed this uniform on the high probability set sample, now I’ll accept that with what’s equivalent to an importance weight… to get a sample from my posterior”
So the diffusive MCMC is just a way to kind of “warp” the “uniform on the high probability set” sample into an appropriate posterior sample in a way like when you warp the proposal distribution with importance weights. Except instead of providing weights, you accept swaps according to the weights. It’s a little like SMC (sequential monte carlo) reweighting.
Note here are some of the advantages:
-
The white box sampler NEVER rejects a sample, since all locations in the set have equal probability, it just goes around bouncing off the walls always forward. The main place it will have problems is in tight funnely bits of the posterior, where it may hit the boundary all the time basically just rattling around in a narrow neck.
-
The white box sampler doesn’t need ANY gradients yet it moves long distances. It is in fact an HMC sampler on a posterior with zero gradient everywhere (a uniform distribution on a complex strangely shaped set).
-
The white box sampler can be run on models that have no gradient that is, models like likelihood free / ABC / agent based models where you’re just simulating a dataset and then comparing it to your data set of interest.
I think it’s really a neat idea and would love to work on it with people who are interested in this kind of thing.
Yup. It wasn’t the potential energy, but all the negation I had to do to get the default optimi
I have a simple evaluation of this. The devil’s in the details of how long to run the short chains, how many leapfrog steps to allow, and then which chain to choose. I have an example code for the funnel in my opt-path-typical repo. The basic idea works for high dimensional normals and funnels.
Thanks. I often feel like I’m wandering through ML and stats without a Rosetta stone.
This bad conditioning is the bane of most ODE solvers. A bunch of people have looked at replacing the leapfrog with the implicit midpoint integrator, which is implicit, but still symplectic. The early results are encouraging from my perspective, but also require lots of tuning to make real.
Which being like importance sampling, can have real problems with variance and collapse to a single draw. How’s that get mitigated?
Is there a writeup of the algorithm somewhere I could see? I can’t quite follow the narrative details.
I think Wikipedia is readable and has the expression for the associated metropolis acceptance…
it also seems like
http://www.math.pitt.edu/~cbsg/Materials/Earl_ParallelTempering.pdf
is a good overview.
I think the equivalence to the problem of SMC collapse is the problem of never succeeding in exchanges, in which case your diffusive corrector just degenerates to diffusive Metropolis… so the key is to have a coupled chain that has sufficient overlap. in this case it will by construction
So here is my proposed algorithm…
initialization:
for each chain 1:c you want to run, run optimization for Max n steps… spawn a short diffusive Monte Carlo and see if it stays relatively stationary… continue until it does…
choose an lp threshold on the basis of say the 0.01 quantile of the above lp search.
run c parallel tempered chains, with the white box sampler that just goes forward until it hits the lp threshold and then bounces off randomly… in parallel with standard diffusive Metropolis, with probability of attempting swaps a tuning parameter, but 0.1 being a good first guess
finally, for each of the c chains, return the output of the diffusive Metropolis chain as your sample
Thanks. That’s what I was looking for, not a description of parallel tempering. I find it hard to piece together algorithms from narrative descriptions.
Apologies for reviving such an old thread - the “typical set” concept came up in some projects I’m working on right now, and this seems to be the most relevant discussion for some questions that have come up.
First, I was hoping to clarify something @betanalpha said above. @andrewgelman seems to have asked a similar question to mine, but none of the subsequent replies seem to address the part that’s confusing me.
How can this be true for fixed \pi and \epsilon? I may be missing something, but my thought is that, tautologically, there should be only one such set: all of the x-values which satisfy the condition. If there is an equivalence class of different sets (again, for fixed \pi and \epsilon), what would distinct members of this class look like?
Second question is more general and pertains to this:
If this is the case, why is it desirable for HMC to target the typical set? Of course, there is significant overlap between these sets - if I’ve understood correctly, the high-density set is simply the typical set with the mode added and the “outermost shell” excluded. In other words, we remove the volume at the boundary of the typical set, replace it with the smaller volume around the mode, and that’s the high-density set. But these sets have roughly equal mass, so the integral over one should be approximately equal to the integral over the other.
Thus, if posterior inference is just a matter of taking averages over the posterior, it shouldn’t matter whether the draws come from the typical set or the high-density set. Why is it preferable for HMC to sample from the “outermost shell” of the typical set instead of the area near the mode, given that the mass contribution of either would be about equal?
Shaun:
I think the quick answer is that there is a technical definition of “typical set,” but it does not correspond with the way that we’ve all been casually using the term “typical set” in Stan discussions. As described for example in wikipedia, “In information theory, the typical set is a set of sequences . . .”, but in Stan discussions we’ve used “typical set” to refer to a set of values of theta, with no sequences involved. That’s why I’m now trying to avoid the term “typical set”–it already has a definition, which is not the implicit definition we’re talking about. Safer, I think, to just talk about values of theta for which the target function (log p(theta|y), in a Bayesian context) is close to its expectation.
Regarding your final question: we want HMC to sample from the entire posterior distribution, which includes values of theta for which the target is close to its expectation and also other values, all in proportional to the posterior density. The point is that Hamiltonian paths will take you toward the target being close to its expectation, and HMC can be slow to move in the space of the target function (see Radford Neal’s 2011 article for some discussion of this point).
Because samples from the typical set are drawn from the joint posterior distribution of the parameters, and samples from the high-density set are not.
Suppose that I measure the position of a particle to be at the origin, but I know that my measurement is noisy. Suppose that the distribution of the measurement error is known. Now suppose that I want the posterior distribution for the true distance of the particle from the origin. If I draw independent samples from the posterior distribution, and I calculate their distance from the origin, and I take the mean of those distances, I get the posterior expectation for the particle’s distance from the origin. If instead I perform the equivalent process on samples from the high-density set, then I get the wrong expectation.
Jsocolar:
Again, I’d just be careful using terms such as “samples from the typical set,” because the concept of “typical set” is not really clearly defined in this context. I recommend saying something like “parameter values whose target function is close to its expectation.”
How so? In the limit where the typical set is defined it is not meaningfully different from a comparable high-density set. Both contain essentially all the mass of the distribution. Sampling from one or the other would be the same for all practical purposes as they overlap almost completely.
Leaving side all this talk about typical sets, are “sampling from the typical set” or “sampling from the high-denstity set” a thing?
They don’t overlap completely.
Where there is very high density, there may be very low volume. Probability mass is density integrated over volume.
If you were to sample only “high density” sets, you’d have a very, very small set compared to where most mass actually is. Even in, e.g., a high dimensional gaussian — The highest density region is right in the middle of the ellipsoid; but most of the mass is around it, in a elliptical shell. You’ll rarely even sample within the very center, simply because most of the volume is not where the small ball of high density exists. Curse of dimensionality, and whatnot.
The intuition that high density sets are near the expected values is misleading in high dimensions. In low dimensions, it may be a reasonable guess; but as dimensionality increases, the volume increases very rapidly, and the bulk of the mass is further and further away from the highest density regions.
What do we talk about when we talk about high-density sets? (Or about typical sets, for that matter.)
For example, Hyndman (“Computing and Graphing Highest Density Regions”, 1996) gives the following definition for high-density regions:
Let 𝑓(𝑥) be the density function of a random variable 𝑋. Then the 100(1−𝛼)% HDR is the subset 𝑅(𝑓𝛼) of the sample space of 𝑋 such that 𝑅(𝑓𝛼)={𝑥:𝑓(𝑥)≥𝑓𝛼}, where 𝑓𝛼 is the largest constant such that 𝑃(𝑋∈𝑅(𝑓𝛼))≥1−𝛼.
Is a 0.01% HDR bad? Sure.
When I wrote “comparable high-density set” I meant a high-density region with similar probability mass, close to 100%, as that not-so-well-defined “typical set” (or similiar volume and higher probability mass).
Defining a high-density set which is almost empty will make it look bad but an empty typical set may not be much better.
This is essentially the point I was getting at. The difference between a 99.9% “typical set” (I’ll try to move away from this terminology in subsequent replies, but for now it’s more convenient than “the set where \log p is close to its expectation”) and a 99.9% HDR is that we remove a thin outer shell and replace it with the mode. Why would that be less preferable? Either way the draws are from the posterior, with only a very low-mass difference in where exactly some of them come from.
Or am I thinking about this the wrong way? Is HMC designed to favour the “typical set” over the HDR, or does it go there simply because that’s what MCMC algorithms tend to do? In other words, I’ve been assuming that NUTS is meant to focus on the typical set “on purpose”, but now I’m wondering if I was mistaken.
Just to make sure I understand, the concept you’re talking about (values of theta for which the target function is close to its expectation) is the same thing as the typical set (as defined on Wikipedia) when n = 1, correct? If so, is the concern just to avoid ambiguity about sequences vs. not-sequences?
I’m a bit lost now in the terminology of the “typical set” versus the HDR, but I think it’s beside the point.
If the computation is working properly, an MCMC algorithm should be sampling from parameter space with probability proportional to the posterior density. It should not strictly confine itself to any particular set, but it should confine itself to any region of parameter space with probability proportional to the posterior mass that sits in that region.
An interesting observation (at least it’s interesting to me), is that in high dimensions the posterior mass concentrates far away from the posterior mode. The sampler does target the neighborhood of the mode with probability proportional to the posterior mass in that neighborhood–i.e. with higher probability than any equivalently-sized region of parameter space.
Neal’s funnel shows that HMC would sample most region well except the sharp neck, which is an HDR. So in this sense I think HMC does favor typical set, due to its fixed step size integrator (at least in this case). With ad-hoc proposals that bias on certain direction I can imagine one can deal with Neal’s funnel more effectively, so probably the conclusion doesn’t apply to all MCMC methods.