Pseudo-extended MCMC

That does look very discouraging. Did they report the samples just ignoring the warnings? I tried running it now too, and I get similar warnings on the banana. I agree that NUTS with higher adapt_delta appears to perform better.

It should be noted that the corrected samples actually have good Rhat, the problem seems to be with sampling the pseudo-samples. This is likely because low temperatures can induce near-uniform marginal densities. Inspecting the samples there are indeed grotesquely large pseudo-samples, they just never get sampled by the post-hoc correction. Indeed, visually inspecting the pseudo-sample vs. pseudo-sample plot of one of the parameters
pseudovspseudo
indicates that the construction seems to allow exploration by one pseudo-sample as long as the other stays in the normal sample range. Still, that hardly redeems it. My gut feeling says that tweaking the temperature prior might reduce this issue, but it seems I have already been too optimistic in reading this.

1 Like

Okay, my gut feeling was not completely off. Imposing a simple beta(3,1) prior on the temperatures pretty much took care of the Rhat issues. Still a little bit of treedepth saturation left though (but that’s more about efficiency than validity, no?).

WARNING:pystan:n_eff / iter for parameter min_lr is -1844674407370955.2!
WARNING:pystan:n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated
WARNING:pystan:Rhat for parameter min_lr is nan!
WARNING:pystan:Rhat above 1.1 or below 0.9 indicates that the chains very likely have not mixed
WARNING:pystan:8 of 5000 iterations saturated the maximum tree depth of 10 (0.16%)
WARNING:pystan:Run again with max_treedepth larger than 10 to avoid saturation
Inference for Stan model: anon_model_7b8e20df96cef2338124de5997ec2e88.
4 chains, each with iter=2500; warmup=1250; thin=1; 
post-warmup draws per chain=1250, total post-warmup draws=5000.

             mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
x[1,1]       -0.5    0.53  14.41 -30.84  -8.53   0.05   8.06  27.53    729   1.01
x[2,1]       0.33    0.45  14.18 -28.08  -8.58   0.23   8.97  30.07    976    1.0
x[3,1]       0.56    0.38  13.58 -25.33  -7.74   0.33   8.85  28.66   1289   1.01
x[1,2]      10.78    2.48  42.53 -11.21   -8.2  -3.12   12.8 105.49    293   1.03
x[2,2]       10.1    1.74  38.15 -11.12  -7.95  -2.26  11.86 109.84    483    1.0
x[3,2]       8.45    1.36  35.16 -11.12  -8.22  -3.07  10.62 108.33    668   1.01
beta[1]      0.63  5.8e-3   0.25   0.13   0.43   0.66   0.85   0.99   1904    1.0
beta[2]      0.62  6.5e-3   0.25   0.12   0.41   0.65   0.84   0.99   1514    1.0
beta[3]      0.62  5.7e-3   0.25   0.13   0.43   0.64   0.83   0.98   1901    1.0
y[1,1]      -0.05    0.05   1.44  -3.08  -0.85 5.0e-3   0.81   2.75    729   1.01
y[2,1]       0.03    0.05   1.42  -2.81  -0.86   0.02    0.9   3.01    976    1.0
y[3,1]       0.06    0.04   1.36  -2.53  -0.77   0.03   0.89   2.87   1289   1.01
y[1,2]      -0.02    0.02   1.46  -2.76  -0.81  -0.01   0.78    2.7   4893    1.0
y[2,2]      -0.02    0.02   1.45  -2.92  -0.85  -0.01   0.82   2.93   4975    1.0
y[3,2]      -0.01    0.02   1.43  -2.76  -0.85  -0.03   0.83   2.81   5074    1.0
index        0.99    0.01   0.82    0.0    0.0    1.0    2.0    2.0   3393    1.0
weights[1]   0.34  5.9e-3   0.23 1.4e-4   0.15   0.33   0.49   0.84   1490    1.0
weights[2]   0.33  7.1e-3   0.23 1.2e-4   0.14   0.32   0.48   0.84   1028    1.0
weights[3]   0.33  5.9e-3   0.23 1.2e-4   0.14   0.32   0.48   0.86   1565    1.0
z[1]         0.18    0.16  10.07 -19.51  -6.81   0.07   6.96  19.87   4213    1.0
z[2]         0.12    0.27  14.39 -11.06  -8.77  -5.24   3.27  40.99   2806    1.0
num[1]      -3.95    0.17   4.91 -11.74  -4.13  -2.87  -2.24  -1.88    803   1.01
num[2]      -3.89    0.12   3.16 -12.11  -4.23  -2.91  -2.27  -1.88    690    1.0
num[3]      -3.78     0.1   3.09 -11.76  -4.13  -2.87  -2.22  -1.87    888    1.0
denom[1]    -2.07    0.02   1.04  -4.75  -2.55  -1.88  -1.38  -0.55   2593    1.0
denom[2]    -2.06    0.02   1.08   -4.7  -2.53  -1.87  -1.37  -0.51   2723    1.0
denom[3]    -2.02    0.02   1.03  -4.63   -2.5  -1.83  -1.34  -0.54   2999    1.0
ratio[1]    -1.88    0.17   4.78  -8.87  -1.87  -0.96   -0.4  -0.04    835   1.01
ratio[2]    -1.83    0.11   2.95  -9.22  -2.02  -0.98  -0.43  -0.04    721    1.0
ratio[3]    -1.76     0.1   2.87  -9.18  -1.89  -0.99  -0.44  -0.04    909    1.0
min_lr        nan     nan    nan    nan    nan    nan    nan    nan-9.2e18    nan
lp__       -16.57    0.08   2.49 -22.44 -17.99 -16.22 -14.75 -12.79   1075   1.01

To compare, here is NUTS/HMC:

Inference for Stan model: anon_model_fbe80d3e795c159b146019f693007686.
4 chains, each with iter=2500; warmup=1250; thin=1; 
post-warmup draws per chain=1250, total post-warmup draws=5000.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
x[1]   0.22    0.37   9.59 -18.37  -6.74   0.22   7.04   18.6    660    1.0
x[2]  -0.77    0.53  11.66  -11.0  -8.64  -5.17   2.62   33.3    493    1.0
y[1]   0.02    0.04   0.96  -1.84  -0.67   0.02    0.7   1.86    660    1.0
y[2]   0.03    0.02    1.0  -1.92  -0.65   0.03    0.7   2.06   3498    1.0
lp__   -2.8    0.03   0.94  -5.28  -3.17  -2.52  -2.13  -1.86   1029    1.0

The z variable in the pseudo-sample setting corresponds to x, so it does actually appear to help considerably with n_eff? But the jury is still out on whether it is worth it. HMC is still several factors faster.

edit: I guess one of the main conclusions is that this is just not a difficult enough benchmark to really challenge HMC.

edit: for the record, the flower geometry appears to be a lot tougher for HMC and it remains stuck in a few lobes while PE navigates the whole density. The Rhats are around ~1.05 though in the worst case, so it does have some problems navigating it.

2 Likes

What’s the flower geometry? Is it some kind of multivariate density?

One of the examples from the paper.

Thanks. Here’s the Stan code:

transformed data {
  real r = 10;
  real A = 6;
  real omega = 6;
}
parameters {
  vector[2] x;
}
model {
  target += -0.5 * (hypot(x[1], x[2]) + r + A * cos(omega * atan2(x[2], x[1])));
}

This is probably another good case for convergence diagnoistics. Here’s the RStan call:

fit <- sampling(model,
                control = list(stepsize = 0.01, adapt_delta = 0.999, max_treedepth=11), 
                chains = 4, iter = 50000)

There was still one iteration that exceed max treedepth 11 and one divergent iteration among the 100K iterations.

     mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
x[1]  0.15    0.15 3.35 -7.06 -1.32  0.03  1.79  7.22   499 1.01
x[2] -0.09    0.15 3.42 -7.15 -2.10 -0.17  1.95  6.78   518 1.01
lp__ -4.53    0.02 1.60 -8.54 -5.35 -4.20 -3.35 -2.41  7576 1.00

Nevetheless, it looks like the true posterior means (which should be zero) are within standard error, but that n_eff is terrible. This is 100K draws! We start to get nervous when the effective sample size rate is 200 draws. Stan’s definitely not exploring this posterior well, though I get the feeling it’d get there with more draws and ideally more chains.

Because of the low n_eff, the standard error for the means is pretty bad. I’m not sure what the true posterior sd is supposed to be. Although it appears to be roughly symmetric for x[1] and x[2], it may not be getting far enough into the tails.

Here’s the code for the plot:

> post_x <- extract(fit)$x
> post_x1 <- post_x[ , 1]
> post_x2 <- post_x[ , 2]
> df <- data.frame(x1 = post_x1, x2 = post_x2)
> plot(post_x1, post_x2)
> library(ggplot2)
> ggplot(df, aes(x1, x2)) + geom_scatterplot()

Which is why we odn’t evaluate these things by eye:

So I tried 16 chains with some thinning so as not to kill my machine:

fit <- sampling(model, chains=16, iter=50000, thin = 25,
                control=list(adapt_delta=0.999,
                             stepsize=0.999, 
                             max_treedepth=11))

It takes about 20s/chain on my Mac notebook (6-year old i7). This produces much better results, as one might expect, with a total of 16K saved posterior draws.

> print(fit)
Inference for Stan model: flower.
16 chains, each with iter=50000; warmup=25000; thin=25; 
post-warmup draws per chain=1000, total post-warmup draws=16000.

      mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
x[1] -0.04    0.07 3.45 -7.37 -1.71 -0.02  1.61  7.35  2322 1.01
x[2] -0.05    0.07 3.44 -7.04 -2.06 -0.19  2.00  6.79  2271 1.01
lp__ -4.56    0.01 1.62 -8.60 -5.39 -4.21 -3.37 -2.44 15039 1.00

The sd are even a little higher. As you can see, 2.5% and 97.5% intervals are terrible. This may be indicating poor tail exploration, but they’re still within MCMC standard error here because 2.5% tail probabiliites only use about 1/40 of the draws, meaning 1/sqrt(40) the effective sample size, which is still decent here at around 300.

3 Likes

Note that the n_eff above is for the mean and you can’t infer n_eff for quantiles from that. The flower is an interesting example, where the relative efficiency is higher in the tails. This happens due to slow mixing between leafs, but fast mixing at the end of leafs.

Here’s the relative efficiency for 100 different quantiles of x[1]

For the 2.5% and 97.5% quantiles of x[1] the relative efficiency is about 0.85 which about 8 times better than for mean or median (0.1). For x[2] the corresponding relative efficiencies are about 0.75 and 0.1 These relative efficiencies are after the thinning by 25, so the actual relative efficiencies are quite bad compared to usual
performance of Stan.

And here’s a proposal how that monitor(fit) output could look like in the future

> monitor(fit, probs=c(0.025,0.5,0.975))
Inference for the input samples (16 chains: each with iter = 50000; warmup = 25000):

      Q2.5   Q50 Q97.5 SE_Q2.5 SE_Q50 SE_Q97.5 Rhat Bulk_Reff Tail_Reff
x[1] -7.39 -0.01  7.33    0.13   0.02     0.13 1.01      0.11      0.26
x[2] -6.87  0.10  7.25    0.12   0.27     0.11 1.01      0.12      0.54
lp__ -8.58 -4.24 -2.41    0.05   0.01     0.01 1.00      0.94      0.99

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

Interesting part is that Monte Carlo error for median of x[2] is larger than the Monte Carlo error for 2.5% and 97.5% quantiles, but this is explained by the multimodal nature of the flower (which is more strong along that axis).

2 Likes

Why does slower mixing indicate higher relative efficiency?

Slower mixing indicates lower relative efficiency. Sorry, if that paragraph wasn’t clear on that.

I refer to the general methodology of adding momenta and simulating deterministic Hamiltonian trajectories as “Hamiltonian Monte Carlo”. Within that general methodology there are a myriad of particular implementations, depending on the distribution for the momenta, how trajectories are simulated, how points are chosen from the trajectories, how chosen points are corrected, etc. The “vanilla” HMC introduced in Duane et al and popularized by Radford Neal is one implementation. NUTS is another. The implementation in Stan is another still.

The biggest jump that NUTS introduced was using a dynamic integration time which significantly improves the performance and robustness of the resulting method.

Fun story: Gauss was Riemann’s thesis advisor. In the German system doctoral candidates were three possible topics by their committee and Gauss those two intentionally impractical ones with the third being “make up differential geometry”. Riemann did not disappoint.

If we had to empirical verify every algorithm on the arXiv then we would get nothing done. Judging by history we really shouldn’t be spending much time on it at all – look at all of the papers in the stats and machine learning literature that have claimed huge advances only never to be talked about again. Building useful implementations of algorithms isn’t easy, but if an algorithm were truly revolutionary then even poor implementations would quickly become popular (as evidenced by all of the poor implementations of bad algorithms that quickly became popular).

The only way to judge algorithms reasonably accurately in a reasonable amount of time is to step back and ask why would this be better? What additional information is being exploited or how is the geometry of the target simplifying? That information alone motivates elucidating empirical experiments.

Hope has no place in statistics and unfortunately most of us don’t have time for “but maybe it could work?”. At the same time hopefully our track record demonstrates that our intuition and judgement of new methods is reasonable.

4 Likes

This particular thread has been really nice because there was something that looked like it could work well enough to try and the answer fell out. It’s a shame it turned out to be a negative conclusion, but it’s nice to see the community trying out something (after it passed the triage point of “being possible enough that someone was willing to put the time in”)

1 Like

I also now run the pseudo-extended_flower.stan for completeness.

With default options there’s a lot of divergences and max treedepth exceendeces.
With Bob’s options for flower_target.stan

fit <- sampling(model, chains=16, iter=50000, thin = 25,
                control=list(adapt_delta=0.999,
                             stepsize=0.999, 
                             max_treedepth=11))

There is no divergences, but a lot of max treedepth exceedences. Each chain takes 400-500 seconds compared to less than 20s without pseudo-extension.

Rhat and relative efficiencies are worse than without pseudo-extension, and I would state that this has not yet converged.

> monitor(fit, probs=c(0.025,0.5,0.975))
Inference for the input samples (16 chains: each with iter = 50000; warmup = 25000):

                 Q2.5   Q50  Q97.5   SE_Q2.5 SE_Q50 SE_Q97.5 Rhat Bulk_Reff Tail_Reff
z[1]            -6.56  0.93   6.70      0.08   0.57     0.07 1.03      0.02      0.04
z[2]            -6.12 -0.03   6.09      0.03   0.07     0.04 1.03      0.06      0.04

Another win for convergence diagnostics instead of just eyeballing scatter plots of the draws.

3 Likes

Hi all,

I’m one of the co-authors on the pseudo-extended paper. Thanks to @Bonnevie for contacting me about this forum and thanks to everyone else who have contributed to this very interesting discussion. I’m not an expert in Stan and I’ve learnt a lot about the software from following these posts. The paper is still a work in progress and so your thoughts will be very helpful when we revise the paper.

I definitely agree with @avehtari that we should have done a better job on the simulations for the flower and banana target. As already noted above, to avoid divergence issues you may need to use non-default Stan settings, or as @Bonnevie suggested, modify the prior on beta. However, I suspect that the best improvement would come from rethinking q, the proposal. The focus of the paper is on modifying the target space to improve the sampling from multimodal targets. We used HMC/NUTS because Stan is a great piece of software, but because we’re only modifying the target, we have the flexibility to use other MCMC algorithms as well, like random-walk or MALA.

In the multi-modal setting, taking q to be a tempered version of the target seems sensible and relates to other work in the literature. However, the pseudo-extended framework is quite general and it’s likely that for the flower or banana models there’s a better q that could be used. Although it’s not obvious to me what this would be and is certainly going to be very model specific. For example, I noticed that the funnel example was mentioned, I haven’t looked at this myself but if recall, there’s an issue with calculating the derivatives at the peak of the funnel. I doubt that the pseudo-extended approach would help if q is the tempered target, but there may be a better choice for q which helps in a similar way as re-parameterising this model can help.

The major issue with the pseudo-extended approach is the increased computational cost. On the extended-space, we’re increasing the dimension linearly with the number of pseudo-samples and the main benefit seems to be going from N=1 to N=2 pseudo-samples with marginal gains thereafter. The trade-off is that for the increased cost we hope to better explore new regions of the state-space. Fortunately, HMC scales better with dimension than many alternative algorithms, (e.g.random-walk), but I suspect that our Stan novice code could be made more efficient, which would hopefully make things faster.

Thanks again for the comments. We’re still in the process of testing out other models and the suggestion from @avehtari of a regularised horseshoe model is something that we’re looking into at the moment. We’re aiming to release a new version of the paper early next year, and when we do I’ll add a comment to this thread regarding the changes we’ve made.

Best wishes,

Chris

3 Likes

Thanks @chris_nemeth for joining the discussion! It’s great to know that the discussion has been helpful for you and that you still have time to react to the feedback. Even if the banana and flower turned out to be not that great examples, I still hope that there are cases where pseudo extended MCMC is useful and we all can learn more about sampling from challenging distributions.

1 Like

Great example. Thanks.

How do I calculate effective sample size for 2.5% quantiles as they’re a quantile calculation, not an expectation? Do I look at Pr[X > inv_cdf(0.025)], which introduces a kind of chicken-and-egg problem given that we don’t know the inverse CDF.

That’s been my line, exactly. Fool me once, shame on you. Fool me a dozen times, shame on me.

This is precisely why I want a set of tools that will let other people do some empirical evals before suggesting we build something into Stan.

Like our early NUTS implementations in Stan 1.0! It wasn’t so much that the NUTS implementation was lousy, but that it had a bit of a ways to go to be more robust in adaptation and also was backed by a very inefficient math library compared to what we have now.

I sort of dislike that you’ve brought me around to your point of view. I’m generally both skeptical and hopeful, but my hope’s been dashed by trends in ML and stats, which I thought would be better than natural language processing, but turn out to be more of the same with more math.

This is a hard lesson to learn. For me, it was brought home most clearly when looking at some mixture lung clearance models and soil carbon models. There’s just no way that anyone could see the mixture components from looking at the data, but magically, solving the inverse problem is very reliable for simulated data.

Hey, I was just about to write that!

There is much to be hopeful about in the some the great modeling being done by our colleagues in the general Stan community! That’s what keeps me going and ever vigilant about the user experience that we provide.