Subtree acceptance probablity in base_nuts.hpp:166

The relevant section in 2.39.0 base_nuts.hpp:163-172:

    if (log_sum_weight_subtree > log_sum_weight) {
      z_sample = z_propose;                           // always accept
    } else {
      double accept_prob = std::exp(log_sum_weight_subtree - log_sum_weight);
      if (this->rand_uniform_() < accept_prob)
        z_sample = z_propose;
    }

    log_sum_weight                                    // ← this is the update
        = math::log_sum_exp(log_sum_weight, log_sum_weight_subtree);

The update on line 171-172 is for the next iteration’s denominator. The acceptance decision on line 166 uses the old log_sum_weight (before the new subtree is included).

The problem: the acceptance probability should be W_new / W_total where W_total = W_old + W_new, i.e., including the subtree being sampled.

Concrete example:

  • W_old = 2, W_new = 1
  • Correct: P = 1 / (2+1) = 0.333
  • 2.39.0: P = 1 / 2 = 0.500 — over-accepts by 50%

Comparison to random_unform() which is [0,1] but accept_prob can be greater than 1!

I don’t understand this. What am I missing?

Best,

Andre

This code has precisely the form of a metropolis step comparing the new subtree to the tree so far, and the desirable effect of biasing selection towards the latter parts of the tree expansion. Stan doesn’t choose its update at random along the trajectory, but rather with bias towards the later parts of the trajectory, while preserving detailed balance. Perhaps @Bob_Carpenter might comment more authoritatively.

As @jsocolar says, this is intentional and correct. It’s mentioned in the reference manual, though looks like the link given there is wrong. Instead, you can find some discussion in Betancourt (2017) “Conceptual introduction to HMC”, specifically section A.3.2 Biased Progressive Sampling in the appendix.

It favors later trajectory more likely even it’s worse for the sake of avoiding correlations in sampling.

Typically the far half of the trajectory is much less correlated than the near half (that’s the entire reason we go to the trouble of constructing the full trajectory). I guess the difference is minimal at the point where the trajectory starts to curve back (and that’s where the u-turn criterion stops) but even then I doubt the latter half is ever significantly worse.

Nope. You got it right.

The motivating factor for designing NUTS was trying to maximize expected squared jump distance. The original NUTS paper talks about this and it’s all implicit in their code.

We talk about it in much more detail in our GIST papers and I have what I think is a much cleaner C++ implementation in the Walnuts repository here:

The expected square jump distance is proportional to (square, square root?) effective sample size. @andrewgelman said this to @matthewdhoffman about a bajillion times as Matt was developing NUTS and Matt’s solution to use “biased progressive sampling” (not Matt’s name for it) was brilliant. It does have one drawback, though—it tends to boost the effective sample size of parameter estimation, but slightly deflate variance effective sample size. I did a post on this ages ago that (I hope) helps visualize this issue with a 1000-dimensional standard normal target:

One thing that’s apparent here is that some step sizes lead to strongly anti-correlated draws, and when you go too far with that, the variance estimates are terrible. So we can’t just maximize expected square jump distance—we need to balance it, or as Chris Sherlock pointed out in his discussion of his Apogee-to-Apogee sampler, we can try to maximize expected square jump distance of our parameters squared rather than our parameters.

There’s more analysis of this in our Walnuts paper, specifically figure 3 which shows the result of biased progressive sampling on expected path length.