Subtree acceptance probablity in base_nuts.hpp:166

andre.pfeuffer · May 21, 2026, 8:44am

The relevant section in 2.39.0 base_nuts.hpp:163-172:

    if (log_sum_weight_subtree > log_sum_weight) {
      z_sample = z_propose;                           // always accept
    } else {
      double accept_prob = std::exp(log_sum_weight_subtree - log_sum_weight);
      if (this->rand_uniform_() < accept_prob)
        z_sample = z_propose;
    }

    log_sum_weight                                    // ← this is the update
        = math::log_sum_exp(log_sum_weight, log_sum_weight_subtree);

The update on line 171-172 is for the next iteration’s denominator. The acceptance decision on line 166 uses the old log_sum_weight (before the new subtree is included).

The problem: the acceptance probability should be W_new / W_total where W_total = W_old + W_new, i.e., including the subtree being sampled.

Concrete example:

W_old = 2, W_new = 1
Correct: P = 1 / (2+1) = 0.333
2.39.0: P = 1 / 2 = 0.500 — over-accepts by 50%

Comparison to random_unform() which is [0,1] but accept_prob can be greater than 1!

I don’t understand this. What am I missing?

Best,

Andre

jsocolar · May 21, 2026, 11:20am

This code has precisely the form of a metropolis step comparing the new subtree to the tree so far, and the desirable effect of biasing selection towards the latter parts of the tree expansion. Stan doesn’t choose its update at random along the trajectory, but rather with bias towards the later parts of the trajectory, while preserving detailed balance. Perhaps @Bob_Carpenter might comment more authoritatively.

nhuurre · May 21, 2026, 4:28pm

As @jsocolar says, this is intentional and correct. It’s mentioned in the reference manual, though looks like the link given there is wrong. Instead, you can find some discussion in Betancourt (2017) “Conceptual introduction to HMC”, specifically section A.3.2 Biased Progressive Sampling in the appendix.

andre.pfeuffer · May 22, 2026, 6:05am

It favors later trajectory more likely even it’s worse for the sake of avoiding correlations in sampling.

nhuurre · May 22, 2026, 6:56pm

Typically the far half of the trajectory is much less correlated than the near half (that’s the entire reason we go to the trouble of constructing the full trajectory). I guess the difference is minimal at the point where the trajectory starts to curve back (and that’s where the u-turn criterion stops) but even then I doubt the latter half is ever significantly worse.

Bob_Carpenter · June 5, 2026, 7:40pm

Nope. You got it right.

The motivating factor for designing NUTS was trying to maximize expected squared jump distance. The original NUTS paper talks about this and it’s all implicit in their code.

We talk about it in much more detail in our GIST papers and I have what I think is a much cleaner C++ implementation in the Walnuts repository here:

github.com/flatironinstitute/walnuts

include/walnuts/walnuts.hpp

main

#pragma once

#include <cmath>
#include <cstdlib>
#include <functional>
#include <limits>
#include <optional>
#include <random>
#include <stdexcept>
#include <string>
#include <tuple>
#include <utility>

#include <Eigen/Dense>

#include <walnuts/concepts.hpp>
#include <walnuts/util.hpp>
#include <walnuts/validate.hpp>

namespace walnuts::detail {

This file has been truncated. show original

The expected square jump distance is proportional to (square, square root?) effective sample size. @andrewgelman said this to @matthewdhoffman about a bajillion times as Matt was developing NUTS and Matt’s solution to use “biased progressive sampling” (not Matt’s name for it) was brilliant. It does have one drawback, though—it tends to boost the effective sample size of parameter estimation, but slightly deflate variance effective sample size. I did a post on this ages ago that (I hope) helps visualize this issue with a 1000-dimensional standard normal target:

Stan forums (i.e., here!): HMC (jittered) vs. NUTS on 1000-dimensional standard normal

One thing that’s apparent here is that some step sizes lead to strongly anti-correlated draws, and when you go too far with that, the variance estimates are terrible. So we can’t just maximize expected square jump distance—we need to balance it, or as Chris Sherlock pointed out in his discussion of his Apogee-to-Apogee sampler, we can try to maximize expected square jump distance of our parameters squared rather than our parameters.

There’s more analysis of this in our Walnuts paper, specifically figure 3 which shows the result of biased progressive sampling on expected path length.

Topic		Replies	Views
Multinomial sampling when joining a new tree in dynamic NUTS implementation Algorithms	5	901	December 7, 2018
Walking through NUTS code Developers	1	650	December 20, 2016
NUTS vs HMC Algorithms	8	7367	August 4, 2020
Purpose of log_sum_weight Algorithms	2	812	February 8, 2018
Confused about accept_stat__ and delta Algorithms mcmc	10	2333	August 21, 2019

Subtree acceptance probablity in base_nuts.hpp:166

Related topics