Dual Averaging Explanations in NUTS

ygao43 · May 9, 2021, 1:22am

Hi, I am learning the Dual Averaging based on the NUTS paper in 2014. Could anyone help to explain why the updates in the equation (6) were defined as attached? What are the exact expressions of the dual or primal problems here? The paper by Nesterov (2009) had very clear expression of the primal or dual problems. The 2014 paper did not mention either. Particularly, why is the average of the x_{t+1} is defined as this? why was the update of x_{t+1} defined not as the relationship with x_{t} but with the summation of Hi. What is the expression between x_{t+1} and x_{t} then? Why is this update called dual averaging specifically?

Thank you for your help in advance.

Yan

caesoma · May 13, 2021, 3:45pm

Can you elaborate a bit, for instance show how the expressions from Nesterov (2009) for the primal/dual problems are clear, and how that is different from what is presented in Hoffman, Gelman (2014)?

I never went into the details of the step size optimization for HMC, but just above that passage, in the same page it is stated that x_{t+1} \leftarrow x_t - \eta_t H_t is so because it guarantees the convergence to zero of h(x) (defined as the expected value of H_t conditioned on x). That would be the answer to your question:

But I’m not giving you a real explanation. Another relevant paper cited there is Andrieu, Thoms (2008), I cannot explain it better without delving further into those two papers, but @andrewgelman and others here may be able to give an quick, informal explanation that makes this clearer.

Sorry I can’t be more helpful right now.

ygao43 · June 15, 2021, 9:32pm

I think it is still an open question to me :) Thanks

nhuurre · June 16, 2021, 9:39am

I’m not sure if it helps but in the Stan source has separate expressions for dual and primal parts. I think s_bar_ is the average gradient (dual).

github.com

stan-dev/stan/blob/fc3fe7970d264818e3e948109d9c24f7abea5655/src/stan/mcmc/stepsize_adaptation.hpp#L60-L68


      
          // Nesterov Dual-Averaging of log(epsilon)
          const double eta = 1.0 / (counter_ + t0_);
          
          s_bar_ = (1.0 - eta) * s_bar_ + eta * (delta_ - adapt_stat);
          
          const double x = mu_ - s_bar_ * std::sqrt(counter_) / gamma_;
          const double x_eta = std::pow(counter_, -kappa_);
          
          x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

Rearranging equation (6)

\gamma\frac{t_{0}+t}{\sqrt{t}}\left(\mu-x_{t+1}\right)=\sum_{i}^{t}H_{i}

which implies that

\gamma\frac{t_{0}+t}{\sqrt{t}}\left(\mu-x_{t+1}\right)=\gamma\frac{t_{0}+t-1}{\sqrt{t-1}}\left(\mu-x_{t}\right)+H_{t}

and now we can derive

x_{t+1}=x_{t}+\left(1-\sqrt{\frac{t}{t-1}}\left(1-\frac{1}{t_{0}+t}\right)\right)\left(\mu-x_{t}\right)-\frac{\sqrt{t}}{t_{0}+t}\frac{H_{t}}{\gamma}

Topic		Replies	Views
Possibility of using dual averaging technique for the whole sample (not only during warm up) General	7	635	April 2, 2018
Issue with dual averaging Algorithms	63	3791	April 12, 2021
Dual-averaging for other MCMC algorithms Algorithms mcmc	2	810	November 2, 2018
NUTS differences in Stan vs paper Algorithms	3	1127	February 2, 2017
Walking through NUTS code Developers	1	625	December 20, 2016

Dual Averaging Explanations in NUTS

Related topics