I have a small query related to the scaling of likelihood value while using inside the HMC. I was hoping if somebody can shed some light on the issue described below.

Below is the equation used to accept or reject the proposal.

Let’s, say I am estimating a discrete choice model (MNL) with 1000 individuals. Now, if I take the sum of log-likelihood value for all 1000 individuals, the combined sum will be way to bigger in absolute magnitude than the kinetic energy. Next, if the proposed Hamiltonian is smaller in magnitude than the initial Hamiltonian (say even by 10 points), alpha will be 1.

Next, we use the dual-averaging algorithm to tune the step-size (epsilon) using the equations mentioned below.

where t0, mu, gamma_value are fixed parameter values for the algorithm as suggested in your paper.

Now what happens is that because alpha is 1, step-size (epsilon) explodes (goes from 1 to 50 in 10 to 20 iterations) and consequently, the model parameter values (beta’s) explode as well.

I was wondering if anyone has encountered such issues and is there some way to scale the log-likelihood (LL) value to avoid these issues.
I tried working with the average of the LL value which works fine if you have few parameters (say around 5). However, if the number of parameters increases (say around 20 or more), the average of LL value approach won’t work as total kinetic energy is way bigger in magnitude than the average LL value.

It’s not the likelihood inside HMC, but the posterior density up to a constant. So it also includes priors.

This is all well described in introductions to HMC (references on our web page, users >> documentation >> external).

This isn’t a problem as far as static HMC itself goes—you don’t need to scale anything here. The Hamiltonian should be preserved exactly if you simulate it exactly, so all the Metropolis step is doing in static HMC is throwing away draws where the Hamiltonian diverges from its true value. (Note: NUTS works very differently by the way and doesn’t involve a Metropolis step)

I don’t know anything about the dual averaging used for adaptation, but @betanalpha may be able to help here.

Thanks for the reply. Well, I am estimating the model using frequentist
approach rather than a Bayesian approach. Therefore I have the likelihood
and it’s analytic gradients. Finally, in stead of using the BFGS algorithm
for optimization, I use HMC approach.

So, I don’t assume any prior (and the constant can be just assumed to be
zero) and thus the posterior becomes a likelihood function.
I am testing the approach to estimate an MNP model. I have also attached
the paper which describes the formulation and the likelihood function in
the context of frequentist approach.

I have also tried the static HMC but the results aren’t promising.

I’m not sure what you think is going wrong here – if the difference in Hamiltonian is that large and negative then the proposal will absolutely be accepted. Yes, in high dimensions the likelihood has huge variations, which is why it’s so hard to fit high dimensional models in practice, but that’s a property of your model that cannot be removed without a reparameterization.

The kinetic energy will automatically adapt and absorb as much of this energy as it can. Sometimes this limits how far even an exact trajectory can move through parameter space, but the resolution is not to change the model but rather adjust the kinetic energy or consider a reparameterization. This is, for example, what happens when you try to fit a centered hierarchical model when the data are sparse. See https://arxiv.org/abs/1312.0906 for more details.

One concern could be that numerical issues might arise due to the limitations of floating point arithmetic, but all of the calculations in our implementation of HMC are very careful designed to accommodate these potential problems.

Also note that the dynamic HMC implementation in Stan does not simply evolve for some time and then accept or reject with the min(1, exp(-\DeltaH)), rather it samples a point from the entire trajectory all at once.

Thanks a lot for your prompt reply. I understand that things can go south
quickly in high dimension. However, at the moment I am testing the
Multinomial Probit model for a smaller dimension (maximum dimension of
integration is just 8). Further, the data used for estimating the model is
also synthesized.

When I estimate the same model using BFGS algorithm, I recover the
parameter absolutely fine (as I generate the data with known
parameter value). On the other hand, when I estimate the same model using
the same dataset using HMC (both static and NUTS), the recovery is not
good. I don’t sample from the whole trajectory in one iteration of HMC but
rather take the last value and use MH method.

However, even with this, I expected it to perform better than what I am
getting. I will be happy to share my code if you think I am making some
fundamental mistake.

Even though I am not using STAN, I have run the code in both Python and
Gauss. Both store value up to 16 significant digits, so don’t
think floating point/machine precision is an issue here.

I have been trying unsuccessfully to do a MNP model in various ways using Stan for five years now. Until shown otherwise, I am going to continue to believe that the MNP model cannot be done well with HMC, and doing it well with HMC is a prerequisite to seeing if any non-HMC approach accurately captures the uncertainty.

Hamiltonian Monte Carlo is not guaranteed to always work, which is why the implementation of dynamic Hamiltonian Monte Carlo in Stan is heavily instrumented to detect problems. For more information see https://arxiv.org/abs/1701.02434.

Presumably your implementation does not (at least you have not informed us of any diagnostics), in which case it’s impossible for us to tell what could be going on. But in all likelihood your model is pathological (it might have a reasonable mode, but the shape around that mode is not well-behaved) and even our best implementation of Hamiltonian Monte Carlo can’t fit it. Indeed as @bgoodri notes the kind of model that you are trying to fit is notoriously nasty.

I would replace “I don’t assume any prior” with “I assume improper, flat priors”. Those improper priors can be awfully strong in practice as they are saying that values of 1,000,000 are just as likely as values of 1 for a parameter.

I would suggest trying to add priors.

I didn’t understand the scope of the “I don’t sample from…” because what comes after is static HMC.

When you say “the recovery is not good”, what exactly do you mean? Do multiple chains not converge? Is the interval coverage not well calibrated vs. simulated data?

Floating point is always an issue. exp(-400) rounds to 0 and exp(400) overflows to +infinity. The closest you can get to 1 is about 1 - 1e-15, whereas you can have numbers very close to zero, such as 1e-300. There are always floating point issues to deal with. That’s why we do everything on the log scale, as multiplying long sequences of probabilities will round down to zero very quickly (e.g, exp(-400)). Even with 16 decimal places, if you multiple two Cholesky factors, the results won’t be symmetric because floating point isn’t associative (a + b) + c isn’t guarnateed to give the same result as a + (b + c).

So true. Even a simple MNP model with only fixed parameters is just so difficult to estimate. I got around the problem for a small number of parameters using Relativistic Monte Carlo (https://arxiv.org/abs/1609.04388). However, the HMC’s behaviour gets weird as the number of parameters increases.