NUTS and ridges

Hi everyone,

I’m currently using NUTS (not Stan but my own implementation) to sample from a distribution which describes the probability of the parameters \theta_n for a set of sinusoidal basis functions, given some time series data d (where the number of basis functions is fixed). This corresponds to the general linear model: d=G_\theta b+\epsilon, where G_\theta is the matrix of basis functions with amplitude b and \epsilon is white Gaussian noise.

Assuming we simulate d using the same number of basis functions as the model, NUTS works fine for n=1 where the distribution has a single large peak. However, for n=2 NUTS seems to have a hard time finding the peak. In this case we can get a situation where one parameter value is correct but the other isn’t, which results in flat ridges in the distribution (see below).


Note the ridges I’m talking about are the ones close to 3\pi/4 and between 0 and \pi/4, which correspond to the true parameter values of the two sinusoids. The other narrow ridges around \pi/2 are a numerical artifact. Also note that the distribution is reflected along the lines \theta_1=\theta_2 and \theta_1,\theta_2=\pi and is periodic on the interval [0,2\pi].

In this case I find that the NUTS algorithm very often gets stuck on these ridges and does not get close to the big peak even after 1000 iterations.

I realise this isn’t strictly related to Stan but since Stan uses NUTS I was wondering if anyone had seen similar behaviour from NUTS? (or maybe someone has a better idea for how to tackle such a distribution?)

Thanks for reading! :)

Are you maybe encountering the multimodality of periodic likelihoods that I describe here?

I suspect @betanalpha will take issue with this passage, but he might be able to also provide some insight into these pathologies.

2 Likes

Does your HMC implementation do something like Stan’s metric adaptation, allowing the integrator to explore anisotropically along the ridges?

What I’m suggesting here is that HMC on this surface might require a transient period of metric adaption to the geometry of the ridge on which the sampler finds itself, followed by a period of re-adaptation to the peak. This is what Stan’s memoryless, windowed adaptation was born to do.

This might be part of the issue. Small peaks occur where both parameters are equal to the actual value of one of the sinusoids. There are also some small sidelobes to the main peak. Still, the chain seems to get fixed to the ridges specifically, rather than getting stuck on local maxima.

No it’s a very basic implementation of the efficient NUTS algorithm. When I was talking to @s.maskell about this yesterday he also suggested that metric adaptation might help the chain move along the ridge. At the moment my workaround has been to use a simple tempering scheme to flatten the distribution in the first iterations, although it would definitely be better to use something that was statistically sound.

@PhilClemson: I think we need to assess the version of NUTS that is in Stan (i.e. use mass matrix and step-size adaption). Maybe the thing to do is to describe this pdf as a Stan file and see what happens with Stan.

@jsocolar: are you confident that Stan’s variant of NUTS would navigate this kind of surface?

1 Like

@s.maskell nope! But that seems like a good candidate for what’s going wrong?

@jsocolar: while Phil’s example problem is 2D and involves going up a ridge then along to a peak, in general, we’re interested in a higher-dimensional version of the problem that involves the sample having to navigate between a sequence of saddle points before it gets to the peak. As I see it, we then need to change the mass matrix each time we get to the next saddle point. I’m not confident that Stan’s mass-matrix adaption will work well in that context.

@PhilClemson the sinusoids are of substantially different frequencies, right? Can you post the model?

Sure, the actual problem I’m considering is for data measured by a spatial sensor array. The design matrix for an array of L sensors measuring M sinusoids is:

G_{\theta,\omega}=\left[\begin{array}{c c c c} e^{j(\omega t+\alpha_{1}(\theta_{1},\omega))} & e^{j(\omega t+\alpha_{1}(\theta_{2},\omega))} & \ldots & e^{j(\omega t+\alpha_{1}(\theta_{M},\omega))} \\ e^{j(\omega t+\alpha_{2}(\theta_{1},\omega))} & e^{j(\omega t+\alpha_{2}(\theta_{2},\omega))} & \ldots & e^{j(\omega t+\alpha_{2}(\theta_{M},\omega))} \\ \vdots & \vdots & \ddots & \vdots \\ e^{j(\omega t+\alpha_{L}(\theta_{1},\omega))} & e^{j(\omega t+\alpha_{L}(\theta_{2},\omega))} & \ldots & e^{j(\omega t+\alpha_{L}(\theta_{M},\omega))} \\ \end{array}\right]

The temporal frequency \omega does vary between the sinusoids (e.g. I think I used 100Hz and 115Hz in the example above) but I’m calculating a product over all frequencies in the likelihood so this doesn’t have much impact. The parameters of interest are the angles \theta_n relative to the array, which cause different phase offsets in the sinusoids at each position in the array.

I did wonder about the multimodality aspect of the likelihood due to the periodicity. I think the only issue with this is that if the mode is at \tilde{\theta} then you don’t have p(\theta|d)\rightarrow0 for |\theta-\tilde{\theta}|\rightarrow\infty, which I think is usually assumed. However, I don’t believe (quite possibly wrongly!) that this is a requirement for NUTS. Instead, I think the usual issue with applying HMC / NUTS to multimodal distributions is that it you don’t get mixing between the modes (if the modes are simply reflections of a unimodal distribution then the samples can be transformed to the base mode, so I don’t think mixing is a problem in this case?).

1 Like

So the thetas defining your axes in the plots reflect the phase, right? Operationalized with a parameter in the model bound to 0-2*pi? Are you doing anything special to relate the two phases such that the theta2 seems to always be greater than theta1?

I was curious about the likelihood topography of this scenario, so I modified my likelihood mapper (attached below)
for two additive sinusoids with equal amplitude, frequencies of 100Hz & 115Hz, true phases of pi/4 & pi/4*3 and a sample rate of 1kHz and SNR of 10, and here’s the likelihood topography:

That’s with the point at phase1=0 & phase2=0 selected, here’s with the true-phases selected (or as close as I could click):

And just to be sure I was remapping your seeming phase2>phase1 constraint in my head properly, here’s with an explicit phase2>phase1 constraint of the topography:

So, pretty unimodal likellihood after all, but maybe there’s something in how you’ve parameterized the phases & the phase2>phase1 constraint that’s causing the behaviour you observe?

periodic_likelihood.r (12.0 KB)

2 Likes

Thanks for spending time on this! It’s really appreciated and it definitely helps rule out some sources for the problem.

Sorry, I probably should have included more specifics on this. In my case the model uses complex sinusoids, which means that the phase offset for each individual sinusoid is contained in the amplitudes b. The significance of the phases \alpha_l(\theta_n,\omega) shown above is that they model the difference in phase between the sinusoids in each time series (i.e. measured by different sensors). I didn’t want to get into the specifics of this phase relation as it varies depending on the array shape, but in the simple case shown here it was
\alpha_l(\theta_n,\omega)=\frac{2\pi\omega a(l-1)}{\cos(\theta_n)}
where a is a constant. In the example I used a=0‭.000675, so the phase difference between the sinusoids measured by adjacent sensors was about 0.004\omega\cos(\theta_n).

In the case of one sinusoid (which I think is almost analogous to your case) I also found the likelihood to be pretty unimodal, although with small sidelobes. There’s also the region of numerical instability around \pi/2 where \cos(\theta)=0 and \alpha_1,\alpha_2\ldots\alpha_L are undefined. NUTS seems to work better in this case.

1 Like

Adaptive Path Sampling in Metastable Posterior Distributions might help (helps with mutlimodality and entropy barriers). There is also a Stan compatible code at GitHub - yao-yl/path-tempering: continuous tempering by path sampling

3 Likes

I think that there are a few misconceptions about Bayesian inference and Markov chain Monte Carlo here, none of which have anything to do with Hamiltonian Monte Carlo in general.

Recall that the goal of Bayesian inference is not to find any single “best fit” model configuration but rather to explore all of the model configurations consistent with the observed data. When a model can be configured in many different ways that are all similarly consistent with the observed data then the posterior distribution will exhibit a degenerate geometry, such as the one seen here. For more discussion see Identity Crisis.

This is a modeling problem, not an algorithm problem. For example the symmetries shown here indicate that the basis expansion is ill-posed, at least for the limited data available.

Markov chain Monte Carlo is an algorithm that can implement Bayesian inference by generating Markov chains that meander through the neighborhoods of consistency model configurations. The more degenerate the posterior distribution the more difficult this exploration will be, even for sophisticated Markov transitions such as those constructed by Hamiltonian Monte Carlo.

Almost all Markov transitions are based on local moves away from one state to another, which makes exploring multiple modes nearly impossible. Running multiple Markov chains – not to mention other hacks that claim to find multiple modes – can identify isolated neighborhoods of consistent model configurations, but without a Markov transition capable of moving between those modes there will be no information to quantify the relative importance of those modes.

Even if we assume a unimodal posterior distribution degeneracies can manifest as ridges and other complex geometries. These complicate Markov chain Monte Carlo because the optimal transitions will vary strongly from state to state. Hamiltonian Monte Carlo uses gradient information to adapt the transitions to the local geometry of the posterior distribution, but this can do only so much.

Sometimes more sophisticated Hamiltonian Monte Carlo algorithms, such as Riemannian Hamiltonian Monte Carlo, can provide better adaptation but they are very fragile and require expert tuning to perform well. None of these methods are exposed in Stan.

To summarize I would not classify this as an algorithm problem. The model is extremely ill-posed which leads to atrocious posterior geometries. Your Hamiltonian Monte Carlo implementation is doing the best that it can, but no algorithm is up for quantifying the entire posterior distribution here. The most productive step forwards is developing the basis expansion to be more compatible with the data, or introducing an far more informative prior model to suppress the bulk of that degenerate behavior.

1 Like