NUTS and ridges

Hi everyone,

I’m currently using NUTS (not Stan but my own implementation) to sample from a distribution which describes the probability of the parameters \theta_n for a set of sinusoidal basis functions, given some time series data d (where the number of basis functions is fixed). This corresponds to the general linear model: d=G_\theta b+\epsilon, where G_\theta is the matrix of basis functions with amplitude b and \epsilon is white Gaussian noise.

Assuming we simulate d using the same number of basis functions as the model, NUTS works fine for n=1 where the distribution has a single large peak. However, for n=2 NUTS seems to have a hard time finding the peak. In this case we can get a situation where one parameter value is correct but the other isn’t, which results in flat ridges in the distribution (see below).


Note the ridges I’m talking about are the ones close to 3\pi/4 and between 0 and \pi/4, which correspond to the true parameter values of the two sinusoids. The other narrow ridges around \pi/2 are a numerical artifact. Also note that the distribution is reflected along the lines \theta_1=\theta_2 and \theta_1,\theta_2=\pi and is periodic on the interval [0,2\pi].

In this case I find that the NUTS algorithm very often gets stuck on these ridges and does not get close to the big peak even after 1000 iterations.

I realise this isn’t strictly related to Stan but since Stan uses NUTS I was wondering if anyone had seen similar behaviour from NUTS? (or maybe someone has a better idea for how to tackle such a distribution?)

Thanks for reading! :)

Are you maybe encountering the multimodality of periodic likelihoods that I describe here?

I suspect @betanalpha will take issue with this passage, but he might be able to also provide some insight into these pathologies.

2 Likes

Does your HMC implementation do something like Stan’s metric adaptation, allowing the integrator to explore anisotropically along the ridges?

What I’m suggesting here is that HMC on this surface might require a transient period of metric adaption to the geometry of the ridge on which the sampler finds itself, followed by a period of re-adaptation to the peak. This is what Stan’s memoryless, windowed adaptation was born to do.

This might be part of the issue. Small peaks occur where both parameters are equal to the actual value of one of the sinusoids. There are also some small sidelobes to the main peak. Still, the chain seems to get fixed to the ridges specifically, rather than getting stuck on local maxima.

No it’s a very basic implementation of the efficient NUTS algorithm. When I was talking to @s.maskell about this yesterday he also suggested that metric adaptation might help the chain move along the ridge. At the moment my workaround has been to use a simple tempering scheme to flatten the distribution in the first iterations, although it would definitely be better to use something that was statistically sound.

@PhilClemson: I think we need to assess the version of NUTS that is in Stan (i.e. use mass matrix and step-size adaption). Maybe the thing to do is to describe this pdf as a Stan file and see what happens with Stan.

@jsocolar: are you confident that Stan’s variant of NUTS would navigate this kind of surface?

1 Like

@s.maskell nope! But that seems like a good candidate for what’s going wrong?

@jsocolar: while Phil’s example problem is 2D and involves going up a ridge then along to a peak, in general, we’re interested in a higher-dimensional version of the problem that involves the sample having to navigate between a sequence of saddle points before it gets to the peak. As I see it, we then need to change the mass matrix each time we get to the next saddle point. I’m not confident that Stan’s mass-matrix adaption will work well in that context.

@PhilClemson the sinusoids are of substantially different frequencies, right? Can you post the model?

Sure, the actual problem I’m considering is for data measured by a spatial sensor array. The design matrix for an array of L sensors measuring M sinusoids is:

G_{\theta,\omega}=\left[\begin{array}{c c c c} e^{j(\omega t+\alpha_{1}(\theta_{1},\omega))} & e^{j(\omega t+\alpha_{1}(\theta_{2},\omega))} & \ldots & e^{j(\omega t+\alpha_{1}(\theta_{M},\omega))} \\ e^{j(\omega t+\alpha_{2}(\theta_{1},\omega))} & e^{j(\omega t+\alpha_{2}(\theta_{2},\omega))} & \ldots & e^{j(\omega t+\alpha_{2}(\theta_{M},\omega))} \\ \vdots & \vdots & \ddots & \vdots \\ e^{j(\omega t+\alpha_{L}(\theta_{1},\omega))} & e^{j(\omega t+\alpha_{L}(\theta_{2},\omega))} & \ldots & e^{j(\omega t+\alpha_{L}(\theta_{M},\omega))} \\ \end{array}\right]

The temporal frequency \omega does vary between the sinusoids (e.g. I think I used 100Hz and 115Hz in the example above) but I’m calculating a product over all frequencies in the likelihood so this doesn’t have much impact. The parameters of interest are the angles \theta_n relative to the array, which cause different phase offsets in the sinusoids at each position in the array.

I did wonder about the multimodality aspect of the likelihood due to the periodicity. I think the only issue with this is that if the mode is at \tilde{\theta} then you don’t have p(\theta|d)\rightarrow0 for |\theta-\tilde{\theta}|\rightarrow\infty, which I think is usually assumed. However, I don’t believe (quite possibly wrongly!) that this is a requirement for NUTS. Instead, I think the usual issue with applying HMC / NUTS to multimodal distributions is that it you don’t get mixing between the modes (if the modes are simply reflections of a unimodal distribution then the samples can be transformed to the base mode, so I don’t think mixing is a problem in this case?).

1 Like

So the thetas defining your axes in the plots reflect the phase, right? Operationalized with a parameter in the model bound to 0-2*pi? Are you doing anything special to relate the two phases such that the theta2 seems to always be greater than theta1?

I was curious about the likelihood topography of this scenario, so I modified my likelihood mapper (attached below)
for two additive sinusoids with equal amplitude, frequencies of 100Hz & 115Hz, true phases of pi/4 & pi/4*3 and a sample rate of 1kHz and SNR of 10, and here’s the likelihood topography:

That’s with the point at phase1=0 & phase2=0 selected, here’s with the true-phases selected (or as close as I could click):

And just to be sure I was remapping your seeming phase2>phase1 constraint in my head properly, here’s with an explicit phase2>phase1 constraint of the topography:

So, pretty unimodal likellihood after all, but maybe there’s something in how you’ve parameterized the phases & the phase2>phase1 constraint that’s causing the behaviour you observe?

periodic_likelihood.r (12.0 KB)

2 Likes

Thanks for spending time on this! It’s really appreciated and it definitely helps rule out some sources for the problem.

Sorry, I probably should have included more specifics on this. In my case the model uses complex sinusoids, which means that the phase offset for each individual sinusoid is contained in the amplitudes b. The significance of the phases \alpha_l(\theta_n,\omega) shown above is that they model the difference in phase between the sinusoids in each time series (i.e. measured by different sensors). I didn’t want to get into the specifics of this phase relation as it varies depending on the array shape, but in the simple case shown here it was
\alpha_l(\theta_n,\omega)=\frac{2\pi\omega a(l-1)}{\cos(\theta_n)}
where a is a constant. In the example I used a=0‭.000675, so the phase difference between the sinusoids measured by adjacent sensors was about 0.004\omega\cos(\theta_n).

In the case of one sinusoid (which I think is almost analogous to your case) I also found the likelihood to be pretty unimodal, although with small sidelobes. There’s also the region of numerical instability around \pi/2 where \cos(\theta)=0 and \alpha_1,\alpha_2\ldots\alpha_L are undefined. NUTS seems to work better in this case.

1 Like

Adaptive Path Sampling in Metastable Posterior Distributions might help (helps with mutlimodality and entropy barriers). There is also a Stan compatible code at GitHub - yao-yl/path-tempering: continuous tempering by path sampling

3 Likes

I think that there are a few misconceptions about Bayesian inference and Markov chain Monte Carlo here, none of which have anything to do with Hamiltonian Monte Carlo in general.

Recall that the goal of Bayesian inference is not to find any single “best fit” model configuration but rather to explore all of the model configurations consistent with the observed data. When a model can be configured in many different ways that are all similarly consistent with the observed data then the posterior distribution will exhibit a degenerate geometry, such as the one seen here. For more discussion see Identity Crisis.

This is a modeling problem, not an algorithm problem. For example the symmetries shown here indicate that the basis expansion is ill-posed, at least for the limited data available.

Markov chain Monte Carlo is an algorithm that can implement Bayesian inference by generating Markov chains that meander through the neighborhoods of consistency model configurations. The more degenerate the posterior distribution the more difficult this exploration will be, even for sophisticated Markov transitions such as those constructed by Hamiltonian Monte Carlo.

Almost all Markov transitions are based on local moves away from one state to another, which makes exploring multiple modes nearly impossible. Running multiple Markov chains – not to mention other hacks that claim to find multiple modes – can identify isolated neighborhoods of consistent model configurations, but without a Markov transition capable of moving between those modes there will be no information to quantify the relative importance of those modes.

Even if we assume a unimodal posterior distribution degeneracies can manifest as ridges and other complex geometries. These complicate Markov chain Monte Carlo because the optimal transitions will vary strongly from state to state. Hamiltonian Monte Carlo uses gradient information to adapt the transitions to the local geometry of the posterior distribution, but this can do only so much.

Sometimes more sophisticated Hamiltonian Monte Carlo algorithms, such as Riemannian Hamiltonian Monte Carlo, can provide better adaptation but they are very fragile and require expert tuning to perform well. None of these methods are exposed in Stan.

To summarize I would not classify this as an algorithm problem. The model is extremely ill-posed which leads to atrocious posterior geometries. Your Hamiltonian Monte Carlo implementation is doing the best that it can, but no algorithm is up for quantifying the entire posterior distribution here. The most productive step forwards is developing the basis expansion to be more compatible with the data, or introducing an far more informative prior model to suppress the bulk of that degenerate behavior.

2 Likes

Hi again eveyone,

Sorry for my delayed response but I thought I’d give an update.

I’ve had a go at translating my model into Stan in order to double-check there wasn’t anything wrong with my implementation. Unfortunately there’s no equation solvers for complex matrices in the current version of Stan and after spending some time on a workaround for this I eventually gave up. Instead I’ve reverted to a simpler model which assumes a 0 phase offset for the sinusoids but does not require any complex numbers. The code is provided below in case anyone is interested.

data {
  int<lower=1> M;            // number of signals 
  int<lower=2> L;            // number of sensors
  int<lower=2> T;            // number of samples for each sensor
  real<lower=0> fs;          // sampling frequency
  real<lower=0> wave_speed;  // wave speed
  vector[T] d[L];            // samples from the sensors 
  real<lower=0> sep;         // separation between sensors
}

transformed data{
  int N = T*L;
  int f_n = T/2;                  // Nyquist frequency
  real N_2 = N/2.0;
  real dTd;
  array[f_n, L, 2] real GHd_factor;
  real sinfunc;
  array[f_n] real sumsinsq;
  array[f_n] real sumcossq;
  array[f_n] real sumsincos;
  
  dTd = 0.0;
  for (l in 1:L){
      dTd = dTd + sum(d[l].*d[l]);
  }
  for (l in 1:L){
      GHd_factor[:, l, 1] = rep_array(0.0, f_n);
      GHd_factor[:, l, 2] = rep_array(0.0, f_n);
  }
  for (omega in 2:f_n){
      // calculate Fourier transform
      for (t in 0:(T-1)){
          for (l in 1:L){
              GHd_factor[omega, l, 1] = GHd_factor[omega, l, 1] + sin(omega*t/fs)*d[l,t+1];
              GHd_factor[omega, l, 2] = GHd_factor[omega, l, 2] + cos(omega*t/fs)*d[l,t+1];
          }
      }
      // calculate other dot products
      sinfunc = sin(4*pi()*omega/fs)/(8*pi()*omega/(T*fs));
      sumsinsq[omega] = T/2 - sinfunc;
      sumcossq[omega] = T/2 + sinfunc;
      sumsincos[omega] = (1-cos(4*pi()*omega/fs))/(8*pi()*omega/(T*fs));
  }
}

parameters {
  real<lower=0, upper=2*pi()> theta[M];  // directions of arrival for each signal 
  real<lower=0> delta2;                  // signal to noise ratio
}

model {
  matrix[M,M] GHG;
  vector[M] GHd;
  real delta2_const;
  real alpha_factor;
  array[M, L] real alpha;
  
  delta2 ~ inv_gamma(2,0.1);
  delta2_const = 1+1/(N*delta2);
  
  for (omega in 1:f_n){
      GHd = rep_vector(0.0, M);
      GHG = rep_matrix(rep_row_vector(0.0, M), M);
      for (m in 1:M){
          GHG[m,m] = N;
          alpha_factor = 2*pi()*(omega/fs)*sep*cos(theta[m])/(wave_speed);
          for (l in 1:L){
              alpha[m,l] = (l-1)*alpha_factor;
              GHd[m] = GHd[m] + cos(alpha[m,l])*GHd_factor[omega,l,1] + sin(alpha[m,l])*GHd_factor[omega,l,2];
          }
      }
      for (m in 1:M){
          for (m1 in m+1:M){
              for (l in 1:L){
                  GHG[m,m1] = GHG[m,m1] + sin(alpha[m,l])*sin(alpha[m1,l])*sumcossq[omega] + (cos(alpha[m,1])*sin(alpha[m1,l])+cos(alpha[m1,l])*sin(alpha[m,l]))*sumsincos[omega] + cos(alpha[m,l])*cos(alpha[m1,l])*sumsinsq[omega];
              }
              GHG[m,m1] = delta2_const*GHG[m,m1];
              GHG[m1,m] = GHG[m,m1];
          }
      }
      target += -N_2*log((dTd - (GHd'/GHG)*GHd)/2);
  }
  
  target += -f_n*log(N*delta2 + delta2_const)/2;
}

I also changed the part that was causing numerical instability - after I found out what was causing it I realise I’d written down the model wrong (I needed to be multiplying by \cos(\theta) rather than dividing!). Here is what the posterior looks like for the updated model:

In any case, the issue still remains and the drawn samples do not match the analytic posterior. I believe this is not an issue with mutlimodality as we can always assume that \theta_2>\theta_1. As @betanalpha alluded to, I think the problem lies with the metric used in HMC which does not allow movement along the ridges.

1 Like

To add to the response from @PhilClemson, the ridges described above are examples of saddle points. Saddle points occur when sampling in other contexts (eg (bayesian) neural networks) so I suspect there may be ideas/techniques used in those worlds that might be able to help, assuming we agree that NUTS, as instantiated and configured in Stan, doesn’t play nicely with saddle points. Do we agree that saddle points are currently problematic?

A posterior density function doesn’t have to be multimodal to be problematic!

In my opinion what this example demonstrates is that (what I’m assuming to be) the Fourier deconvolution of the signal just isn’t able to recover the latent source, for example because there aren’t enough sensors, the sensors are placed in the right positions, the measurement times aren’t close enough or far enough apart, etc. In other words this is an experimental design problem more than a computational one, and the most effective resolutions concern the design of the experiment (improving the measurement, incorporating auxiliary measurements) or the prior model for the latent source (deconvolutions often require strong constraints on the length scales of the latent source in order to avoid degeneracies like these).

A more accurate quantification of the complex posterior here would technically allow for more robust decision making about the latent signal, but with these extreme degeneracies even with perfect computation there wouldn’t be much information to inform many useful decisions. One could attempt a more accurate quantification by, for example, carefully tuning a fragile thermodynamic method but the chances that one gets a complete quantification will still be slim and, again, one would still have to deal with the consequences of those degenerate inferences.

2 Likes

The experimental design definitely does make this a tricky problem. However, people have been successfully applying beamforming methods to these sort of problems for a long time and I’d like to think a computational Bayesian approach can do better!

I think I misdiagnosed the problem in the original example. While I thought the NUTS chains were getting stuck on the saddle points / ridges, they were actually converging to small sidelobes which were present on the ridge. These resemble an Airy pattern and are due to the finite resolution of the sensor array.

I’ve since slightly altered the parameters of the problem so that neither of the angles \theta_n are close to 0 or \pi. I also adjusted the colormap to highlight the local peaks. The trajectories for 10 chains using NUTS are shown below.

In this case the problem is more severe than before. The chains struggle to find even the ridges and in many cases jump over them.

I wanted to see how an optimizer would perform in this case so I did some further tests using Adam which are shown below.

As you can see, the algorithm suffers from similar issues. However, it does find the ridge in most cases.

I think saddle points are still an issue for NUTS although this is probably not the best problem to illustrate this.

I’ll push back on “successful” a bit here. Yes these inversion/deconvolution methods have been successful in some cases, but often only because of implicit heuristics to compensate for the degeneracy problems without really understanding those problems. There are also survival bias considerations (the applications where the degeneracy results in poor deconvolutions are often heuristically separated out as “inapplicable”, leaving applications where any point along those ridges results in equivalent results).

The benefit of Bayesian methods is that they make this degeneracy, and the experimental design problems behind them, impossible to ignore. Only by confronting these problems that have always been there but largely ignored historically can substantial improvements be made. It’s a pattern I see over and over again across a diversity of fields.

Looking at those plots it looks like ADAM converges to the ridges only when initialized close to the ridges. Assuming “NUTS trajectories” refers to the states of independent Markov chains, and not for example the numerical Hamiltonian trajectories that are used to generate Hamiltonian Markov transitions, I argue that the Markov chain Monte Carlo method is doing a far better job of exploring the likelihood function here. Individual chains are jumping between the ridge and the fringes pretty well, which isn’t surprising given how close the log likelihood values are (I’m assuming an effectively flat prior so that log posterior density is equal to log likelihood up to an offset). The difference between the ridge and the fringes (both high and low parts of the fringes!) is in the third decimal place so there’s little to keep the Hamiltonian Markov chains confined to the ridges.

1 Like