Divergence /Treedepth issues with unit_vector

We tried that in 2012 and ran into the problem that 0 and 2\pi were far apart in the unconstrained space, which is why we changed the underlying parameterization of unit_vectors to what it is now. You can unwind it in a different way

but then \infty and -\infty in the unconstrained space both map to the same point in the constrained space, so that region has to have no posterior volume.

1 Like

Yes, should have been light tails.

Just played a bit with different distributions and it appears that log-normal or Gamma would both be more reasonable than the truncated normal. All plotted distributions have unit mean and standard deviation of 0.1. Would probably take the Gamma(shape = 100, rate = 100) if I had to choose.

Wow, those are some beautiful pictures!

I’d be happy to contribute. From my perspective combining makes more sense than writing something alone, because this would provide context, making it more “complete”. It would be difficult for me to achieve that on my own.

I’m not sure if I can follow you there. I understand that If we look at the two dimensions of the donut independently they have varying curvature, but from the perspective of the full 2-dimensional structure it should be almost uniform (although anisotropic), right? (except in the center, of course). Having it uniform in 1d would be nicer of course, but if this is not possible making it uniform in 2d is probably the next-best solution.
Of course, if we change the truncated normal to a log-normal or gamma this might decrease uniformity, while the yacobian might as well in theory, but in practice might effectively improve it (by correcting for the curvedness of the donut, helping the sampler to get around the curve). I’d have to think more about that.

Well, not trying to unwind it is exactly what lead to this solution, so maybe we just have to deal with that. Because if we unwind it we probably get the problem with the topology mismatch again. The nice thing about the donut is that while it’s living in a space that doesn’t match he topology of the circular variable, the shape of the donut actually kind of does match it, thus kind of circumvents the topology mismatch problem.

That said, given single-variable representation of a circular parameter, I don’t quite understand why one can’t just project the leapfrog position over to the other side when the sampler leaves the boundary of the circular variable (e.g. -pi to pi). So e.g. when the leapfrog jumps to -pi-0.123 move it over to pi-0.123, keeping the momentum the same. Intuitively I’d think that this would work fine with HMC, but might be problematic for NUTS because it might break the detection of U-turns? I think such a solution would be ok for the R^ and n_eff computations, as these values look ok at first sight for the angle derived from the donut-representation.

Doesn’t that also produce the same problem as just using the space from 0 to 2pi, that the endpoints are not connected?

Yes, although infinity is a better place for that to occur than anything finite. Still if your constrained point could be near -1, then it won’t work well.

I’d be happy to contribute as well. Maybe we could start with a case study and then expand into a more complete computational stats paper. As a starting point, we could compare different ways to represent the unit sphere as discussed here.

From what I understand, it is not possible to faithfully represent an angle with a 1-dimensional Euclidean parameter due to its circular geometry. The same would be true for the unit sphere in higher dimensions. Instead, such objects can be embedded into an Euclidean space of higher dimension, i.e. as done here with a unit vector in a d+1 dimensional space. Then, one could either impose a hard constraint to stay exactly on the d-dimensional subspace corresponding to the unit sphere or a soft constraint to keep the sampler close to it. The mathematically proper way for handling the hard constraint in HMC seems to be discussed in this paper Geodesic Monte Carlo on Embedded Manifolds - PMC.
This would require to change the internals of the sampler, as the integration of the Hamiltonian has to be projected back onto the subspace in each step. The soft constraint, as discussed above, could be added without such changes. Here, the sampler is free to move in \mathbb{R}^{d+1} while the length of the vector is contained by a suitable prior, e.g. the Gamma suggested above. Don’t know which method works better in practice, but both should be an improvement over the current implementation where the soft constraint is too weak except in high dimensions.

Don’t think there is a silver bullet with respect to the curvature. In the end, the sampler has to run around the origin in order to traverse the angular space. At a larger radius the curvature will be reduced, yet the distance/scale that needs to be travelled increases. Maybe there is an optimum for the width of the donut though?

1 Like

This is an excellent review of the situation. The changes required for implementing non-Euclidean parameter spaces are surprisingly comprehensive. Not only would the integrator have to change but also the storage of all of the parameters would have to change to separate them into their varying topologies. Even with all of that many of the diagnostics and default chain output would have to change because means, variances, and quantiles are no long uniquely defined on non-Euclidean spaces.

2 Likes

I’ve never heard the term “anisotropic”, so had to look it up. I just mean that if we have a donut, then the second derivative matrix varies around the posterior. Maybe it doesn’t—it won’t in the hyper-donut defining the typical set for the normal, for instance.

Thsi would all break our simple Euclidean NUTS in R^N. You could do that wrapping with custom samplers—it just won’t work with Stan as currently written. People have built HMC to work directly on hyperspheres [I see the Byrne and Girolami geodesics paper—Simon Byrn’s the one who did a lot of this work]. @betanalpha has written about how to translate NUTS to more general manifolds, so we can also have that. It just needs custom code for everything and it wouldn’t be clear how to integrate it with Stan’s sampler.

That would be really great.

To help deciding on an implementation I did some more tests to find out what donut shape is optimal, and whether to use the jacobian.

Test 1:

I tested two cases: highly informative likelihood (->narrow posterior angular distribution) and completely uninformative likelihood (->complete donut). The first test is to check the occurrence of divergences due to the radial sharpening of the likelihood through the circular projection, the second test is to check how fast the sampler can navigate the donut. It turned out that even in the second test divergenges can occur, more on that below.

I tested both conditions with all three proposed donut shapes (normal, lognormal, gamma), with and without jacobian, and with various widths of the donut. The mean of the donut prior (the “donut-radius”) was always approximately 1.

For the test of the uninformative likelihood I just left the likelihood away completely.

For the test of the highly informative likelihood I fixed the von mises precision parameter to 300 and fed the model 15 datapoints from a von mises distribution of this precision. I turns out that this produces approximatively the same posterior shape as in my first bunch of tests (posted above) with 100 datapoints and von mises precision 300, because the precision prior in the first tests was quite stupid and led to a very biased precision estimate (~42), which made the likelihood wider than it would otherwise have been (which doesn’t matter beacuse I didn’t rely on it anywhere). As in the first tests the performance and divergence figures stabilized at precision 300 and didn’t change much for higher precisions, the test at this value should be ±representative for “infinite” precision likelihoods.

All tests were repeated 100 times with new random datasets.

Here’s the results:

Test with highly informative likelihood:

Test without likelihood:

Plot explanation:
The x-axis shows what donut-shape was used. The first number on each colum denotes which distribution was used (1=normal, 2=lognormal, 3=gamma) and the second number tells whether the jacobian was applied (0=no 1=yes). Everything else is as in the graphs in my posts further up.

Results summary:

Even if the likelihood is completely uninformative (in the test I just removed the likelihood altogether) there can be divergences. I think this can be due to two reasons:

  1. for the zero-avoiding distributions if the sampler gets too close to zero there is a crazy gradient.

  2. in any case if the sampler manages to ±completely leapfrog over the region around the origin the hamiltonian integration is probably very bad (not sure if this would lead to a divergence warning as I don’t know exactly how divergences are detected).

Within the range of values tested, for uninformative likelihoods broad donuts are faster (easier to navigate for the sampler), for informative ones narrow ones are faster (because it decreases the radial sharpening of the likelihood). Outside the tested range other effects can apply, but this range is not interesting for our purpose (at the moment).

But for the sake of completeness:

  1. For uninformative likelihoods if the donut-sd is >0.3 it seems to stop getting faster and at some point actually gets slower again, didn’t test this much. Presumably this is because the sampler then gets more often to the problematic central part of the distribution (but if sd>>0.3 then this doesn’t register as divergences). We can already see that starting at 0.3 in the plot. The special case of the normal-donut without jacobian might be ±immune to this.

  2. For higly informative likelihoods I suspect that if the donut-sd is made very small then the warmup time will increase again because it takes time to find the correct scale for the sampler (didn’t test).

The initially chosen donut width of sd=0.1 seems to be close to optimal.

With highly informative likelihood the lognormal is best, followed by the gamma and then the normal. Applying the jacobian makes the results worse. Without the likelihood it’s all reversed. Overall it doesn’t make much differnce which of the distributions is chosen, and whether the jacobian is applied. This might be different in higher dimensions (especially for the jacobian), here I tested only the two-dimensional case.

Discussion:

In real applications the energy distribution might vary (is this true?). Not sure what effect this would have, but I guess it might help the sampler to move into the problematic reagions, thus increasing the chance of divergences or increase of treedepth.

In real applications the shape of the angular distribution might be less “nice” than the Von Mises, also increasing the risk of divergences etc. . Such a less nice distribution could, amongst other things, also arise from the interaction with other model parameters, e.g. the estimate of the precision of the von mises might be uncertain, leading to a “neals funnel” type problem.

For these reasons it might be wise to err on the safe side and choose a narrower donut than sd=0.1*radius. Although, of course, this might be foregone in the sake of performance. If issues with divergences occur this can be countered by increasing the acceptance target.

I’m not quite sure if, in this special case, divergences do signal a bias in the posterior. I suspect that they do cause a bias because they alter the number of possible pathways between different parts of the posterior angular distribution, and thus the probability of moving from a certain angle to a certain other angle.

Test 2:

I also partly tested the effect of the jacobian in higher dimensions. I used the normal(1,0.1) donut-prior and left away the likelihood (because im not knowledgeable in higher dimensional angular math).

As expected, the length distribution of the vector grows with dimension without jacobian, and is stable with jacobian:


What I don’t understand is the performance graph, it has weird patterns in it, but these seem to be consistent:

If we ignore the strange pattern It seems that performance in high dimensions is good and that, again, it doesn’t matter much whether the jacobian is applied or not, at least till 128 dimensions. With the yacobian I couldn’t go to 256 dimensions, because then initial values were rejected. I guess the jacobian just puts out too crazy numbers in that case. Without the jacobian it is possible to go higher.

(edit: What happened to the colours of the last 2 images?)

1 Like

That is very interesting. In order to test with an informative likelihood you could use the von Mises-Fisher distribution. It generalizes the von Mises distribution to higher-dimensions and has two parameters: A direction – specified as a unit vector \mathbf{\mu} – and a precision \kappa controlling how strongly the density concentrates around the mean direction. When we do not want to fit \kappa, the normalization constant can be dropped and the density supported on the unit sphere is just p(\mathbf{u}) \propto e^{\kappa \mathbf{\mu}^T \mathbf{u}}.

In two dimensions this reduces to the von Mises distribution with the same \kappa. The only difference being that the mean angle is not specified as \theta \in [0, 2\pi), but as a unit vector \mathbf{\mu} = (\cos \theta, \sin \theta)^T.

functions {
  real vonMises_Fisher_lpdf(vector x, vector mu, real kappa) {
    // Note: Does not include normalization and cannot be used to fit kappa!
    return kappa * dot_product(mu, x);
  }
}
data {
  int<lower = 2> D;
  unit_vector[D] mu;
  real kappa;
}
parameters {
  vector[D] x;
}
transformed parameters {
  real r = sqrt(dot_self(x));
  unit_vector[D] u = x / r;
}
model {
  r ~ gamma(100, 100);
  target += - (D - 1) * log(r);

  u ~ vonMises_Fisher(mu, kappa);
}

I quickly tested it using the above program. With the Jacobian included I did not encounter any problems with 256 dimensions (with \kappa = 100 or without \kappa = 0 informative likelihoods).

Isn’t there one right way to either include or not include the Jacobian? Otherwise I’d think the sampling would be from the wrong target.

Right. That’s like zero data points—you just get the prior.

This is a good illustration of why the prior and likelihood matter. We run into this all the time, like with non-centered parameterizations for hierarchical models. It’s not the quantity of data, it’s the informativeness of data plus prior.

What was the y-axis on those plots?

You mean like in a hierarchical model where curvature varies around the posterior?

How are you measuring goodness here? I suspect it has something to do with the y-axis on those plots! What I want to see is simulation-based calibration to make sure we’re truly uniformly generating without a likelihood.

What looked strange to you? I didn’t know what to expect, so nothing looks strange to me. I still don’t even understand the y-axes of the graph, though.

No, I don’t think so. Because the Yacobian only contains information about the length of the helper-vector, which is not part of the inference we’re trying to do. It get’s eliminated when the vector is projected to the unit-hypersphere.

The meaning of the Y-Axis is in the grey box that is stuck to the right hand side of each row of subplots, or alternatively in the color legend. Runtime is measured in seconds.

I don’t know, that’s why I’m asking. I just thought that if the energy distribution has heavier tails it probably is more likely that the sampler gets close to the center of the donut where the problematic region is. I have no theory about what shapes the energy distribution, so from my point of view it might be constant, like some intrinsic property of the HMC method, or it might depend on the target distribution. Since the energy is available as a model disgnostic I assume that the model does indeed have an influence on the energy distribution. If so, we should try to test the donut-prior with a “mean” energy distribution, that is one that has a heavy tail of high-energy states. I don’t know how to construct such a test, but also didn’t really think much about it yet.

good = robust against divergences, with lowest possible runtime/treedepth.

I’m not quite sure what you have in mind. Could you elaborate? What do you mean with “simulation based” (Isn’t my test above simulation based?).
Checking uniformity on the unit-hypersphere within my tests above should be doabl with relatively little efforte. While we’re at it we might also check whether the model returns the right answer when a likeliood is involved. But this would be more like a sanity check, because from theory nothing can go wrong, right? I’ts not like I’m violating any of the basic principles that give Stan its robustness, right? (Except for the high-dimensional example, apparrently? These strange patterns must come from somewhere!)

The treedepth has a very complicated dependency on the number of dimensions. I expected a very simple relationship. In addition, the shape of the treedepth distribution also varies by dimension in very complex ways, sometimes being heavy tailed, sometimes not, sometimes with high and sometimes with low variance, and so forth.

High-dimensions are weird, and high-dimensions with constraints are even weirder. The more dimensions you have the further the sampler will explore away from the origin which will complicate the radial regularization.

That said, the sampler also expands its trajectories exponentially. When the trajectories need to grow linearly (say with increasing dimension) then even when things are working you can sometimes see thresholding effects where the exponential growth overshoots the optimal length.

The analogy to physical systems is that the negative log density is the potential energy and then we impart a momentum vector that’s independently standard normal in every dimension. Because the density comes into play, reparameterizations can matter (where we typically move some parameters to be generated from a different parameterization).

Together the potential and momentum determine the Hamiltonian and give you the distribution over total energy (Hamiltonian = potential plus kinetic). The distribution of random momenta is chi-square(n) where n is the number of dimensions, as the kinetic energy is the squared momentum (perhaps divided by the mass matrix if that’s not the unit matrix).

@Bob_Carpenter
Yes, I understood that much. I just haven’t thought much about how the posterior shape and energy shape are interrelated yet, and how that plays into the probability of the sampler moving near the donut-center. My first intuition would be that the shape of the posterior in other dimensions has no effect on the sampler movements in the donut-dimensions, regardless of its effect on the energy distribution. But I’m not sure. Guess I’ll have to do my homework before coming back to this.

Hi Raoul:

Thanks for your thorough investigation.

Following your parameterization for noninformative priors for mean angle (i.e. either on mean vector length or coordinates of mean vector), I wonder how I should proceed if I want to place informative priors on the mean angle, say, meanAngle ~ von_mises(pi()/4, pi()/36).

Thanks in advance!

You can simply add an informative prior on the constrained scale. In the model in post numer three in this thread, for example, meanAngleUnitVector represents the mean angle of the circular distribution in unconstrained space and meanAngle represents it in constrained space. So in that example model adding this line should do it:

meanAngle ~ von_mises(0, 1). // prior for mean angle

In other words: you don’t have to know anything about how the noninformative prior works, you can just add “the information” on top of it (by defining an informative prior in any way you like).

(hope I’ll find the time to continue working on this thread at some point)

Raoul:

Thank you so much for your quick reply. I will implement the informative priors and report back if I have any issues.

It will be extremely helpful if you could kindly summarize all the scattered discussions in the original post and include further analysis/suggestions/comments.

Sorry I just realized I mistakenly gave a noninformative prior in the example. It should be something like meanAngle ~ von_mises(pi()/4, 50).

Unfortunately this thread is still work in progress, so it’s a bit early for a conclusive summary.
I think what can be said at the moment is that while using a donut prior to parametrise a circular distribution is probably the way to go, the limitations of this method still need to be explored further.

1 Like