Divergence /Treedepth issues with unit_vector

That is very interesting. In order to test with an informative likelihood you could use the von Mises-Fisher distribution. It generalizes the von Mises distribution to higher-dimensions and has two parameters: A direction – specified as a unit vector \mathbf{\mu} – and a precision \kappa controlling how strongly the density concentrates around the mean direction. When we do not want to fit \kappa, the normalization constant can be dropped and the density supported on the unit sphere is just p(\mathbf{u}) \propto e^{\kappa \mathbf{\mu}^T \mathbf{u}}.

In two dimensions this reduces to the von Mises distribution with the same \kappa. The only difference being that the mean angle is not specified as \theta \in [0, 2\pi), but as a unit vector \mathbf{\mu} = (\cos \theta, \sin \theta)^T.

functions {
  real vonMises_Fisher_lpdf(vector x, vector mu, real kappa) {
    // Note: Does not include normalization and cannot be used to fit kappa!
    return kappa * dot_product(mu, x);
data {
  int<lower = 2> D;
  unit_vector[D] mu;
  real kappa;
parameters {
  vector[D] x;
transformed parameters {
  real r = sqrt(dot_self(x));
  unit_vector[D] u = x / r;
model {
  r ~ gamma(100, 100);
  target += - (D - 1) * log(r);

  u ~ vonMises_Fisher(mu, kappa);

I quickly tested it using the above program. With the Jacobian included I did not encounter any problems with 256 dimensions (with \kappa = 100 or without \kappa = 0 informative likelihoods).

Isn’t there one right way to either include or not include the Jacobian? Otherwise I’d think the sampling would be from the wrong target.

Right. That’s like zero data points—you just get the prior.

This is a good illustration of why the prior and likelihood matter. We run into this all the time, like with non-centered parameterizations for hierarchical models. It’s not the quantity of data, it’s the informativeness of data plus prior.

What was the y-axis on those plots?

You mean like in a hierarchical model where curvature varies around the posterior?

How are you measuring goodness here? I suspect it has something to do with the y-axis on those plots! What I want to see is simulation-based calibration to make sure we’re truly uniformly generating without a likelihood.

What looked strange to you? I didn’t know what to expect, so nothing looks strange to me. I still don’t even understand the y-axes of the graph, though.

No, I don’t think so. Because the Yacobian only contains information about the length of the helper-vector, which is not part of the inference we’re trying to do. It get’s eliminated when the vector is projected to the unit-hypersphere.

The meaning of the Y-Axis is in the grey box that is stuck to the right hand side of each row of subplots, or alternatively in the color legend. Runtime is measured in seconds.

I don’t know, that’s why I’m asking. I just thought that if the energy distribution has heavier tails it probably is more likely that the sampler gets close to the center of the donut where the problematic region is. I have no theory about what shapes the energy distribution, so from my point of view it might be constant, like some intrinsic property of the HMC method, or it might depend on the target distribution. Since the energy is available as a model disgnostic I assume that the model does indeed have an influence on the energy distribution. If so, we should try to test the donut-prior with a “mean” energy distribution, that is one that has a heavy tail of high-energy states. I don’t know how to construct such a test, but also didn’t really think much about it yet.

good = robust against divergences, with lowest possible runtime/treedepth.

I’m not quite sure what you have in mind. Could you elaborate? What do you mean with “simulation based” (Isn’t my test above simulation based?).
Checking uniformity on the unit-hypersphere within my tests above should be doabl with relatively little efforte. While we’re at it we might also check whether the model returns the right answer when a likeliood is involved. But this would be more like a sanity check, because from theory nothing can go wrong, right? I’ts not like I’m violating any of the basic principles that give Stan its robustness, right? (Except for the high-dimensional example, apparrently? These strange patterns must come from somewhere!)

The treedepth has a very complicated dependency on the number of dimensions. I expected a very simple relationship. In addition, the shape of the treedepth distribution also varies by dimension in very complex ways, sometimes being heavy tailed, sometimes not, sometimes with high and sometimes with low variance, and so forth.

High-dimensions are weird, and high-dimensions with constraints are even weirder. The more dimensions you have the further the sampler will explore away from the origin which will complicate the radial regularization.

That said, the sampler also expands its trajectories exponentially. When the trajectories need to grow linearly (say with increasing dimension) then even when things are working you can sometimes see thresholding effects where the exponential growth overshoots the optimal length.

The analogy to physical systems is that the negative log density is the potential energy and then we impart a momentum vector that’s independently standard normal in every dimension. Because the density comes into play, reparameterizations can matter (where we typically move some parameters to be generated from a different parameterization).

Together the potential and momentum determine the Hamiltonian and give you the distribution over total energy (Hamiltonian = potential plus kinetic). The distribution of random momenta is chi-square(n) where n is the number of dimensions, as the kinetic energy is the squared momentum (perhaps divided by the mass matrix if that’s not the unit matrix).

Yes, I understood that much. I just haven’t thought much about how the posterior shape and energy shape are interrelated yet, and how that plays into the probability of the sampler moving near the donut-center. My first intuition would be that the shape of the posterior in other dimensions has no effect on the sampler movements in the donut-dimensions, regardless of its effect on the energy distribution. But I’m not sure. Guess I’ll have to do my homework before coming back to this.

Hi Raoul:

Thanks for your thorough investigation.

Following your parameterization for noninformative priors for mean angle (i.e. either on mean vector length or coordinates of mean vector), I wonder how I should proceed if I want to place informative priors on the mean angle, say, meanAngle ~ von_mises(pi()/4, pi()/36).

Thanks in advance!

You can simply add an informative prior on the constrained scale. In the model in post numer three in this thread, for example, meanAngleUnitVector represents the mean angle of the circular distribution in unconstrained space and meanAngle represents it in constrained space. So in that example model adding this line should do it:

meanAngle ~ von_mises(0, 1). // prior for mean angle

In other words: you don’t have to know anything about how the noninformative prior works, you can just add “the information” on top of it (by defining an informative prior in any way you like).

(hope I’ll find the time to continue working on this thread at some point)


Thank you so much for your quick reply. I will implement the informative priors and report back if I have any issues.

It will be extremely helpful if you could kindly summarize all the scattered discussions in the original post and include further analysis/suggestions/comments.

Sorry I just realized I mistakenly gave a noninformative prior in the example. It should be something like meanAngle ~ von_mises(pi()/4, 50).

Unfortunately this thread is still work in progress, so it’s a bit early for a conclusive summary.
I think what can be said at the moment is that while using a donut prior to parametrise a circular distribution is probably the way to go, the limitations of this method still need to be explored further.

1 Like