I’ve seen that there were several discussions about divergence/treedepth problems when using unit_vector for circular statistics in the past year.
Has this issue been resolved in the meantime?
If not, I think I might be able to explain the cause and give a solution. If there’s interest I can write It up and post some theory and a small case study here.
Go ahead. I think we know the cause, namely Euclidean-based HMC schemes can face weird geometry when trying to draw from posterior distributions that are defined on manifolds. And in particular with unit_vectors we originally were using a one-to-one but periodic transformation using polar coordinates and trigonometry but ran into the problem that 0 was far from 2 \pi in the unconstrained space. Then we switched to normalizing independent standard normals, which is better but isn’t quite one-to-one because the radius of the hypersphere is not really defined although we can take it to be 1 when converting initial values to the unconstrained space. It works well as a prior, but conditional on the data, it can have a weird shape. I’ve long thought that we could utilize the ambiguity about the radius to tune the sampler but if you have other solutions, feel free to post them.
I hope this is not just some old Idea that has already been rejected due to some reason. However, it seems quite logical and natural to me and seems to work quite well, so I wondered why I didn’t see any mention of it, suggesting that it might actually hasn’t been considered before.
After I had this Idea I went looking for similar things in the forums an it turned out that @bertschi already proposed a very similar / the same solution here a few months back, also including some theory and a case study:
I can’t see much discussion of it, hovewer, or evidence that the idea has been picked up by the devs. Therefore I decided to highlight it again, which is the purpose of this thread. In addition, in that old thread some possibly interesting aspects remained undiscussed, and I’m also coming from a slightly different perspective.
I split this into two post. In this first post I present the basic idea, in the second post I discuss implementation.
First a simple example that highlights the problem.
Let’s say I have measured a bunch of angles and want to estimate the mean angle. I assume that the angles follow a von mises distribution. I parametrise the mean angle using an angle derived from a 2-dimensional unit vector via atan2().
All model Runs in the following are conducted in the following way unless any deviations are mentioned:
They are fit via rstan 2.18.2 using the default settings (4 chains with each 2000 iterations, half of which is warmup). They are conducted on a simulated dataset of 100 measured angles coming from a von mises distribution with mean=1 and precision =3.
Running the model took 41 seconds (not including compilation) with a mean treedepth of 3.518 and produced 28 divergent transitions. (the previous run had >100 divergences).
As you can see there is some problem even with such a simple model.
The rest of this post is based on the assumption that I understood the implementation of the unit vector in Stan correctly. Based on what is written in the manual, and on forum posts. I understood it to work as follows:
If I declare a unit vector of dimension k, then Stan internally creates an unconstrained vector of dimension k and puts an independend standard-normal prior on each dimension. Samples of the unit vector are derived from this unconstrained vector by scaling each of them individually to length 1.
To be able to look at what happens inside this implementation I re-implemented the unit vector myself in the same way in my stan program. The program then looks like this:
Reassuringly, if I run this second model it produces an identical (within stochastic variation) runtime, treedepth and number of divergences as the first model (which used the native implementation of the unit vector).
With this background, I now present my interpretation of what is going wrong there:
I think the key to understanding this, is to think about how the posterior distribution of the internal reprentation of the unit vector looks like. Its shape will on one hand be influenced by the internal standard-normal priors on each of its dimensions, and on the other hand by the other influences which are transmitted to it through the unit vector. As the unit vector contains no info about the length of the internal representations unconstrained vector, it can only transmit back information about the direction of the internal vector, not about its length.
In the following plot I simulate how this internal posterior density looks like when the information relayed back to it through its unit vector representation corresponds to a von mises density with precision=10. I calculated this density by multiplying the 2d-standard-normal prior density with the von mises density.
The point is exactly at the origin of the internal unconstrained vector.
It can be seen that the distribution is quite smooth away from the origin, but as one comes close to the origin it becomes sharper and sharper, leading to areas of fairly high curvature close to the origin (and a discontinuity exactly at the origin). I think these high curvature regions are what causes the divergences and/or treedepth issues.
So what causes the divergences seems to be the fact that due to the radial nature of the info relayed through the unit-vector, the originally smooth von mises distribution gets progressively "sharpened" towards the center of the internal unconstrained representation.
Here are two more examples.
The first uses a very narrow von mises distribution, simulating the posterior of a highly informative dataset. It agains shows how the curvature gradually increases towards the center. It can be seen that the posterior has a very sharp spike with high curvature edges, somewhat reminiscent of what is known as "neals funnel".
This illustrates that the issue should be worse if the angular distribution is sharper to begin with. Accordingly, the narrower the von mises distribution is, the slower the sampler, the higher the treedepth and the higher the change of divergences. Tho following plot illustrates this with model 2 (result for model 1 looked similar).
The von mises precision here describes the distribution of the measured angles in the simulated dataset. Each bar shows the distribution from 40 runs of 1 chain each, each run using a new simulated dataset. The width of each bar follows the empirical cumulative distribution function of the datapoints.
The solution Idea:
For the purposes of the unit-vector we use only the directional info of the internal unconstrained vector, not its length. This means that we are to some degree free to choose its length as we see fits. As can be seen from the plots above, problems with high curvature arise when the sampler is close to the origin, that is when the vector is short. Unfortunately, the current implementation uses standard-normal priors, which draw the sampler to this problematic region.
To avoid problems with the high curvature areas I suggest to change the prior of the unit vectors internal representation to one that keeps the sampler away from the these areas. I suggest that the typical set should not span much more than a 2-times difference in distance from the origin, such that the radial compression of the angular distribution can’t change the curvature by more than a factor of 2. This would present the sampler a relatively uniform amount of curvature, to which it can adapt well.
A simple prior that achieves this goal is if we replace the independend standard-normal distributions with one single normal(1,0.1) distribution that acts only on the length of the unconstrained vector (optimal choice and jacobian correction are discussed in the second post). This achieves a "donut" shaped distribution.
The following plots show the simulated shape of the posterior of the internal unconstrained vector. The plots are identical to the previous three ones, except that the prior has been changed from unit-normal to the "donut prior" suggested above.
Here we can see that for highly informative datasets the posterior is very close to the shape of a multivariate normal. This should be very easy to sample from. If the density is even more narrow the distribution becomes a narrow ridge, but again, similar shape as a multivariate normal, so shouldn’t be a problem.
The plot doesn’t show the strong relationship between runtime and distribution which appeared in the previous model, and there were no divergences. Update: I also tried it with much higher von-mises-precisions, and the runtimes stabilize at about 4.5 seconds.
Bob Carpenter once wrote that that the unit-vector was originally implemented to create a data structure where the points which are close together on e.g. a circle are also close together in its unconstrained representation. While the implementation achieves this goal, it unfortunately also puts close together such points which are not close on the circle. This is because all points of the circle are close together at the origin of the unconstrained representation. What the "donut-prior" achieves is that not only points that are close on a circle are close in the unconstrained representation, but also that points which should be far away from each other are so in the unconstrained representation, too. At least if we only consider the typical set, or the length of the path which the sampler typically travels between the points.
It also makes sense intuitively: If you want to model a circular variable, use a posterior that looks like a circle (donut). If you want to model coordinates on a sphere use a distribution shaped like (the surface of) a sphere, and so forth …
“I suggest that the typical set should not span much more than a 2-times difference in distance from the origin, such that the radial compression of the von mises distribution (or whatever other angular distribution) can’t change the curvature by more than a factor of 2. This would present the sampler a relatively uniform amount of curvature, to which it can adapt well.”
The proposed normal(1,0.1) prior on the vector length was chosen according to this rule. If we make the ring too broad this will not solve the problem in full, leading to higher than optimal treedepth and running time, and a risk of divergences. If we make it much narrower, the sampler would also have to make smaller steps to navigate the distribution, which wouldn’t lead to divergences, but make it slower. So besides the desire to avoid divergences, this is primarily a question of what gives optimal performance.
We might also discuss what the optimal shape of the donuts flanks may be, although I think the normal distribution is probably already the optimal choice. But on theoretical grounds it would be desireable to have a prior that completely forbids the sampler to go to the origin, which the normal distribution doesn’t do.
We should also think of what happens in higher dimensions. While in low dimensions the current implementation of the unit vector draws the sampler towards the origin, in higher dimensions this is offset by the proportionately much lower volume there, so that the sampler rarely goes there. Therefore I suspect that the current implementation already works well for high-dimensional unit-vectors, and only low dimensions show the problematic behaviour. Something I didn’t test because I’m not very firm with higher dimensional calculations.
Another question is how the proposed "donut-prior" may behave in high dimensions. I suspect that the ring would probably increase in radius and become narrower, which may or may not be optimal for performance.
When @bertschi suggested this same Idea a few months ago he proposed to apply a jacobian correction to the “donut prior” (he used exactly the same “donut prior” as I did). He only provided the jacobian for the 2-dimensional case, where it is 1/vectorLength. I think the general form is vectorLength^(1-dimension). This jacobian correction effectively prevents that the ring icreases in radius or narrows with increasing dimension, possibly ensuring optimal performance for all dimensionalities. There are also some potential problems with this jacobian, however. Maybe @bertschi can say something about his experiences with that, I didn’t test it. So, based on theory, the following points might speak against using the jacobian:
it increases the density near the places where the high curvature lies (around the origin), increasing the chance that the sampler goes there.
it creates new high curvature shapes by itself (only around the origin).
it creates a very small very high target-density (not posterior-density) region (around the origin), which means that the sampler could get stuck there. This might act like what is known as a “phase change”. I don’t have any experience with how HMC acts when it gets in such a situation, but guess that it would either lead to a lot of rejections/divergences, or bad exploration, depending on the adaptation.
it creates one point with infinite density (the origin).
In practice, the jacobian might in fact be totally unproblematic, as due to the combined effect of the dimensionality (which exactly cancels out with the jacobian) and the "donut-prior" the sampler will be unlikely go near the origin anyway, and even if it does, most the target-density spike near the origin is not very high compared to the target-density of the donut. The exception beeing if the chain is initialised there, but as far as I understood the unconstrained representation of unit vectors is by default initialised with length 1 (is that true?).
The following plot shows the target (not posterior) density of the combination of the "donut-prior" with the jacobian for the two-dimensional case in black, and the 4-dimensional case in red. Note that the y-axis has a log-scale.
One point that I’m not sure about is how the jacobian might behave numerically or interact with the sampler in high dimensions, as I think the strength of the jacobian grows very fast with dimension. (How many dimensions might a unit-vector maximally have in practice?)
It might be that all possible variants work just fine. In that case it’s just a matter of performance optimisation. In this case some of the things to keep in mind are to keep treedepth at minimum, and to minimize the time required to derive the gradients.
I hope these 2 posts raise some useful points and apologise for the abuse of the words "donut" and "prior".
This looks promising. The Stan language is in the process of getting these “multiplier” keywords (really just scalings) for non-centered parameterizations. Perhaps we could have something similar for unit_vectors like