Divergence /Treedepth issues with unit_vector

To help deciding on an implementation I did some more tests to find out what donut shape is optimal, and whether to use the jacobian.

Test 1:

I tested two cases: highly informative likelihood (->narrow posterior angular distribution) and completely uninformative likelihood (->complete donut). The first test is to check the occurrence of divergences due to the radial sharpening of the likelihood through the circular projection, the second test is to check how fast the sampler can navigate the donut. It turned out that even in the second test divergenges can occur, more on that below.

I tested both conditions with all three proposed donut shapes (normal, lognormal, gamma), with and without jacobian, and with various widths of the donut. The mean of the donut prior (the “donut-radius”) was always approximately 1.

For the test of the uninformative likelihood I just left the likelihood away completely.

For the test of the highly informative likelihood I fixed the von mises precision parameter to 300 and fed the model 15 datapoints from a von mises distribution of this precision. I turns out that this produces approximatively the same posterior shape as in my first bunch of tests (posted above) with 100 datapoints and von mises precision 300, because the precision prior in the first tests was quite stupid and led to a very biased precision estimate (~42), which made the likelihood wider than it would otherwise have been (which doesn’t matter beacuse I didn’t rely on it anywhere). As in the first tests the performance and divergence figures stabilized at precision 300 and didn’t change much for higher precisions, the test at this value should be ±representative for “infinite” precision likelihoods.

All tests were repeated 100 times with new random datasets.

Here’s the results:

Test with highly informative likelihood:

Test without likelihood:

Plot explanation:
The x-axis shows what donut-shape was used. The first number on each colum denotes which distribution was used (1=normal, 2=lognormal, 3=gamma) and the second number tells whether the jacobian was applied (0=no 1=yes). Everything else is as in the graphs in my posts further up.

Results summary:

Even if the likelihood is completely uninformative (in the test I just removed the likelihood altogether) there can be divergences. I think this can be due to two reasons:

  1. for the zero-avoiding distributions if the sampler gets too close to zero there is a crazy gradient.

  2. in any case if the sampler manages to ±completely leapfrog over the region around the origin the hamiltonian integration is probably very bad (not sure if this would lead to a divergence warning as I don’t know exactly how divergences are detected).

Within the range of values tested, for uninformative likelihoods broad donuts are faster (easier to navigate for the sampler), for informative ones narrow ones are faster (because it decreases the radial sharpening of the likelihood). Outside the tested range other effects can apply, but this range is not interesting for our purpose (at the moment).

But for the sake of completeness:

  1. For uninformative likelihoods if the donut-sd is >0.3 it seems to stop getting faster and at some point actually gets slower again, didn’t test this much. Presumably this is because the sampler then gets more often to the problematic central part of the distribution (but if sd>>0.3 then this doesn’t register as divergences). We can already see that starting at 0.3 in the plot. The special case of the normal-donut without jacobian might be ±immune to this.

  2. For higly informative likelihoods I suspect that if the donut-sd is made very small then the warmup time will increase again because it takes time to find the correct scale for the sampler (didn’t test).

The initially chosen donut width of sd=0.1 seems to be close to optimal.

With highly informative likelihood the lognormal is best, followed by the gamma and then the normal. Applying the jacobian makes the results worse. Without the likelihood it’s all reversed. Overall it doesn’t make much differnce which of the distributions is chosen, and whether the jacobian is applied. This might be different in higher dimensions (especially for the jacobian), here I tested only the two-dimensional case.

Discussion:

In real applications the energy distribution might vary (is this true?). Not sure what effect this would have, but I guess it might help the sampler to move into the problematic reagions, thus increasing the chance of divergences or increase of treedepth.

In real applications the shape of the angular distribution might be less “nice” than the Von Mises, also increasing the risk of divergences etc. . Such a less nice distribution could, amongst other things, also arise from the interaction with other model parameters, e.g. the estimate of the precision of the von mises might be uncertain, leading to a “neals funnel” type problem.

For these reasons it might be wise to err on the safe side and choose a narrower donut than sd=0.1*radius. Although, of course, this might be foregone in the sake of performance. If issues with divergences occur this can be countered by increasing the acceptance target.

I’m not quite sure if, in this special case, divergences do signal a bias in the posterior. I suspect that they do cause a bias because they alter the number of possible pathways between different parts of the posterior angular distribution, and thus the probability of moving from a certain angle to a certain other angle.

Test 2:

I also partly tested the effect of the jacobian in higher dimensions. I used the normal(1,0.1) donut-prior and left away the likelihood (because im not knowledgeable in higher dimensional angular math).

As expected, the length distribution of the vector grows with dimension without jacobian, and is stable with jacobian:


What I don’t understand is the performance graph, it has weird patterns in it, but these seem to be consistent:

If we ignore the strange pattern It seems that performance in high dimensions is good and that, again, it doesn’t matter much whether the jacobian is applied or not, at least till 128 dimensions. With the yacobian I couldn’t go to 256 dimensions, because then initial values were rejected. I guess the jacobian just puts out too crazy numbers in that case. Without the jacobian it is possible to go higher.

(edit: What happened to the colours of the last 2 images?)

1 Like