Von_mises documentation suggestion


I have a suggestion for the manual for the section about von Mises distributions.
Ignore me if you think this is obvious or worse, it is already in the documentation. Maybe this will save someone some pain.
------ Suggestion: ------------

You may be using a von Mises distribution or perhaps a joint probability that includes it. There are two reasons it may end in tears.
Firstly there is a modified bessel function lurking there, which is likely to explode if kappa is large (more than around 100).

… here the existing section on using the approximation to a gaussian for large kappa…

Secondly, von_mises_lpdf() works as advertised, but there is a potential problem with the formulation and stan’s idea of gradient driven search. If you have a line like

parameters {
    real <lower = 0, upper = 2 * pi()> mu;

then it is possible for the gradient to send stan towards zero or 2 $pi$. This leads to stan pushing against the wall you have built and rejecting moves. This can be avoided by saying

parameters {
    unit_vector [2] mu_vec;

and add a section

transformed parameters {
    real mu = atan2 (mu_vec[2], mu_vec[1]);

and in your model section, just use mu as you naively would

model {
   ... other stuff ...
   mu ~ von_mises (3, 0.5); // a relatively flat prior

   y ~ von_mises (mu, kappa);

Now, when stan pushes mu, there is no limit. The derivative acting on mu pushes the unit vector and it painlessly wraps.

Divergence /Treedepth issues with unit_vector

Shouldn’t there be a Jacobian for the non-linear transform of mu?

There’s a chapter on circular statistics in the manual for which this would be appropriate and it could be referenced from the von Mises distribution section.


Ahh yes.

Shouldn’t there be a Jacobian for the non-linear transform of mu?

I was hoping nobody would notice.
If you agree that the warning is helpful, I will fix my text, but I probably cannot get to it until next week.


This sounds like it could be a useful trick.

I think this is fine without the Jacobian. The unit vector is uniformly distributed on the boundary of the circle. Can verify this real quickly with:

parameters {
  unit_vector[2] mu_vec;

generated quantities {
  real mu = atan2 (mu_vec[2], mu_vec[1]);


fit = stan("Downloads/angle.stan")
hist(extract(fit, "mu")$mu, breaks = seq(-pi, pi, length = 10))

The issue with unit_vector is that it’s actually 3 parameters that reduce to 2 (details in the docs) that might cause the sampler headaches in some circumstances. @betanalpha has mentioned there are real difficulties with this a couple times when unit_vector pops up.

If we just try this out in a test model:

data {
  int N;
  vector[N] y;

parameters {
  unit_vector[2] mu_vec;

transformed parameters {
  real mu = atan2 (mu_vec[2], mu_vec[1]);

model {
  y ~ von_mises(mu, 1.0);


N = 100
y = as.vector(rvonmises(N, -pi, 1.0))
fit = stan("Downloads/angle.stan", data = list(N = N, y = y))
hist(extract(fit, "mu")$mu, breaks = seq(-pi, pi, length = 10))

I got 149 divergences with the default 4 chain 4 cores 2k iters.

Output is:

           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_vec[1] -0.99    0.00 0.02 -1.00 -1.00 -0.99 -0.98 -0.94  1355 1.00
mu_vec[2] -0.05    0.00 0.14 -0.33 -0.15 -0.05  0.05  0.24  1685 1.00
mu        -0.80    0.07 2.91 -3.13 -3.05 -2.93  3.02  3.13  1835 1.00
lp__      46.94    0.05 1.26 43.78 46.38 47.28 47.87 48.32   684 1.01

So you’d probably have to take care when doing something like this.

I like this though. Seems like it could be useful! Thanks for bringing it up Andrew. I wonder if the source of those divergences could be nailed down? Is it just me being silly or is this intrinsic to the parameterization?


I am confused.
Jacobian: when I look at a circle, I do not want a Jacobian. When I look at the derivative of mu with respect to mu_vec[1] or mu_vec[2], I want one.
I had done very similar tests to you. I generated data from some hand-built code, and fit it using the built-in functions.
I get correct values for the mean and kappa, but I did not write that before. I thought I was doing something naughty - ignoring a numerical error, because it did not seem to make any difference.

Number of parameters: I have missed something.

The issue with unit_vector is that it’s actually 3 parameters that reduce to 2 (details in the docs)

Do you mean 2 parameters that get reduced to 1 or are you including kappa or have I missed something ?

I get lots of divergences and cannot imagine where they come from. If I understand right, this means the leap-frog has fallen off the cost-function surface. Where I come from, this means a derivative is bigger than expected.
When I look at the variable traces, I cannot see where this could be happening. After warmup, kappa never goes anywhere it shouldn’t and the atan2 function should be friendly.

I like this though. Seems like it could be useful!

My first thought was that I had wasted a few days, finding something that the experts regarded as too simple to write in the manual.


Do you mean 2 parameters that get reduced to 1 or are you including kappa or have I missed something ?

Err, lol, 2 to 1. Good catch, my bad.

Re Jacobian

And eek, I’m double wrong. Maybe I shouldn’t look at the forums right when I wake up haha. We’d need the Jacobian if we planned mu on the left hand side of a sampling statement.

We’d need them if we did something like:

mu ~ normal(1.4, 1.0);

finding something that the experts

I’m not an expert in the slightest haha. As evidenced by above. Oh well. Thanks for correcting me.


I did some more thinking about this.

If you were having troubles with the sampler running up against the wall with:

real <lower = 0, upper = 2 * pi()> mu;

I think you’d just get rid of those bounds. The mean in the von_mises distribution can be any real. You’d end up with a posterior that is multimodal, but you could just map that back to [0, 2 * pi) or whatever you wanted.

I messed around with the Jacobian for a bit, but it quickly became clear I was trying to reverse engineer something I shouldn’t haha.


I do not like letting mu run free.

There is the question of elegance. If a quantity wraps, I like the idea that it wraps. I grant you, this is not very convincing.

Multi-model posteriors are bad, especially when they are not multi-modal - they are periodic images (sorry, you can probably hear I am a chemist). The terrible thing is that you lose the statistics about convergence.
Chain 1 has found x. Chain 2 finds x+2 pi. Stan looks at the chains and thinks you have found rubbish. I have an example. I have a model with some other parameters. They converge nicely. mu_0 is the mean of my distribution. Stan tells me it is all broken (Rhat)

        n_eff   Rhat
mu        823   1.01
mu_0        3 172.74
kappa     600   1.01
rho_1     449   1.00
rho_2     420   1.01
sigma     393   1.02
rho_sq    635   1.01
sigma_c  1457   1.00

Next is the killer problem.
Von Mises distributions are OK, but much of the real world is a joint probability over periodic data and something else. My data is a mixture model involving joint probabilities. Imagine the soup of groups one will have if class 1 has mu=x, class 2 has mu=x+2pi, class 3 has mu=x, but differs in the other part of the joint probability.

Why do you not like the idea of a Jacobian correction ?
If the scraps of paper on my desk are roughly right, I think I will end up with some log (1/(1 + mu_vec[1] * mu_vec[1])) and log(1/(1 + mu_vec[2] * mu_vec[2])) terms.

I would not mind if somebody tells me I am thinking the wrong way.


Next is the killer problem.

Fair enough, I have a problem where I’m estimating a crystal orientation. I just live with the multimodality in the posterior. I agree it is pretty inconvenient (cause I always have to convert between fundamental zones n’ such and that is a pretty error-prone sorta process). But in this case I think just looking at all the numbers mod 2 pi should be fine (you could do that in generated quantities or something).

Why do you not like the idea of a Jacobian correction?

I’m not convinced there is simple Jacobian transformation or that it makes sense. 2 variables to 1 isn’t as simple as the examples in the manual. I’m just thumbing through: http://www.math.uah.edu/stat/dist/Transformations.html.

If you don’t have a statement where you need to evaluate the probability of mu (like a sampling statement), you don’t need the Jacobian adjustment, and you can use the unit_vector parameterization fine.

Hopefully this is all correct :D.


I just wanted to say thanks for the link and the interest, but I can’t see any personal message button.


Good intuition. It only normalizes properly if it’s constrained to an interval.

If you hit boundaries, you’re running into circularity problems and may want to use a unit vector.

You don’t need a Jacobian if you apply a function and use it on the right of a sampling statement. But you need it if you’re using it on the left-hand side. There’s a discussion in the manual chapter on tranforming variables.


We’d much rather keep discussions on list.


And just to clarify:

parameters {
  real mu;
transformed parameters {
  real<lower=0, upper=1> theta = inv_logit(mu);
model {
  y ~ normal(theta, 1);  // Jacobian not needed
  theta ~ foo(...);  // Jacobian needed


Thanks for all the comments.
Let me summarise.
Transforming an angle to a unit vector removes much pain. It lets the values wrap, gives you a uni-modal posterior estimate and doesn’t mess with stan’s convergence parameters.

There is no correction necessary in the sampling statement. Nobody is messing with the data - just parameters on the right hand of the sampling statement.
A correction is necessary on the silly prior I put on the distribution’s mean.

This prior is unnecessary, unless you really have some prior belief. Because one is sampling from a unit vector, the system never goes off the rails or visits anywhere too weird.
Suggestion: If one has prior knowledge about the mean of the distribution, I would consider coding it as a prior on the x and y elements of the vector.

I have a lovely picture on my screen from a stan_trace() where you can see stan wrapping the distribution’s mean during the warmup phase.

This thread started with my question as to whether this is worth including in the manual. It might be, if I remove my silly prior.


Again thanks.
Here is an example of working code for a von Mises distribution, using the unit vector.
It eats periodic data. It gives you an angle from -pi to pi for the centre.
It does not ask you to tolerate multi-model posteriors and stan’s convergence statistics remain meaningful.
It allows you to put a von Mises prior on mu, the distribution centre.
It does not need a jacobian correction. The prior is a wrapper which acts on the unit vector directly.
it is suitable for vegans

functions {
    real von_mises_prior_lpdf (vector in_vec, real mu, real kappa) {
        real angle = atan2(in_vec[2], in_vec[1]);
        return (von_mises_lpdf (angle| mu, kappa));

data {
    int NP;
    vector [NP] angles;

parameters {
    unit_vector [2] mu_vec;
    real <lower = 0> kappa;

transformed parameters {
    real mu = atan2(mu_vec[2], mu_vec[1]);

model {
    kappa ~  normal (4, 3);  // My kappa's are not more than 7 or 8
    mu_vec ~ von_mises_prior (3, 1);  // prior, centred on 3, kappa = 1
    angles ~ von_mises (mu, kappa);

Imagine you are working with a simple von Mises distribution.
Is there any reason not to do it this way routinely ?

This thread started with a suggestion for the manual. This version is better, but maybe it is too long. Maybe it is overkill for just fitting to circular data, when convergence is not an issue and you don’t mind unwrapping your posteriors. If this is going into part of a model with more terms, then I think it is important.


Thanks for bringing this back up – I think I was wrong about the Jacobian stuff. You don’t need it here. If you have a random variable that lives on a unit circle, then you can map little arclengths there directly onto 0 to 2*pi or whatever range atan2 uses.

I think it’s completely fine to just say mu ~ von_mises(3, 1) even when mu is a transformed variable here, cause you’re just mapping in line wrapped around a circle in 2D to one stretched out on one axis. The arclengths are arclengths. What you have coded up here is the same thing.

Hope I finally got it straight in my head this time… If not oh well :D. What you’re doing seems good to me.


Incorrect. You can map little arclenghts onto the interval [0, 2pi] locally, but the transformations that we’re talking about here are global. Again, the problem is that the transform from an interval to a circle is not 1-1 and it can’t be because the topologies are different. In fact if you tried to calculate the Jacobian you’d find that you get an infinite sum of terms which cannot be evaluated – again, a manifestation of the topological problems.

You can embed the circle in R^{n + 1} but then you have the problem of trying to map R^{n} -> R^{n+ 1}, which is also not 1-1 and hence doesn’t admit a clean transformation between densities.

Circles are hard.


Thanks for stepping in. One of these days I’m gonna know what a topology is haha.


As someone who had a rigorous education but not much formal mathematics, I’d highly recommend John Lee’s “Topological Manifolds”. It gets really complicated after a few chapters, but the first few are comprehensive and carefully explain all of the concepts.

The basic idea is you have a space of points and then you endow it with more structure by defining “what sets of points are open”? Then after you work with this structure you see what the different possible assignments of open sets, or topologies, manifest as what we would think of as different shapes. So torii, spheres, and real numbers end up being spaces defined by different topologies.

There’s a lot of rigorous math you can do with topology, but there’s also lots of concepts that you can take away an apply more informally. The relevant notion here is that probability distributions are almost always defined by a space’s topology. If you want to map a distribution from one space to another then you also need to map the corresponding topologies. If the topologies are inconsistent (sphere vs real numbers, etc) then you can’t (usually) construct such a map and hence can’t map distributions back and forth as you might naively expect that you could.

The really cool thing is that, like measure theory itself, all of these formal math concepts help to identify what would otherwise be really subtle pathologies in your analysis/algorithms that would take forever to find if ever found at all!


I’d recommend starting where @betanalpha pointed me, with John Baez an Javier P. Muniain’s Gauge Fields, Knots and Gravity, which sounds like it’s going to be even more hairy, but starts with a lovely introduction to smooth manifolds. It does presuppose you know a bit of algebra and analysis before you start.