Von_mises documentation suggestion

I just wanted to say thanks for the link and the interest, but I can’t see any personal message button.

Good intuition. It only normalizes properly if it’s constrained to an interval.

If you hit boundaries, you’re running into circularity problems and may want to use a unit vector.

You don’t need a Jacobian if you apply a function and use it on the right of a sampling statement. But you need it if you’re using it on the left-hand side. There’s a discussion in the manual chapter on tranforming variables.

We’d much rather keep discussions on list.

And just to clarify:

parameters {
  real mu;
}
transformed parameters {
  real<lower=0, upper=1> theta = inv_logit(mu);
}
model {
  y ~ normal(theta, 1);  // Jacobian not needed
  theta ~ foo(...);  // Jacobian needed
}
2 Likes

Thanks for all the comments.
Let me summarise.
Transforming an angle to a unit vector removes much pain. It lets the values wrap, gives you a uni-modal posterior estimate and doesn’t mess with stan’s convergence parameters.

There is no correction necessary in the sampling statement. Nobody is messing with the data - just parameters on the right hand of the sampling statement.
A correction is necessary on the silly prior I put on the distribution’s mean.

This prior is unnecessary, unless you really have some prior belief. Because one is sampling from a unit vector, the system never goes off the rails or visits anywhere too weird.
Suggestion: If one has prior knowledge about the mean of the distribution, I would consider coding it as a prior on the x and y elements of the vector.

I have a lovely picture on my screen from a stan_trace() where you can see stan wrapping the distribution’s mean during the warmup phase.

This thread started with my question as to whether this is worth including in the manual. It might be, if I remove my silly prior.

Again thanks.
Here is an example of working code for a von Mises distribution, using the unit vector.
It eats periodic data. It gives you an angle from -pi to pi for the centre.
It does not ask you to tolerate multi-model posteriors and stan’s convergence statistics remain meaningful.
It allows you to put a von Mises prior on mu, the distribution centre.
It does not need a jacobian correction. The prior is a wrapper which acts on the unit vector directly.
it is suitable for vegans

functions {
    real von_mises_prior_lpdf (vector in_vec, real mu, real kappa) {
        real angle = atan2(in_vec[2], in_vec[1]);
        return (von_mises_lpdf (angle| mu, kappa));
    }
}

data {
    int NP;
    vector [NP] angles;
}

parameters {
    unit_vector [2] mu_vec;
    real <lower = 0> kappa;
}

transformed parameters {
    real mu = atan2(mu_vec[2], mu_vec[1]);
}

model {
    kappa ~  normal (4, 3);  // My kappa's are not more than 7 or 8
    mu_vec ~ von_mises_prior (3, 1);  // prior, centred on 3, kappa = 1
    angles ~ von_mises (mu, kappa);
}

Imagine you are working with a simple von Mises distribution.
Is there any reason not to do it this way routinely ?

This thread started with a suggestion for the manual. This version is better, but maybe it is too long. Maybe it is overkill for just fitting to circular data, when convergence is not an issue and you don’t mind unwrapping your posteriors. If this is going into part of a model with more terms, then I think it is important.

3 Likes

Thanks for bringing this back up – I think I was wrong about the Jacobian stuff. You don’t need it here. If you have a random variable that lives on a unit circle, then you can map little arclengths there directly onto 0 to 2*pi or whatever range atan2 uses.

I think it’s completely fine to just say mu ~ von_mises(3, 1) even when mu is a transformed variable here, cause you’re just mapping in line wrapped around a circle in 2D to one stretched out on one axis. The arclengths are arclengths. What you have coded up here is the same thing.

Hope I finally got it straight in my head this time… If not oh well :D. What you’re doing seems good to me.

Incorrect. You can map little arclenghts onto the interval [0, 2pi] locally, but the transformations that we’re talking about here are global. Again, the problem is that the transform from an interval to a circle is not 1-1 and it can’t be because the topologies are different. In fact if you tried to calculate the Jacobian you’d find that you get an infinite sum of terms which cannot be evaluated – again, a manifestation of the topological problems.

You can embed the circle in R^{n + 1} but then you have the problem of trying to map R^{n} -> R^{n+ 1}, which is also not 1-1 and hence doesn’t admit a clean transformation between densities.

Circles are hard.

1 Like

Thanks for stepping in. One of these days I’m gonna know what a topology is haha.

As someone who had a rigorous education but not much formal mathematics, I’d highly recommend John Lee’s “Topological Manifolds”. It gets really complicated after a few chapters, but the first few are comprehensive and carefully explain all of the concepts.

The basic idea is you have a space of points and then you endow it with more structure by defining “what sets of points are open”? Then after you work with this structure you see what the different possible assignments of open sets, or topologies, manifest as what we would think of as different shapes. So torii, spheres, and real numbers end up being spaces defined by different topologies.

There’s a lot of rigorous math you can do with topology, but there’s also lots of concepts that you can take away an apply more informally. The relevant notion here is that probability distributions are almost always defined by a space’s topology. If you want to map a distribution from one space to another then you also need to map the corresponding topologies. If the topologies are inconsistent (sphere vs real numbers, etc) then you can’t (usually) construct such a map and hence can’t map distributions back and forth as you might naively expect that you could.

The really cool thing is that, like measure theory itself, all of these formal math concepts help to identify what would otherwise be really subtle pathologies in your analysis/algorithms that would take forever to find if ever found at all!

2 Likes

I’d recommend starting where @betanalpha pointed me, with John Baez an Javier P. Muniain’s Gauge Fields, Knots and Gravity, which sounds like it’s going to be even more hairy, but starts with a lovely introduction to smooth manifolds. It does presuppose you know a bit of algebra and analysis before you start.

2 Likes

Good - one needs a correction, but one question has not been answered.
By moving the transformation to the right hand side (in von_mises_prior_lpdf()), does the problem go away ?

If you are confused about Jacobians then I highly recommend that you spend some time going over probability densities and their properties. They’re sufficiently non-intuitive that it’s worth putting time into the theory so that you don’t end up trying to guess in every application.

The important property here is that you don’t need Jacobians for parameters you condition on but you do for parameters in the distribution. So let pi(f | g) be a conditional probability density function but let f and g be specified as functions of other parameters, f(x) and g(y).

What is pi(f | y)? Just pi(f | g(y)).

What is pi(x | g)? Now we need the Jacobian: pi(f(x) | g) | dx/df |.

But in the example above (immediately above, not the start of the thread), the prior acts directly on the unit vector parameter (mu_vec). It takes the raw unit vector as a parameter, not the version transformed to an angle.

It certainly does not provoke stan’s warning about needing a Jacobian. Have I simply stumbled across a way to obscure the structure, so as to hide a mistake from stan ?

If you are referring to

the reason that you don’t get the warning is that you’ve “hidden” the nonlinear transformation in the user-defined function so that the parser can’t see it. That construction still needs a Jacobian to be correct.

That construction still needs a Jacobian to be correct

Thanks. That is what I wanted spelled out.
When I look at the derivative, I think I have to do something like
target += -log(1 + x^2);
and
target += -log(1 + y^2);

for the two vector components and remove the helper function, but I better check my signs and experiment a bit. I want it to be correct, but I am struggling to find a good numerical test - something that goes wrong when I have the wrong derivatives.

It is embarrassing that what I wrote in the earlier post seems to be patently wrong.
Thanks for the comments.

Probability is really hard. No shame in struggling provided that one respects the math and doesn’t try to hack something together.

You want to look at MCMC output to verity that you’re quantifying the right distribution, so it helps to have analytic expectations (means, medians, variances, quantiles, etc) for comparison. Keeping in mind that expectations on circles are weird (ambiguity in defining the mean, hence the variance, etc).

2 Likes

Simulation is the best way to make sure you’re doing the right thing. You’ll see that things won’t get the right answer if you get the Jacobians wrong—you’ll be sampling from the wrong distribution.

No, here I do not really agree. Obviously I work with simulated data.
I have two extremes in order to highlight problems. First, I have plenty of data - a few thousand points and second, I generate a small amount of data - 50 points.
The problem is that the Jacobian is necessary in the prior, but the prior is so unimportant.

  • If you have no prior, stan finds the centre and kappa.
  • If one puts in a weak prior (mu~von_mises(x, y)) and no Jacobian, stan finds the centre and kappa with a difference in the second digit after the decimal point.
  • If I put in what I think is correct, (log of the derivative of arctan) the values also come out correct. Maybe they are.

I think I picked my words correctly before. I am struggling to find a good numeric test. If you use a unit vector, it is hard for the system to go wrong. The dominant derivatives will always send mu towards the middle of the data.

I would really like to get this correct in principle, not just correct in the sense of the results look OK.

That means making sure you are doing the right math. So either convince yourself or find somebody else to compare to. The gateway drug here is the log-normal density where y is data and the model is.

log(y) ~ normal(mu, sigma)

At least one of Bob’s case studies goes through this in detail which I found helpful at the beginning. Another fun example to do is sampling from a donut defined by angle/radius (easy) and then transforming so that the sampler is actually moving in cartesian coordinates but sampling from the same density. Maybe bivariate log-normal would be a better second step but it’s less fun.

If you want to try it on your density I would skip the function business the first time around so you don’t add an extra layer of confusion.