Yes, I saw that. I didn’t read the relevant part of the paper because it’s behind a paywall.

Read the paper now!

Here’s my summary:

He first reviews 3 previous methods of producing points on a 3d-sphere, then proposes a new one. I assume this new one is what we call “marsaglia method”.

This methods consists of 2 steps:

- A rejection sampling step to create 2d-vectors distributed uniformly over the interior of a circle with radius 1.
- A method to map these vectors to the surface of a 3d sphere such that they are uniformly distributed on the sphere.

The method that Stan uses for unit vectors, as far as I know, is:

- Generate a 3d-multivariate standard normal distribution.
- Divide the vectors by their length to project them onto the 3d-sphere surface.

What I called “sphere method” is what I introduced In the other thread and is basically identical to what Stan does except I start with a different prior so the sampler doesnt go near (0,0,0). (this distribution already looks almost like a sphere-surface even before projecting on the sphere surface)

What I called “polar method” above is:

- Create a vector of longiture and latitude describing coordinates on a sphere
- Use jacobian corrections to ensure they are uniformly distributed.

So we have 4 different methods. I don’t see why the Stan manual refers to marsaglia, it seems to me like two completely different methods. The method stan uses is one of the three methods reviewed at the beginning of the Marsaglia paper. In the paper Marsaglia notes that there was an earlier paper from Muller which discussed this method. Maybe the stan manual should refer to this paper instead of the one by Marsaglia? Maybe we should call the method used by Stan the Muller method?

The marsaglia-polar-transform is something different. It has nothing to do with spheres, but is a method to create a 2d standard-normal by:

- sample a uniformly filled 2d-circle via rejection sampling from a uniform square.
- project these samples to a 2d-standard normal

Is it the case that Stan’s current method and the “sphere method” converge in high dimensions due to concentration of measure?

I think so,but didn’t test it. There might be a difference in performance because the sphere method might have a less nicely shaped gradient. In the other thread we were investigating different gradient shapes, but didn’t test for this particular aspect as far as I remember.

I’ve been thinking more about the method introduced in the marsaglia paper (which I’m going to call the marsaglia method, hopefully not confusing anyone due to the previous use of this word for Stan’s current method).

Using this method in Stan (I will come up with a stan implementation when I come back from work) would have some nice properties. It is applicable to 3d-spheres, not sure if it also works in higher dimensionality. For the 2d-case it wouldnt offer any advantage over a polar parameterisation (which in 2d is just encoding the position on the circle as an angle from 0 to 2pi). In the 3d case however, it does have an advantage, which I’m going to explain now.

The marsaglia method mapps a filled circle to a sphere-surface. Imagine you’re standing on/embedded into the surface of the sphere and you have 360° surround vision. Logically, the sphere fills exactly 50% of you field of view. The part of your field of view that is covered by the sphere has the shape of a filled disk. This gives us a rough intuition of how the transform works: Points in the center of the disk are on the opposite side of the sphere, points very close to the edge of the disk are very close to us on our side of the sphere. This means that all points near the edge are very close to each other.

Unlike the polar method, there is no “seam” on the sphere surface, and there is only one “pole” (where we are standing), rather than two. Therefore the “points close on the sphere are not close in unconstrained space” problem arises less than in the polar parameterisation. It does arise near the pole, but it isn’t nearly as problematic near a pole as near a seam. The reason is that all points near the pole are arranged in a single continuous patch of space in the unconstrained representation (in the marsaglia method this patch of space is the circumference of the disk). Therefore although the points are not close together in unconstrained space, there is always a path that the sampler can use to travel between these points. This is because since they are close together on the sphere, they have similar posterior density, which allows the sampler to freely travel along the patch in unconstrained space (the disk circumferene in the marsaglia method).

Edit: note of course a disk cannot be an unconstrained representation in Stan. But it can be mapped to a standard-normal which serves as the unconstrained representation. See my next post for actual code.

Here it is:

```
functions{
vector mapFromGaussianToDisk( // inverse marsaglia-polar-transform sensu https://en.wikipedia.org/wiki/Marsaglia_polar_method
vector v2 // length 2 numeric vector
){
real x2=v2[1];
real y2=v2[2];
real r2=sqrt(x2^2+y2^2);
real r1=sqrt(exp((x2^2+y2^2)/-2));
vector[2] v1;
v1[1]=x2*r1/r2;
v1[2]=y2*r1/r2;
return v1;
}
vector mapFromDiskToSphereSurface( // marsaglia transform sensu marsaglia 1972 (doi: 10.1214/aoms/1177692644) (https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-43/issue-2/Choosing-a-Point-from-the-Surface-of-a-Sphere/10.1214/aoms/1177692644.full)
vector v // length 2 numeric vector
){
real s=v[1]^2+v[2]^2;
vector[3] p;
p[1]=2*v[1]*sqrt(1-s);
p[2]=2*v[2]*sqrt(1-s);
p[3]=(1-2*s);
return p;
}
}
data{
}
parameters{
vector[2] pGaussian;
}
transformed parameters {
vector[2] disk = mapFromGaussianToDisk(pGaussian);
vector[3] sphere = mapFromDiskToSphereSurface(disk);
}
model{
pGaussian~normal(0,1);
}
```

And here the R-Code to run it:

```
library(cmdstanr)
library(tidyverse)
library(tidybayes)
library(rstan)
stanModel=
cmdstan_model("stanCode_3dSphereOriginalMarsagliaMethod.stan")
model_cmdstan=
stanModel$sample(
data=list()
)
model=
rstan::read_stan_csv(
model_cmdstan$output_files()
)
# make a nice pairplot
get_variables(model)
model %>%
spread_draws(
pGaussian.1
,pGaussian.2
,disk.1
,disk.2
,sphere.1
,sphere.2
,sphere.3
,lp__
,accept_stat__
,treedepth__
,stepsize__
,divergent__
) %>%
mutate(
.chain=.chain+runif(n(),-.3,.3)
,lp__=lp__+runif(n(),-.3,.3)
,treedepth__=treedepth__+runif(n(),-.3,.3)
,divergent__=divergent__+runif(n(),-.3,.3)
) %>%
{
par(bg="#202020")
pairs(
.
,pch=46
,upper.panel=NULL
# ,col="#ffffffff"
,col=hsv(scales::rescale(round(.$treedepth__))*.6)
)
par(bg="white")
}
```

This is of course not a 1 to 1 implementation of the method form the marsaglia paper (since its a method so sample uniformly from a sphere, rather then to parameterize a spherical distribution in a sampler). But it think its a fairly straightforward translation to our purpose.

I coded up a 2d standard-normal as usual, then transformed it to a distribution representing a uniformly filled 2d-unit-circle using an inverted version of the Marsaglia polar method - Wikipedia then transformed that filled circle to a sphere-surface with the method introduced in the 1972 Marsaglia paper (the one referenced by the Stan manual section on unit-vectors).

I didn’t include a likelihood in this barebones test.

I don’t know if this concept is applicable to higher dimensional spheres, simply because I dont have training in higher dimensional math.

That’s right.

Me, either, now that I look at Marsaglia’s paper. I probably wrote that doc without consulting the original reference and just thought that’s what the method was called. I may have just misunderstood what someone better informed told me.

I created an issue to update the doc:

I think people have mostly converged to a similar perspective but let me add a few words.

As I’ve discussed various times in other places (see for example Understanding non-bijective variable transformations - #19 by betanalpha) a sphere cannot be directly parameterized with real numbers. One can parameterize *most* of a sphere using real numbers but there will always be some artifact due to the incomplete parameterization. Whether or not that artifact has practical consequences depends on the details of a particular application.

For example let’s consider the one-dimensional circle which can be *almost* completely parameterized by real intervals such as \theta \in (0, 2 \pi). Notice that the point \theta = 0 = 2 \pi is not included in this interval; the exclusion of this point is the artifact. Whether or not that artifact will be problematic will depend on the target distribution over this space.

If we our target distribution is specified by a uniform density function,

then the artifact will be pretty negligible. In particular the can implement this model in Stan with

where y is unconstrained. Taking the Jacobian into account gives a well-behaved unconstrained density function \pi(y) with which Stan will have no trouble.

But what happens if our target distribution is specified by a von Mises density function,

The problem here is that the target density function concentrates right on that artifact, and we won’t be able to ignore it so readily. Indeed once we transform to the unconstrained space as before we get a much less pleasant, multimodal density function.

Notice that the von Mises density function is unimodal – this multimodality is a consequence of not completely parameterizing the circle.

At the same time if we shift the von Mises density function to the other side of the circle,

then the transformed density function will be manageable again.

In higher dimensions the artifact of using hyper spherical coordinates goes from being a point to a line, but the qualitative consequences are the same. This incomplete parameterization can be useful, but only if the target distribution allocates negligible probability to the neighborhood around the artifact which has to be validated by the user for each particular application.

The current `unit_vector`

method is based on embedding a sphere in a higher-dimensional real space and then lifting any target distribution over that initial sphere to the higher-dimensional space. In particular an implicit distribution is imposed over the radial direction transverse to the embedded sphere.

As with the previous approach this method works reasonably well for a uniform distribution over the sphere but can run into problems when the distribution concentrates into smaller neighborhoods, especially in higher-dimensions. Very conceptually the lifted distribution in the higher-dimensional embedding space can become “wedge”-like and frustrate the sampler.

These methods aren’t quite complementary, but they can both be useful in different circumstances. I wouldn’t be against presenting both provided that their limitations are clearly stated. In particular the hyper spherical coordinate method has to be presented as an approximation that will be valid only when the spherical coordinates are properly aligned with the target distribution.

Suppose we want to sample x \in \mathbb{S}^n \subset \mathbb{R}^{n+1} with the spherical embedding approach, i.e. sampling y \in \mathbb{R}^{n+1}, where x = \frac{y}{\lVert y \rVert}. As @betanalpha explained, the log-density “correction” of -\lVert y \rVert^2/2 that Stan uses puts an implicit density on r=\lVert y \rVert.

Specifically, we can think of the transformation from x to y as two steps:

- augment the density \pi_x(x) with a prior \pi_r(r) for r>0 to get a new joint density \pi_x(x)\pi_r( r).
- bijectively map (x, r) \in \mathbb{S}^n \times R_{>0} to \mathbb{R}^{n+1} \backslash \{0\} with y = r x, and use the Jacobian correction -n \log r to get the log-density \log\pi_y(y) = \log\pi_x(y/\lVert y \rVert) + \log\pi_r(\lVert y \rVert) - n\log \lVert y \rVert.

We are free to choose any continuous proper density for \pi_r(r). Stan implicitly chooses the Chi distribution with n+1 degrees of freedom, which has the log-density \log\pi_r(r) = n \log(r) -\frac{1}{2} r^2. The n \log r term, which attracts draws to 0 in one expression and repulses in the other, perfectly cancel, so that we are left with \log \pi_y(y) = -\frac{1}{2} \lVert y\rVert^2.

We still have a singularity at y=0 that we need to avoid, and as noted above, when \pi_x(x) is concentrated, then \pi_y(y) has a wedge geometry that is challenging to sample. Due to concentration of measure, for large n, draws near y=0 should be rare, so this shouldn’t be as much of a problem. But for low n, this will manifest with divergences.

An alternative solution to those mentioned so far is to use \log\pi_r(r) = \frac{1}{2} r^2 + (n + a) \log r for a \ge 0, i.e. a Chi distribution with n+a+1 degrees of freedom. This repels y from y=0, and the degree of repulsion can be tuned by the user by increasing a.

Here we see the log-density for a von Mises distribution with concentration of 100 transformed to the latent space for various values of a:

Yet another alternative is to use \log \pi_r(r) = -\frac{1}{2} (r-m)^2 + n \log r for m \ge 0:

I’m not sure sure how this could be supported in Stan.

As requested by @spinkney, here are Stan models for both of these two parameterizations:

```
data {
int<lower=0> N;
real<lower=0> a;
unit_vector[N+1] mu;
real<lower=0> kappa;
}
parameters {
vector[N + 1] z;
}
transformed parameters {
real<lower=0> r = sqrt(dot_self(z));
unit_vector[N + 1] x = z / r;
}
model {
target += -r^2/2 + a * log(r);
target += kappa * dot_product(mu, x);
}
```

```
data {
int<lower=0> N;
real<lower=0> m;
unit_vector[N+1] mu;
real<lower=0> kappa;
}
parameters {
vector[N + 1] z;
}
transformed parameters {
real<lower=0> r = sqrt(dot_self(z));
unit_vector[N + 1] x = z / r;
}
model {
target += -(r - m)^2/2;
target += kappa * dot_product(mu, x);
}
```

At first glance this looks to me like another way to do what I did in the thread over here:

Is that correct?

Indeed! Seems you already arrived at the same approach (and tried a few others it seems). Thanks for the link to the thread. Did you ever get around to turning that analysis into a small paper?

No. But I intend to at least finish the validation work that others requested in the thread. Maybe that will bring some new life to the topic.

Back then a few people said they could imagine working together on a paper version. If that ever happens I’d be happy to collaborate.

Thanks, @Seth_Axen. I really appreciate working code.

Hi, @Raoul-Kima. @mjhajharia and I are going to spend the summer trying to figure out a general framework for evaluating these things and then do a bunch of evaluations. We’re happy to collaborate and will reach out soon to potential co-authors.

@Adam_Haber did some interesting and related work on correlation matrix parameterizations. See: Discrepancy between TFP and Stan's CholeskyLKJ log prob · Issue #694 · tensorflow/probability · GitHub

More sophisticated constraints on the auxiliary radius definitely offer more flexibility, but my intuition is that the tuning problem will be difficult in general.

Any likelihood function that depends on only x will constrain only the angular extent of the posterior distribution over y. In general this constraint over x will be lifted to y by smearing it out to larger and smaller radii, where the smearing is determined by the implicit prior on the radius. As we go to higher and higher dimensions the smearing to larger radii will be stronger than the smearing to smaller radii due to concentration of measure, at least without some dimensionality-dependence prior model for the radius carefully tuned to compensate.

My hesitation is that it is this asymmetry, and not overlap with the singularity at zero, that’s causing much of the fitting problems with the embedded unit vector implementation. The posterior over y will appear pinched towards smaller radii which can amplify any awkward geometry in the angular constraint even when the singularity is strongly excluded.

If an effective radial prior model has to be tuned each problem then I don’t think that it makes a great constrained type implementation. That said one can always implement it directly, as in @Seth_Axen’s nice demonstrations.

Great to hear there is progress! (sorry for taking so long to respond)

It may be out of scope at the moment, but:

As far as I understand it tuning is only necessary when using a diagonal mass matrix, not when using a dense one. If so, a solution for cases where a dense matrix isn’t suitable might be to have a “partially dense” mass matrix, which only models the correlations between those dimensions which are part of the same unit-vector.

This would suggest that the existing unit-vector works better at low than high dimensions, but as far as I know the opposite is the case. The way I understand it is that since in higher dimensions (due to concentration of measure) the average radius grows larger relative to the spread of radius, at high dimensions the “smearing” is in a sense less pronounced. In other words: I think It doesn’t matter that the typical radius is larger than that of the “radius-prior” in high dimensions, the only thing that matters is that mean(radius)/sd(radius) is as large as possible.

Therefore, the “donut-prior” I introduced in the other thread should only be needed in low dimensions.

This also means that the tuning problem is only relevant in low dimensions.

I suspect the tuning is sufficiently uncritical to allow ignoring it in most cases. In the worst case the model would sample inefficiently, but still return correct results. I will put that on my todo-list of stuff to test when I resume that project.

In the post containing this quote, and my followup post, I have discussed (and provided working stan code) another way to code a unit vector, which I hadn’t seen discussed before. Nobody responded to that as far as I can see. I think the method has promise! Shall I open a new thread for that or shall we keep the discussion here?

This is getting long. Go ahead and open a new post and reference this one so people can get the background if they want.

Thanks, @Raoul-Kima. I’m working with @mjhajharia, @Seth_Axen, and @Adam_Haber on evaluating potential transforms over a range of target models. We can throw yours into the mix. Is there a derivation of this somewhere other than the code? And is there a generalization to arbitrary `N`

-vectors?

Essentially what we need for our evals is a Stan program that induces a uniform distribution (perhaps improper) over the relevant constrained variables.