A better unit vector

It’s widely known in the developer circle that the unit-vector type is a bad apple. It doesn’t play nicely with HMC and there are plenty of cases where divergences or treedepth issues occur due to how the constraint is formed.

There’s a particularly informative thread on the issues at Divergence /Treedepth issues with unit_vector. Unfortunately, I don’t believe the jacobian adjustment suggested in that thread is correct. The derivative wrt to the unconstrained parameters results in a vector - not a square matrix - and one needs to introduce auxiliary parameters to push through into a square matrix transform. Fortunately, that transform is just the spherical coordinate transform. Then there’s no need to fiddle with a prior on r (the radius of the sphere) because we can set it equal to 1. You’ll notice that this part bares similarity to the Marsaglia transform that Stan currently uses.

The potential issue with spherical coordinates (and one I believe @betanalpha has pointed out) is that the restriction of the angles between 0 and 2\pi is -\infty to \infty in the unconstrained space but we want 2\pi to be neighbors to 0. Also, @bgoodri pointed out that a spherical transform was used in Stan at some point. I do not know if it was written this way or what issues were discovered.

I understand that concern, however, in my tests where the space is restricted to between 0 and \pi for 2d and only introducing one 0 to 2\pi parameter for > 2d I don’t encounter any issues. In fact, I am able to resolve the current Marsaglia parameterization issues completely.

For example, in the post below (@mike-lawrence may be interested to see)

I can use the 2d circle parameterization and fit the model without divergences (it’s also twice as fast)

// saved as periodic_sphere.stan
functions {
   // case when N == 2
  vector unit_vector_transform_lp(vector theta, int N) {
   vector[N] unit_vec;
  
    unit_vec[1] = sin(theta[1]);
    unit_vec[2] = cos(theta[1]);

   return unit_vec;
 }
}
data{
  int n;
  vector[n] y;
  vector[n] x;
  real<lower=0> rate ;
}
parameters{
  // unit_vector[2] phase_helper;
  vector<lower=0, upper=2 * pi()>[1] theta; 
  real<lower=0> amplitude;
  real<lower=0> noise;
}
transformed parameters{
  vector[2] phase_helper = unit_vector_transform_lp(theta, 2);
  real phase = atan2(phase_helper[1],phase_helper[2]) ;
}
model{
  amplitude ~ weibull(2,2) ; 
  noise ~ normal(0,1) ; 
  y ~ normal( amplitude * sin( x*rate*2*pi()+phase ) , noise ) ;
}

The R code is


n = 1e3
x = seq(0,10,length.out = n)
rate = .5
noise = .1
phase = pi/2
amplitude = 1

f = amplitude*sin(x*rate*2*pi+phase)
y = f + rnorm(n,0,noise)
plot(x,y,type='l')
lines(x,f,col='green',lwd=2)

stan_dat <- list(
  n = length(y)
  , x = x
  , y = y
  , rate = rate
)

periodic_sphere_mod <- cmdstan_model("periodic_sphere.stan")
mod_sphere_out <- periodic_sphere_mod$sample(
  data = stan_dat,
  parallel_chains = 4,
  adapt_delta = 0.8
)

In the case of 3 or more dimensions the Stan code is

// Copyright (c) 2022, Sean Pinkney
// BSD 3-Clause License
functions {
 // case when N > 2
 vector unit_vector_transform_lp(vector theta1_, vector theta, int N) {
   vector[N] unit_vec;
   real theta1 = theta1_[1];
   
    unit_vec[1] = sin(theta1) * prod(sin(theta));
      unit_vec[2] = cos(theta1) * prod(sin(theta));
      for (i in 1:N - 3) 
        unit_vec[i + 2] = cos(theta[i]) * prod(sin(theta[i + 1:N - 2]));
  
      unit_vec[N] = cos(theta[N - 2]);
      
      for (i in 2:N - 1) 
        target += (i - 1) * log(sin(theta[i - 1]));

   return unit_vec;
 }
 
 // case when N == 2
  vector unit_vector_transform_lp(vector theta, int N) {
   vector[N] unit_vec;
  
    unit_vec[1] = sin(theta[1]);
    unit_vec[2] = cos(theta[1]);
  
   return unit_vec;
 }
}
data {
  int uv;
}
parameters {
   vector<lower=0, upper=2 * pi()>[1] theta1;
   vector<lower=0, upper=pi()>[uv == 2 ? 0 : uv - 2] theta; 
  // unit_vector[uv] unit_stan;
}
transformed parameters {
  vector[uv] unit_vec = uv == 2 ? unit_vector_transform_lp(theta1, uv) :  unit_vector_transform_lp(theta1, theta, uv);
}
model {
}
generated quantities {
  real r_out = sqrt(sum(pow(unit_vec, 2)));
}

edit: fixed the issue in N = 2. @Raoul-Kima pointed out that the bound needs to be 0 - 2pi. Also, no jacobian adjustment is needed in the 2d case because it’s the radius which = 1, a constant.

4 Likes

I’m not sure if I understand you correctly. So maybe I just misunderstood your post you when I wrote the following answers.

I assume you’re talking about the “jacobian correction” on the sphere radius.
It doesn’t need to be correct, as it is orthogonal to all model parameters. It can be left out or modified. The sphere radius prior can (and should) be freely chosen to optimize sampler performance and robustness. That is why in that thread we have been discussing a variety of different terms for this purpose. These do not all include a jacobian correction. Still, I’d be interested to hear what you think the correct jacobian correction would be.

I don’t understand this sentence (has been a while since I last thought about this topic). Is there another way to say this?

It is my understanding that “setting it equal to 1” would re-introduce the problems which unit-vectors are supposed to solve in the first place. As far as I can tell it is necessary to have an “extra-dimension” to solve them. That is, unless you modify the sampler internally to “teleport” the particle to the sphere surface at each leapfrog step and set the momentum orthogonal to the sphere surface to zero. If that is what you mean then I agree.

That’s the (or at least “a”) reason for having unit-vectors.
For example when using a 0 to 2pi parameterisation issues arise with posteriors which have meaningful mass on both sides of the boundary, but negligible mass away from it. If 0 and 2π are not neighbours then the result is a strongly bimodal distribution, which leads to sampling problems.

I can’t follow you there. In the previous paragraph (which you seem to refer to in the quoted sentence) you were talking about a 0 to 2pi boundary rather then 0 to pi.

I dont understand the example Stan code. But maybe I’m missing something. I haven’t tried to run your code to check. Why is there a target+= statement in the transformed parameters block (hidden in a function)? I thought that isn’t allowed. Does it have any effect? I also don’t understand the purpose of that target+= statement. The code looks to me as if this target statement had an effect the model wouldn’t give the correct results, since (it sems to me as if) it essentially defines an (unwanted) prior on theta. Did you try fitting the model to datasets with phase angles other than pi/2? E.g. a phase angle of -pi/2 (which I would think doesn’t work because you restricted theta to 0 to pi).
I’ll stop here, as I’m probably missing something basic. I haven’t looked at the second stan model yet.

2 Likes

The derivative of sqrt(dot_self(x)) wrt to x is

\frac{\partial}{dx} (x^T x)^{0.5} = x (x^Tx)^{-0.5}.

If x is size n then this derivative is also a size n-vector. There’s no determinant of this because the function is many-to-one. We need to introduce an auxillary variable and then change coordinates. That is if we let the function above be r = \sqrt{x^2 + y^2} then the density of that can be expressed as

\pi(r) = \int d(\theta) \pi(r, \theta) = \int d(\theta) \pi( x(r, \theta), y(r, \theta) ) | J (r, \theta) |

which is just the parameterization I’m speaking of but generalized to the n-d case.

I am following the derivations from this note below
n_d_polarcoordinates_note1.pdf (193.6 KB)

See 9.5 Function bodies | Stan Reference Manual. You can access the target using a user-defined function that has the suffix _lp.

In the examples I have included above one could leave this adjustment off because a prior is not put on the unit vector. All the constrained types in Stan include the adjustment in case a user wishes to put a prior on the output. See the attached pdf for the derivation of the jacobian determinant.

This is what I’m also most uncertain about and what my motivation for posting is to discuss. It’s possible that this should be 0 to 2pi. My hope is that someone tests and confirms that it works or finds the mistake and can propose a correction.

2 Likes

Should that work also for _lpdf and _lpmf? I thought _lp was deprecated?

I’m not sure about having _lpdf/_lpmf access the target. I bet @WardBrian knows. The _lp functions are great! I use them all the time. There’s a bit more in the SUG about them at 20.2 Functions as statements | Stan User’s Guide.

Re: _lp. This is not the same as _log, which is deprecated and should be changed to lpdf or lpmf. Its only purpose is to have side-effects on the target.

I don’t believe target+= can be used in a lpdf/lpmf function. I think it is restricted to _lp functions and the model block

Also, the use of _lp functions in the transformed parameters block is a sort of “known bug” which was carried over from stanc2 to preserve backwards compatibility. Discussions to directly allow target += in that block stalled, so it is in a sort of inconsistent state right now

@Raoul-Kima I fixed the 2d case in the original post (code is updated) by changing the 2d upper limit to 2 * pi. Also no need for the jacobian adjustment in that case because I believe it comes out to be r which is just 1. Testing with a phase of -pi/2 did not work originally. It now recovers the parameters and no divergences.

I tried out your (new) 2d code now.
One can see the multimodality issue I talked about above by setting phase in the data simulation to something that falls on the boundary of theta. e.g.phase=0. Be sure to run a bunch of chains and have the other parameters set as in the code you posted. The different chains will not give the same results. No single chain will give the correct answer, but when running several chains sometimes their combined samples randomly happen to give the correct distribution.

As far as I know this issue is the only issue with this type of parameterisation, and is what unit vectors are supposed to solve.

I also tested the jacobian that was in the original version of the 2d model (target+=log(sin(theta))). It seems incorrect. It biases the results and causes divergences when the sampler crosses multiples of pi. In the original version of the model the divergence issue didn’t show because the theta was restricted to 0 to pi. From what you wrote later it seems to me that you changed your view on what the correct 2d jacobian would be, as you then wrote “I believe it comes out to be r which is just 1”, which sounds plausible to me.

I agree that a jacobian correction is needed when parameterising the sphere with polar coordinates. The unit-vector parameterisation doesn’t need this. The jacobian that was discussed in the unit-vector thread you linked is optional and solely serves the purpose of optimizing sampler performance by controlling how the sphere radius at which the sampler particle moves scales with sphere dimensionality. It’s the correction needed for the sampler to correctly follow the distribution specified for the radius (which may or may not be desireable).

1 Like

Yes that makes sense. My proposal is that we need both parametrizations in Stan. I nearly always get divergences with the Marsaglia method with 2-3 dimensions and I have treedepth issues when d gets larger. I often don’t actually care about the unit vector itself but use it in constructing other objects that need a random unit vector. I would use this spherical method with that. In the case of the example model, if I had reason to believe that it was near phase 0 I would use the marsaglia method or the smaller sd of .1 proposed in the linked thread.

Now that you say it, that makes sense, because when using polar coordinates at the poles things that are far away in unconstrained space are close together on the sphere. So it suffers from the same problems as the (still current?) unit vector implementation of stan. The problems I explained in the opening post of the other thread which you linked.

The sphere method scales well to high dimensions. Also, if you model is reasonably complex I think it wouldn’t cause much slowdowns. It just needs a certain minimum treedepth so the sampler can follow the sphere surface. It would probably only strongly slow down models which would have a lower treedepth without the sphere, but not models which have such a treedepth anyways (but I didnt test that). And the required treedepth isn’t very high either (see opening post in the other thread).
Another way it might cause slowdowns is when a large proportion of the model dimensions are dimensions of low-dimension spheres. In this case the 1 additional dimension per sphere compared to the polar method noticeably increases the posterior dimensionality.

So to sum up: the sphere method can be (not much) slower than the polar method sometimes, but it seems robust and works in all cases. In other words: I suggest that Users should default to the sphere method, and view the polar method as a high performance special case that only works for particular posterior shapes and requires additional detail knowledge of its pitfalls to use safely (it can lead to biased posteriors without triggering any of the diagnostics, as e.g. when using phase=0 in the example code above and all chains end up in the same mode.).
Given the difficulty of its use and the potential for silent bias I’m not sure whether it is a good idea to have a “convenient” implementation of it in stan. Or at least one could try to somehow discourage users from using it without thinking about the implications.

Hm, I think there are two mistakes in my previous post:

  1. Stuff being far apart in the polar-posterior but (potentially infinitely) close on the sphere is not the problem from the other thread, but the opposite problem. I suspect its less problematic, but can still cause the same problems under certain circumstances.

  2. I think there’s something that can cause large slow downs and treedepth problems with the sphere method when using a diagonal mass matrix. I think back when the other thread was active I falsely assumed that Stan’s default setting is to use a dense metric. But when using a diagonal one, then if the model posterior is very narrow then the sphere-posterior can be a narrow ridge, which causes large treedepth. I didn’t test that yet. I theorize that this means that it makes sense to adjust the width of the radius distribution to match the width of the model-posterior (= the components along the sphere surface) when using a diagonal metric.

So for both parametrization an additional tuning parameter would remove the problems, wouldn’t it?

No, it removes some issues but not all. I initially started coding the sphericial parameterization when tuning the radius with the Marsaglia method still didn’t alleviate divergences/treedepth issues for my model. The spherical parameterization did without any additional tuning.

In the spirit of the title of this post, the answer is perhaps to be more clear about the issues in the SUG. An initial start is to writeup the findings from the linked post started by @Raoul-Kima and compare the two parameterizations. A nice thing would be to have clear guidance on when these issues occur and, if it is possible in HMC, different ways to mitigate.

If anyone is interested in helping to put this together message me here, slack, or github.

4 Likes

My current understanding is that the sphere method doesn’t require tuning to work, but can benefit from it in terms of performance.

Although that other thread hasn’t been active for a while, it’s still on my to-do list to finish the testing that people asked for there, to be as sure as possible whether the sphere method is reliable in all situations (although I think that it’s already pretty safe to use) and in what cases its performance isn’t optimal and how to improve it. When I find the time to work on these things properly again, my priorities are to finish that first. After that my next step is to write up the findings of that thread.

But I guess a quick summary so far is: the method so far seems applicable to all situations and reliable. Which of the shapes the radius “prior” has doesnt matter much. Whether using the radius-jacobian doesnt matter much. The donut shouldn’t be wider than sd=0.1 if the mean is 1, preferably a bit narrower. So I suggest to just use a normal(1,0.05) “prior” on the radius. If your posterior is very narrow using a narrower range prior (e.g. normal(1,0.005)) might provide better performance.

2 Likes

The original code for unit vector never worked. That’s why we switched to the current version.

What’s the sphere method and the polar method? I’m getting lost in the naming. I think it’s going to be confusing to give users two different data types that represent unit vectors. It’s not inconsistent with the language, we’d just have to call one something other than unit_vector, such as unit_vector_polar or something like that.

My original intention was to allow target += in the transformed parameter block, but it somehow didn’t get implemented except through _lp functions. My thinking is that’s where we would define the Jacobian for transforms—in the same block as the transform. But there were objections to adding target += to the transformed parameter block, primarily from @betanalpha.

I would personally be in favor of adding a jacobian += statement to the transformed parameters block, so that we could actually code our built-in transforms in Stan. As is, we don’t have a way to code a Jacobian that gets turned off during optimization within Stan itself, even though that’s how our built-in transforms work. I’ve just never gotten around to putting the code together and getting the votes to overrule the objections.

3 Likes

Using this model

parameters {
  unit_vector[2] alpha;
}

I can run in cmdstanr,

> library('cmdstanr')
> mod <- cmdstan_model('units.stan')
> fit <- mod$sample()

I don’t see any divergences.

> fit <- mod$sample()
Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.0 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.0 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.7 seconds.

The effective sample sizes are estimated to be roughly the number of draws (4000).

> fit
 variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 lp__     -1.03  -0.73 1.02 0.73 -3.12 -0.05 1.00     1577     2141
 alpha[1]  0.00   0.00 0.70 1.03 -0.99  0.99 1.00     4236     3550
 alpha[2]  0.01   0.03 0.71 1.06 -0.98  0.99 1.00     3749     3278

I also don’t see any divergences running in 10, 100, or 1000 dimensions (that’s as far as I went). In fact, the ESS gets better per draw as the number of dimensions goes up, averaging around ESS = 6000 for 4000 draws in 100 and ESS = 8000 for 4000 draws in 1000 dimensions.

There seem to be some terminology confusion here.

When I said “sphere method” I meant the method I introduced in the other thread that is linked in the opening post. I realized that @spinkney probably has used the term “sphere method” to mean what I called “polar method”, which is to represent e.g. coordinates on a 3d-sphere using two coordinates in the same way geographic coordinate systems often work (longitude and latitude). I called this polar coordinates because this coordinate system has these points where coordinates converge, called poles.

When I wrote “unit-vector” I assumed it is using a parameterisation that doesn’t have anything to do with polar coordinates, and I assumed that not having to use polar coordinates is the reason for the existence of unit-vectors. But then it dawned on me that (if I understood correctly) the unit vector was itself once based on polar coordinates, so people might not understand the term in the way I used it. I’ll try to not associate the term “unit-vector” with any particular implementation from now on.

I’m not sure what @spinkney means by “marsaglia method”. I read up again on what the marsaglia method is on wikipedia (Marsaglia polar method - Wikipedia) and clearly it is not directly applicable to HMC sampling without some modifications. So I’m not sure what @spinkey meant when he wrote in the opening post “I am able to resolve the current Marsaglia parameterization issues completely.” At first I thought the 2d stan code he had in that post was supposed to be an example of the marsaglia method, but now I think it’s an example of what he called the sphere method and I called the polar method. This is where some confusion on my side came from: I thought that is the marsaglia method.
I’ve been thinking about how the marsaglia method (sensu wikipedia) could be adapted to HMC and came to the conclusion that if we think that to the end we probably end up with exactly the implementation I presented in the other thread liked in the opening post. Which means that maybe what @spinkney meant when he wrote marsaglia method is my method from the other thread? Or the current unit-vector implementation of stan?
If that is the case we have our terminologies completely reversed: What I called sphere method he called marsaglia method and what he called sphere method I called polar method (an suspected but didn’t write so to be the marsaglia method).

Maybe @spinkney could state whether I now read his terminology correctly, or alternatively clarify what he means with “marsaglia method” and what he thinks are the issues with that which he wrote he resolved in the opening post, and how he resolved them?

I didn’t fully check/understand the higher-dimensional stan code @spinkney provided yet. (If he could provide the associated r code that could help).

Assuming the unit_vector is still implemented the same way as back when I started the other thread then the divergences arise from the interaction of the internal parameterisation with the likelihood. So if you don’t have a likelihood no divergences occur. Weak likelihoods give fewer divergences than strong ones. It’s all explained in the opening post of that other thread.

1 Like

Marsaglia is the author of method that Stan uses and the paper is referred to in the docs. See 10.8 Unit vector | Stan Reference Manual

Marsaglia, George. 1972. “Choosing a Point from the Surface of a Sphere.” The Annals of Mathematical Statistics 43 (2): 645–46.

The method I originally proposed, I’m calling spherical but should refer to as n-dimensional polar transform.

1 Like

I need to also test choosing theta and phi with this initial transform https://math.stackexchange.com/questions/3954611/choosing-point-uniformly-on-surface-of-sphere