A better-er unit vector

spinkney · January 15, 2026, 4:07pm

This is a follow-up of A better unit vector.

I’ve decided to put these things up at a blog because writing up a paper feels like too much. The post is at The Hyper-Tanh Peel: A Novel Parameterization for Bayesian Spheres – Pinkney’s Sufficient Stats. The below is mostly a copy-paste from the blog.

On the parameterization, I’m finding it to be very stable but I might’ve missed some issues with this.

1. Background

Hyperspherical coordinates (angles) involve boundaries (0, \pi) that hinder Hamiltonian Monte Carlo (HMC) sampling.
Muller’s method (normalizing standard Gaussians, how Stan parameterizes unit_vector) is isotropic but over-parameterized (K parameters for K-1 dimensions) and suffers from a gradient singularity at the origin.

I’m calling this the hyper-tanh peel bijective parameterization. It is a mapping \mathbb{R}^{K-1} \to S^{K-1}. It uses “Logistic Geometry” to create a smooth, unconstrained manifold that is numerically stable and avoids the singularities of traditional methods.

2. Mathematical Formulation

The Transformation

The method “peels” dimensions one by one using the hyperbolic tangent function (\tanh), which maps (-\infty, \infty) \to (-1, 1), and closes the remaining 2D subspace with a stereographic projection.

Algorithm:

Let r_1 = 1. For i = 1, \dots, K-2:

\begin{aligned} y_i &= r_i \cdot \tanh(x_i) \\ r_{i+1} &= r_i \cdot \text{sech}(x_i) \quad \text{where } \text{sech}(x) = \frac{1}{\cosh(x)} = \sqrt{1 - \tanh^2(x)} \end{aligned}

The final two dimensions are parameterized as a stereographic coordinate:

\begin{aligned} \theta &= 2 \arctan(x_{K-1}) \\ y_{K-1} &= r_{K-1} \cos(\theta) = r_{K-1} \frac{1 - x_{K-1}^2}{1 + x_{K-1}^2} \\ y_{K} &= r_{K-1} \sin(\theta) = r_{K-1} \frac{2 x_{K-1}}{1 + x_{K-1}^2} \end{aligned}

The Jacobian Adjustment

The log-determinant of the Jacobian is:

\log |J| = \sum_{i=1}^{K-2} -(K-i) \log(\cosh(x_i)) + \log(2) - \log(1 + x_{K-1}^2)

This implies that: 1. The peeling parameters follow a sech distribution: p(x_i) \propto \text{sech}^{K-i}(x_i). 2. The core parameter follows a cauchy distribution: p(x_{K-1}) \propto (1+x_{K-1}^2)^{-1}.

3. Logistic Geometry & Gradient Stability

A problem with Muller’s method when used with HMC is the singularity at the origin. In Muller’s parameterization, \mathbf{y} = \mathbf{z} / \|\mathbf{z}\|. As \mathbf{z} \to 0, the gradient \nabla_\mathbf{z} \mathbf{y} explodes to infinity. This creates a “funnel” that traps HMC samplers.

The Hyper-Tanh Peel maps the origin of the parameter space \mathbf{x}=\mathbf{0} to the “North Pole” of the sphere \mathbf{y}=(1, 0, \dots, 0).

The derivative of the mapping is governed by \frac{d}{dx} \tanh(x) = \text{sech}^2(x). * At x=0, \text{sech}^2(0) = 1. * The gradient is linear and bounded. There is no singularity.

4. Visual Verification

This demonstrates two key properties using R: 1. Uniformity: With Jacobian correction, Hyper-Tanh matches Muller’s isotropy. 2. Stability: Hyper-Tanh has bounded gradients where Muller explodes.

Experiment A: Uniformity Check

I generate 2,000 points on a 3D sphere (S^2) using both methods and verify that they cover the projected disk uniformly.

Result: The Hyper-Tanh parameterization, when sampled with the correct Jacobian prior, is indistinguishable from Muller’s method. It is perfectly uniform.

Experiment B: Gradient Stability (The “y=0” Singularity)

The magnitude of the gradient \left\| \frac{d\mathbf{y}}{dp} \right\| is calculated as the parameter p passes through the origin.

Muller: p=z. \mathbf{y} = z/|z|. Gradient \propto 1/|z|.
Hyper-Tanh: p=x. y = \tanh(x). Gradient \propto \text{sech}^2(x).

gradient-plot-11536×960 75.7 KB

Result: The Hyper-Tanh Peel eliminates the topological singularity. The sampler can pass through the origin without experiencing infinite forces.

5. Stan Implementation

Copy this block directly into your Stan program.

functions {
  /**
   * Maps unconstrained R^(K-1) vector x to Unit Vector S^(K-1).
   *
   * @param x Unconstrained vector of length K-1
   * @param K Dimension of the embedding space (output vector size)
   * @return Unit vector of length K
   */
  vector hyper_tanh_to_unit_jacobian(vector x, int K) {
    vector[K] y;
    real r = 1.0;
    
    for (i in 1:K - 2) {
      real val = x[i];
      y[i] = r * tanh(val);
      real cosh_val = cosh(val);
      r = r * inv(cosh_val);
      real power = K - i; 
      jacobian += -power * log(cosh_val);
    }

    real last_x = x[K-1];
    jacobian += log(2.0) - log1p(square(last_x));
  
    real denom = 1.0 + square(last_x);
    
    y[K-1] = r * (1.0 - square(last_x)) / denom;
    y[K] = r * (2.0 * last_x) / denom;
    
    return y;
  }
}

data {
  int<lower=2> K;
}

parameters {
  vector[K - 1] x_raw; // Unconstrained
}

transformed parameters {
  vector[K] mu = hyper_tanh_to_unit_jacobian(x_raw, K);
}

model {
  // uniform on the sphere
}

Carl_Pearson · January 16, 2026, 7:45pm

This chunk seems vectorizable:

functions {
  /**
   * Maps unconstrained R^(K-1) vector x to Unit Vector S^(K-1).
   *
   * @param x Unconstrained vector of length K-1
   * @param K Dimension of the embedding space (output vector size)
   * @return Unit vector of length K
   */
  vector hyper_tanh_to_unit_jacobian(vector x, int K) {
    vector[K] y;
    real r = 1.0;
    
    for (i in 1:K - 2) {
      real val = x[i];
      y[i] = r * tanh(val);
      real cosh_val = cosh(val);
      r = r * inv(cosh_val);
      real power = K - i; 
      jacobian += -power * log(cosh_val);
    }

    real last_x = x[K-1];
    jacobian += log(2.0) - log1p(square(last_x));
  
    real denom = 1.0 + square(last_x);
    
    y[K-1] = r * (1.0 - square(last_x)) / denom;
    y[K] = r * (2.0 * last_x) / denom;
    
    return y;
  }
}

something like (n.b. this being off the brain compiler only, so likely indexing errors etc)

vector[K-2] tanhval = tanh(x[1:K-2]);
vector[K-2] coshval = cosh(x[1:K-2]);
vector[K-2] power = reverse(linspaced_vector(K-2, 2, K-1));
vector[K-1] rval;
rval[1] = 1.0;
for (i in 1:K - 2) {
  rval[i+1] = rval[i] * inv(coshval[i]); // implementing cumulative_prod function here 
}
y[1:K-2] = rval[1:K-2] .* tanhval;
jacobian += -dot_product(power, log(coshval));
real r = rval[K-1];
// ... rest as above

Are there reasons to prefer a non-vectorized version?

spinkney · January 16, 2026, 8:48pm

Nice name!

No, I didn’t try to optimize much. In your version it’s more memory intensive with all the vector declarations. If we wrote this in Stan math we’d try our best to balance performance and memory and we’d write the rev mode derivative specialty.

Carl_Pearson · January 16, 2026, 9:17pm

Yeah, I get that sometimes. No relation, as far as I know.

Sure - memory is the main potential downside. Would warrant a bit of performance testing, but easy enough to do an intermediate version:

  vector hyper_tanh_to_unit_jacobian(vector x, int K) {
    vector[K] y;
    y[1:K-2] = tanh(x[1:K-2]);
    vector[K-2] cosh_val = cosh(x[1:K-2]);
    real r = 1.0;
    
    for (i in 1:K - 2) {
      y[i] *= r;
      r *= inv(cosh_val[i]);
      jacobian += (i - K) * log(cosh_val[i]);
    }

    real last_x = x[K-1];
    real sqlast_x = square(last_x);

    jacobian += log(2.0) - log1p(sqlast_x);
  
    real mult = r / (1.0 + sqlast_x);
    
    y[K-1] = mult * (1.0 - sqlast_x);
    y[K] = mult * (2.0 * last_x);
    
    return y;
  }

(or just the tanh map for absolute minimal vector op).

Bob_Carpenter · January 16, 2026, 9:31pm

For the uniformity test, what are points on the sphere supposed to look like projected down to 2D? It looks like way too many points on the surface, but is that just the geometry? What are y1 and y2? If I sample on a circle in 2D (OK, a 1D sphere if you want to measure manifold dimensions), then I wouldn’t expect the distribution of y to be uniform. I would expect the distribution of the angle to be uniform. I take it the boundary artifacts in the histograms are because it’s trying to extend beyond -1.0 and 1.0.

If I’m understanding the next part right, the derivatives happen in the unconstrained space, which I think is z here. I didn’t understand what p = z or p = x*y were doing. What you’re calling the Muller transform here does look bad.

Have you tried it with actual examples? Does the method extend beyond unit vectors to general orthonormal matrices?

spinkney · January 17, 2026, 3:07pm

Muller is the one used in Stan (you updated the docs for this ;)). That parameterization is pretty good except around the singularity at 0. The other unit vector post goes into this in detail. We’d really need to add some multiplier to the existing parameterization to make this better. It’s not a lost cause.

The benefit of this new parameterization is it handles that situation well. I think one of the last two coordinates in this parameterization might have lower ESS due to the sterographic mapping of 1 parameter to 2 and declaring one of the poles as the anchor.

I should really mention that we’re still mapping from the real line to the sphere and this dictates there will be artifacts. Let me reiterate that the stereographic part just picks a pole. There is no parameterization that maps from the base Euclidean points to the sphere completely (see @betanalpha’s great answer in the aforementioned post). The point of picking a pole helps when densities wrap around because it stays on one pole instead of bouncing around both poles. I wanted something that behaved better at the singularity at 0 and this wrapping behavior so I tried combining some thoughts.

spinkney · January 17, 2026, 3:12pm

On the uniformity plot, if you compare the posterior marginals to what uniform sampling on the sphere is, the means/sds all come out to equal under sampling error, for both parameterizations. I think when you plot this stuff things near the boundary clump with things on the boundary and it looks less uniform than you might intuitively expect.

Bob_Carpenter · January 20, 2026, 8:44pm

Thanks, that’s what I suspected, but I just couldn’t work through the geometry in my head and was too lazy to simulate (I shouldn’t let that stop me these days with LLMs).

jmh530 · January 21, 2026, 3:00pm

Cool stuff. What’s your plan for this?

spinkney · January 21, 2026, 3:24pm

First, is to update the unit vector built-in type so we don’t have to put a multiplier keyword.

The second one is that you can actually use this to create a cholesky correlation matrix but doing exp(x_last) of the last value and adding jacobian += x_last to the log target for each row of the lower Cholesky factor. I don’t think it’s faster than Updated cholesky corr parameterization testing but it’s neat because of the implied priors of sech and half-cauchy before standardization. Maybe these can be used to make some interesting priors?

Topic		Replies	Views
A better unit vector Developers specification	57	4439	September 26, 2022
Divergence /Treedepth issues with unit_vector Algorithms	30	5239	September 30, 2019
Von_mises documentation suggestion General	43	4689	July 20, 2017
Understanding non-bijective variable transformations Modeling techniques	22	3858	March 15, 2022
HMC Gradient information for original and transformed parameter space Algorithms	7	1111	October 9, 2019