A better-er unit vector

This is a follow-up of A better unit vector.

I’ve decided to put these things up at a blog because writing up a paper feels like too much. The post is at The Hyper-Tanh Peel: A Novel Parameterization for Bayesian Spheres – Pinkney’s Sufficient Stats. The below is mostly a copy-paste from the blog.

On the parameterization, I’m finding it to be very stable but I might’ve missed some issues with this.

1. Background

  1. Hyperspherical coordinates (angles) involve boundaries (0, \pi) that hinder Hamiltonian Monte Carlo (HMC) sampling.
  2. Muller’s method (normalizing standard Gaussians, how Stan parameterizes unit_vector) is isotropic but over-parameterized (K parameters for K-1 dimensions) and suffers from a gradient singularity at the origin.

I’m calling this the hyper-tanh peel bijective parameterization. It is a mapping \mathbb{R}^{K-1} \to S^{K-1}. It uses “Logistic Geometry” to create a smooth, unconstrained manifold that is numerically stable and avoids the singularities of traditional methods.

2. Mathematical Formulation

The Transformation

The method “peels” dimensions one by one using the hyperbolic tangent function (\tanh), which maps (-\infty, \infty) \to (-1, 1), and closes the remaining 2D subspace with a stereographic projection.

Algorithm:

Let r_1 = 1. For i = 1, \dots, K-2:

\begin{aligned} y_i &= r_i \cdot \tanh(x_i) \\ r_{i+1} &= r_i \cdot \text{sech}(x_i) \quad \text{where } \text{sech}(x) = \frac{1}{\cosh(x)} = \sqrt{1 - \tanh^2(x)} \end{aligned}

The final two dimensions are parameterized as a stereographic coordinate:

\begin{aligned} \theta &= 2 \arctan(x_{K-1}) \\ y_{K-1} &= r_{K-1} \cos(\theta) = r_{K-1} \frac{1 - x_{K-1}^2}{1 + x_{K-1}^2} \\ y_{K} &= r_{K-1} \sin(\theta) = r_{K-1} \frac{2 x_{K-1}}{1 + x_{K-1}^2} \end{aligned}

The Jacobian Adjustment

The log-determinant of the Jacobian is:

\log |J| = \sum_{i=1}^{K-2} -(K-i) \log(\cosh(x_i)) + \log(2) - \log(1 + x_{K-1}^2)

This implies that: 1. The peeling parameters follow a sech distribution: p(x_i) \propto \text{sech}^{K-i}(x_i). 2. The core parameter follows a cauchy distribution: p(x_{K-1}) \propto (1+x_{K-1}^2)^{-1}.

3. Logistic Geometry & Gradient Stability

A problem with Muller’s method when used with HMC is the singularity at the origin. In Muller’s parameterization, \mathbf{y} = \mathbf{z} / \|\mathbf{z}\|. As \mathbf{z} \to 0, the gradient \nabla_\mathbf{z} \mathbf{y} explodes to infinity. This creates a “funnel” that traps HMC samplers.

The Hyper-Tanh Peel maps the origin of the parameter space \mathbf{x}=\mathbf{0} to the “North Pole” of the sphere \mathbf{y}=(1, 0, \dots, 0).

The derivative of the mapping is governed by \frac{d}{dx} \tanh(x) = \text{sech}^2(x). * At x=0, \text{sech}^2(0) = 1. * The gradient is linear and bounded. There is no singularity.

4. Visual Verification

This demonstrates two key properties using R: 1. Uniformity: With Jacobian correction, Hyper-Tanh matches Muller’s isotropy. 2. Stability: Hyper-Tanh has bounded gradients where Muller explodes.

Experiment A: Uniformity Check

I generate 2,000 points on a 3D sphere (S^2) using both methods and verify that they cover the projected disk uniformly.

Result: The Hyper-Tanh parameterization, when sampled with the correct Jacobian prior, is indistinguishable from Muller’s method. It is perfectly uniform.

Experiment B: Gradient Stability (The “y=0” Singularity)

The magnitude of the gradient \left\| \frac{d\mathbf{y}}{dp} \right\| is calculated as the parameter p passes through the origin.

  1. Muller: p=z. \mathbf{y} = z/|z|. Gradient \propto 1/|z|.
  2. Hyper-Tanh: p=x. y = \tanh(x). Gradient \propto \text{sech}^2(x).

Result: The Hyper-Tanh Peel eliminates the topological singularity. The sampler can pass through the origin without experiencing infinite forces.

5. Stan Implementation

Copy this block directly into your Stan program.

functions {
  /**
   * Maps unconstrained R^(K-1) vector x to Unit Vector S^(K-1).
   *
   * @param x Unconstrained vector of length K-1
   * @param K Dimension of the embedding space (output vector size)
   * @return Unit vector of length K
   */
  vector hyper_tanh_to_unit_jacobian(vector x, int K) {
    vector[K] y;
    real r = 1.0;
    
    for (i in 1:K - 2) {
      real val = x[i];
      y[i] = r * tanh(val);
      real cosh_val = cosh(val);
      r = r * inv(cosh_val);
      real power = K - i; 
      jacobian += -power * log(cosh_val);
    }

    real last_x = x[K-1];
    jacobian += log(2.0) - log1p(square(last_x));
  
    real denom = 1.0 + square(last_x);
    
    y[K-1] = r * (1.0 - square(last_x)) / denom;
    y[K] = r * (2.0 * last_x) / denom;
    
    return y;
  }
}

data {
  int<lower=2> K;
}

parameters {
  vector[K - 1] x_raw; // Unconstrained
}

transformed parameters {
  vector[K] mu = hyper_tanh_to_unit_jacobian(x_raw, K);
}

model {
  // uniform on the sphere
}
12 Likes

This chunk seems vectorizable:

functions {
  /**
   * Maps unconstrained R^(K-1) vector x to Unit Vector S^(K-1).
   *
   * @param x Unconstrained vector of length K-1
   * @param K Dimension of the embedding space (output vector size)
   * @return Unit vector of length K
   */
  vector hyper_tanh_to_unit_jacobian(vector x, int K) {
    vector[K] y;
    real r = 1.0;
    
    for (i in 1:K - 2) {
      real val = x[i];
      y[i] = r * tanh(val);
      real cosh_val = cosh(val);
      r = r * inv(cosh_val);
      real power = K - i; 
      jacobian += -power * log(cosh_val);
    }

    real last_x = x[K-1];
    jacobian += log(2.0) - log1p(square(last_x));
  
    real denom = 1.0 + square(last_x);
    
    y[K-1] = r * (1.0 - square(last_x)) / denom;
    y[K] = r * (2.0 * last_x) / denom;
    
    return y;
  }
}

something like (n.b. this being off the brain compiler only, so likely indexing errors etc)

vector[K-2] tanhval = tanh(x[1:K-2]);
vector[K-2] coshval = cosh(x[1:K-2]);
vector[K-2] power = reverse(linspaced_vector(K-2, 2, K-1));
vector[K-1] rval;
rval[1] = 1.0;
for (i in 1:K - 2) {
  rval[i+1] = rval[i] * inv(coshval[i]); // implementing cumulative_prod function here 
}
y[1:K-2] = rval[1:K-2] .* tanhval;
jacobian += -dot_product(power, log(coshval));
real r = rval[K-1];
// ... rest as above

Are there reasons to prefer a non-vectorized version?

Nice name!

No, I didn’t try to optimize much. In your version it’s more memory intensive with all the vector declarations. If we wrote this in Stan math we’d try our best to balance performance and memory and we’d write the rev mode derivative specialty.

Yeah, I get that sometimes. No relation, as far as I know.

Sure - memory is the main potential downside. Would warrant a bit of performance testing, but easy enough to do an intermediate version:

  vector hyper_tanh_to_unit_jacobian(vector x, int K) {
    vector[K] y;
    y[1:K-2] = tanh(x[1:K-2]);
    vector[K-2] cosh_val = cosh(x[1:K-2]);
    real r = 1.0;
    
    for (i in 1:K - 2) {
      y[i] *= r;
      r *= inv(cosh_val[i]);
      jacobian += (i - K) * log(cosh_val[i]);
    }

    real last_x = x[K-1];
    real sqlast_x = square(last_x);

    jacobian += log(2.0) - log1p(sqlast_x);
  
    real mult = r / (1.0 + sqlast_x);
    
    y[K-1] = mult * (1.0 - sqlast_x);
    y[K] = mult * (2.0 * last_x);
    
    return y;
  }

(or just the tanh map for absolute minimal vector op).

For the uniformity test, what are points on the sphere supposed to look like projected down to 2D? It looks like way too many points on the surface, but is that just the geometry? What are y1 and y2? If I sample on a circle in 2D (OK, a 1D sphere if you want to measure manifold dimensions), then I wouldn’t expect the distribution of y to be uniform. I would expect the distribution of the angle to be uniform. I take it the boundary artifacts in the histograms are because it’s trying to extend beyond -1.0 and 1.0.

If I’m understanding the next part right, the derivatives happen in the unconstrained space, which I think is z here. I didn’t understand what p = z or p = x*y were doing. What you’re calling the Muller transform here does look bad.

Have you tried it with actual examples? Does the method extend beyond unit vectors to general orthonormal matrices?

Muller is the one used in Stan (you updated the docs for this ;)). That parameterization is pretty good except around the singularity at 0. The other unit vector post goes into this in detail. We’d really need to add some multiplier to the existing parameterization to make this better. It’s not a lost cause.

The benefit of this new parameterization is it handles that situation well. I think one of the last two coordinates in this parameterization might have lower ESS due to the sterographic mapping of 1 parameter to 2 and declaring one of the poles as the anchor.

I should really mention that we’re still mapping from the real line to the sphere and this dictates there will be artifacts. Let me reiterate that the stereographic part just picks a pole. There is no parameterization that maps from the base Euclidean points to the sphere completely (see @betanalpha’s great answer in the aforementioned post). The point of picking a pole helps when densities wrap around because it stays on one pole instead of bouncing around both poles. I wanted something that behaved better at the singularity at 0 and this wrapping behavior so I tried combining some thoughts.

On the uniformity plot, if you compare the posterior marginals to what uniform sampling on the sphere is, the means/sds all come out to equal under sampling error, for both parameterizations. I think when you plot this stuff things near the boundary clump with things on the boundary and it looks less uniform than you might intuitively expect.

Thanks, that’s what I suspected, but I just couldn’t work through the geometry in my head and was too lazy to simulate (I shouldn’t let that stop me these days with LLMs).

Cool stuff. What’s your plan for this?

First, is to update the unit vector built-in type so we don’t have to put a multiplier keyword.

The second one is that you can actually use this to create a cholesky correlation matrix but doing exp(x_last) of the last value and adding jacobian += x_last to the log target for each row of the lower Cholesky factor. I don’t think it’s faster than Updated cholesky corr parameterization testing but it’s neat because of the implied priors of sech and half-cauchy before standardization. Maybe these can be used to make some interesting priors?

4 Likes