Zero sum vector and normal distribution

Zero-sum vector and normal priors

The release of 2.36.0 introduces a new constraint transform in the Stan language. This constraint has been discussed here and, I’m sure, in other topics over the years.

A sum to zero vector is exactly what the name suggests. A vector where the sum of the elements equals 0. If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance. To get the same variance as the intended normal prior do

parameters {
  sum_to_zero_vector[N] z;
}
model {
  z ~ normal(0, sqrt(N * inv(N - 1)) * sigma)
}

where sigma is the intended standard deviation. FYI, it’s a bit more efficient to pre-calculate the sqrt(N * inv(N - 1)) in transformed_data. The general result to get a given variance from a normal with linear constraints is in: Fraser, D. A. S. (1951). Normal Samples With Linear Constraints and Given Variances. Canadian Journal of Mathematics, 3, 363–366. doi:10.4153/CJM-1951-041-9.

I hope in the next release we add a zero_sum_normal distribution as syntactic sugar for the above.

Connection with a simplex

We know that a simplex of dimension N is composed of N - 1 free parameters. The N^{th} parameter is determined by the other N - 1 parameters. Similarly, a zero-sum vector is determined by N - 1 free parameters.

A common way to transform an N-vector to a simplex is through the softmax function and to set one of the elements to 0. The sum-to-zero vector we now have is probably better because it ‘parses’ or ‘distributes’ the N - 1 free elements evenly into the N^{th} which should be easier for the sampler to explore.

parameters {
  sum_to_zero_vector[N] z;
}
transformed parameters {
  simplex[N] s = softmax(z);
}
model {
  // jacobian for softmax
  // typically you'd see sum(s) here too but it's 0
  target += -N * log_sum_exp(z);
  
  // co-area correction
  target += 0.5 * log(N);
  // the above is not necessary in Stan b/c it's a constant
  // but if you test this in something like Jax
  // you'll see the log jacobian det is off by this factor 
  // it is due to the coarea formula, a generalization
  // of the jacobian when mapping from R^m -> R^n where m >= n
}

There’s more on the coarea at coarea.pdf (578.6 KB).

Hopefully this simplex paper I have with others that details this will be out soon, where we explore this and other simplex transforms.

Zero-sum background for those interested

Our version cropped up from a circuitous research route where I’m working on a simplex parameterization paper with @mjhajharia, @Bob_Carpenter, @Seth_Axen, and @Niko where we are studying the Aitchison parameterizations and specifically the inverse logarithmic ratio. The simplex transform results from first generating a zero sum vector from a set of orthonormal bases. This seemed to work well but was a bit computationally expensive, creating a full orthogonal matrix and having matrix vector products to construct the resulting zero sum vector. @Seth_Axen noticed that this can be simplified into a series of sums, which @WardBrian condensed into a single loop through the elements!

The “different flavors” of implementation from our PyMC friends is due to choosing a different orthonormal bases. And any orthonormal bases will result in the same outcome. I believe they are doing an implementation of a Householder reflection. PyMC also allows any dimensional tensor to be zero-sum across the axes. In Stan we currently don’t have any n-array constraint transforms like that, but I have worked out the sum_to_zero_matrix and hopefully that will be in the next release. The stan-math github issue is Add sum_to_zero_matrix · Issue #3132 · stan-dev/math · GitHub.

13 Likes

Just came here to thank @spinkney—this is really a breakthrough in terms of efficiency of sum-to-zero. The isometric log ratio transforms he used rocks—it’s speeding up everything we do with sum-to-zero and making it more robust.

We not only need to finish the paper, we need to get the new simplex transform into Stan. It has the same benefits as this sum-to-zero constraint. I feel bad because this has been on the back burner for years now.

We didn’t put too much thought into any of the transforms other than the correlation matrix transforms, which we borrowed from Lewandowiski, Kurowicka, and Joe (hence the LKJ name of the correlation density—they’re random matrix theorists who needed to generate random correlation matrices concentrated around the identity).

2 Likes

The new simplex parameterization will be really nice. We should add the log version too as we often can stay on the log scale until generated quantities.

With the addition or row/col stochastic matrices, I want to add doubly stochastic matrices. I’ve worked out how to do this with the ILR parameterization so it will be similar to the better simplex parameterization rather than stick breaking.

The sum to zero matrix will be great too. After working through the doubly stochastic parameterization I believe I can add hard row/col sum constraints for a random rectangular matrix (row sums must equal col sums). This will be a game changer for Bayesian survey raking and ecological type Bayesian inference.

I believe the Cholesky factor of correlation matrix transform and the correlation matrix transform can also both be improved. On the Cholesky factor I have 2 candidates I need to extensively test. In practice they do not give the initialization warnings I get with the in-built and I can construct matrices of size dimension > 30 without it failing. On the correlation matrix, we currently construct the Cholesky factor and then self multiply it. I believe the vine parameterization in the LKJ paper should be faster and more stable (which I also have).

Is there any demand or a unit vector that instead divides by the L1 norm instead of the L2 norm?

Yes, do you have a use case?

In the intermediate stages of working on something now that makes use of it, like simulating a prior that uses it. Can talk more about it after I make more progress.

1 Like

Let me know if you need any help, sounds interesting.

1 Like

pymc.ZeroSumNormal — PyMC dev documentation Is this similar to this pymc implementation? Or something completely different.

It results in the same thing. I wonder if they don’t know about the Fraser paper, but that’s the earliest reference I could see of this.

1 Like

The sum to zero matrix will be great too. After working through the doubly stochastic parameterization I believe I can add hard row/col sum constraints for a random rectangular matrix (row sums must equal col sums). This will be a game changer for Bayesian survey raking and ecological type Bayesian inference.

Hey Sean, thanks for this thread and writing clear explanations for the great code you produced. I was wondering if you had any examples for where double stochastic matrices are used in ecological models? Most of the ecological models I’m familiar with tend to parameterise the stochastic matrices in terms of underlying probabilities, such as multistate models where transitioning between states is usually parameterised as some sort of survival probability and a state transition probability (e.g., going from alive and uninfected with pathogen to alive and infected).

1 Like

I’m using “ecological inference” to mean inference about individuals from population aggregates, see Ecological Inference | GARY KING.

I see that I wasn’t super clear in what I wrote. I meant that the rectangular, N x M, sum to zero matrix can be used for ecological inference. Where we have information about the marginal totals and possibly small sample or side info about the conditionals (the interior cells of the matrix).

As I was working through the doubly stochastic matrix, which must be square, I saw that a similar procedure could be used for the rectangular matrix one needs for the ecological inference.

The thing that I see doubly stochastic matrices enabling are sampling from convex polyhedrons. Either of the problem can be expressed as a convex polyhedron constraint or if one wants to directly investigate a distribution over some convex polyhedron. I don’t have as much info here to discuss because I haven’t thought too much about it.

1 Like

Let me know if you need any help, sounds interesting.

Maybe once I get a little further along I’ll reach out, if you don’t mind giving the code a once over.

Thanks a lot Sean! Both for the work and for the careful explanation. I put this right to use in a model, in fact.

However, when running moment matching in loo, I get an error related to the zero sum vector:

Compiling additional model methods...
Error: Exception: stan::math::sum_to_zero_free: sum_to_zero variable does not sum to zero. 
sum(sum_to_zero variable) = -2e-06, but should be 0 (in [irrelevant file information removed])

I’m guessing this is a machine precision or rounding issue? Or is it indicative of some problem with the sum_to_zero_vector? The sum that seems to throw the error certainly is close enough to zero for me, but the moment match of loo seems to complain. Perhaps it’s an issue for the loo maintainers?

That does indeed sound like a precision error. You may be able to fix it by requesting higher precision output from cmdstanR, otherwise I think there is a compiler flag to change the tolerance level

1 Like

Ah, nice. I got a similar error when trying to run the standalone generated quantities method, in fact. I’ll try to turn up the precision.

Update edit: You were right - setting sig_figs = 9 solved it. Thanks for the quick and useful response, @WardBrian!

1 Like

Great! Although mathematically this should result in a 0, with finite precision computation you’ll often have some small number.

I’m interested in how the zero-sum vector worked for your model. Are you finding it to be useful?

Absolutely useful, although I’m not using it for anything very fancy! I’m modeling brain imaging (MRI) data from adolescents with OCD and early onset psychosis, as well as healthy controls. We fit a hierarchical loglinear regression model to measures of thickness and surface area for different parts of the cortex, and to volumes of subcortical structures. The zero-sum vector is used for the varying intercepts per brain area, for identification of the model (we also have varying intercepts per participant). It allowed me to get rid of fixing the intercept of an arbitrary brain structure to 0 for identification. This is nice, as that specific brain region would not have any special role in the model.

I think this easily accessible zero-sum-vector (with the code you kindly provided to correct the variance) is a better way to identify our model than fixing an arbitrary parameter to 0, as the interpretation and hence communication of the model is a bit simpler - the intercept keeps a more conventional meaning. Also, the code is cleaner and easier to read and understand, which is really important as well.

2 Likes