New Stan data type: zero_sum_vector

I just created a stanc3 GitHub issue around adding a zero_sum_vector type and have attached the link below. I’d like this feature because I’m getting tired of coding this all by hand, which requires boilerplate parameter and transformed parameter declarations which bloat the code and make it hard to read.

This didn’t seem like it deserved a whole design doc, but I did write out some considerations for declaring priors.

[Edit: I forgot to add that the reason I’m posting here is to solicit feedback before adding this to the language.]

12 Likes

Just to note (and I don’t expect that this is news to @Bob_Carpenter) that the Stan Users Guide has a section on parameterizing vectors of this sort as well as some discussion of how to set priors on the margins of a centered vector: 1.7 Parameterizing centered vectors | Stan User’s Guide

Beyond saying that this seems useful and worthwhile, I have one minor comment on the Github issue. For pedagogical purposes I wouldn’t say that

There is no Jacobian because the transform is a constant linear function

Instead, I would say that there is no useful adjustment to perform because the Jacobian isn’t square. The whole point of a Jacobian adjustment is to ensure that the probability density functions that syntactically appear to correspond to prior statements over transformed parameters actually yield the corresponding pushforward densities over those parameters. For constrained variable types like centered vectors, it isn’t the case that the prior pushforwards will correspond to the “syntactic priors”. If there were a fully general adjustment that would enable this to be the case, then I imagine we would apply it, just like a non-constant Jacobian adjustment for a bounded parameter.

I think this distinction is pedagogically important because when a user hears that the transform is linear and no Jacobian adjustment is required, they might reasonably assume that they can write down prior statements that encapsulate domain knowledge in the usual straightforward way. And then they’d write down the wrong Stan model without realizing it.

4 Likes

I don’t think there will be any debate about the design and implementation of the type, but I strongly disagree with the modeling motivation (not to mention the heavy use of the royal “we” when discussing that motivation).

Imposing a zero-sum constraint on individual parameters addresses identifiability issues only by pretty substantially changing the assumed population model.

For example from the mathematical perspective it breaks infinite exchangeability/conditional independence of the model, making it fundamentally different from classic hierarchal models. One consequence is that the zero-sum model cannot consistently generalize beyond the given contexts and so inferences and predictions can no longer be made for new contexts.

From a more statistical perspective the zero-sum constraint fixes the population mean to exactly the average of the individual parameters which is extremely unlikely to happen if the population model is actually conditionally independently normal.

As with so many popular hacks the results are achieved only at the expense of fundamentally changing the model which is, in my opinion, very poor practice. That said separating the design/implementation from these (at least what I consider) controversial modeling applications would avoid these issues in the software discussion entirely.

1 Like

Thanks, @jsocolar. This is super cool. I didn’t do the math to figure out the correction and I don’t even think I reviewed the PR.

There is a Jacobian, but it’s constant (i.e., doesn’t depend on parameters).

The Jacobian is always square in a change of variables but it’s not always apparent what those variables are (and sometimes there’s a choice). For example, if I have variables (x, y) and want a transformed variable \exp(x + y), then I can’t just put a prior on \exp(x + y) and get a distribution over (x, y)—the result is the same for (x - c, \ y + c), so I need more constraint. For example, I could put a prior on x or y in addition to \exp(x + y) and get a proper prior on (x, y). The typical fix is to think of the transform as mapping both x and y, which means you need two outputs. A simplex choice is (x, y) \rightarrow (\exp(x + y), y) and you get a nice triangular Jacobian and easy determinant calculation and then you can put a prior on \exp(x + y) and on y and get a proper prior on (x, y).

Let’s look at a simple example, where we map (x, y) \rightarrow (x, y, -(x + y)). This is indeed not square. We have too many degrees of freedom in the output. So instead we need to pick and choose. (x, y) \rightarrow (-(x + y), \ y) works. Then what’s happening in this model is what @betanalpha is complaining about—we’re overconstraining the solution by putting priors on all of x, y, and -(x + y). That’s why we need the adjustment to the scale of the prior indicated in the user’s guide.

Is there an example you had in mind?

1 Like

Thanks for pointing that out. The model becomes an “intrinsic” model like ICAR instead of a full model like CAR.

@betanalpha: Which do you consider the “classic hierarchical model”? The model that pins one of the coefficients is quite popular in frequentist circles, but it introduces asymmetries into the prior. I’m guessing you mean the symmetric prior \alpha_{1:N} \sim \textrm{normal}(0, \sigma) along with an intercept \mu so that \mu + \alpha is not identified in the likelihood.

I agree. It’s bad practice for me to use “we” because the scope isn’t clear (even to myself!).

1 Like

I was curious about how the posterior would differ between enforcing the sum-to-zero constraint and not. So I’m going to do a little simulation, because it’s easier than doing real math.

Simple hierarchical model

Let’s consider a simple, canonical random-effects model for binary data organized into K groups.

\mu \sim \textrm{normal}(0, 1) \\ \sigma \sim \textrm{lognormal}(0, 1) \\ \alpha_k \sim \textrm{normal}(\mu, \sigma) \\ y_{k, n} \sim \textrm{Bernoulli}(\textrm{logit}^{-1}(\alpha_k))

This is essentially the same model as I used in my case study, Hierarchical Partial Pooling for Repeated Binary Trials. If you’re not familiar with these models, you might also want to check out Betancourt and Girolami’s paper, Hamiltonian Monte Carlo for Hierarchical Models, which discusses what all the fuss is about.

An alternative formulation of the same model keeps the same priors on \mu and \sigma, but moves the global mean parameter \mu from acting as the prior location parameter to an additive term in the sampling distribution.

\alpha_k \sim \textrm{normal}(0, \sigma) \\ y_{k, n} \sim \textrm{Bernoulli}(\textrm{logit}^{-1}(\mu + \alpha_k))

It’s an alternative formulation of the same model in the sense that the posteriors on \alpha, \mu, and \sigma will be identical.

Stan code

There are a lot of ways to code this model in Stan. Here’s the traditional parameterization that codes the first model pretty much as written.

data {
  int<lower=0> K;
  int<lower=0> N;
  array[K, N] int<lower=0, upper=1> y;
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector<offset=mu, multiplier=sigma>[K] alpha;
}
model {
  alpha ~ normal(mu, sigma);
  mu ~ normal(0, 1);
  sigma ~ lognormal(0, 1);
  for (k in 1:K)
    y[k] ~ bernoulli_logit(alpha[k]);
}

The variable alpha is declared to have an offset of mu and multiplier of sigma, which provides a non-centered parameterization under the hood.

In contrast, here’s a slightly different, but closely related model that uses only K parameters instead of K + 1 by setting

\alpha_K = -\textrm{sum}(\alpha_{1:K-1})

and adjusting the prior scale accordingly; see the Stan User’s Guide, section 1.7 on parameterizing centered vectors for more details on the adjustment.

data {
  int<lower=0> K;
  int<lower=0> N;
  array[K, N] int<lower=0, upper=1> y;
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[K - 1] alpha_pre;
}
transformed parameters {
  vector[K] alpha = mu + append_row(alpha_pre, -sum(alpha_pre));
}
model {
  alpha ~ normal(0, inv(sqrt(1 - inv(K))) * sigma);
  mu ~ normal(0, 1);
  sigma ~ lognormal(0, 1);
  for (k in 1:K)
    y[k] ~ bernoulli_logit(alpha[k]);
}

In both cases, I also snuck in a generated quantities block that reports the posterior for the sample mean and standard deviation of the coefficients \alpha,

generated quantities {
  real mean_alpha = mean(alpha);
  real<lower=0> sd_alpha = sd(alpha);
}

Simulated data and fit

Now I’m going to simulate data for K = 20 and N = 200 and fit using cmdstanr with the following R script.

library('cmdstanr')
ilogit <- function(u) 1 / (1 + exp(-u))

set.seed(1234)

K <- 20
mu <- 1.3
sigma <- 2.3
alpha <- rnorm(K, mu, sigma)

N <- 200
y <- matrix(NA, K, N)
for (k in 1:K)
    y[k, ] <- rbinom(N, 1, ilogit(alpha[k]))
data = list(K = K, N = N, y = y);

mod <- cmdstan_model('hier.stan')
fit <- mod$sample(data = data, chains=2, parallel_chains=4)

modc <- cmdstan_model('hier-ctr.stan')
fitc <- modc$sample(data = data, chains=2, parallel_chains=4)

And here are the results (for seed 1234).

   variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 mu          0.53   0.56 0.45 0.41 -0.27  1.24 1.02      130      146
 sigma       2.20   2.16 0.38 0.37  1.68  2.92 1.02      152      345
 alpha[1]   -1.61  -1.61 0.19 0.20 -1.92 -1.30 1.00     2786     1454
 alpha[11]   0.29   0.29 0.14 0.14  0.07  0.51 1.00     2684     1168
 alpha[20]   5.62   5.49 1.04 1.02  4.16  7.52 1.00      750     1321
 mean_alpha  0.69   0.69 0.08 0.08  0.58  0.83 1.00     1141     1266
 sd_alpha    2.17   2.15 0.16 0.15  1.93  2.46 1.00      990     1332
   variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 mu          0.69   0.68 0.08 0.08  0.56  0.82 1.01      575      859
 sigma       2.22   2.17 0.41 0.37  1.66  2.98 1.00     1585     1202
 alpha[1]   -1.62  -1.62 0.19 0.19 -1.95 -1.32 1.00     3043     1547
 alpha[11]   0.28   0.28 0.14 0.14  0.05  0.51 1.00     2892     1550
 alpha[20]   5.64   5.50 1.12 1.07  4.11  7.69 1.00      755     1064
 mean_alpha  0.69   0.68 0.08 0.08  0.56  0.82 1.01      575      859
 sd_alpha    2.17   2.15 0.17 0.16  1.93  2.48 1.00      818     1048

All of the parameters other than \mu have indistinguishable posteriors. But \mu is both larger and has much smaller posterior 95% intervals in the second model.

Did we just get lucky? Let’s try a different seed (123456).

> prt(fit)
   variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 mu          1.90   1.90 0.49 0.48  1.06  2.66 1.00      250      412
 sigma       2.31   2.25 0.46 0.39  1.69  3.14 1.01      303      594
 alpha[1]    3.90   3.87 0.49 0.47  3.16  4.74 1.00     2530     1208
 alpha[11]  -0.94  -0.94 0.16 0.15 -1.20 -0.68 1.00     2193     1619
 alpha[20]   2.60   2.59 0.29 0.30  2.16  3.08 1.00     2625     1474
 mean_alpha  2.38   2.37 0.13 0.12  2.18  2.60 1.00     1477     1356
 sd_alpha    2.24   2.21 0.20 0.17  1.97  2.62 1.00     1226     1259
> prt(fitc)
   variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 mu          2.38   2.37 0.13 0.13  2.19  2.62 1.01      377      584
 sigma       3.22   3.15 0.57 0.53  2.41  4.22 1.00     1047     1246
 alpha[1]    3.86   3.81 0.48 0.47  3.16  4.66 1.00     1495     1416
 alpha[11]  -0.95  -0.95 0.16 0.16 -1.22 -0.69 1.00     3066     1528
 alpha[20]   2.60   2.58 0.29 0.27  2.15  3.12 1.00     1971      818
 mean_alpha  2.38   2.37 0.13 0.13  2.19  2.62 1.01      377      584
 sd_alpha    2.28   2.24 0.23 0.20  2.00  2.72 1.01      348      419

@betanalpha is right

So we can conclude that @betanalpha is right—changing the model to use a sum-to-zero constraint does change the posterior. And arguably the non-identifiable model is the one we want, because as @betanalpha said, this is the one where we can naturally make posterior predictions for new groups. Making predictions for new groups is important. There might not be new states in a political model, but there are new testing sites for Covid modeling, and Andrew and I used posterior predictions for new groups in our Covid test sensitivity and specificity paper.

Help with intuitions?

I’d like to develop some intuition around why these results are so different. Of course, they have completely different correlation structure, but then so does the traditional centered vs. non-centered parameterization (though in that case you can easily show the transformed variables have the same distributions). What I’m really surprised about is that the posteriors for the coefficients \alpha_k don’t change even though the posteriors on \mu and \sigma are so different.

Given the posteriors for \alpha were indistinguishable between the two models, it’s not surprising that the sample mean and standard deviations were identical. But I was surprised that \mu, \sigma didn’t closely track the sample mean and standard deviation of \alpha in either model. With both seeds, the value of \mu matches the sample mean of \alpha_{1:K} in the second model. But the standard deviations are way off in both cases.

11 Likes

@avehtari and @bgoodri found my bug during the Stan meeting. Here’s the revised hierarchical model that takes the prior for alpha to be centered at mu. This brings the results in line.

Here’s the revised hier-ctr.stan.

data {
  int<lower=0> K;
  int<lower=0> N;
  array[K, N] int<lower=0, upper=1> y;
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[K - 1] alpha_pre;
}
transformed parameters {
  vector[K] alpha = mu + append_row(alpha_pre, -sum(alpha_pre));
}
model {
  alpha ~ normal(mu, inv(sqrt(1 - inv(K))) * sigma);
  mu ~ normal(0, 1);
  sigma ~ lognormal(0, 1);
  for (k in 1:K)
    y[k] ~ bernoulli_logit(alpha[k]);
}
generated quantities {
  real mean_alpha = mean(alpha);
  real<lower=0> sd_alpha = sd(alpha);

}

Note that this is equivalent to replacing the sampling statement for alpha with

append_row(alpha_pre, -sum(alpha_pre)) ~ normal(0, inv(sqrt(1 - inv(K))) * sigma);

And then here are the results, first for the standard model and then for the sum-to-zero model.

> prt(fit)
   variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 mu          0.53   0.56 0.45 0.41 -0.27  1.24 1.02      130      146
 sigma       2.20   2.16 0.38 0.37  1.68  2.92 1.02      152      345
 alpha[1]   -1.61  -1.61 0.19 0.20 -1.92 -1.30 1.00     2786     1454
 alpha[11]   0.29   0.29 0.14 0.14  0.07  0.51 1.00     2684     1168
 alpha[20]   5.62   5.49 1.04 1.02  4.16  7.52 1.00      750     1321
 mean_alpha  0.69   0.69 0.08 0.08  0.58  0.83 1.00     1141     1266
 sd_alpha    2.17   2.15 0.16 0.15  1.93  2.46 1.00      990     1332
> prt(fitc)
   variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 mu          0.69   0.69 0.08 0.08  0.57  0.82 1.00     1040     1201
 sigma       2.10   2.05 0.40 0.37  1.57  2.81 1.00     1993     1455
 alpha[1]   -1.60  -1.60 0.19 0.18 -1.92 -1.31 1.00     2544     1508
 alpha[11]   0.29   0.28 0.15 0.15  0.05  0.54 1.00     2704     1610
 alpha[20]   5.60   5.45 1.10 1.02  4.10  7.59 1.00     1284     1173
 mean_alpha  0.69   0.69 0.08 0.08  0.57  0.82 1.00     1040     1201
 sd_alpha    2.16   2.14 0.17 0.15  1.93  2.46 1.00     1322     1251

The marginal distribution for the parameters are still fine, and you can see the big gap in ESS for the hierarchical case (first model above). The fit for mu and sigma look different, but they have high posterior sd in the first model and a low ESS, so the MCMC standard error is on the order of .05 for the first model and less than .01 for the second.

Now the question I have is why mu has such a wide distribution in the first model compared to the second. Uncertainty for sigma is similar in both models.

5 Likes

I wondered about exactly the same thing a while ago. It turns out that with normal prior on the intercept, you can parametrize the model with sum-to-zero constraint, resolve identifiability issues + remove one parameter, but then recover exactly the same inferences as you would have with the original model: Correlated posterior - Sum to zero constraint for varying intercepts?! - #24 by martinmodrak

(there’s also a lot of interesting ideas from other contributors to the thread)

The specific way I did that however didn’t result in improved performance in the cases I cared about, but hope there’s still something to learn from the attempt.

I’ll also tag @paul.buerkner who seemed to be interested in the previous discussion on the topic.

1 Like

My intuition is that the mu on the non centered program must span a wider range due to the range of values that alpha may take. In the centered program regardless of where the K - 1 alphas end up in the real line they will always be as close or closer together than the K in the non centered. If a bunch of alphas are negative the Kth one has be large positive - but the prior doesn’t care about the Kth one so the sd on it is tighter around where it ends up on the real line.

The answer to this is something you have already written yourself:

The problem with the second model can easily be visualized for the case of K=2 and N=infinity, ie when the marginal posteriors for the alphas are Dirac deltas.

Let’s say these marginal posteriors of the alphas are point masses at - 1 and +1 respectively. Then in the second model, mu (which really is the sample man and not the population mean) is forced to always be zero due to the sum-to-zero constraint. In the first model on the other hand, mu (which now actually is the population mean) is free to explore the posterior.

1 Like

I was trying to put this in a multivariate normal specification to see how we can derive the standard errors using matrix algebra.

We want a matrix M which when multiplied by a vector gives the vector plus the negative sum of the elements prior.

\alpha M = \bigg[\alpha_1 \:\alpha_2 \: \cdots \: \alpha_{K -1} \; -\sum_\limits{i=1}^{K - 1} \alpha_i \bigg]

The vector \alpha needs to be length K, so I pad it an extra 0. An M that satisfies this is a matrix with ones on the diagonal and negative ones in the last column (the K x K value can be anything since alpha has 0 at the Kth value)

M = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 & -1\\ 0& 1 & 0 &\dots & 0 & - 1 \\ 0& 0 & 1 & \dots & 0 & - 1 \\ \vdots & & \ddots & & \vdots & \vdots\\ 0 & & \dots & & 0 & 1\\ \end{bmatrix}

Let’s say that we want to put a standard normal prior on alpha. The transform on this standard normal is

\alpha M \sim MVN(0, MIM^T).

Using the fact that the variance of a constant matrix A times a random matrix X is

\begin{align} \mathbb{V}[𝐴𝑋] &= \mathbb{E}[(𝐴(𝑋−\mu))(𝐴(𝑋−\mu))^𝑇] \\ &=\mathbb{E}(𝐴(𝑋−\mu)(𝑋−\mu)^T𝐴^T) \\ &=𝐴\mathbb{E}[(𝑋−\mu)(𝑋−\mu)^𝑇]𝐴^𝑇 \\ &=𝐴\mathbb{V}[𝑋]𝐴^𝑇. \end{align}

Great! Now we can write some code. To set this up using @Bob_Carpenter’s data we just need the covariance matrix. Of course, we’ll use the Cholesky factor of it too.

id_mat <- diag(K - 1)
id_mat <- rbind(id_mat, rep(0, K - 1))
id_mat <- cbind(id_mat, -1)
id_mat[K, K] <- 1

Sigma <- cov2cor(id_mat %*% diag(K) %*% t(id_mat))

mod_mvn <- cmdstan_model('mvn_ctr.stan')
data = list(K = K, 
            N = N, 
            y = y,
            L = t(chol(Sigma)),
            s = rep(2, K)
)

Stan program

data {
  int<lower=0> K;
  int<lower=0> N;
  array[K, N] int<lower=0, upper=1> y;
  cholesky_factor_corr[K] L;
  vector[K] s;
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[K - 1] alpha_pre;
}
transformed parameters {
  vector[K] alpha = mu + append_row(alpha_pre, -sum(alpha_pre));
}
model {
  alpha ~ multi_normal_cholesky(rep_vector(mu, K), diag_pre_multiply(s .* rep_vector(sigma, K), L));
  mu ~ normal(0, 1);
  sigma ~ lognormal(0, 1);
  for (k in 1:K)
    y[k] ~ bernoulli_logit(alpha[k]);
}
generated quantities {
  real mean_alpha = mean(alpha);
  real<lower=0> sd_alpha = sd(alpha);
}

The s vector is there because the M matrix has a diagonal of 2 (except for the last value). When making this a correlation matrix the cov2cor divides by the sqrt of the diagonal. I’m adding it back in as the square of that (which is 2).

Which gives

     variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 lp__         -1952.19 -1951.91 3.29 3.21 -1957.81 -1947.40 1.00      876     1318
 mu               0.63     0.63 0.06 0.06     0.53     0.74 1.00     1607     1609
 sigma            2.13     2.08 0.41 0.38     1.56     2.88 1.00     2624     1560
 alpha_pre[1]    -2.25    -2.25 0.19 0.20    -2.54    -1.95 1.00     2871     1658
 alpha_pre[2]     1.20     1.19 0.20 0.21     0.87     1.54 1.00     3618     1509
 alpha_pre[3]     3.06     3.03 0.43 0.42     2.42     3.83 1.00     2778     1513
 alpha_pre[4]    -4.75    -4.70 0.53 0.49    -5.73    -3.94 1.00     2698     1318
 alpha_pre[5]     1.12     1.12 0.21 0.21     0.78     1.48 1.00     3641     1586
 alpha_pre[6]     2.15     2.14 0.29 0.29     1.70     2.66 1.00     3783     1527
 alpha_pre[7]    -0.67    -0.67 0.14 0.14    -0.90    -0.43 1.00     2674     1601

vs Bob’s

     variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
 lp__         -1942.57 -1942.25 3.31 3.32 -1948.41 -1937.71 1.00      955     1529
 mu               0.69     0.68 0.08 0.07     0.57     0.82 1.00      785      743
 sigma            2.08     2.03 0.35 0.34     1.58     2.72 1.00     2049     1760
 alpha_pre[1]    -2.31    -2.30 0.19 0.19    -2.64    -1.99 1.00     1784     1323
 alpha_pre[2]     1.13     1.12 0.21 0.20     0.77     1.48 1.00     2022     1255
 alpha_pre[3]     2.92     2.89 0.42 0.41     2.29     3.67 1.00     2024     1533
 alpha_pre[4]    -4.70    -4.65 0.52 0.50    -5.62    -3.92 1.00     1801     1250
 alpha_pre[5]     1.05     1.04 0.21 0.20     0.72     1.40 1.00     1991     1607
 alpha_pre[6]     2.05     2.05 0.28 0.27     1.62     2.53 1.00     2119     1412
 alpha_pre[7]    -0.73    -0.72 0.16 0.15    -0.99    -0.47 1.00     1748     1612

Now, what’s really interesting is that you can clearly see how mu will have much tighter estimates based on the correlation matrix of M - I’m going to use a 4 x 4 but it’s the same last row and everything else as you get bigger -

        [,1]       [,2]       [,3]       [,4]
[1,]  1.0000000  0.5000000  0.5000000 -0.7071068
[2,]  0.5000000  1.0000000  0.5000000 -0.7071068
[3,]  0.5000000  0.5000000  1.0000000 -0.7071068
[4,] -0.7071068 -0.7071068 -0.7071068  1.0000000

Whereas the non-centered version assumes an identity correlation matrix.

What I don’t know is how this can be put into the univariate normal specification.

3 Likes

Right.

“Classic” is one of those dangerous words that I try to avoid.

In the older Bayesian literature “hierarchical” is typically motivated from the infinite exchangeability perspective through de Finetti’s theorem; it’s more often about subjective domain expertise rather than modeling a physical population. Degeneracy of the “hyperparameters” wasn’t much of a concern because it was just an honest reflection of that domain expertise.

From what I’ve read the older frequentist literature takes the population model perspective because it’s only when the individual variables are missing data corresponding to a physical population, and not parameters, that they can be integrated out of the observational model. This is why older references will default to “random effects” only when modeling subsets of a population and “fixed effects” when modeling any finite population. Any degeneracy problems were problems of experimental design, remedied by more observations in the right places.

For various reasons these more careful perspectives eroded as “hierarchical” became synonymous with regularization and not an aspect of the model. Based on my (limited) research heuristic techniques like “set one of the coefficients to zero” and “set the sum to zero” are more recent and largely come out of this more superficial regularization perspective.

For example what domain expertise or narratively generative process would result in an exact sum-to-zero constraint for a finite population of individuals? The lack of conditional exchangeability implies that there is some strong interaction between the individuals that results in the sum being conserved. I can definitely come up with circumstances where this might be possible, but I don’t think they’re appropriate to most uses of hierarchal models.

The model \pi(\alpha_{n} \mid \mu, \tau) = \text{normal}(\alpha_{n} \mid \mu, \tau) is formally identified; given enough observations informing each \alpha_{n} all N + 2 parameters will be constrained. The parameters are degenerate – \mu goes down when all of the \alpha_{n} go up together and vice versa – only when the is no or little constraint from a likelihood function. A non-centered parameterization helps here as the relative deviations are independent of the population parameters in the prior model alone.

Also we should be careful to separate a single hierarchal model from multiple “additive” hierarchal models, i.e. what I refer to as a “first-order overlapping multifactor model” or what are more commonly called “additive random effect models”. If we have

y_{n} \sim \pi(\mu_{n}, \phi) \\ \mu_{n} = \alpha_{0} + \sum_{k = 1}^{K} \alpha_{k, j(n)} \\ \alpha_{k, j} \sim \text{normal}(\mu_{k}, \tau_{k})

then \mu and the \mu_{k} will not be identified. Eliminating the individual location parameters,

y_{n} \sim \pi(\mu_{n}, \phi) \\ \mu_{n} = \alpha_{0} + \sum_{k = 1}^{K} \alpha_{k, j(n)} \\ \alpha_{k, j} \sim \text{normal}(0, \tau_{k})

formally identifies the model, but without enough information from observed data (in particular when the individual levels in each factor are not all observed) there can be all kinds of nasty degeneracies.

Again historically these degeneracies were largely considered a problem of experimental design, hence all of those Latin squares and similar designs to ensure that enough levels are observed.

Fuss is more thoroughly discussed in Hierarchical Modeling.

As @spinkney and @Niko mentioned in the sum-to-zero constraint model \mu is exactly equal to the sample mean of the individual parameters; there is no uncertainty. The standard deviation in the marginal posterior is just the uncertainty of the \alpha_{k} propagating to the sample mean.

In the more typical hierarchal model, however, \mu is meant to model the population mean which will in general be different from the sample mean. When the prior model for \mu is diffuse, and the posterior distribution is dominated by the likelihood function, then the marginal posterior distribution for \mu will be centered around the sample mean of the individual parameters (that’s the most consistent configuration), but there will be a wide breadth because many other population means will also be consistent with the observed sample mean.

2 Likes

Just the examples that we are already discussing. I have no quarrel with any of the math here, but I think it is useful to be careful about how we communicate this stuff to users. In my experience, most users are essentially following three rules of thumb:

  1. Anything declared in the parameters block is a parameter.
  2. The purpose of a Jacobian adjustment is to make the sampling statements that appear syntactically to correspond to prior statements over (transformed) parameters actually yield the corresponding prior pushforward densities over the margins of the prior.
  3. Jacobian adjustments are the determinant of the Jacobian matrix, and they “work” when the transformation is a bijection.

Notice that these rules of thumb aren’t always strictly correct! For example, if we do

parameters{
   real<lower = 0> x;
}
model{
  x ~ std_normal();
}

then even though Stan correctly applies the Jacobian adjustment we (of course!) don’t get a standard normal margin for x; we get a half-normal, in violation of the second rule-of-thumb.

I find it revealing to look at what happens if we try to apply these rules of thumb to a zero_sum_vector, parameterized by taking the final element as the negative sum of the other elements.

If we take rule-of-thumb 1 to be true, and we additionally claim that the Jacobian adjustment exists and is constant (and therefore can be ignored), then we lose both of the other rules of thumb. Sampling statements over the elements of the zero_sum_vector won’t yield the corresponding pushforwards without additional adjustment, in violation of rule-of-thumb 2. And we are talking about a Jacobian adjustment over a transformation that isn’t a bijection. If we take the transformation to be real y = -sum(x) then the transformation is not one-to-one. If we take the transformation to be vector[N] y = append_row(x, -sum(x)) then the transformation is not onto. In either case, the Jacobian matrix is not square and it has no determinant. Does it really make sense to talk about a Jacobian adjustment at all, even to say that it is constant and can be ignored?

Another way to think about this is instead to violate rule-of-thumb 1. That is, given

parameters{
   zero_sum_vector[N] y;
}
model{
   y ~ std_normal();
}

we can say that the final element of y is not really a parameter but rather a generated quantity hiding in the parameters block. Now the Jacobian is square, its determinant is constant, and the adjustment is trivial. The cost (which seems large to me) is that now we have a generated quantity hiding in the parameters block and moreover we can manipulate that quantity as though it is a parameter, including by “putting a prior on it” (i.e. we can put it on the left-hand-side of a sampling statement). If we want to just understand this bit of syntax as a nice shorthand for writing down some principled prior density function over the N-1 actual parameters, then there’s no problem. But I don’t think that is how most users will think about putting priors on quantities that are declared in the parameters block. Note that the same problem already exists for simplex vectors, and I’m not arguing against including such transformations in Stan! I’m just searching for the right language to clarify what is really happening here. Curious if @betanalpha has further thoughts about how to be precise in discussing this stuff.

Simplex is different because the prior is almost always a Dirichlet distribution. It’s readily interpretable as simplexes are the distribution’s natural support.
With the “normal” distribution on sum-to-zero vectors the constraint and prior are in conflict.

1 Like

There’s yet another way to achieve this sum-to-zero constraint and on my tests it’s the fastest with the highest ESS, especially for large numbers of groups. It uses the fact that this sum-to-zero multivariate normal distribution has a singular covariance matrix. We can generate samples using the stochastic representation in Stan.

Even better all the slow linear algebra stuff can be precomputed.

The math:

The covariance matrix can be decomposed by the SVD into

\Sigma = U \Lambda^2 U^T.

Because we have K - 1 free variables there will be K-1 positive non-zero entries in the diagonal matrix \Lambda and 1 zero entry. Then we can generate samples via Y = \mu + U\Lambda X. To generate the singular \Sigma matrix for any K we can multiply a diagonal matrix with K - 1 entries by the sum-to-zero constrast matrix for K groups. All pre-computed in R or transformed data block!

Sigma <- contr.sum(K) %*% diag(K - 1) %*% t(contr.sum(K))

Then we take the SVD of this

s <- cov2cor(svd(Sigma) )
s$d <- abs(zapsmall(s$d))

All we need is the U matrix and the d vector

data = list(K = K, N = N, y = y,
            U = s$u,
            d_raw = sqrt(s$d))

mod_svd <- cmdstan_model('hier_ctr_svd.stan')
fit_svd <- mod_svd$sample(data = data, 
                          chains=2, 
                          parallel_chains=4)

Because we want to estimate a specific \sigma and \mu we need to take those into account in the program. We can take care of \sigma by giving x \sim \mathcal{N}(0, \sigma).

data {
  int<lower=0> K;
  int<lower=0> N;
  array[K, N] int<lower=0, upper=1> y;
  matrix[K, K] U;
  vector[K] d_raw;
}
transformed data {
 matrix[K, K] alpha_mat = diag_post_multiply(U, d_raw);
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[K - 1] alpha_pre;
}
transformed parameters {
  vector[K] alpha = mu + alpha_mat * append_row(alpha_pre, 0);
}
model {
  alpha ~ normal(mu, inv(sqrt(1 - inv(K))) * sigma);
  mu ~ normal(0, 1);
  sigma ~ lognormal(0, 1);
  for (k in 1:K)
    y[k] ~ bernoulli_logit(alpha[k]);
}
generated quantities {
  real mean_alpha = mean(alpha);
  real<lower=0> sd_alpha = sd(alpha);
}

edit 3/2/2022: Updated the Stan code because there’s no need to split adding and subtracting mu. From Bob’s post it’s sufficient to add mu to alpha and put a normal prior on alpha with mu. Also, moved the diag_post_multiply into the transformed data block.

8 Likes

3 posts were split to a new topic: Sum-to-zero with multiple constraints

For those who are further interested in the distinction between the finite-population mean and sd, and the superpopulation mean and sd, I recommend chapter 21 of my 2007 book with Jennifer and also this recent blog post in the context of meta-analysis: Exploring some questions about meta-analysis (using ivermectin as an example), with R and Stan code | Statistical Modeling, Causal Inference, and Social Science

As I was discussing with Bob the other day, one trick that can help make these ideas more clear is to consider what your model would be if you have one more group for which no data are observed, as this corresponds to a new draw from the population distribution.

1 Like

Interestingly, this all came up recently when @andrewgelman wrote to me with another “Gibbs is faster than HMC” paper, this time from Giacomo Zanella and Gareth Roberts:

Their “fast” version uses the sum-to-zero constraint and their second fastest approach sets one of the coefficients to zero.

I saw this stuff first not in Bayes, but in frequentist (which we’re not going to call “classical”!) analyses where they’re trying to write identifiable likelihoods.

EDIT: @maxbiostat edited this to correct a spelling mistake.

I’m not sure what users you’re talking to, but I think that includes me. But then I don’t know that “parameter” has a formal definition. There’s all that “missing data” fudging going on in frequentist analyses that want to marginalize with a clear philosophical conscience.

I would’ve said we just need the Jacobian adjustment when we have p(x) and want to work out p(y) when y = f(x) for some monotonic and smooth function f.

I’m not sure what they’re supposed to correspond to. I wasn’t expecting the model to be the same and I don’t know how to transform so that it is. I was just trying to understand the three different models (identify with prior, identify by pinning a value, identify by pinning the sum).

I wasn’t talking about a Jacobian adjustment at all. The bijection here is between unconstrained (K-1) dimensional vectors and K-dimensional vectors that sum to 0. It’s like the bijection between (K-1) unconstrained parameters and a simplex, or (N choose 2) unconstrained parameters and a N x N correlation matrix.

I see @nhuure responded on the difference between this and simplexes,

This is interesting. The Dirichlet is defined as `log Dirichlet(theta | alpha) = (alpha - 1)’ * log(theta), which puts a “prior” on each entry (but it all normalizes properly). I see the difference with normal, which is typically defined over an unconstrained space. Thus I think it’s closer to when we do this,

parameters {
  real<lower=0> sigma;
}
model {
  sigma ~ normal(0, 2);
}

This applies a normal distribution to a parameter with only half support. Now it doesn’t matter here because the normalizing factor is just 2, and log 2 drops out as a constant. It’s similar with the sum-to-zero and independent normals, I think. We could probably work out the lower rank distribution (as I think @spinkney is doing).

I’m talking to you but not about you when I say “most users”. I’m talking about relatively less experienced users who are trying to wrap their heads around Jacobian adjustments. I think that if we tell them no Jacobian adjustment is required, then they might reasonably expect that writing theta ~ std_normal(); will yield standard normal pushforwards on the margins of theta. The reason I think this might be true is that I believe users understand the Jacobian adjustment in a very abstract way to mean “the required adjustment such that the prior pushforwards correspond to the sampling statements that look like priors”.

I don’t really know whether or how this concern is actionable, other than by discussing the issue in the doc, which is already done in the SUG and you clearly intend to do with the new zero_sum_vector type. And I certainly could be wrong about “most users” anyway. I doubt I have anything useful to add at this point, beyond clearing up the possible misunderstanding about whom I was talking to/about. :)